## Scraping info about the BSC Faculty

In [2]:
# Run this cell unchanged
!pip install -r requirements.txt



In [38]:
import os
import re
import pandas as pd
from time import sleep
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller


import warnings
warnings.filterwarnings("ignore")
%config Completer.use_jedi = False

The goal for today is to scrape the
* Name
* Office 
* Office phone
* Email
* The classes taught by the teacher

For each program that is linked [here](https://www.bsc.edu/academics/index.html)

In [8]:
# Set up the url we want to make requests to. 

# Send request

# Print the request response


It looks there are some security measures that are keeping us from making a sucessful request.

One common way around this is to use a python package called `selenium`.

Below is some code to set up a selenium webdriver.

In [21]:
def create_driver(headless=True):
    driver = chromedriver_autoinstaller.install(cwd=True)
    chrome_options = webdriver.ChromeOptions()     
    if headless:
        chrome_options.add_argument("--headless")

    driver = webdriver.Chrome(driver, 
                             chrome_options = chrome_options)
    return driver

driver = create_driver()

Now we can run the code below, and we'll be able to connect to the web page.

In [22]:
root = 'https://www.bsc.edu/academics/'
url = root + 'index.html'
driver.get(url)

To pull the raw html from our webdriver, we can run the following:

In [23]:
html = driver.page_source

print('HTML:')
print(html[:150])

HTML:
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js tablesaw-enhanced" lang="en"><head>
		<meta content="IE=edge" http-equiv="X-UA-Compatible">
		


Next we pass the html into `BeautifulSoup` so we can parse the information embedded inside the html. 

In [24]:
# Pass html into BeautifulSoup
soup = BeautifulSoup(html)

Ok now let's start scraping. 

For this page, we need to collect the name of the department and the url for each department's landing page. 

Let's create a dictionary that has the following format:

```
{department_name: deparment_url}
```

In [None]:
# Your code here

Next, we need to collect the html for each of the department landing pages

Let's create a dictionary with the following format:

```
{department_name: department_html}
```

In [None]:
# Your code here

Now let's create a dictionary containing each departments `soup` (or their parsed html files)

```
{department_name: department_soup}
```

In [28]:
# Your code here

Ok. So at this point, we have successfully collect the parsed html for each department's landing page. 

Now, we need to figure out how to write code that will scrape the information for each department's landing page. The problem we will discover is that none of the landing pages are formatted exactly the same. *This* is where the true messying of web scraping appears. 

In the cell below, let's isolate the soup for the `Sociology` department. 

In [30]:
program = department_soup['Sociology']

Now we need to isolate the section of the html that contains the information we want.

In [31]:
# Your code here

Ok. This is where is gets really hairy. 

We need to:
* Loop over the tags inside the `faculty` variable
* Collect the name of the staff member
* Check if the staff member has a link for a bio
    * If they do collect the url for that page, collect html and parse it
    * Scrape office, phone, email, and classes from the bio
* If not, see if we can find any of the data points
* Append the data we have scraped for a staff member to a list

In [45]:
# We are going to define a function for collecting the data 
# for faculty members that do not have a bio page
def parse_faculty_no_link(p_tag):
    """
    Helper function for collecting bio data 
    for staff members who have placed their bio
    in the program faculty dropdown
    """
    phone = p_tag.find(text = re.compile('\(.+\) \d{3}-\d{4}|\d{3}-\d{3}-\d{4}'))
    if phone:
        phone = phone.strip()
    email = p_tag.find(text = re.compile('\@bsc\.edu'))
    return phone, email

 

# Now let's write code collect the data for every faculty member
# in a department!

data = []
root = 'https://www.bsc.edu/academics/'

# Loop over the tags in faculty
for person in faculty:

    # Create a default dataset for a staff member
    # that sets the default values to None
    # That way, if we do not find anything, 
    # we have a null value for the observation
    default = {"name": None,
               "office": None,
              "phone": None,
              "email": None,
              "classes": None}

    # Collecting the faculty name turned out to be quite difficult
    # The departments deviate from each other pretty dramatically
    # in the way they format their faculty dropdown.
    # the `.string` attribute was used because it returns
    # the strings inside the person div as seperate objects
    # Whereas `.text` returns one joined string. This poses a problem
    # When faculty members included a bio in the same tag as their name.
    name = list(person.strings)
    if not name:
        # Some p tags were empty. 
        # This if check ensures we ignore them.
        continue
    if len(name[0]) > 150:
        # If the length of the text is >150, it is likely a bio of some kind
        # So we skip this p tag and move to the next
        continue
    if not name[0].strip():
        # If, when we strip the string (remove preceding and trailing spaces)
        # The string becomes empty, it is just a ptag filled with white space
        try:
            # In some cases, the ptag opened with a white space
            # with a second nested ptag where the instructors name 
            # was found. If the first tag was white space
            # We are naively setting the name to the second string
            # This has been placed inside a try except block in case
            # a second string doesn't exist. In which case, 
            # the ptag is completely
            # empty and we jump to the next iteration of ptags
            name = name[1]
        except:
            continue
    else:
        # If the stripped string is not empty and is not >150
        # Then we assume the string is the name of the staff member
        name = name[0]
    # We set the name data to the scraped string
    default['name'] = name

    # Next we check if the ptag contains any `a` tags (links)
    if person.find('a'):
        if person.find('a').attrs['href'][:2] == '..':
        # If an `a` tag was found, we check if the href for the atag
        # contains the string '..'
        # This is a weird design choice by the web developers
        # Where they place '..' at the beginning of relative urls.
        # Ultimately, if the url is relative it is most likely the bio page
        # for the staff member. 
            # If a relative path was found
            # We collect the href and assemble the url
            # for the staff member's bio page 
            href = person.find('a').attrs['href']
            faculty_url = root + href.replace('../', '')
            # Next we connect to the web page,
            # collect the html, and parse it.
            driver.get(faculty_url)
            faculty_html = driver.page_source
            faculty_soup = BeautifulSoup(faculty_html)
            try:
                pass
                # The entire next bit is wrapped in a try except
                # This is mostly to keep our code somewhat more readible
                # If we find that we are losing too much data 
                    # For example if `office` is throwing a lot of errors
                    # which means none of the other data points are collected
                # We could consider wrapping each of these datapoints in their
                # own try except block
                
                # ================= STUDENT WORK =======================
                # Find the office for the staff member
                # YOUR CODE HERE
                
                # Set the office variable in our default dictionary 
                # to the office data
                # YOUR CODE HERE
                
                # Find the phone number for the staff member
                # YOUR CODE HERE
                
                # Set the phone variable in our default dictionary 
                # to the phone data
                # YOUR CODE HERE
                
                # Find the email for staff member
                # YOUR CODE HERE
                
                # Set the email variable in our default dictionary 
                # to the email data
                # YOUR CODE HERE
                
                # Find the classes taught by the staff member
                # YOUR CODE HERE
                
                # Set the classes variable in our default dictionary 
                # to the classes data
                # YOUR CODE HERE
                
            except:
                # If an error is thrown move to the last line in the for loop
                pass
        else:
            # If no a tag was found, we have a helper function
            # parse the information of the staff member in the drop down
                # Some programs have opted to place their entire bio
                # in the dropdown and do not seem to have a seperate web page
            phone, email = parse_faculty_no_link(person)
            default['phone'] = phone
            default['email'] = email
    # Append the found data for a staff member to a list
    data.append(default)

Let's take a look at the results...

In [47]:
data

[{'name': 'William Holt',
  'office': '313 Harbert',
  'phone': '(205) 226-4834',
  'email': 'wholt@bsc.edu',
  'classes': ['UES 110 Sustainability: Southern Cities & The World',
   'UES 210 Environmental Problems and Policy (1)',
   'SO 373 Urban Sociology (1)',
   'SO 376 Environmental Sociology (1)',
   'UES 470 Senior Seminar in Environmental Studies (1)']},
 {'name': 'Meghan Mills',
  'office': 'Harbert 213',
  'phone': '(205) 226-4791',
  'email': 'mmills@bsc.edu',
  'classes': ['SO 101 Introduction to Sociology',
   'SO293 Health and Society',
   'SO370 Sociology of Medicine']},
 {'name': 'Stephanie Hansard',
  'office': None,
  'phone': None,
  'email': None,
  'classes': None},
 {'name': 'Katie McIntyre',
  'office': None,
  'phone': None,
  'email': None,
  'classes': None}]

This looks good. Now let's put it all together!

To make our web scraping code *reusable*, which I an extremely important part of writing powerful web scraping, we need to toss this entire process into functions. 

**Please create a functions below:**

In [None]:
# This function is completed for you
# =======================================================
def create_driver(headless=True):
    """
    Installs a chromedriver into the current working directory
    and returns an active selenium webdriver.
    """
    # Install chrome driver
    driver = chromedriver_autoinstaller.install(cwd=True)
    # Create a chrome options object to customize the settings
    chrome_options = webdriver.ChromeOptions() 
    # Set the download path to the current working directory
        # (Files will be downloaded in the directory of the notebook
        # rather than in the computer's Download folder)
    prefs = {'download.default_directory' : os.getcwd()}
    chrome_options.add_experimental_option('prefs', prefs)
    # Turn off the gui for the web browser
    if headless:
        chrome_options.add_argument("--headless")
    # Create the webdriver object
    driver = webdriver.Chrome(driver, 
                             chrome_options = chrome_options)
    return driver
# =======================================================

def get_department_links(driver):
    # Set up the root and url for collecting
    # the links for each department's landing page
    # YOUR CODE HERE
    
    # Connect the webdriver to the url
    # YOUR CODE HERE
    
    # Collect the html and parse it with BeautifulSoup
    # YOUR CODE HERE
    
    # Isolate the a tags for the departments
    # YOUR CODE HERE
    
    # Create the dictionary of department links
    # YOUR CODE HERE
    
    # Return the department links
    # YOUR CODE HERE

def get_department_soup(driver, link):
    
    # Connect the webdriver to the link
    # YOUR CODE HERE
    
    # Collect the html from the webdriver
    # YOUR CODE HERE
   
    # Parse the html with BeautifulSoup
    # YOUR CODE HERE
    
    # Return the parsed html
    # YOUR CODE HERE

# This function is completed for you. 
# =======================================================
# Feel free to read through the comments
# That describe the web scraping decisions
def get_faculty_divs(program_soup):
    # Isolate the accordion
    dropdown = program_soup.find('section', {'class': 'accordion'})
    
    # The biology department organizes
    # their web pages in ways that make selecting
    # The h2 tag that contains the words "faculty" or "staff" inadequate.
    # Instead, we will select `i` tags that contain white space, but are
    # present on every program page. Then we will loop over the `i` tags and 
    # run a text search for the word faculty/staff on the `h2` tag directly following
    # the `i` tag. 
    i_tags = dropdown.find_all('i')
    for tag in i_tags:
        if re.findall('STAFF|FACULTY|Faculty|Staff', tag.find_next('h2').text):
            # If we have found an h2 tag inside an i tag
            # that contains the key words, it is the faculty dropdown!
            faculty_drop = tag.find_previous('li')
    
    # Each faculty member has their own p tag so we will find all p tags
    # inside the faculty_drop variable and set that to faculty
        # There are some cases where we will end up with "faculty"
        # that are actually just empty p tags in the html
        # This results in a slightly messier dataset, but overall doesn't
        # impact the integrity of our data. 
    faculty = faculty_drop.find('div', {'class': 'content'}).find_all('p')

    return faculty
# =======================================================

def scrape_faculty_data(driver, faculty_divs):
    data = []
    root = 'https://www.bsc.edu/academics/'
    
    for person in faculty_divs:
        
        # Create a default dataset for a staff member
        # If we do not find a data point for an instructor
        # They still need to have the same number of keys
        # If we want to turn this data into a dataframe
        default = {"name": None,
                   "office": None,
                  "phone": None,
                  "email": None,
                  "classes": None}
        
        # Collecting the faculty name turned out to be quite difficult
        # The departments deviate from each other pretty dramatically
        # in the way they format their faculty dropdown.
        # the `.string` attribute was used because it returns
        # the strings inside the person div as seperate objects
        # Whereas `.text` returns one joined string. This poses a problem
        # When faculty members included a bio in the same tag as their name.
        name = list(person.strings)
        if not name:
            # Some p tags were empty. 
            # This if check ensures we ignore them.
            continue
        if len(name[0]) > 150:
            # If the length of the text is >150, it is likely a bio of some kind
            # So we skip this p tag and move to the next
            continue
        if not name[0].strip():
            # If, when we strip the string (remove preceding and trailing spaces)
            # The string becomes empty, it is just a ptag filled with white space
            try:
                # In some cases, the ptag opened with a white space
                # with a second nested ptag where the instructors name 
                # was found. If the first tag was white space
                # We are naively setting the name to the second string
                # This has been placed inside a try except block in case
                # a second string doesn't exist. In which case, 
                # the ptag is completely
                # empty and we jump to the next iteration of ptags
                name = name[1]
            except:
                continue
        else:
            # If the stripped string is not empty and is not >150
            # Then we assume the string is the name of the staff member
            name = name[0]
        # We set the name data to the scraped string
        default['name'] = name
        
        # Next we check if the ptag contains any `a` tags (links)
        if person.find('a'):
            if person.find('a').attrs['href'][:2] == '..':
            # If an `a` tag was found, we check if the href for the atag
            # contains the string '..'
            # This is a weird design choice by the web developers
            # Where they place '..' at the beginning of relative urls.
            # Ultimately, if the url is relative it is most likely the bio page
            # for the staff member. 
                # If a relative path was found
                # We collect the href and assemble the url
                # for the staff member's bio page 
                href = person.find('a').attrs['href']
                faculty_url = root + href.replace('../', '')
                # Next we connect to the web page,
                # collect the html, and parse it.
                driver.get(faculty_url)
                faculty_html = driver.page_source
                faculty_soup = BeautifulSoup(faculty_html)
                try:
                    # The entire next bit is wrapped in a try except
                    # This is mostly to keep our code somewhat more readible
                    # If we find that we are losing too much data 
                        # For example if `office` is throwing a lot of errors
                        # which means none of the other data points are collected
                    # We could consider wrapping each of these datapoints in their
                    # own try except block
                    # ================= STUDENT WORK =======================
                    # Find the office for the staff member
                    # YOUR CODE HERE

                    # Set the office variable in our default dictionary 
                    # to the office data
                    # YOUR CODE HERE

                    # Find the phone number for the staff member
                    # YOUR CODE HERE

                    # Set the phone variable in our default dictionary 
                    # to the phone data
                    # YOUR CODE HERE

                    # Find the email for staff member
                    # YOUR CODE HERE

                    # Set the email variable in our default dictionary 
                    # to the email data
                    # YOUR CODE HERE

                    # Find the classes taught by the staff member
                    # YOUR CODE HERE

                    # Set the classes variable in our default dictionary 
                    # to the classes data
                    # YOUR CODE HERE
                except:
                    # If an error is thrown move to the last line in the for loop
                    pass
            else:
                # If no a tag was found, we have a helper function
                # parse the information of the staff member in the drop down
                    # Some programs have opted to place their entire bio
                    # in the dropdown and do not seem to have a seperate web page
                phone, email = parse_faculty_no_link(person)
                default['phone'] = phone
                default['email'] = email
        # Append the found data for a staff member to a list
        data.append(default)
        
    # Return data for ever staff member 
    return data

# This function is completed for you. 
# =======================================================
def parse_faculty_no_link(p_tag):
    """
    Helper function for collecting bio data 
    for staff members who have placed their bio
    in the program faculty dropdown
    """
    phone = p_tag.find(text = re.compile('\(.+\) \d{3}-\d{4}|\d{3}-\d{3}-\d{4}'))
    if phone:
        phone = phone.strip()
    email = p_tag.find(text = re.compile('\@bsc\.edu'))
    return phone, email
# =======================================================

def scrape_faculty():
    # Create an empty dataframe
    # YOUR CODE HERE
    
    # Create a webdriver
    # YOUR CODE HERE
    
    # Collect the links for each department
    # YOUR CODE HERE

    # Loop over the keys of the program links dictionary
    # YOUR CODE HERE
    
        # Print the name of the program so we know what
        # is happening as our code runs
        # YOUR CODE HERE
        
        # Collect the link for the program
        # by passing in the program name
        # to the program links dictionary
        # YOUR CODE HERE
        
        # Collect the parsed html for the program
        # YOUR CODE HERE
        
        # Collect the divs for each faculty member
        # YOUR CODE HERE
        
        # Scrape the data for each faculty member
        # YOUR CODE HERE
        
        # Turn the scraped data into a dataframe
        # YOUR CODE HERE
        
        # Add a column called `program` that is set
        # to the name of the program we have scraped
        # data from
        # YOUR CODE HERE
        
        # Append the program dataframe to 
        # the `df` variable that we created before
        # the for loop
        # YOUR CODE HERE
    
    # Return df
    return df

In [50]:
df = scrape_faculty()

Scraping data for the Applied Computer Science program.
Scraping data for the Mathematics program.
Scraping data for the Art & Art History program.
Scraping data for the Media And Film Studies program.
Scraping data for the Biology program.
Scraping data for the Music program.
Scraping data for the Business program.
Scraping data for the Philosophy program.
Scraping data for the Chemistry program.
Scraping data for the Political Science program.
Scraping data for the Classics program.
Scraping data for the Psychology program.
Scraping data for the Economics program.
Scraping data for the Physics program.
Scraping data for the Education program.
Scraping data for the Religion program.
Scraping data for the English program.
Scraping data for the Sociology program.
Scraping data for the Foreign Languages program.
Scraping data for the Theatre program.
Scraping data for the History program.
Scraping data for the Urban Environmental Studies program.


In [51]:
df.dropna(subset=['email'])

Unnamed: 0,name,office,phone,email,classes,program
0,Dr. Amber Wagner,,,anwagner@bsc.edu,,Applied Computer Science
1,Dr. Anthony Winchester,,,agwinche@bsc.edu,,Applied Computer Science
0,Jeffrey Barton,111 Olin Center,(205) 226-3027,jbarton@bsc.edu,[],Mathematics
1,Catherine Cashio,Olin 109,(205) 226-3025,clcashio@bsc.edu,"[MA 207 General Statistics (1), MA 455 Introdu...",Mathematics
2,E-mail:,,,clcashio@bsc.edu,,Mathematics
...,...,...,...,...,...,...
8,Megan Gibbons,Stephens Science Center 236,(205) 226-4874,mgibbons@bsc.edu,"[BI 115 Organismal Biology (1), BI 225 Evoluti...",Urban Environmental Studies
9,Bill Myers,HC 222,(205) 226-4868,bmyers@bsc.edu,[],Urban Environmental Studies
10,Rebekah Parker,,,rpparker@bsc.edu,,Urban Environmental Studies
11,Kathleen Greer Rossmann,Harbert 309,(205) 226-4603,KRossman@bsc.edu,[],Urban Environmental Studies


# Senate

Below we scrape the data for the current US Senators from the [US Senators Wikipedia Page](https://en.wikipedia.org/wiki/List_of_current_United_States_senators)

In [52]:
# Set up the url
url = 'https://en.wikipedia.org/wiki/List_of_current_United_States_senators'
# Make a request
response = get(url)
# Collect the html
html = response.text
# Parse the html
soup = BeautifulSoup(html)

First we will look at a really neat web scraping trick with the `pandas` library. 

First we isolate the `<table>` tag that contains the information about the senators.

In [56]:
table = soup.find('table', {'id':'senators'})

Next, we convert the isolated html table to a string datatype.

In [57]:
# Convert the soup variable to a string
table_string = str(table)

And now the cool part!

For any `<table>` tag, we can simply pass it into `pd.read_html` and pandas will parse the table for us and return it as a dataframe!

In [62]:
pandas_result = pd.read_html(table_string)
pandas_result

[            State  Portrait               Senator  Party        Party.1  \
 0         Alabama       NaN        Richard Shelby    NaN  Republican[2]   
 1         Alabama       NaN      Tommy Tuberville    NaN     Republican   
 2          Alaska       NaN        Lisa Murkowski    NaN     Republican   
 3          Alaska       NaN          Dan Sullivan    NaN     Republican   
 4         Arizona       NaN        Kyrsten Sinema    NaN     Democratic   
 ..            ...       ...                   ...    ...            ...   
 95  West Virginia       NaN  Shelley Moore Capito    NaN     Republican   
 96      Wisconsin       NaN           Ron Johnson    NaN     Republican   
 97      Wisconsin       NaN         Tammy Baldwin    NaN     Democratic   
 98        Wyoming       NaN         John Barrasso    NaN     Republican   
 99        Wyoming       NaN        Cynthia Lummis    NaN     Republican   
 
         Born                                      Occupation(s)  \
 0   (age 87)     

The above about looks sort of funny. 

The developers of pandas made the choice that `read_html` will always about a `list` of dataframes. This did this with the thinking that you could pass `read_html` the html of an entire webpage, and pandas will parse every `<table>` tag found in the html. 

In this case, we have already isolated the table we want which means the list given to us by `read_html` has a length of `1`.

In [64]:
len(pandas_result)

1

So really, all we need to do is **index** the list returned to us by `read_html`.

In [65]:
senate = pd.read_html(table_string)[0]
senate.head(2)

Unnamed: 0,State,Portrait,Senator,Party,Party.1,Born,Occupation(s),Previous electiveoffice(s),Education,Assumed office,Term up,Residence
0,Alabama,,Richard Shelby,,Republican[2],(age 87),Lawyer,U.S. HouseAlabama Senate,University of Alabama Birmingham School of Law...,"January 3, 1987",2022,Tuscaloosa[3]
1,Alabama,,Tommy Tuberville,,Republican,(age 66),"College football coachPartner, investment mana...",,Southern Arkansas University,"January 3, 2021",2026,Auburn


And we have our data! 🥳

It is worth noting that the data given to us is a bit messy, and in some cases is missing information we might want. This pandas trick can be really powerful and helpful if we're short on time, but typically a manual scrape will always produce a cleaner result. 

## A better scrape:

Below, we create a function called `scrape_senate_table` that produced a much cleaner dataframe.

This function is completed for you. We encourage you to read through the comments, and even copy the code into a code cell and play around with it. 

In [75]:
# Run this cell unchanged
def scrape_senate_table():
    
    # Connect to the wikipedia page
    response = get('https://en.wikipedia.org/wiki/List_of_current_United_States_senators')
    # Collect the html from the page
    html = response.text
    # Parse the html with BeautifulSoup
    soup = BeautifulSoup(html)
    # Find the senators table
    table = soup.find('table', {'id': 'senators'})
    
    # The first row in the table contains the column names.
    # Isolate the first row, then final all of the column tags.
    columns = table.find('tr').find_all('th')
    # Collect the row tags for the entire dataset
    rows = table.find_all('tr')[1:]
    # Create an empty list to append the row data to
    senate_data = []
    # Create some cleaning functions for text data
    remove_new_line = lambda x: x.replace('\n', '')
    split_new_line = lambda x: x.split('\n')
    
    # Loop over each row
    for row in rows:
        # find all td tags from the row
        td_tags = row.find_all('td')
        # ===============================================================================
        # The senators table merges the cell for `state name` so it spans both 
        # senators from a state. 
        # When parsing the html, the state name only appears for the senator
        # that appears first in the table, and the second appearing senator has 
        # one less td tag.
        # Because of this we need to check the length of the td tags
        # If there is one less tag, we add  the state name from the previous iteration
        # To the beginning of the list of tags
        # ===============================================================================
        # Check if the list of tags has the full number of columns
        if len(td_tags) == len(columns):
            # If it does, store the first element in the list of tags
            # to a variable called previous_element
            previous_element = td_tags[0]
        # If the list of tags does not have the full number of columns
        # insert the previous_element variable at the beginning of the list of tags
        else:
            td_tags.insert(0, previous_element)
        
        # ===============================================================================
        #                              Parse the row data
        # ===============================================================================
        # Collect the state name
        state = remove_new_line(td_tags[0].text)
        # Collect the image url
        image = td_tags[1].find('img').attrs['src']
        # Collect the name of the senator
        name = remove_new_line(row.find('th').text)
        # Collect the css color string for the political party
        party_color = td_tags[2].attrs['style'].split(':')[-1]
        # Collect the party name
        party_name = remove_new_line(td_tags[3].text)
        # Remove the [2] that occasionally follows the word "Republicans"
        party_name = party_name.split('[')[0]
        # Collect the date of birth
        dob = ' '.join(remove_new_line(td_tags[4]\
                                       .text)\
                                       .strip()\
                                       .split(' ')[1:4])
        # Collect the occupation
        occupation = split_new_line(BeautifulSoup(str(td_tags[5])\
                                                  .replace('<br/>', '\n'))\
                                                  .td\
                                                  .text)
        # If only one occupation is present
        # Pull that occupation out of the list
        # And return a single string
        if occupation[1] == '':
            occupation = occupation[0]
        # Collect the previous office
        previous_office = split_new_line(BeautifulSoup(str(td_tags[6])\
                                                       .replace('<br/>', '\n'))\
                                                       .td\
                                                      .text)
        # If only one previous office is present
        # Pull that value out of the list
        # And return a single string
        if previous_office[1] == '':
            previous_office = previous_office[0]
        # Collect assumed office
        assumed_office = remove_new_line(td_tags[7].text)
        # Collect the end of their term
        term_up = remove_new_line(td_tags[8].text)
        # Remove any [number] appendix references
        term_up = term_up.split('[')[0]
        # Collect their residence
        residence = split_new_line(td_tags[9].text)
        
        # If only one residence is present
        # Pull that value out of the list
        # And return a single string
        if residence[1] == '':
            residence = residence[0]
            # Many of the residences have a 
            # link pointing to additional information
            # We do not need this
            if '[' in residence:
                residence = residence[:-3]
        # Create data dictionary
        collected = {'state': state, 'name': name,'dob': dob, 'party': party_name,
                    'party_color': party_color,
                    'occupation': occupation,
                    'previous_office': previous_office,
                    'assumed_office': assumed_office,
                    'term_up': term_up, 'residence': residence,
                    'portrait': image}
        # Append dictionary to the senate_data list
        senate_data.append(collected)
        
    # Once data from all rows has been collected
    # return the data as a pandas dataframe
    return pd.DataFrame(senate_data)

In [76]:
# Run this cell unchanged
scrape_senate_table()

Unnamed: 0,state,name,dob,party,party_color,occupation,previous_office,assumed_office,term_up,residence,portrait
0,Alabama,Richard Shelby,"May 6, 1934",Republican,#E81B23,Lawyer,"[U.S. House, Alabama Senate, ]",University of AlabamaBirmingham School of LawU...,"January 3, 1987",2022,//upload.wikimedia.org/wikipedia/commons/thumb...
1,Alabama,Tommy Tuberville,"September 18, 1954",Republican,#E81B23,"[College football coach, Partner, investment m...",,Southern Arkansas University,"January 3, 2021",2026,//upload.wikimedia.org/wikipedia/commons/thumb...
2,Alaska,Lisa Murkowski,"May 22, 1957",Republican,#E81B23,Lawyer,Alaska House of Representatives,Georgetown UniversityWillamette University Col...,"December 20, 2002",2022,//upload.wikimedia.org/wikipedia/commons/thumb...
3,Alaska,Dan Sullivan,"November 13, 1964",Republican,#E81B23,"[U.S. Marine Corps officer, Lawyer, Assistant ...",Alaska Attorney General,Culver Military AcademyHarvard UniversityGeorg...,"January 3, 2015",2026,//upload.wikimedia.org/wikipedia/commons/thumb...
4,Arizona,Kyrsten Sinema,"July 12, 1976",Democratic,#3333FF,"[Social worker, Political activist, Lawyer, Co...","[U.S. House, Arizona Senate, Arizona House of ...",Brigham Young University,"January 3, 2019",2024,//upload.wikimedia.org/wikipedia/commons/thumb...
...,...,...,...,...,...,...,...,...,...,...,...
95,West Virginia,Shelley Moore Capito,"November 26, 1953",Republican,#E81B23,"[College career counselor, Director, state Boa...","[U.S. House, West Virginia House of Delegates, ]",Duke UniversityUniversity of Virginia,"January 3, 2015",2026,//upload.wikimedia.org/wikipedia/commons/thumb...
96,Wisconsin,Ron Johnson,"April 8, 1955",Republican,#E81B23,"[Accountant, Corporate executive, ]",,University of Minnesota,"January 3, 2011",2022,//upload.wikimedia.org/wikipedia/commons/thumb...
97,Wisconsin,Tammy Baldwin,"February 11, 1962",Democratic,#3333FF,Lawyer,"[U.S. House, Wisconsin Assembly, Dane County, ...",Smith CollegeUniversity of Wisconsin,"January 3, 2013",2024,//upload.wikimedia.org/wikipedia/commons/thumb...
98,Wyoming,John Barrasso,"July 21, 1952",Republican,#E81B23,"[Orthopedic surgeon, Medical chief of staff, N...",Wyoming Senate,Rensselaer Polytechnic UniversityGeorgetown Un...,"June 25, 2007",2024,//upload.wikimedia.org/wikipedia/commons/thumb...
