# Living Wage Data By **State** for One Adult

**Scrapes MIT's livingwage website ('https://livingwage.mit.edu/')**

1. Data Collection: Requests is used for simple webscraping as this website did not have extensive blocking features at this time
 * First, the individual links for each state are scraped into a list (url_links)
 * Second, the data from each state is extracted 
 * Only the data for single adult is extracted for simplification purposes
 
2. Data Extraction: Beautiful soup is used to parse HTML elements saved as lists
3. Data Storage: Scraped elements are stored in a dataframe for further processing.
4. Data Cleaning: Not required

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [3]:
url = 'https://livingwage.mit.edu/'

In [4]:
response = requests.get(url)

In [5]:
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    print('success')
else:
    print("Failed to fetch the webpage.")

success


In [6]:
# This is the unordered list div that contains all the li elements for each state (including link)

url_divs = soup.find_all('ul', attrs={'class': 'states list-unstyled row'})  
print(f"Number of divs found: {len(url_divs)}")

Number of divs found: 1


In [7]:
# Getting all the links for each state's page

links = []
for ul in url_divs:
    for li in ul.find_all('li'):
        read_more = li.find('a').get('href') if li.find('a') is not None else ''
        url_link = 'https://livingwage.mit.edu' + read_more
        links.append(url_link)

In [8]:
# Here are the links to get the information on the living wage:

links = [link.replace('/locations', '') for link in links]

In [9]:
url = links[0]

In [11]:
# parsed link example
url

'https://livingwage.mit.edu/states/01'

In [10]:
# confirming parsed link is correct
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    print('success')
else:
    print("Failed to fetch the webpage.")

success


In [12]:
expense_divs = soup.find_all('table', attrs={'class': 'results_table table-striped expense_table'}) 
len(expense_divs)

1

In [13]:
state = soup.find_all('h1')[1].text.split(' Calculation for ')[1]
print(state)

Alabama


In [14]:
food_expenses = []

# Find all the rows in the table body
for row in expense_divs[0].tbody.find_all('tr'):
    second = row.find_all('td')
    food = second[1].get_text()
    food_expenses.append(food)


print(food_expenses)

['\n        $3,926\n      ', '\n        $0\n      ', '\n        $2,998\n      ', '\n        $7,954\n      ', '\n        $5,477\n      ', '\n        $3,074\n      ', '\n        $4,253\n      ', '\n        $27,814\n      ', '\n        $4,733\n      ', '\n        $32,547\n      ']


In [15]:
import pandas as pd

# Create dictionary with expense categories and values
expense_dict = {}
for row in expense_divs[0].tbody.find_all('tr'):
    category = row.find('td', class_='text').text
    values = row.find_all('td')[1:]
    expense_dict[category] = [v.text.strip().replace(',', '').replace('$', '') for v in values]

# Create DataFrame from dictionary
df = pd.DataFrame.from_dict(expense_dict, orient='columns')

In [17]:
# This shows all the data in the table, we only require the data for a single adult for our analysis

df.head()

Unnamed: 0,Food,Child Care,Medical,Housing,Transportation,Civic,Other,Required annual income after taxes,Annual taxes,Required annual income before taxes
0,3926,0,2998,7954,5477,3074,4253,27814,4733,32547
1,5795,6659,9059,10406,9851,6107,7420,55428,10817,66245
2,8707,13319,9069,10406,12045,6821,8755,69253,14547,83800
3,11540,19978,8994,13570,14484,9300,9610,87607,20472,108080
4,7198,0,6663,8548,9851,6107,7420,45918,7208,53127


In [20]:
# Define function to extract expenses for one adult

def one_adult(expense_divs):
    expense_dict = {}
    for row in expense_divs[0].tbody.find_all('tr'):
        category = row.find('td', class_='text').text
        values = row.find_all('td')[1:]
        expense_dict[category] = [v.text.strip().replace(',', '').replace('$', '') for v in values]

    # Create DataFrame from dictionary
    df = pd.DataFrame.from_dict(expense_dict, orient='columns')

    # Select the first row of all columns
    first_row = df.iloc[0]

    return first_row

In [21]:
# Extract all the cost of living information for each state <3 minutes to run

df_final = pd.DataFrame()

for url in links:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    expense_divs = soup.find_all('table', attrs={'class': 'results_table table-striped expense_table'}) 
    series = one_adult(expense_divs)
    
    state = soup.find_all('h1')[1].text.split(' Calculation for ')[1]
    series['state'] = state
    
    series = series.to_frame().transpose()
    df_final = pd.concat([df_final, series], axis=0) # keep adding a new row per state

df_final.reset_index(drop=True, inplace=True)

In [None]:
df_final.to_csv('LivingWageCalculator.csv', index=False)

In [23]:
df_final.head()

Unnamed: 0,Food,Child Care,Medical,Housing,Transportation,Civic,Other,Required annual income after taxes,Annual taxes,Required annual income before taxes,state
0,3926,0,2998,7954,5477,3074,4253,27814,4733,32547,Alabama
1,4686,0,3042,10591,5316,2920,4596,31284,4388,35671,Alaska
2,4686,0,3125,11194,5316,2920,4596,31970,5015,36985,Arizona
3,3926,0,3200,7276,5477,3074,4253,27337,4382,31719,Arkansas
4,4686,0,3136,17076,5316,2920,4596,37863,6312,44175,California


# Living Wage Data By **County** for One Adult

**Scrapes MIT's livingwage website ('https://livingwage.mit.edu/')**

Same as above, except now this code can be used to extract the information for each County in each State 
This takes longer to run but obtains more detailed data

In [39]:
links_for_counties = links
test = links_for_counties[0]

response = requests.get(test)
soup = BeautifulSoup(response.text, 'html.parser')
url_divs = soup.find_all('ul')

divs = url_divs[0].find_all('li')

for div in divs:
    href = div.find('a').get('href')
    county = div.find('a').text.strip()
    print(county)

8

**Now instead of just using the states, we will extract all the information from the county-specific living wage **

In [54]:
#Getting the county lists

county_links = []
county_names = []

for url in links_for_counties:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    url_divs = soup.find_all('ul')
    
    for div in url_divs:
        divs = div.find_all('li')
        
        for li in divs:
            href = li.find('a').get('href')
            county = li.find('a').text.strip()
            url_link = 'https://livingwage.mit.edu' + href
        
            county_links.append(url_link)
            county_names.append(county)
            

# Removing any of the start page/links we dont want and metropolitan areas
county_link_cleaned = [link for link in county_links if '/counties' in link]

In [55]:
len(county_link_cleaned)

3143

In [58]:
# Define function to extract expenses for one adult
def one_adult(expense_divs):
    expense_dict = {}
    for row in expense_divs[0].tbody.find_all('tr'):
        category = row.find('td', class_='text').text
        values = row.find_all('td')[1:]
        expense_dict[category] = [v.text.strip().replace(',', '').replace('$', '') for v in values]

    # Create DataFrame from dictionary
    df = pd.DataFrame.from_dict(expense_dict, orient='columns')

    # Select the first row of all columns
    first_row = df.iloc[0]

    return first_row

In [63]:
# Running this takes ~ 45 minutes, 15 seconds per 20 urls, and there are a total of 3143 urls to go through
df_final = pd.DataFrame()
chunk_size = 20

for i, url in enumerate(county_link_cleaned):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    expense_divs = soup.find_all('table', attrs={'class': 'results_table table-striped expense_table'}) 
    state = soup.find_all('h1')[1].text  #.split(' Calculation for ')[1]
    series = one_adult(expense_divs) # Using the underscore to ignore the index
    series['state'] = state
    series = series.to_frame().transpose()
    df_final = pd.concat([df_final, series], axis=0)
    
    if (i + 1) % chunk_size == 0:
        print("Processed {} URLs".format(i + 1))

df_final.reset_index(drop=True, inplace=True)

Processed 20 URLs
Processed 40 URLs
Processed 60 URLs
Processed 80 URLs
Processed 100 URLs
Processed 120 URLs
Processed 140 URLs
Processed 160 URLs
Processed 180 URLs
Processed 200 URLs
Processed 220 URLs
Processed 240 URLs
Processed 260 URLs
Processed 280 URLs
Processed 300 URLs
Processed 320 URLs
Processed 340 URLs
Processed 360 URLs
Processed 380 URLs
Processed 400 URLs
Processed 420 URLs
Processed 440 URLs
Processed 460 URLs
Processed 480 URLs
Processed 500 URLs
Processed 520 URLs
Processed 540 URLs
Processed 560 URLs
Processed 580 URLs
Processed 600 URLs
Processed 620 URLs
Processed 640 URLs
Processed 660 URLs
Processed 680 URLs
Processed 700 URLs
Processed 720 URLs
Processed 740 URLs
Processed 760 URLs
Processed 780 URLs
Processed 800 URLs
Processed 820 URLs
Processed 840 URLs
Processed 860 URLs
Processed 880 URLs
Processed 900 URLs
Processed 920 URLs
Processed 940 URLs
Processed 960 URLs
Processed 980 URLs
Processed 1000 URLs
Processed 1020 URLs
Processed 1040 URLs
Processed 106

In [84]:
df_final.head()

Unnamed: 0,State,County,Required annual income before taxes,Food,Medical,Housing,Transportation,Civic,Annual taxes
0,Alabama,Autauga County,32506,3926,2998,7921,5477,3074,4725
1,Alabama,Baldwin County,34476,3926,2998,9510,5477,3074,5106
2,Alabama,Barbour County,30811,3926,2998,6554,5477,3074,4398
3,Alabama,Bibb County,34369,3926,2998,9424,5477,3074,5085
4,Alabama,Blount County,34369,3926,2998,9424,5477,3074,5085


In [71]:
# Get the County and state seperately
df_final[['County', 'State']] = df_final['state'].str.split(',', expand=True)

In [82]:
df_final = df_final[['State','County', 'Required annual income before taxes', 'Food', 'Medical','Housing','Transportation', 'Civic', 'Annual taxes']].reset_index(drop=True)

In [85]:
df_final.to_csv('Living_Wage_County_State.csv')