# Project: Milestone 3

## Cleaning/Formatting Website Data Source

***Instructions)***

Perform at least 5 data transformation and/or cleansing steps to your flat file data. The below examples are not required - they are just potential transformations you could do. If your data doesn't work for these scenarios, complete different transformations. You can do the same transformation multiple times if needed to clean your data. The goal is a clean dataset at the end of the milestone.

- Replace Headers
- Format data into a more readable format
- Identify outliers and bad data
- Find duplicates
- Fix casing or inconsistent values
- Conduct Fuzzy Matching

Make sure you clearly label each transformation (Step #1, Step #2, etc.) in your code and describe what it is doing in 1-2 sentences. You can submit a Jupyter Notebook or a PDF of your code. If you submit a .py file you need to also include a PDF or attachment of your results.


***Answer)***

**#0. Intro Work**

When inspecting the source HTML Elements to locate the table I found following class identifier:

- class = "wikitable sortable jquery-tablesorter"

There are three tables in a list in this class value

In [1]:
# Read the page using bs4
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Open and read the html file with BeautifulSoup
with open("List of U.S. cities by adjusted per capita personal income - Wikipedia.html", "r", encoding="utf-8") as htmlFile:
    soup = BeautifulSoup(htmlFile, "html.parser")
    
tables = soup.find_all('table')

# This code reads the table into a pandas df, and sets which tables[] to select when reading. 
df = pd.read_html(str(tables[0]))[0] # Selecting the first table

df.head(50)

Unnamed: 0,50 largest metropolitan areas,Per Capita Personal Income (PCPI) (2019)[2],Rank in PCPI before adjustment,Purchasing power of $1.00 (2019)[5],PCPI after adjustments for purchasing power of dollar,Rank in PCPI after adjustment
0,"1. New York City-Newark-Jersey City, NY-NJ-PA MSA","$79,844",4.0,$0.80,"$63,875",7.0
1,"2. Los Angeles-Long Beach-Anaheim, CA MSA","$66,684",8.0,$0.84,"$56,015",34.0
2,"3. Chicago-Naperville-Elgin, IL-IN-WI MSA","$63,500",14.0,$0.97,"$61,595",18.0
3,"4. Dallas-Fort Worth-Arlington, TX MSA","$58,725",22.0,$0.99,"$58,138",27.0
4,"5. Houston-The Woodlands-Sugar Land, TX MSA","$58,890",20.0,$0.98,"$57,712",28.0
5,"6. Washington-Arlington-Alexandria, DC-VA-MD-W...","$74,385",6.0,$0.85,"$63,227",9.0
6,"7. Miami-Fort Lauderdale-West Palm Beach, FL MSA","$60,966",16.0,$0.90,"$54,869",39.0
7,"8. Philadelphia-Camden-Wilmington, PA-NJ-DE-MD...","$66,596",9.0,$0.95,"$63,226",10.0
8,"9. Atlanta-Sandy Springs-Alpharetta, GA MSA","$54,557",32.0,$1.02,"$55,648",36.0
9,"10. Phoenix-Mesa-Chandler, AZ MSA","$48,065",47.0,$1.01,"$48,546",48.0


**#1. Remove 'MSA' from the primary Name Column**

The first formatting/cleaning task is to remove unecessary strings from the primary name variable. I will have to conduct fuzzy matching to bring in the official CENSUS MSA ID.

This step will improve the success of fuzzy matching.

In [3]:
# Remove 'MSA' from the name column for consistency and to improve fuzzy match

df['50 largest metropolitan areas'] = df['50 largest metropolitan areas'].str.replace('MSA', '')

In [4]:
df[['50 largest metropolitan areas']].head(5)

Unnamed: 0,50 largest metropolitan areas
0,"1. New York City-Newark-Jersey City, NY-NJ-PA"
1,"2. Los Angeles-Long Beach-Anaheim, CA"
2,"3. Chicago-Naperville-Elgin, IL-IN-WI"
3,"4. Dallas-Fort Worth-Arlington, TX"
4,"5. Houston-The Woodlands-Sugar Land, TX"


**#2. Remove intigers from the primary Name Column**

Now I'll do further cleaning of the name column to further improve the success of fuzzy match.

In [5]:
# Remove intergers '1, 2, 3, etc.' from the name column to improve fuzzy match

df['50 largest metropolitan areas'] = df['50 largest metropolitan areas'].str.replace('\d+', '', regex=True)

# Remove the initial '.' from the start of the string
df['50 largest metropolitan areas'] = df['50 largest metropolitan areas'].str.replace('.', '', regex=True)

In [6]:
df[['50 largest metropolitan areas']].head(5)

Unnamed: 0,50 largest metropolitan areas
0,"New York City-Newark-Jersey City, NY-NJ-PA"
1,"Los Angeles-Long Beach-Anaheim, CA"
2,"Chicago-Naperville-Elgin, IL-IN-WI"
3,"Dallas-Fort Worth-Arlington, TX"
4,"Houston-The Woodlands-Sugar Land, TX"


**#3. Clean up the Column Names**

This next step is improve the readability of the columns and make them easier for use in downstream tasks.

In [7]:
# Clean up some of the column names to make them easier to read

df.rename(columns={
    '50 largest metropolitan areas': 'Metropolitan Area',
    'Per Capita Personal Income (PCPI) (2019)[2]': 'Per Capita Personal Income (PCPI) (2019)',
    'Purchasing power of $1.00 (2019)[5]': 'Purchasing power of $1.00 (2019)'
}, inplace=True)

In [8]:
df.head(5)

Unnamed: 0,Metropolitan Area,Per Capita Personal Income (PCPI) (2019),Rank in PCPI before adjustment,Purchasing power of $1.00 (2019),PCPI after adjustments for purchasing power of dollar,Rank in PCPI after adjustment
0,"New York City-Newark-Jersey City, NY-NJ-PA","$79,844",4.0,$0.80,"$63,875",7.0
1,"Los Angeles-Long Beach-Anaheim, CA","$66,684",8.0,$0.84,"$56,015",34.0
2,"Chicago-Naperville-Elgin, IL-IN-WI","$63,500",14.0,$0.97,"$61,595",18.0
3,"Dallas-Fort Worth-Arlington, TX","$58,725",22.0,$0.99,"$58,138",27.0
4,"Houston-The Woodlands-Sugar Land, TX","$58,890",20.0,$0.98,"$57,712",28.0


**#4. Perform fuzzy matching to match Metropolitan Area to the official Census MSA Name**

This step needs to be done because the web data has its own naming convetion for MSA. In order for this data source to be tied to the other data sources, the Census naming convention and IDs for MSA need to be imported and matched to the naming convention in the web data.

To do this I need to use a mapping from the U.S. Census and conduct a fuzzy match to the naming convention 'Region Name' column of the flat file.

In [12]:
# Importing US Census mapping file
dfmapping = pd.read_excel("MappingFile0.xlsx")

dfmapping.head()

Unnamed: 0,CBSA Code,CBSA Title
0,10100,"Aberdeen, SD"
1,10100,"Aberdeen, SD"
2,10140,"Aberdeen, WA"
3,10180,"Abilene, TX"
4,10180,"Abilene, TX"


In [15]:
# Dedupping mapping file
dfmappingV2 = dfmapping.drop_duplicates()

dfmappingV2.head()

Unnamed: 0,CBSA Code,CBSA Title
0,10100,"Aberdeen, SD"
2,10140,"Aberdeen, WA"
3,10180,"Abilene, TX"
6,10220,"Ada, OK"
7,10300,"Adrian, MI"


In [16]:
from thefuzz import fuzz, process

# extracting the relevant columns for fuzzy matching
column_a = dfmappingV2['CBSA Title']
column_b = df['Metropolitan Area'] 

# Initialize a list to store the matches
matches = []

# Iterate over each item in column_b
for b in column_b:
    best_match = None
    highest_score = 0

    # Compare with each item in column_a
    for a in column_a:
        score = fuzz.token_sort_ratio(b, a)

        # Update the best match if this score is higher than the current highest
        if score > highest_score:
            best_match = a
            highest_score = score

    # Append the best match and its score to the matches list
    matches.append((b, best_match, highest_score))

# Create a DataFrame from the matches
matches_web_df = pd.DataFrame(matches, columns=['Metropolitan Area', 'CBSA Title', 'Score'])



In [17]:
# Now matches_df contains the best matches based on the highest ratio
matches_web_df.sort_values(by='Score', ascending=True)

Unnamed: 0,Metropolitan Area,CBSA Title,Score
26,"Pittsburgh-New Castle-Weirton, PA-OH-WV CSA","Washington Court House, OH",48
50,United States,"Athens, TN",55
10,"Boston-Worcester-Providence, MA-RI-NH-CT CSA","Providence-Warwick, RI-MA",60
48,"Buffalo-Niagara Falls, NY","Buffalo-Cheektowaga, NY",61
18,"Denver-Aurora-Lakewood, CO","Denver-Aurora-Centennial, CO",62
32,"Indianapolis-Carmel-Anderson, IN","Indianapolis-Carmel-Greenwood, IN",70
8,"Atlanta-Sandy Springs-Alpharetta, GA","Atlanta-Sandy Springs-Roswell, GA",72
27,"Las Vegas-Henderson-Paradise, NV","Las Vegas-Henderson-North Las Vegas, NV",72
28,"Austin-Round Rock-Georgetown, TX","Austin-Round Rock-San Marcos, TX",74
11,"San Francisco-Oakland-Berkeley, CA","San Francisco-Oakland-Fremont, CA",74


In [40]:
# creating a csv to manually evaluate the lower scored matches
matches_web_df.to_csv("fuzzy_matchedV3_webdata.csv", index=False)

The fuzzy match was able to succesfully match 94% of the records. The remaining records will need to be manually evaluated for accuracy. I will perform this evaluation offline in excel and will load the finished csv once this process is completed.

**#5. Add MSA ID using the finalized mapping file started by fuzzy match**

In [18]:
# Importing the completed fuzzy matched file with manual verifications
manualFuzzV3df = pd.read_csv("fuzzy_matched_manual_verified_V3_webdata.csv")

manualFuzzV3df.head()

Unnamed: 0,Metropolitan Area,CBSA Title,Score
0,"Pittsburgh-New Castle-Weirton, PA-OH-WV CSA","Pittsburgh, PA",48
1,"Boston-Worcester-Providence, MA-RI-NH-CT CSA","Boston-Cambridge-Newton, MA-NH",60
2,"Buffalo-Niagara Falls, NY","Buffalo-Cheektowaga, NY",61
3,"Denver-Aurora-Lakewood, CO","Denver-Aurora-Centennial, CO",62
4,"Indianapolis-Carmel-Anderson, IN","Indianapolis-Carmel-Greenwood, IN",70


In [20]:
# Merge dfmappingV2 with manualFuzzdf
merged_mapping = pd.merge(dfmappingV2, manualFuzzV3df, on='CBSA Title', how='right')

# Check the merged data
merged_mapping.head()

Unnamed: 0,CBSA Code,CBSA Title,Metropolitan Area,Score
0,38300,"Pittsburgh, PA","Pittsburgh-New Castle-Weirton, PA-OH-WV CSA",48
1,14460,"Boston-Cambridge-Newton, MA-NH","Boston-Worcester-Providence, MA-RI-NH-CT CSA",60
2,15380,"Buffalo-Cheektowaga, NY","Buffalo-Niagara Falls, NY",61
3,19740,"Denver-Aurora-Centennial, CO","Denver-Aurora-Lakewood, CO",62
4,26900,"Indianapolis-Carmel-Greenwood, IN","Indianapolis-Carmel-Anderson, IN",70


In [21]:
# Merge the above result with the cleaned web-data df
final_df = pd.merge(df, merged_mapping, left_on='Metropolitan Area', right_on='Metropolitan Area', how='left')

# Check the final merged DataFrame
final_df.head()

Unnamed: 0,Metropolitan Area,Per Capita Personal Income (PCPI) (2019),Rank in PCPI before adjustment,Purchasing power of $1.00 (2019),PCPI after adjustments for purchasing power of dollar,Rank in PCPI after adjustment,CBSA Code,CBSA Title,Score
0,"New York City-Newark-Jersey City, NY-NJ-PA","$79,844",4.0,$0.80,"$63,875",7.0,35620.0,"New York-Newark-Jersey City, NY-NJ",89.0
1,"Los Angeles-Long Beach-Anaheim, CA","$66,684",8.0,$0.84,"$56,015",34.0,31080.0,"Los Angeles-Long Beach-Anaheim, CA",100.0
2,"Chicago-Naperville-Elgin, IL-IN-WI","$63,500",14.0,$0.97,"$61,595",18.0,16980.0,"Chicago-Naperville-Elgin, IL-IN",95.0
3,"Dallas-Fort Worth-Arlington, TX","$58,725",22.0,$0.99,"$58,138",27.0,19100.0,"Dallas-Fort Worth-Arlington, TX",100.0
4,"Houston-The Woodlands-Sugar Land, TX","$58,890",20.0,$0.98,"$57,712",28.0,26420.0,"Houston-Pasadena-The Woodlands, TX",82.0


In [22]:
# Dropping the 'Score' column since it's not needed in the final dataset.
final_df.drop(columns=['Score'], inplace=True)

final_df.head()

Unnamed: 0,Metropolitan Area,Per Capita Personal Income (PCPI) (2019),Rank in PCPI before adjustment,Purchasing power of $1.00 (2019),PCPI after adjustments for purchasing power of dollar,Rank in PCPI after adjustment,CBSA Code,CBSA Title
0,"New York City-Newark-Jersey City, NY-NJ-PA","$79,844",4.0,$0.80,"$63,875",7.0,35620.0,"New York-Newark-Jersey City, NY-NJ"
1,"Los Angeles-Long Beach-Anaheim, CA","$66,684",8.0,$0.84,"$56,015",34.0,31080.0,"Los Angeles-Long Beach-Anaheim, CA"
2,"Chicago-Naperville-Elgin, IL-IN-WI","$63,500",14.0,$0.97,"$61,595",18.0,16980.0,"Chicago-Naperville-Elgin, IL-IN"
3,"Dallas-Fort Worth-Arlington, TX","$58,725",22.0,$0.99,"$58,138",27.0,19100.0,"Dallas-Fort Worth-Arlington, TX"
4,"Houston-The Woodlands-Sugar Land, TX","$58,890",20.0,$0.98,"$57,712",28.0,26420.0,"Houston-Pasadena-The Woodlands, TX"


I now have the final version of the web data. This data source is rather simple and did not require much processing. However, here are the transformations that were conducted to clean and prep the data:

*In conclusion these were the transformations conducted to the data:*
- Cleaned the values in 'Metropolitan Area' by removing certain strings. 
- Cleaned the values in 'Metropolitan Area' by removing integersa and certain special characters.
- Cleaned certain column names to improve readability and downstream use.
- Performed fuzzy match on 'Metropolitan Area' to Census MSA Name ('CBSA Title')
- Used the matched values of 'Metropolitan Area' to import Census MSA ID ('CBSA Code')

The MSA ID (CBSA Code) is now included in the final file. The CBSA Code will act as the FK that will tie to the other data sources.

In [23]:
# creating a csv of the final version of the flat file
final_df.to_csv("web_data_final_version.csv", index=False)