# 1.2 Python data cleaning steps

In this notebook we'll perform data cleaning operations with the dataset `PPP-Data-up-to-150k-080820-HI-OpenRefineCleaned.csv` exported from `OpenRefine`.

We will use libraries like `pandas` for data manipulation and `uszipcode` for properly mapping `City` column values to `Zip` code values.

In [38]:
# Import necessary libraries
import pandas as pd
from uszipcode import SearchEngine

### Load the data

First, we’ll load the dataset exported from OpenRefine `PPP-Data-up-to-150k-080820-HI-OpenRefineCleaned.csv` into a pandas DataFrame.

In [39]:
# Load the data
df = pd.read_csv('../../data/cleaned/PPP-Data-up-to-150k-080820-HI-OpenRefineCleaned.csv')

### Step 1: Remove duplicate records from the dataset

In [40]:
original_length = len(df)
df = df.drop_duplicates()
print(f"Step #1 -- Removed {original_length - len(df)} duplicate records")

Step #1 -- Removed 17 duplicate records


### Step 2: Rectify `City` and `Zip` associations using the `uszipcode` library

In [41]:
search = SearchEngine()

zipcode_cache = {}
city_corrections = 0
for i, row in df.iterrows():
    # Using cache to avoid multiple requests for the same Zip code
    if row["Zip"] in zipcode_cache:
        city, state = zipcode_cache[row["Zip"]]
    else:
        # Using uszipcode library to find city and state based on Zip code
        zipcode = search.by_zipcode(row["Zip"])
        city, state = zipcode.major_city, zipcode.state
        zipcode_cache[row["Zip"]] = (city, state)

    # Check if state is not Hawaii (as it should be)
    if state != "HI":
        # Special case: One record with City="Honolulu" and Zip=97817
        # The Zip code 97817 belongs to Oregon, not Hawaii
        # We change the Zip to 96817 (which is in Hawaii) since the City is Honolulu
        df.at[i, "Zip"] = 96817
        df.at[i, "City"] = "Honolulu"
        df.at[i, "State"] = "HI"
    else:
        # If the original city name and the city name from uszipcode are not similar,
        # overwrite the city in the dataframe with the one from uszipcode
        original_city = df.at[i, "City"]
        if original_city.lower() != city.lower():
            city_corrections += 1
        df.at[i, "City"], df.at[i, "State"] = city, state
print(f"Step #2 -- Checked and rectified City and Zip associations. {city_corrections} "
      f"City names were corrected based on the Zip code value")

Step #2 -- Checked and rectified City and Zip associations. 1074 City names were corrected based on the Zip code value


### Step 3: Add a new column `NAICSTitle`

In [42]:
# Load NAICS codes and titles from external file 'industry-titles.csv'
# The file was obtained from the U.S. Bureau of Labor Statistics website
# https://www.bls.gov/cew/classifications/industry/industry-titles.htm
naics_df = pd.read_csv('../../data/external/industry-titles.csv')

# Clean up the industry_title by removing the "NAICS" and the code
naics_df['industry_title'] = naics_df['industry_title'].apply(lambda x: ' '.join(x.split(' ')[2:]))

# Rename the columns to match the original dataframe
naics_df.columns = ['NAICSCode', 'NAICSTitle']
print(f"Step #3 -- Loaded {len(naics_df)} NAICS codes and titles, and cleaned up the industry titles")

Step #3 -- Loaded 2678 NAICS codes and titles, and cleaned up the industry titles


### Step 4: Preprocess the NAICS codes

In [43]:
# Convert 'NAICSCode' to float while handling errors safely
naics_df['NAICSCode'] = pd.to_numeric(naics_df['NAICSCode'], errors='coerce')

# Remove rows where 'NAICSCode' is NaN
naics_df = naics_df.dropna(subset=['NAICSCode'])

# Modify 5-digit 'NAICSCode' in naics_df that have a 'NAICSCode' ending in a digit other than 0
naics_df['NAICSCode'] = naics_df['NAICSCode'].apply(
    lambda x: x * 10 if ((x * 10 + 1) in naics_df['NAICSCode'].values) else x)

# Special case: Modify 'NAICSCode' in naics_df that have value `99999` to `999990`
naics_df['NAICSCode'] = naics_df['NAICSCode'].apply(lambda x: x * 10 if x == 99999 else x)

# Remove duplicate 'NAICSCode' in naics_df
original_length = len(naics_df)
naics_df.drop_duplicates(subset='NAICSCode', inplace=True)

# Print number of duplicate NAICS codes removed
print(f"Step #4 -- Preprocessed the NAICS codes from the naics_df that holds the NAICS codes and titles. "
      f"Removed {original_length - len(naics_df)} duplicate NAICS codes.")

Step #4 -- Preprocessed the NAICS codes from the naics_df that holds the NAICS codes and titles. Removed 53 duplicate NAICS codes.


### Step 5: Merge the dataframes on `NAICSCode` column

In [44]:
# Convert 'NAICSCode' to float in both dataframes
df['NAICSCode'] = df['NAICSCode'].astype(float)
naics_df['NAICSCode'] = naics_df['NAICSCode'].astype(float)

# Merge the original dataframe with the NAICS dataframe on 'NAICSCode'
original_length = len(df)
df = pd.merge(df, naics_df, on='NAICSCode', how='left')

# Calculate the number of records matched with a 'NAICSTitle' value
naicstitle_added = df['NAICSTitle'].notna().sum()

# Special case: Manually set NAICSTitle "Dual-purpose cattle ranching and farming" to records with NAICSCode `112130`
df.loc[df['NAICSCode'] == 112130, 'NAICSTitle'] = "Dual-purpose cattle ranching and farming"

# Calculate the number of manually set NAICSTitles
manual_naicstitle_count = (df['NAICSTitle'] == "Dual-purpose cattle ranching and farming").sum()
print(f"Step #5 -- Merged the dataframes on 'NAICSCode'. {naicstitle_added} "
      f"records were matched with a 'NAICSTitle' value.\n"
      f"Handled the special case for NAICSCode `112130` and "
      f"NAICSTitle `\"Dual-purpose cattle ranching and farming\"`. "
      f"{manual_naicstitle_count} 'NAICSTitle' values were manually set.")

Step #5 -- Merged the dataframes on 'NAICSCode'. 21775 records were matched with a 'NAICSTitle' value.
Handled the special case for NAICSCode `112130` and NAICSTitle `"Dual-purpose cattle ranching and farming"`. 7 'NAICSTitle' values were manually set.


### Step 6: Impute one missing value in `BusinessType` column with `"Corporation"` (the most common value in the column)

In [45]:
df['BusinessType'] = df['BusinessType'].fillna('Corporation')
print("Step #6 -- Imputed one (1) missing values in 'BusinessType' with 'Corporation' (the most common value)")

Step #6 -- Imputed one (1) missing values in 'BusinessType' with 'Corporation' (the most common value)


### Convert `NAICSCode` column values to string and save the cleaned dataframe to a new CSV file

In [46]:
# Convert 'NAICSCode' to string after ensuring it's not a NaN value
df.loc[~df['NAICSCode'].isna(), 'NAICSCode'] = df.loc[~df['NAICSCode'].isna(), 'NAICSCode'].astype(int).astype(str)

# Save the cleaned data to a new CSV file
df.to_csv('../../data/cleaned/PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv', index=False)
print("Cleaned data saved to '../../data/cleaned/PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv'")

Cleaned data saved to '../../data/cleaned/PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv'
