## The Dataset

For our summer case study, our group chose to explore data related to the Paycheck Protection Program (PPP) that was enacted in 2020 - 2021 in response to the COVID - 19 pandemic. 

The dataset our group chose to use for this sudy was the PPP FOIA data for business with earnings greater than 150K. The raw data is available for download from: 

https://data.sba.gov/dataset/ppp-foia

This dataset captures a large amount of information surrounding various businesses who applied for PPP loans and whose earnings as a business exceeded 150K. Additional datasets were available for businesses who earned less than 150K, however these datasets contained a wide assortment of challenges that would complicate analysis and modeling.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Here we will load in the original dataset in its raw form from the Small Business Administration's database. 

DATASET = '/dsa/home/lcmhng/jupyter/casestudy_data/group_8/public_150k_plus_v2.csv'

ppp = pd.read_csv(DATASET)

ppp.head()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,ForgivenessAmount,ForgivenessDate
0,9547507704,5/1/2020,464,PPP,"SUMTER COATINGS, INC.",2410 Highway 15 South,Sumter,,29150-9662,12/18/2020,...,Corporation,19248,Synovus Bank,COLUMBUS,GA,Unanswered,Unanswered,,773553.37,11/20/2020
1,9777677704,5/1/2020,464,PPP,"PLEASANT PLACES, INC.",7684 Southrail Road,North Charleston,,29420-9000,0,...,Sole Proprietorship,19248,Synovus Bank,COLUMBUS,GA,Male Owned,Non-Veteran,,0.0,
2,5791407702,5/1/2020,1013,PPP,BOYER CHILDREN'S CLINIC,1850 BOYER AVE E,SEATTLE,,98112-2922,3/17/2021,...,Non-Profit Organization,9551,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,Y,696677.49,2/10/2021
3,6223567700,5/1/2020,920,PPP,KIRTLEY CONSTRUCTION INC,1661 MARTIN RANCH RD,SAN BERNARDINO,,92407-1740,0,...,Corporation,9551,"Bank of America, National Association",CHARLOTTE,NC,Unanswered,Unanswered,,0.0,
4,9662437702,5/1/2020,101,PPP,AERO BOX LLC,,,,,0,...,,57328,The Huntington National Bank,COLUMBUS,OH,Unanswered,Unanswered,,370819.35,4/8/2021


In [3]:
ppp.tail()

Unnamed: 0,LoanNumber,DateApproved,SBAOfficeCode,ProcessingMethod,BorrowerName,BorrowerAddress,BorrowerCity,BorrowerState,BorrowerZip,LoanStatusDate,...,BusinessType,OriginatingLenderLocationID,OriginatingLender,OriginatingLenderCity,OriginatingLenderState,Gender,Veteran,NonProfit,ForgivenessAmount,ForgivenessDate
970382,4395967002,4/3/2020,897,PPP,"ROY E PAULSON, JR., P.C.",102 N. Kenwood,CASPER,WY,82601-2724,1/13/2021,...,Corporation,42366,Platte Valley Bank,TORRINGTON,WY,Male Owned,Non-Veteran,,151037.5,12/9/2020
970383,6985647108,4/14/2020,897,PPP,"SWEETWATER COUNTY CHILD DEVELOPMENTAL CENTER, ...",1715 HITCHING POST DR,GREEN RIVER,WY,82935-5783,12/8/2020,...,Non-Profit Childcare Center,122813,NebraskaLand National Bank,NORTH PLATTE,NE,Unanswered,Unanswered,Y,150789.04,11/3/2020
970384,7996438405,2/12/2021,897,PPS,ELECTRICAL SYSTEMS OF WYOMING INC,1105 Adon Rd,Rozet,WY,82727-8465,0,...,Subchapter S Corporation,77189,First National Bank of Gillette,GILLETTE,WY,Female Owned,Unanswered,,0.0,
970385,9054647103,4/15/2020,897,PPP,EDEN LIFE CARE,30 N. Gould Street Suite 4000,SHERIDAN,WY,82801,0,...,Corporation,25901,Small Business Bank,LENEXA,KS,Unanswered,Unanswered,,0.0,
970386,9184687004,4/9/2020,897,PPP,S & S JOHNSON ENTERPRISES INC,7342 Granite Loop Rd,TETON VILLAGE,WY,83025-0550,0,...,Subchapter S Corporation,77193,Bank of Jackson Hole,JACKSON,WY,Unanswered,Unanswered,,0.0,


## Data Carpentry



### Zip Codes

Initially, we want to verify and clean up some of the data with states and zip codes. 
The below code block reads in a file for US zips, matches the zips to what is in the file already and then cross-references the states to ensure we have states documented correctly for the businesses. 

In [4]:
# Read in zip code file

ZIPS = '/home/lcmhng/jupyter/casestudy_data/group_8/uszips.csv'
zips = pd.read_csv(ZIPS)

# Establishing a loop to clean up the zips in our ppp file

clean_zips = []

for i in ppp['BorrowerZip']:
    # Check if instance is a string. If so, split it on the '-' character if that exists
    if isinstance(i, str):
        x = i.split('-')
        clean_zips.append(x[0])
    else:
        clean_zips.append(i)
             
ppp['BorrowerZip'] = clean_zips #<------- Clean zips are added to the file

# Create a cistionary matching the zips to states. We will then use this data to ensure states and zips in our 
# File match
ppp_zip = dict(zip(ppp['BorrowerZip'], ppp['BorrowerState']))
ppp_zip.update(dict(zip(zips['zip'], zips['state_id'])))

states_zip_match = []

for i, row in ppp.iterrows():
    states_zip_match.append(ppp_zip.get(row['BorrowerZip']))
    
ppp['BorrowerState'] = states_zip_match

# Finally, drop rows without a zip
ppp = ppp.dropna(subset=['BorrowerZip'], axis=0)

### Franchise, NAICS, Business Type, NonProfit, ForgivenYN

In [5]:
# Creating a Franchise variable
ppp['FranchiseName'].fillna(0, inplace=True)
ppp['FranchiseYN'] = ppp['FranchiseName'].apply(lambda x: 0 if x == 0 else 1)

# We had several NAICS Code NAs but they were minimal. Since the dataset is large enough we will choose to drop
# these rows
ppp = ppp.dropna(subset=['NAICSCode'], axis=0)

# We fill business type with the mode as there were <700 business types that were NA
ppp['BusinessType'] = ppp['BusinessType'].fillna(ppp['BusinessType'].mode()[0])

# Non-Profits will be handeled by assuming all non-NonProfits have a value of 0.
ppp['NonProfit'].fillna(0, inplace=True)
ppp['NonProfit'] = ppp['NonProfit'].apply(lambda x: 0 if x == 0 else 1)

# Foregiveness Date will be turned into a Forgiven boolean variable
ppp['ForgivenessDate'].fillna(0, inplace=True)
ppp['ForgivenYN'] = ppp['ForgivenessDate'].apply(lambda x: 0 if x == 0 else 1)

### Phases 

One of the key features with the Paycheck Protection Program was how it rolled out in 3 phases. For our analysis, we will want to encorporate a phase variable in the set to ensure we can look at the phases throughout our exploration.

In [6]:
import datetime as dt
from datetime import datetime

# Converting the entire 'DateApproved' column to datetime
dates = pd.to_datetime(ppp['DateApproved'])

phase_list = []

for date in dates:
    if date <= dt.datetime(2020, 4, 26):
        phase_list.append(1)
    elif date > dt.datetime(2020, 4, 26) and date <= dt.datetime(2020, 8, 8):
        phase_list.append(2)
    else:
        phase_list.append(3)
        
ppp['Phase'] = phase_list

### Congressional Districts and Political Parties

In [7]:
# We want to remove any CD's that are outside of the main US States

index_names = ppp[ (ppp['CD'] == 'AE-') | (ppp['CD'] == 'AP-') | (ppp['CD'] == 'AS-') | (ppp['CD'] == 'GU-') | (ppp['CD'] == 'MP-') |
            (ppp['CD'] == 'PR-') | (ppp['CD'] == 'VI-')].index
ppp.drop(index_names, inplace = True)

# Next we can create a variable that connects the loan location to whether it was in a Republican or Democratic CD

data = '/dsa/home/lcmhng/jupyter/casestudy_data/group_8/congressdistricts.csv'
congress = pd.read_csv(data)

ppp = pd.merge(ppp, congress, on="CD") # Merge on CD for the affiliated party

ppp = pd.get_dummies(ppp, columns=['Party']) # One-Hot Encoding the party affiliation.

### One-Hot encoding

Since we plan to run a regression model on our dataset we will need to encode all the variables to make sure everything is numeric. 

We have chosen to one-hot encode everything we will be using which is accomplished down below.

In [8]:
BusinessAgeDescription = pd.get_dummies(ppp.BusinessAgeDescription, prefix='BusinessAgeDescription')
ProjectState = pd.get_dummies(ppp.ProjectState, prefix='ProjectState')
Race = pd.get_dummies(ppp.Race, prefix='Race')
Ethnicity = pd.get_dummies(ppp.Ethnicity, prefix='Ethnicity')
BusinessType = pd.get_dummies(ppp.BusinessType, prefix='BusinessType')
Gender = pd.get_dummies(ppp.Gender, prefix='Gender')
Veteran = pd.get_dummies(ppp.Veteran, prefix='Veteran')
RuralUrbanIndicator = pd.get_dummies(ppp.RuralUrbanIndicator, prefix='RuralYN')
Phase = pd.get_dummies(ppp.Phase, prefix='Phase')

encoders = [BusinessAgeDescription, 
            ProjectState, 
            Race, 
            Ethnicity, 
            BusinessType, 
            Gender, 
            Veteran,
            RuralUrbanIndicator,
            Phase]

for encoder in encoders:
    ppp = ppp.join(encoder)

### Checking for local banks

This final variable will track whether the loan came from a bank located in the same state as the business or not.

In [9]:
def localbank(ppp):
    if ppp['BorrowerState'] == ppp['OriginatingLenderState']:
        return 1
    else:
        return 0

localbank = ppp.apply(localbank, axis=1)

ppp = pd.concat([localbank, ppp], axis=1)
ppp.rename(columns={ ppp.columns[0]: "LocalBankYN"}, inplace=True)

### Finally, we drop all the unnecessary columns and create our Cost Per Job variable

First, we will only grab the numeric variables as these will be the ones we need. From there we will make sure to drop any numeric variables that are irrelevant

In [10]:
# First, we grab only the numeric variables that we need.

ppp = ppp.select_dtypes(['number'])

In [11]:
# Next, we create a list of variables to drop

drop_cols = [
    'LoanNumber',  
    'SBAOfficeCode',
    'ServicingLenderLocationID',
    'OriginatingLenderLocationID',
    'NAICSCode'
]

ppp.drop(drop_cols, axis = 1, inplace=True)

# Now to create our dependent variable of Cost Per Job for the loans

ppp['CostPerJob'] = ppp['CurrentApprovalAmount'] / ppp['JobsReported']

In [12]:
# Now to create our dependent variable of Cost Per Job for the loans

ppp['CostPerJob'] = ppp['CurrentApprovalAmount'] / ppp['JobsReported']

In [13]:
# Confirming a sample of the data

ppp.head()

Unnamed: 0,LocalBankYN,Term,SBAGuarantyPercentage,InitialApprovalAmount,CurrentApprovalAmount,UndisbursedAmount,JobsReported,UTILITIES_PROCEED,PAYROLL_PROCEED,MORTGAGE_INTEREST_PROCEED,...,Gender_Unanswered,Veteran_Non-Veteran,Veteran_Unanswered,Veteran_Veteran,RuralYN_R,RuralYN_U,Phase_1,Phase_2,Phase_3,CostPerJob
0,0,24,100,769358.78,769358.78,0.0,62,0.0,769358.78,0.0,...,1,0,1,0,0,1,0,1,0,12409.012581
1,0,24,100,7944600.0,7944600.0,0.0,500,0.0,7944600.0,0.0,...,1,0,1,0,1,0,1,0,0,15889.2
2,0,24,100,5800000.0,5800000.0,0.0,380,0.0,5800000.0,0.0,...,1,0,1,0,0,1,1,0,0,15263.157895
3,0,24,100,4932347.5,4932347.5,0.0,327,0.0,4932347.5,0.0,...,1,0,1,0,0,1,1,0,0,15083.631498
4,1,24,100,3713223.0,3713223.0,0.0,309,0.0,3713223.0,0.0,...,1,0,1,0,1,0,0,1,0,12016.902913


## Saving out the data

In [14]:
ppp.to_csv('PPP_DATASET.csv')