# U.S. Small Business Administration High Risk Loan Applicant Analysis

## Data Wrangling & Cleaning - Part 1

### Loading assets and data set

In [1]:
import numpy as np
import pandas as pd

from uszipcode import SearchEngine
from library.utils import save_file



In [2]:
apps_ver1 = pd.read_csv('../data/raw/SBAnational.csv', low_memory=False)

### Initial explorations of the data set
The next few cells will get an idea of the shape of the data set, the data types for each of the features, which features are missing data. These initial observations will also be used to identify the critical response feature, features that won't be needed, and which features that will require some modification.

In [3]:
apps_ver1.shape

(899164, 27)

This dataset contains 899,164 observations and 27 unique features.

In [4]:
apps_ver1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 899164 entries, 0 to 899163
Data columns (total 27 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   LoanNr_ChkDgt      899164 non-null  int64  
 1   Name               899150 non-null  object 
 2   City               899134 non-null  object 
 3   State              899150 non-null  object 
 4   Zip                899164 non-null  int64  
 5   Bank               897605 non-null  object 
 6   BankState          897598 non-null  object 
 7   NAICS              899164 non-null  int64  
 8   ApprovalDate       899164 non-null  object 
 9   ApprovalFY         899164 non-null  object 
 10  Term               899164 non-null  int64  
 11  NoEmp              899164 non-null  int64  
 12  NewExist           899028 non-null  float64
 13  CreateJob          899164 non-null  int64  
 14  RetainedJob        899164 non-null  int64  
 15  FranchiseCode      899164 non-null  int64  
 16  Ur

In [5]:
pd.set_option('display.max_columns', None) #Allows us to view every feature when the head function is ran.
apps_ver1.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,28-Feb-97,1997,84,4,2.0,0,0,1,0,N,Y,,28-Feb-99,"$60,000.00",$0.00,P I F,$0.00,"$60,000.00","$48,000.00"
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,28-Feb-97,1997,60,2,2.0,0,0,1,0,N,Y,,31-May-97,"$40,000.00",$0.00,P I F,$0.00,"$40,000.00","$32,000.00"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,28-Feb-97,1997,180,7,1.0,0,0,1,0,N,N,,31-Dec-97,"$287,000.00",$0.00,P I F,$0.00,"$287,000.00","$215,250.00"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,28-Feb-97,1997,60,2,1.0,0,0,1,0,N,Y,,30-Jun-97,"$35,000.00",$0.00,P I F,$0.00,"$35,000.00","$28,000.00"
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,28-Feb-97,1997,240,14,1.0,7,7,1,0,N,N,,14-May-97,"$229,000.00",$0.00,P I F,$0.00,"$229,000.00","$229,000.00"


The provided data dictionary found on this [page](https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied?select=Should+This+Loan+be+Approved+or+Denied+A+Large+Dataset+with+Class+Assignment+Guidelines.pdf) identifies that the **MIS_Status** feature is the critical response feature for this dataset. *PIF* indicates paid in full (didn't default) and *CHGOFF* indicates charged off (did default). This data dictionary is also recreated and stored as [reference](../references/data_dictionary.md) This study will be aiming to build a model that identifies which features would suggest a candidate would be at higher risks for potentially defaulting on their loan.

Features like *LoanNr_ChkDgt* and *Name* are unique to the individual and won't be useful for this study. The *Bank* and *BankState* features don't pertain to the individual so those will also not be useful for this study. Those will be dropped later after investigating all the columns. 

Next is to drop any duplicate observations and identify the number of missing values in each cell.

In [6]:
apps_ver2 = apps_ver1.drop_duplicates(keep = 'first')
apps_ver2.isnull().sum()

LoanNr_ChkDgt             0
Name                     14
City                     30
State                    14
Zip                       0
Bank                   1559
BankState              1566
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736465
DisbursementDate       2368
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

The *Name*, *Bank*, and *BankState* missing values won't be an issue as those columns will be dropped. The *City* and *State* missing values can likely be imputed since the *Zip* is not missing on any observation. The **uszipcode** library inside the following function will help with filling in these missing values.

### Data Cleaning

In [7]:
def fill_missing_city_state_values(df, missing_vals_subset):
    """This function takes the present dataframe along with the missing city and state values dataframe and looks up
       the missing information with the uszipcode.SearchEngine library. The function creates a temp dataframe so the
       original dataframe is unaltered, and returns the temp dataframe."""
    search = SearchEngine()
    df_temp = df # mutate and return the temp
    for index, row in missing_vals_subset.iterrows():
        zipInfo = search.by_zipcode(row['Zip'])
        if(zipInfo is not None):
            if pd.isnull(df_temp.iloc[index,2]): # 2 is the city columin index in the dataframe
                df_temp.iloc[index,2] = zipInfo.major_city
            if pd.isnull(df_temp.iloc[index,3]): # 3 is the state colunn index in the dataframe
                df_temp.iloc[index,3] = zipInfo.state
    return df_temp

In [8]:
missing_city_state_rows = apps_ver2[(apps_ver2['City'].isnull()) | (apps_ver2['State'].isnull())]
apps_ver3 = fill_missing_city_state_values(apps_ver2, missing_city_state_rows)

In [9]:
apps_ver3.isnull().sum()

LoanNr_ChkDgt             0
Name                     14
City                      2
State                     1
Zip                       0
Bank                   1559
BankState              1566
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736465
DisbursementDate       2368
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

There are still 2 missing city values and 1 missing state value. Those rows are shown below.

In [10]:
apps_ver3[(apps_ver3['City'].isnull()) | (apps_ver3['State'].isnull())]

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
49244,1380800010,TRYON COATS & LEATHER,JOHNSTOWN NY,,0,KEYBANK NATIONAL ASSOCIATION,NY,0,18-May-66,1966,282,0,0.0,0,0,0,0,N,N,29-Mar-90,16-Aug-66,"$60,000.00",$0.00,CHGOFF,"$6,084.00","$60,000.00","$54,000.00"
437804,4247612009,DOUBLE W MOLD INC,,MA,2401,,,0,3-Aug-81,1981,180,21,1.0,0,0,0,0,N,N,,4-Nov-81,"$38,000.00",$0.00,P I F,$0.00,"$38,000.00","$38,000.00"
437818,4247702001,GEMCO NARROW FABRICS INC,,MA,2165,BAY COLONY DEVEL CORP,MA,0,18-Sep-81,1981,180,50,1.0,0,0,0,0,N,N,,10-Feb-82,"$95,000.00",$0.00,P I F,$0.00,"$95,000.00","$95,000.00"


These 3 observations are likely key entry errors. These are also very old applications. These rows will be dropped.

In [11]:
apps_ver4 = apps_ver3.dropna(axis = 0, subset=['City', 'State'])

In [12]:
apps_ver4.isnull().sum()

LoanNr_ChkDgt             0
Name                     14
City                      0
State                     0
Zip                       0
Bank                   1558
BankState              1565
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736463
DisbursementDate       2368
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

Next, is to drop the four columns (*LoanNr_ChkDgt*, *Name*, *Bank*, *BankState*) that won't provide any sufficient value to this analysis. 

*ApprovalDate*, *DisbursementDate*, and *ApprovalFY* are all relatively redundant columns; for simplicity, *ApprovalFY* will be retained and the other two columns will be dropped. *ApprovalFY* will also be converted to a datetime datatype.

In [13]:
apps_ver5 = apps_ver4.drop(axis=1, columns=['LoanNr_ChkDgt', 'Name', 'Bank', 'BankState', 'ApprovalDate', 'DisbursementDate'])

In [14]:
# Verifying columns have been removed

apps_ver5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 899161 entries, 0 to 899163
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   City               899161 non-null  object 
 1   State              899161 non-null  object 
 2   Zip                899161 non-null  int64  
 3   NAICS              899161 non-null  int64  
 4   ApprovalFY         899161 non-null  object 
 5   Term               899161 non-null  int64  
 6   NoEmp              899161 non-null  int64  
 7   NewExist           899025 non-null  float64
 8   CreateJob          899161 non-null  int64  
 9   RetainedJob        899161 non-null  int64  
 10  FranchiseCode      899161 non-null  int64  
 11  UrbanRural         899161 non-null  int64  
 12  RevLineCr          894633 non-null  object 
 13  LowDoc             896579 non-null  object 
 14  ChgOffDate         162698 non-null  object 
 15  DisbursementGross  899161 non-null  object 
 16  Ba

In [15]:
# Converting ApprovalFY from object to Datetime
apps_ver5['ApprovalFY'].value_counts()


2005     77525
2006     76040
2007     71876
2004     68290
2003     58193
1995     45758
2002     44391
1996     40112
2008     39540
1997     37748
2000     37381
1999     37363
2001     37350
1998     36016
1994     31598
1993     23305
1992     20885
2009     19126
2010     16848
1991     15666
1990     14859
1989     13248
2011     12608
2012      5997
2013      2458
1987      2218
1986      2118
1984      2022
1985      1944
1988      1898
1983      1684
1982       719
1981       628
1980       477
1979       352
2014       268
1978       242
1977       137
1976        66
1973        52
1974        42
1975        30
1972        27
1971        20
1976A       18
1970         8
1969         4
1968         2
1967         2
1962         1
1965         1
Name: ApprovalFY, dtype: int64

The *ApprovalFY* feature has an entry that won't convert to datetime: 1976A. This is likely a key entry error, and in this situation, '1976A' will be imputed to '1976'.

In [16]:
apps_ver5['ApprovalFY'] = apps_ver5['ApprovalFY'].replace(['1976A'], '1976')
apps_ver5['ApprovalFY'] = pd.to_datetime(apps_ver5['ApprovalFY']).dt.year

In [17]:
# Verifying changes
apps_ver5['ApprovalFY'].value_counts()

2005    77525
2006    76040
2007    71876
2004    68290
2003    58193
1995    45758
2002    44391
1996    40112
2008    39540
1997    37748
2000    37381
1999    37363
2001    37350
1998    36016
1994    31598
1993    23305
1992    20885
2009    19126
2010    16848
1991    15666
1990    14859
1989    13248
2011    12608
2012     5997
2013     2458
1987     2218
1986     2118
1984     2022
1985     1944
1988     1898
1983     1684
1982      719
1981      628
1980      477
1979      352
2014      268
1978      242
1977      137
1976       84
1973       52
1974       42
1975       30
1972       27
1971       20
1970        8
1969        4
1968        2
1967        2
1962        1
1965        1
Name: ApprovalFY, dtype: int64

In [18]:
apps_ver5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 899161 entries, 0 to 899163
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   City               899161 non-null  object 
 1   State              899161 non-null  object 
 2   Zip                899161 non-null  int64  
 3   NAICS              899161 non-null  int64  
 4   ApprovalFY         899161 non-null  int64  
 5   Term               899161 non-null  int64  
 6   NoEmp              899161 non-null  int64  
 7   NewExist           899025 non-null  float64
 8   CreateJob          899161 non-null  int64  
 9   RetainedJob        899161 non-null  int64  
 10  FranchiseCode      899161 non-null  int64  
 11  UrbanRural         899161 non-null  int64  
 12  RevLineCr          894633 non-null  object 
 13  LowDoc             896579 non-null  object 
 14  ChgOffDate         162698 non-null  object 
 15  DisbursementGross  899161 non-null  object 
 16  Ba

The next feature is NAICS. In the [data dictionary](../references/data_dictionary.md), there is a table that describes what these values mean. Companies will have their own unique NAICS ID, but they will share the first two digits with other companies that are in a similar industry. Every company having their own unique ID isn't very useful, but grouping them by their sectors could be very useful. So a new column will be created *NAICS_sectors* which will be all these values dropped down to their first two digits and the NAICS column will be dropped. 

It should be noted that not all companies are required to have a NAICS code. So there will be plenty of 0 values for companies that do not have one.

In [19]:
apps_ver5['NAICS_sectors'] = apps_ver5['NAICS'].astype(str).str[:2].astype(int)
apps_ver5['NAICS_sectors'].value_counts()

0     201945
44     84737
81     72618
54     68170
72     67600
23     66646
62     55366
42     48743
45     42514
33     38284
56     32685
48     20310
32     17936
71     14640
53     13632
31     11809
51     11379
52      9496
11      9005
61      6425
49      2221
21      1851
22       663
55       257
92       229
Name: NAICS_sectors, dtype: int64

The next features *Term*, *NoEmp*, *CreateJob*, and *RetainedJob* all have no missing values and are integer based entries. These features are good to go until later on when it is time to investigate outliers. 

The next feature to investigate with missing data is *NewExist*

In [20]:
print('Missing Values: ', apps_ver5['NewExist'].isnull().sum())
apps_ver5['NewExist'].value_counts()

Missing Values:  136


1.0    644867
2.0    253125
0.0      1033
Name: NewExist, dtype: int64

The [data dictionary](../references/data_dictionary) states that entries labeled with a 1 are an existing business and those labeled 2 are a new business. An assumption can be made that those labeled 0 are undefined; therefore, the 136 missing values can be relabeled 0 for this study.

In [21]:
apps_ver5['NewExist'] = apps_ver5['NewExist'].fillna(0)
apps_ver5.isnull().sum()

City                      0
State                     0
Zip                       0
NAICS                     0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                  0
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736463
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
NAICS_sectors             0
dtype: int64

Next, relabeling the values 0, 1, 2 to unknown, existing_business, new_business; respectively.

In [22]:
def replace_values(df, col, valueList):
    """
    Function that replaces values in a dataframe column and returns the Series. The valueList needs to be a list
    of tuples as follows (oldVals, newVal). 
    """
    temp = df
    for oldVals, newVal in valueList:
        temp[col] = temp[col].replace(oldVals, newVal)
    return temp[col]

In [23]:
newExistVals = [(0, 'unknown'), (1, 'existing_business'), (2, 'new_business')]
apps_ver5['NewExist'] = replace_values(apps_ver5, 'NewExist', newExistVals)

    
apps_ver5['NewExist'].value_counts()

existing_business    644867
new_business         253125
unknown                1169
Name: NewExist, dtype: int64

In [24]:
apps_ver5['FranchiseCode'].value_counts()

1        638554
0        208832
78760      3373
68020      1921
50564      1034
          ...  
24421         1
61615         1
81580         1
83876         1
15930         1
Name: FranchiseCode, Length: 2768, dtype: int64

All companies labeled with a FranchiseCode of 1 or 0 are a non franchise company. The remainder are all franchises. Considering every franchise will have its own unique value, it will be better to label all those to the same value. Non franchise companies should also be grouped under one value, as well. A new column will be created called *isFranchise* where the new values will be *not_franchise* and *franchise*. The *FranchiseCode* column will no longer be needed after this.

In [25]:
apps_ver5['isFranchise'] = apps_ver5['FranchiseCode'].apply(lambda x: 'not_franchise' if x == 0 or x == 1 
                                                            else 'franchise')
apps_ver5['isFranchise'].value_counts()

not_franchise    847386
franchise         51775
Name: isFranchise, dtype: int64

In [26]:
## No additional work needed here for UrbanRural, 1 = Urban, 2 = Rural, 0 = Undefined

apps_ver5['UrbanRural'].unique()

array([0, 1, 2])

*UrbanRural* will be updated in a similar manner as *NewExist* utilizing the same replace_values function. The values 0, 1, 2 will become 'unknown', 'urban', 'rural'; respectively. 

In [27]:
urbanRuralVals = [(0, 'unknown'), (1, 'urban'), (2, 'rural')]
apps_ver5['UrbanRural'] = replace_values(apps_ver5, 'UrbanRural', urbanRuralVals)

apps_ver5['UrbanRural'].value_counts()    

urban      470654
unknown    323164
rural      105343
Name: UrbanRural, dtype: int64

In [28]:
apps_ver5['RevLineCr'].value_counts()

N    420285
0    257602
Y    201397
T     15284
1        23
R        14
`        11
2         6
C         2
3         1
,         1
7         1
A         1
5         1
.         1
4         1
-         1
Q         1
Name: RevLineCr, dtype: int64

In [29]:
apps_ver5['LowDoc'].value_counts()

N    782819
Y    110335
0      1491
C       758
S       603
A       497
R        75
1         1
Name: LowDoc, dtype: int64

For both *RevLineCr* and *LowDoc*, the accepted values are Y for yes and N for no. Therefore, there are many data entry errors including missing values in both of these. To solve for this, these values will all be reassigned back to 0, just like in *UrbanRural* and *NewExist*. The 0 value will simply be looked at as undefined.

In [30]:
apps_ver5['RevLineCr_v2'] = apps_ver5['RevLineCr'].apply(lambda x: 0 if x != 'Y' and x != 'N' else x)
apps_ver5['LowDoc_v2'] = apps_ver5['LowDoc'].apply(lambda x: 0 if x != 'Y' and x != 'N' else x)
apps_ver5.isnull().sum()

City                      0
State                     0
Zip                       0
NAICS                     0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                  0
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736463
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
NAICS_sectors             0
isFranchise               0
RevLineCr_v2              0
LowDoc_v2                 0
dtype: int64

Time to do some cleanup. The following columns will no longer be needed: *NAICS*, *FranchiseCode*, *RevLineCr*, and *LowDoc*. The *ChgOffDate* will also likely not be needed in future analysis as these would be values that will be populated **after** the applicant has defaulted. Therefore, those columns will be removed as well.

In [31]:
apps_ver6 = apps_ver5.drop(axis=1, columns=['NAICS', 'FranchiseCode', 'RevLineCr', 'LowDoc', 'ChgOffDate'])
apps_ver6.head()

Unnamed: 0,City,State,Zip,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,UrbanRural,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,NAICS_sectors,isFranchise,RevLineCr_v2,LowDoc_v2
0,EVANSVILLE,IN,47711,1997,84,4,new_business,0,0,unknown,"$60,000.00",$0.00,P I F,$0.00,"$60,000.00","$48,000.00",45,not_franchise,N,Y
1,NEW PARIS,IN,46526,1997,60,2,new_business,0,0,unknown,"$40,000.00",$0.00,P I F,$0.00,"$40,000.00","$32,000.00",72,not_franchise,N,Y
2,BLOOMINGTON,IN,47401,1997,180,7,existing_business,0,0,unknown,"$287,000.00",$0.00,P I F,$0.00,"$287,000.00","$215,250.00",62,not_franchise,N,N
3,BROKEN ARROW,OK,74012,1997,60,2,existing_business,0,0,unknown,"$35,000.00",$0.00,P I F,$0.00,"$35,000.00","$28,000.00",0,not_franchise,N,Y
4,ORLANDO,FL,32801,1997,240,14,existing_business,7,7,unknown,"$229,000.00",$0.00,P I F,$0.00,"$229,000.00","$229,000.00",0,not_franchise,N,N


Two more things to clear up: switching the currency to useable floats, and determining what to do with the null values in MIS_Status. The following function below will be used to convert the currency to floats.

In [32]:
def convertCurrencyToFloat(df, columns):
    """Takes in a dataframe and copies it to a temporary data frame. The temp data frame converts all currency
    columns to float columns and returns the temp data frame as the new data frame to be used."""
    temp_df = df
    for column in columns:
        temp_df[column] = temp_df[column].replace('[\$,]', '', regex=True).astype(float)
    return temp_df

In [33]:
currencyColumns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv', 'ChgOffPrinGr']
apps_ver7 = convertCurrencyToFloat(apps_ver6, currencyColumns)
apps_ver7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 899161 entries, 0 to 899163
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   City               899161 non-null  object 
 1   State              899161 non-null  object 
 2   Zip                899161 non-null  int64  
 3   ApprovalFY         899161 non-null  int64  
 4   Term               899161 non-null  int64  
 5   NoEmp              899161 non-null  int64  
 6   NewExist           899161 non-null  object 
 7   CreateJob          899161 non-null  int64  
 8   RetainedJob        899161 non-null  int64  
 9   UrbanRural         899161 non-null  object 
 10  DisbursementGross  899161 non-null  float64
 11  BalanceGross       899161 non-null  float64
 12  MIS_Status         897164 non-null  object 
 13  ChgOffPrinGr       899161 non-null  float64
 14  GrAppv             899161 non-null  float64
 15  SBA_Appv           899161 non-null  float64
 16  NA

In [34]:
apps_ver7[apps_ver7['MIS_Status'].isnull()]

Unnamed: 0,City,State,Zip,ApprovalFY,Term,NoEmp,NewExist,CreateJob,RetainedJob,UrbanRural,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,NAICS_sectors,isFranchise,RevLineCr_v2,LowDoc_v2
343,Saratoga Springs,NY,12866,1998,60,3,existing_business,0,0,unknown,474.0,0.0,,0.0,30000.0,15000.0,0,not_franchise,0,N
611,DANIELSON,CT,6239,1980,180,30,new_business,0,0,unknown,0.0,0.0,,144461.0,300000.0,300000.0,0,not_franchise,N,N
738,BOISE,ID,83703,1997,60,1,existing_business,0,0,unknown,2585.0,0.0,,0.0,10000.0,5000.0,0,not_franchise,0,N
740,SIOUX CITY,IA,51111,1980,120,3,existing_business,0,0,unknown,0.0,0.0,,142666.0,350000.0,350000.0,0,not_franchise,N,N
833,HUNTINGTON,NY,11743,2003,84,1,new_business,0,0,unknown,1276.0,0.0,,0.0,25000.0,12500.0,42,not_franchise,N,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
892340,ALAMEDA,CA,94605,1997,84,1,existing_business,0,0,unknown,0.0,0.0,,0.0,50000.0,25000.0,0,not_franchise,0,N
893791,SAN GABRIEL,CA,91776,1997,84,4,existing_business,0,0,unknown,373.0,0.0,,0.0,25000.0,12500.0,0,not_franchise,0,N
894290,PORTLAND,OR,97206,1997,84,1,existing_business,0,0,unknown,20.0,0.0,,0.0,10000.0,5000.0,0,not_franchise,0,N
896318,RICHMOND,VA,23225,1997,36,1,new_business,0,0,unknown,3500.0,0.0,,0.0,3500.0,2800.0,0,not_franchise,N,Y


In [35]:
apps_ver7.shape

(899161, 20)

In [36]:
1997 / 899161 * 100

0.22209593165183988

There are 1997 (0.22%) observations that do not have their MIS_Status filled out. With this being such a small representation of the current dataset, these observations will be dropped, and the values will be changed from 'PIF' and 'CHGOFF' to 'paid' and 'default', respectively. 

In [37]:
apps_ver7['MIS_Status_v2'] = apps_ver7['MIS_Status'].dropna().apply(lambda x: 'paid' if x == 'P I F' else 'default')
apps_ver8 = apps_ver7.dropna(axis=0, subset=['MIS_Status_v2'])

In [38]:
apps_ver8['MIS_Status_v2'].value_counts()

paid       739607
default    157557
Name: MIS_Status_v2, dtype: int64

In [39]:
apps_ver8.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 897164 entries, 0 to 899163
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   City               897164 non-null  object 
 1   State              897164 non-null  object 
 2   Zip                897164 non-null  int64  
 3   ApprovalFY         897164 non-null  int64  
 4   Term               897164 non-null  int64  
 5   NoEmp              897164 non-null  int64  
 6   NewExist           897164 non-null  object 
 7   CreateJob          897164 non-null  int64  
 8   RetainedJob        897164 non-null  int64  
 9   UrbanRural         897164 non-null  object 
 10  DisbursementGross  897164 non-null  float64
 11  BalanceGross       897164 non-null  float64
 12  MIS_Status         897164 non-null  object 
 13  ChgOffPrinGr       897164 non-null  float64
 14  GrAppv             897164 non-null  float64
 15  SBA_Appv           897164 non-null  float64
 16  NA

The **MIS_Status** column is no longer needed and can be dropped.

In [40]:
apps_ver9 = apps_ver8.drop(axis=1, columns=['MIS_Status'])

In [41]:
apps_ver9.isnull().sum()

City                 0
State                0
Zip                  0
ApprovalFY           0
Term                 0
NoEmp                0
NewExist             0
CreateJob            0
RetainedJob          0
UrbanRural           0
DisbursementGross    0
BalanceGross         0
ChgOffPrinGr         0
GrAppv               0
SBA_Appv             0
NAICS_sectors        0
isFranchise          0
RevLineCr_v2         0
LowDoc_v2            0
MIS_Status_v2        0
dtype: int64

The dataset is finally clean and is ready for some exploration. In the EDA phase, relationships between the features will be explored to determine which features appear to have the strongest influence on the **MIS_Status_v2** feature. 

In [42]:
datapath = '../data/interim'
save_file(apps_ver9, 'sba_national_summary.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../data/interim/sba_national_summary.csv"
