# Students Do: Encoding Categorical Data for Machine Learning

In this activity, you are tasked to encode some categorical and text features of a dataset that contains `2097` loans applications. In forthcoming activities, you will use this dataset to predict defaulted loan applications.

## Dataset Description.

The data provided, is based on the dataset used in the research paper entitled [_“Should This Loan be Approved or Denied?”: A Large Dataset with Class Assignment Guidelines_](https://doi.org/10.1080/10691898.2018.1434342) published by Min Li, Amy Mickel & Stanley Taylor from the California State University on the Journal of Statistics Education.

This dataset contains information about loans applications managed by the U.S. Small Business Administration (SBA), it was adapted for Today's class. The dataset is distributed under the [Creative Commons (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/).

The columns in the dataset are the following:

* `Year`: The fiscal year of the loan application.
* `Month`: Month of the fiscal year.
* `Amount`: The loan amount issued.
* `Term`: Loan's term in months
* `Bank`: Name of the bank that issued the loan.
* `State`: Borrower state.
* `City`: Borrower city.
* `Zip`: Borrower zipcode.
* `CreateJob`: Number of jobs created using the loan.
* `NoEmp`: Number of business employees.
* `RealEstate`: Define if loan is backed by real estate.
* `RevLineCr`: Indicates if it's a revolving line of credit.
* `UrbanRural`: Location type of the borrower.
* `Default`: Indicates if the loan was defaulted (`1`) or not (`0`).

In [2]:
# Initial imports
import pandas as pd
from path import Path
import calendar
from sklearn.preprocessing import LabelEncoder

## Loading the Data

Load the `sba_loans.csv` data in a Pandas DataFame. Show the `head` to get familiar with the columns and data values.

In [3]:
# Read in the data
file_path = Path("../Resources/sba_loans.csv")
loans_df = pd.read_csv(file_path)
loans_df.head()

Unnamed: 0,Year,Month,Amount,Term,Bank,State,City,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,November,32812,36,CALIFORNIA BANK & TRUST,CA,ANAHEIM,92801,0,1,No,Y,Rural,0
1,2001,April,30000,56,CALIFORNIA BANK & TRUST,CA,TORRANCE,90505,0,1,No,Y,Rural,0
2,2001,April,30000,36,CALIFORNIA BANK & TRUST,CA,SAN DIEGO,92103,0,10,No,Y,Rural,0
3,2003,October,50000,36,CALIFORNIA BANK & TRUST,CA,SAN DIEGO,92108,0,6,No,Y,Rural,0
4,2006,July,343000,240,SBA - EDF ENFORCEMENT ACTION,CA,LOS ANGELES,91345,3,65,Yes,N,Urban,0


## Integer Encoding

### Manual Integer Encoding

Perform a manual integer encoding of the `Month` column, use a dictionary to map months names with their corresponding numerical value.

In [4]:
# Months dictionary
name_to_num = {name: num for num, name in enumerate(calendar.month_name) if num}
name_to_num

{'January': 1,
 'February': 2,
 'March': 3,
 'April': 4,
 'May': 5,
 'June': 6,
 'July': 7,
 'August': 8,
 'September': 9,
 'October': 10,
 'November': 11,
 'December': 12}

In [None]:
# Encode month name
loans_df["Month"] = loans_df["Month"].apply(lambda x: name_to_num[x])
loans_df.head()

### Encoding Data using `LabelEncoder`

Use the `LabelEncoder` method from `sklearn` to perform an integer encoding of the `RealEstate`, `RevLineCr` and `UrbanRural` columns.

In [None]:
# Create the LabelEncoder instance
le = LabelEncoder()

In [None]:
# Fitting and encoding the columns with the LabelEncoder

# RealEstate column
le.fit(loans_df["RealEstate"])
loans_df["RealEstate"] = le.transform(loans_df["RealEstate"])

# Encoding RevLineCr column
le.fit(loans_df["RevLineCr"])
loans_df["RevLineCr"] = le.transform(loans_df["RevLineCr"])

# Encoding UrbanRural column
le.fit(loans_df["UrbanRural"])
loans_df["UrbanRural"] = le.transform(loans_df["UrbanRural"])

# Preview the DataFrame
loans_df.head()

### Encoding Data using `get_dummies()`

Perform a binary encoding on the `Bank`, `State` and `City` columns using the Pandas `get_dummies()` function.

In [None]:
# Encoding the Bank, State and City columns
loans_df = pd.get_dummies(loans_df, columns=["Bank", "State", "City"])
loans_df.head()

## Save the Preprocessed File

Finally, save the preprocessed file as `sba_loans_encoded.csv` for forthcoming usage.

In [None]:
# Save the file for forthcoming usage
file_path = Path("../Resources/sba_loans_encoded.csv")
loans_df.to_csv(file_path, index=False)