# Goals
The goal of this project is to mimic the full data science life cycle, from data cleaning and feature selection to machine learning. 

We will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's [credit risk](https://en.wikipedia.org/wiki/Credit_risk). 

## Background Information
(source: Dataquest.io)

Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from [Lending Club]](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here](https://www.lendingclub.com/public/how-peer-lending-works.action).

Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. You can read more about the interest rate that Lending Club assigns [here](https://www.lendingclub.com/public/borrower-rates-and-fees.action). Lending Club also tries to verify each piece of information the borrower provides but it can't always verify all of the information (usually for regulation reasons).

A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. The interest rates range from 5.32% all the way up to 30.99% and each borrower is given a [grade](https://www.lendingclub.com/public/rates-and-fees.action) according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

Investors are primarily interested in receiveing a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.

The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.

While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. While at first, you may wonder why investors would put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. 

In this project, we'll **focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time**. To do this, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict whether a loan will be paid off.

## The Data
(source: Dataquest.io)

Lending Club releases data for all of the approved and declined loan applications periodically on their [website](https://www.lendingclub.com/info/download-data.action). Any user can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.

You'll also find a [data dictionary](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit) (in XLS format) which contains information on the different column names towards the bottom of the page. 

Before diving into the datasets themselves, let's get familiar with the data dictionary. The **LoanStats** sheet describes the approved loans datasets and the **RejectStats** describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, we'll be focusing on data for the approved loans only.

The approved loans datasets contain information on current loans, completed loans, and defaulted loans. 

# Defining the Problem
Can we build a machine learning model that can accurately predict whether a borrower will pay off their loan on time?

## Data and Feature Selection
(source: Dataquest.io)
Before we can start doing any machine learning, we need to define what features we want to use and which column repesents the target column we want to predict. 

In this project, we'll focus on approved loans data from 2007 to 2011 since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

We have reduced the size of the dataset to make it easier to work with, by:
- removing the desc column:
    - which contains a long text explanation for each loan
- removing the url column:
    - which contains a link to each loan on Lending Club which can only be accessed with an investor account
- removing all columns containing more than 50% missing values:
    - which allows us to move faster since we can spend less time trying to fill these values
    
Let's start by reading in this dataset and exploring it.

In [1]:
import pandas as pd
loans = pd.read_csv("loans_2007.csv", low_memory = False)
# display the first 5 rows
loans.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [2]:
# data info: column names, and row/column count with data types
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
id                            42538 non-null object
member_id                     42535 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39909 non-null object
emp_length                    41423 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
p

This data contains 52 columns which we will need to dig through in order to select appropriate features. We can use the data dictionary to become familiar with what each column represents and looking for any features that:

- leak information from the future (after the loan has already been funded)
- don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
- are formatted poorly and need to be cleaned up
- require more data or a lot of processing to turn into a useful feature
- contain redundant information
- could be considered our target column

We need to especially pay attention to data leakage, since it can cause our model to overfit. If we use leaked information, the model would be using data about the target column that wouldn't be available when we're using the model on future loans. 

### Selection Details
After analyzing each column, we can conclude that the following features need to be removed:

- `id`: randomly generated field by Lending Club for unique identification purposes only
- `member_id`: also a randomly generated field by Lending Club for unique identification purposes only
- `funded_amnt`: leaks data from the future (after the loan is already started to be funded)
- `funded_amnt_inv`: also leaks data from the future (after the loan is already started to be funded)
- `grade`: contains redundant information as the interest rate column (int_rate)
- `sub_grade`: also contains redundant information as the interest rate column (int_rate)
- `emp_title`: requires other data and a lot of processing to potentially be useful
- `issue_d`: also leaks data from the future (after the loan is already completed funded)
- `zip_code`: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
- `out_prncp`: also leaks data from the future, (after the loan already started to be paid off)
- `out_prncp_inv`: also leaks data from the future, (after the loan already started to be paid off)
- `total_pymnt`: also leaks data from the future, (after the loan already started to be paid off)
- `total_pymnt_inv`: also leaks data from the future, (after the loan already started to be paid off)
- `total_rec_prncp`: also leaks data from the future, (after the loan already started to be paid off)
- `total_rec_int`: leaks data from the future, (after the loan already started to be paid off),
- `total_rec_late_fee`: also leaks data from the future, (after the loan already started to be paid off),
- `recoveries`: also leaks data from the future, (after the loan already started to be paid off),
- `collection_recovery_fee`: also leaks data from the future, (after the loan already started to be paid off),
- `last_pymnt_d`: also leaks data from the future, (after the loan already started to be paid off),
- `last_pymnt_amnt`: also leaks data from the future, (after the loan already started to be paid off).


In [3]:
col_to_remove = ['id', 'member_id', 'funded_amnt', 'funded_amnt_inv', 'grade', 'sub_grade', 'emp_title', 'issue_d',
                 'zip_code', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 
                'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
                'last_pymnt_amnt']
data = loans.drop(loans[col_to_remove], axis = 1)

In [4]:
# data check for 32 columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 32 columns):
loan_amnt                     42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
emp_length                    41423 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
loan_status                   42535 non-null object
pymnt_plan                    42535 non-null object
purpose                       42535 non-null object
title                         42522 non-null object
addr_state                    42535 non-null object
dti                           42535 non-null float64
delinq_2yrs                   42506 non-null float64
earliest_cr_line              42506 non-null object
inq_last_6mths                42506 non-null float64
o

### Chosing a Target Column
The `loan_status` column is the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. 

Currently, this column contains text values and we will to convert it to a numerical one for training a model. Let's explore the different values in this column and come up with a strategy for converting the values in this column.

In [5]:
data['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

There are 8 different possible values for the `loan_status` column.

|Loan Status|Count|Meaning|
|:--- |:--- |:--- |
|Fully Paid|33136|Loan has been fully paid off.|
|Charged Off|5634|Loan for which there is no longer a reasonable expectation of further payments.|
|Does not meet the credit policy. Status:Fully Paid|1988|While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.|
|Does not meet the credit policy. Status:Charged Off|761|While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.|
|In Grace Period|20|The loan is past due but still in the grace period of 15 days.|
|Late (16-30 days)|8|Loan hasn't been paid in 16 to 30 days (late on the current payment).|
|Late (31-120 days)|24|Loan hasn't been paid in 31 to 120 days (late on the current payment).|
|Current|961|Loan is up to date on current payments.|
|Default|3|Loan is defaulted on and no payment has been made for more than 121 days.|

(source: Dataquest.io)
From the investor's perspective, we're interested in trying to **predict which loans will be paid off on time and which ones won't be**. Only the `Fully Paid` and `Charged Off` values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.

Since we're interested in being able to predict which of these 2 values(`Fully Paid` or `Charged Off`) a loan will fall under, we can treat the problem as a binary classification one. 

First we'll remove all the loans that don't contain either `Fully Paid` and `Charged Off` as the loan's status and then transform the `Fully Paid` values to `1` for the positive case and the `Charged Off` values to `0` for the negative case.

One thing we need to keep in mind is the **class imbalance** between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. This class imbalance is a common problem in binary classification, and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations. The stronger the imbalance, the more biased the model becomes. 

In [6]:
data = data[(data['loan_status'] == "Fully Paid") | (data['loan_status'] == "Charged Off")]

category_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}
data = data.replace(category_replace)

In [7]:
# data check
data['loan_status'].value_counts()

1    33136
0     5634
Name: loan_status, dtype: int64

Next, we'll look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application.

We will first remove rows where there are `nan` values and then compile a list of column names where there is only one unique value. We will use the list to remove those columns from our dataframe.

In [8]:
drop_columns = []
original_columns = data.columns
for col in original_columns:
    col_series = data[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
        
data = data.drop(drop_columns, axis = 1)

# Which columns were dropped?
print(drop_columns)


['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


## Preparing The Features for Machine Learning
We'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.

We will begin by checking for null values next.

In [9]:
data.isnull().sum()

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

We have 5 columns that contain missing values. Three columns have fifty or less rows with missing values, and two columns, `emp_length` and `pub_rec_bankruptcies`, contain a relatively high amount of missing values.

Domain knowledge tells us that employment length is frequently used in assessing how risky a potential borrower is, so we'll keep this column despite its relatively large amount of missing values (and only remove the rows with the missing values).

Let's take a look at the values in the `pub_rec_bankruptcies` column:

In [10]:
data['pub_rec_bankruptcies'].value_counts(normalize = True, dropna = False)

 0.0    0.939438
 1.0    0.042456
NaN     0.017978
 2.0    0.000129
Name: pub_rec_bankruptcies, dtype: float64

It looks like 94% of this column contains a single categorical value (0). This column does not have much variability and therefore may not offer much predictive value. Let's drop this column entirely.

In [11]:
data = data.drop('pub_rec_bankruptcies', axis = 1)

We will also drop rows containing missing values.

In [12]:
data = data.dropna(axis = 0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37675 entries, 0 to 39785
Data columns (total 22 columns):
loan_amnt              37675 non-null float64
term                   37675 non-null object
int_rate               37675 non-null object
installment            37675 non-null float64
emp_length             37675 non-null object
home_ownership         37675 non-null object
annual_inc             37675 non-null float64
verification_status    37675 non-null object
loan_status            37675 non-null int64
purpose                37675 non-null object
title                  37675 non-null object
addr_state             37675 non-null object
dti                    37675 non-null float64
delinq_2yrs            37675 non-null float64
earliest_cr_line       37675 non-null object
inq_last_6mths         37675 non-null float64
open_acc               37675 non-null float64
pub_rec                37675 non-null float64
revol_bal              37675 non-null float64
revol_util             37675

We have 11 numerical columns (including the `float64` and `int64` data types) and 11 columns of `object` data type.

### Converting Text Columns
While the numerical columns can be used natively with scikit-learn for machine learning, the object columns that contain text need to be converted to numerical data types. 

Let's create a new Dataframe containing just the object columns so we can explore them in more depth.

In [13]:
object_columns_df = data.select_dtypes(include = ['object'])
object_columns_df.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

- `home_ownership`: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
- `verification_status`: indicates if income was verified by Lending Club,
- `emp_length`: number of years the borrower was employed upon time of application,
- `term`: number of payments on the loan, either 36 or 60,
- `addr_state`: borrower's state of residence,
- `purpose`: a category provided by the borrower for the loan request,
- `title`: loan title provided the borrower,


There are also some columns that represent numeric values, that need to be converted:

- `int_rate`: interest rate of the loan in %,
- `revol_util`: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.

Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

- `earliest_cr_line`: The month the borrower's earliest reported credit line was opened,
- `last_credit_pull_d`: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, we'll remove these date columns from the Dataframe.

In [14]:
# visualizing the unique values and their counts for each column
cat_cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state', 'purpose', 'title']

for col in cat_cols:
    print(data[col].value_counts())

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
LA     420
AL     420
KY     311
OK     285
UT     249
KS     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT  

### Cleaning Categorical Columns
The `home_ownership`, `verification_status`, `emp_length`, `term`, and `addr_state` columns all contain multiple discrete values. With the exception of `addr_state` we should encode these columns as dummy variables and keep them. The `addr_state` column contains 49 discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

We can clean the `emp_length` column and treat it as a numerical one since the values have ordering (ie: 2 years of employment is less than 8 years).

It seems like the `purpose` and `title columns` do contain overlapping information. We'll keep the `purpose` column since it contains fewer discrete values. Additionally, the `title` column has data quality issues since many of the values are repeated with slight modifications (e.g. `Debt Consolidation`, `Debt Consolidation Loan` and `debt consolidation`).

In [15]:
# removing outlined columns
drop_cols = ['addr_state', 'title', 'earliest_cr_line', 'last_credit_pull_d']
data = data.drop(drop_cols, axis = 1)

In [16]:
# converting numeric columns as outlined
data["int_rate"] = data["int_rate"].str.rstrip("%").astype("float")
data["revol_util"] = data["revol_util"].str.rstrip("%").astype("float")

# cleaning the `emp_length` column

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

data = data.replace(mapping_dict)

In [17]:
# data check
data.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc
0,5000.0,36 months,10.65,162.87,10,RENT,24000.0,Verified,1,credit_card,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0
1,2500.0,60 months,15.27,59.83,0,RENT,30000.0,Source Verified,0,car,1.0,0.0,5.0,3.0,0.0,1687.0,9.4,4.0
2,2400.0,36 months,15.96,84.33,10,RENT,12252.0,Not Verified,1,small_business,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0
3,10000.0,36 months,13.49,339.31,10,RENT,49200.0,Source Verified,1,other,20.0,0.0,1.0,10.0,0.0,5598.0,21.0,37.0
5,5000.0,36 months,7.9,156.46,3,RENT,36000.0,Source Verified,1,wedding,11.2,0.0,3.0,9.0,0.0,7963.0,28.3,12.0


In [18]:
# encoding dummy variables and removing original columns
dummy_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(data[dummy_columns]).astype('float64')
data = pd.concat([data, dummy_df], axis=1)
data = data.drop(dummy_columns, axis=1)

In [19]:
# data check
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37675 entries, 0 to 39785
Data columns (total 38 columns):
loan_amnt                              37675 non-null float64
int_rate                               37675 non-null float64
installment                            37675 non-null float64
emp_length                             37675 non-null int64
annual_inc                             37675 non-null float64
loan_status                            37675 non-null int64
dti                                    37675 non-null float64
delinq_2yrs                            37675 non-null float64
inq_last_6mths                         37675 non-null float64
open_acc                               37675 non-null float64
pub_rec                                37675 non-null float64
revol_bal                              37675 non-null float64
revol_util                             37675 non-null float64
total_acc                              37675 non-null float64
home_ownership_MORTGAGE    

We successfully converted all of the columns to numerical values (because those are the only type of value scikit-learn can work with). 


## Next Steps

Next, we'll experiment with training models and evaluating accuracy using cross-validation in Part 2 of this project.
