![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# #03 | Train Test Split for Model Selection

## Load the Data

- The goal of this dataset is
- To predict if **bank's customers** (rows) could have the approval for a credit card `target`
- Based on their **socio-demographical characteristics** (columns)

In [1]:
import pandas as pd #!

df_credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data',
                 na_values='?', header=None)

df_credit.rename(columns={15: 'target'}, inplace=True)
df_credit.target.replace({'+': 1, '-': 0}, inplace=True)
df_credit.columns = [str(i) for i in df_credit.columns]
df_credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,target
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0
689,b,35.00,3.375,u,g,c,h,8.29,f,f,0,t,g,0.0,0,0


## Preprocess the Data

In [3]:
df_credit.isna().sum()

0         12
1         12
          ..
14         0
target     0
Length: 16, dtype: int64

In [4]:
df_credit = df_credit.dropna()
df_credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,target
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,280.0,750,0
689,b,35.00,3.375,u,g,c,h,8.29,f,f,0,t,g,0.0,0,0


In [13]:
df_credit = pd.get_dummies(df_credit, drop_first=True)

## Feature Selection

In [14]:
X = df_credit.drop(columns='target')

In [15]:
y = df_credit.target

## Build & Compare Models

### `DecisionTreeClassifier()`

In [16]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X,y)

### `RandomForestClassifier()`

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [18]:
model_rf = RandomForestClassifier()
model_rf.fit(X, y)

### `KNeighborsClassifier()`

In [19]:
from sklearn.neighbors import KNeighborsClassifier

In [20]:
model_kn = KNeighborsClassifier()
model_kn.fit(X,y)

## Which Model is the Best?

In [21]:
model_dt.score(X,y)

1.0

In [22]:
model_rf.score(X,y)

1.0

In [23]:
model_kn.score(X,y)

0.7840735068912711

## `train_test_split()` & Compare Again

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

### `DecisionTreeClassifier()`

In [28]:
model_dt_tts = DecisionTreeClassifier()

In [29]:
model_dt_tts.fit(X_train, y_train)

### `RandomForestClassifier()`

In [30]:
model_rf_tts = RandomForestClassifier()

model_rf_tts.fit(X_train, y_train)

### `KNeighborsClassifier()`

In [31]:
model_kn_tts = KNeighborsClassifier()
model_kn_tts.fit(X_train, y_train)

## Which is the Best Model with `train_test_split()`?

In [32]:
model_dt_tts.score(X_test, y_test)

0.8148148148148148

In [33]:
model_rf_tts.score(X_test, y_test)

0.8564814814814815

In [34]:
model_kn_tts.score(X_test, y_test)

0.6666666666666666

# Achieved Goals

_Double click on **this cell** and place an `X` inside the square brackets (i.e., [X]) if you think you understand the goal:_

- [ ] Understand the necessity to **create functions** to avoid the repetition of the code.
- [ ] **Bootstrapping** as a way to create an artificial dataset that helps to reduce the bias.
- [ ] **Classification threshold** to predict categories out of probabilities.
- [ ] Different ways to **compare classification models**.
- [ ] Understand the importance to check how good is a model with **data not seen during training**.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.