<font size="+5">#03 | Model Selection. Decision Tree vs Support Vector Machines vs Logistic Regression</font>

- Subscribe to my [Blog ↗](https://blog.pythonassembly.com/)
- Let's keep in touch on [LinkedIn ↗](www.linkedin.com/in/jsulopz) 😄

# Discipline to Search Solutions in Google

> Apply the following steps when **looking for solutions in Google**:
>
> 1. **Necesity**: How to load an Excel in Python?
> 2. **Search in Google**: by keywords
>   - `load excel python`
>   - ~~how to load excel in python~~
> 3. **Solution**: What's the `function()` that loads an Excel in Python?
>   - A Function to Programming is what the Atom to Phisics.
>   - Every time you want to do something in programming
>   - **You will need a `function()`** to make it
>   - Theferore, you must **detect parenthesis `()`**
>   - Out of all the words that you see in a website
>   - Because they indicate the presence of a `function()`.

# Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
> - The goal of this dataset is
> - To predict `internet_usage` of **people** (rows)
> - Based on their **socio-demographical characteristics** (columns)

In [13]:
import pandas as pd

url = 'https://raw.githubusercontent.com/py-thocrates/data/main/internet_usage_spain.csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0,internet_usage,sex,age,education
0,0,Female,66,Elementary
1,1,Male,72,Elementary
2,1,Male,48,University
3,0,Male,59,PhD
4,1,Female,44,PhD


In [14]:
df = pd.get_dummies(df, drop_first=True)

In [15]:
X = df.drop(columns='internet_usage')
y = df.internet_usage

# Build & Compare Models

## `DecisionTreeClassifier()` Model in Python

In [16]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [17]:
from sklearn.tree import DecisionTreeClassifier

In [18]:
model = DecisionTreeClassifier()

In [19]:
model.fit(X,y)

DecisionTreeClassifier()

In [21]:
model.score(X,y)

0.859877800407332

## `SVC()` Model in Python

In [22]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [23]:
from sklearn.svm import SVC

In [24]:
model_sv = SVC()

In [27]:
model_sv.fit(X,y)

SVC()

In [28]:
model_sv.score(X,y)

0.7837067209775967

## `LogisticRegression()` Model in Python

In [4]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [31]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(max_iter=1000)

model_lr.fit(X,y)

model_lr.score(X,y)

0.8334012219959267

# Function to Automate Lines of Code

> - We repeated all the time the same code:

```python
model.fit()
model.score()
```

> - Why not turning the lines into a `function()`
> - To automate the process?
> - In a way that you would just need

```python
calculate_accuracy(model=dt)

calculate_accuracy(model=svm)

calculate_accuracy(model = lr)
```

> - To calculate the `accuracy`

## Make a Procedure Sample for `DecisionTreeClassifier()`

## Automate the Procedure into a `function()`

**Code Thinking**

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

## `DecisionTreeClassifier()` Accuracy

## `SVC()` Accuracy

## `LogisticRegression()` Accuracy

# Which is the Best Model?

> Which model has the **highest accuracy**?

## University Access Exams Analogy

> Let's **imagine**:
>
> 1. You have a `math exam` on Saturday
> 2. Today is Monday
> 3. You want to **calculate if you need to study more** for the math exam
> 4. How do you calibrate your `math level`?
> 5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
> 6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
> 7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
> 8. And compare `your_solutions` with the `real_solutions`
> 9. You've got **90/100 correct answers** `accuracy` in the mock exam
> 10. You think you are **prepared for the maths exam**
> 11. And when you do **the real exam on Saturday, the mark is 40/100**
> 12. Why? How could have we prevented this?
> 13. **Solution**: separate the 100 questions in
> - `70 train` to study & `30 test` for the mock exam.

# `train_test_split()` the Data

> 1. **`fit()` the model with `Train Data`**
>
> - `model.fit(70%questions, 70%solutions)`

> 2. **`.predict()` answers with `Test Data` (mock exam)**
>
> - `your_solutions = model.predict(30%questions)`

> **3. Compare `your_solutions` with `correct answers` from mock exam**
>
> - `your_solutions == real_solutions`?

# Optimize All Models & Compare Again

## Make a Procedure Sample for `DecisionTreeClassifier()`

## Automate the Procedure into a `function()`

**Code Thinking**

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

In [35]:
X

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
0,66,0,0,0,0,0,0
1,72,1,0,0,0,0,0
2,48,1,0,0,0,0,1
3,59,1,0,0,0,1,0
4,44,0,0,0,0,1,0
...,...,...,...,...,...,...,...
2450,43,1,0,0,0,0,0
2451,18,0,1,0,0,0,0
2452,54,0,0,0,0,0,0
2453,31,1,1,0,0,0,0


In [34]:
X_train

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
2432,66,0,0,0,0,0,0
1023,63,1,0,0,0,0,0
1087,25,1,0,0,0,0,0
2436,82,1,0,0,0,0,0
841,36,1,0,0,0,0,0
...,...,...,...,...,...,...,...
1638,37,0,1,0,0,0,0
1095,35,0,0,1,0,0,0
1130,58,0,0,0,0,0,0
1294,52,0,0,0,0,0,0


In [36]:
X_test

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
1598,52,0,0,0,0,0,0
620,48,1,0,1,0,0,0
1266,53,0,0,0,0,1,0
649,43,1,0,0,0,0,1
1908,43,1,0,1,0,0,0
...,...,...,...,...,...,...,...
2279,59,1,0,0,1,0,0
305,37,1,0,1,0,0,0
209,33,0,1,0,0,0,0
1383,18,1,1,0,0,0,0


## `DecisionTreeClassifier()` Model in Python

In [16]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [37]:
from sklearn.tree import DecisionTreeClassifier

In [38]:
model = DecisionTreeClassifier()

In [39]:
model.fit(X_train,y_train)

DecisionTreeClassifier()

In [40]:
model.score(X_test,y_test)

0.8064118372379778

## `SVC()` Model in Python

In [41]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [42]:
from sklearn.svm import SVC

In [43]:
model_sv = SVC()

In [47]:
model_sv.fit(X_train,y_train)

SVC()

In [48]:
model_sv.score(X_test,y_test)

0.7866831072749692

## `LogisticRegression()` Model in Python

In [49]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [51]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(max_iter=1000)

model_lr.fit(X_train,y_train)

model_lr.score(X_test,y_test)

0.8508014796547472

# Which is the Best Model with `train_test_split()`?

> Which model has the **highest accuracy**?

# Reflect

> - Banks deploy models to predict the **probability for a customer to pay the loan**
> - If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
> - What would have happened?
> - Is `train_test_split()` always required to compare models?