<font size="+5">#07. Model Selection. Decision Tree vs Support Vector Machines vs Logistic Regression</font>

- Book + Private Lessons [Here ↗](https://sotastica.com/reservar)
- Subscribe to my [Blog ↗](https://blog.pythonassembly.com/)
- Let's keep in touch on [LinkedIn ↗](www.linkedin.com/in/jsulopz) 😄

# Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
> - The goal of this dataset is
> - To predict `internet_usage` of **people** (rows)
> - Based on their **socio-demographical characteristics** (columns)

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/py-thocrates/data/main/internet_usage_spain.csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0,internet_usage,sex,age,education
0,0,Female,66,Elementary
1,1,Male,72,Elementary
2,1,Male,48,University
3,0,Male,59,PhD
4,1,Female,44,PhD


# Build & Compare Models

## `DecisionTreeClassifier()` Model in Python

In [2]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [3]:
fit()

NameError: name 'fit' is not defined

In [6]:
from sklearn.tree import DecisionTreeClassifier

In [10]:
model = DecisionTreeClassifier()

In [31]:
y = df["internet_usage"]

In [32]:
X = df.drop(labels='internet_usage', axis=1)

In [37]:
df.education.value_counts()

Elementary      1184
Higher Level     374
High School      279
PhD              236
University       205
No studies       177
Name: education, dtype: int64

In [38]:
X = pd.get_dummies(X, drop_first=True)

In [39]:
model.fit(X,y)

DecisionTreeClassifier()

In [40]:
model.score(X,y)

0.859877800407332

## `SVC()` Model in Python

In [49]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [42]:
from sklearn.svm import SVC

In [46]:
model = SVC()

In [47]:
model.fit(X,y)

SVC()

In [48]:
model.score(X,y)

0.7837067209775967

## `LogisticRegression()` Model in Python

In [4]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [50]:
from sklearn.linear_model import LogisticRegression

In [51]:
model = LogisticRegression()

In [52]:
model.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [53]:
model.score(X,y)

0.8325865580448065

# Function to Automate Lines of Code

> - We repeated all the time the same code:

```python
model.fit()
model.score()
```

> - Why not turning the lines into a `function()`
> - To automate the process?
> - In a way that you would just need

```python
calculate_accuracy(model=dt)

calculate_accuracy(model=svm)

calculate_accuracy(model = lr)
```

> - To calculate the `accuracy`

## Make a Procedure Sample for `DecisionTreeClassifier()`

## Code Thinking

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

## Automate the Procedure into a `function()`

In [54]:
def calcular_precision(model):

    model.fit(X,y)

    precision = model.score(X,y)

    return precision

In [55]:
dt = DecisionTreeClassifier()

In [56]:
lr = LogisticRegression()

In [57]:
sv = SVC()

## `DecisionTreeClassifier()` Accuracy

In [58]:
calcular_precision(dt)

0.859877800407332

## `SVC()` Accuracy

In [60]:
calcular_precision(sv)

0.7837067209775967

## `LogisticRegression()` Accuracy

In [59]:
calcular_precision(lr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8325865580448065

# Which is the Best Model?

> Which model has the **highest accuracy**?

## University Access Exams Analogy

> Let's **imagine**:
>
> 1. You have a `math exam` on Saturday
> 2. Today is Monday
> 3. You want to **calculate if you need to study more** for the math exam
> 4. How do you calibrate your `math level`?
> 5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
> 6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
> 7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
> 8. And compare `your_solutions` with the `real_solutions`
> 9. You've got **90/100 correct answers** `accuracy` in the mock exam
> 10. You think you are **prepared for the maths exam**
> 11. And when you do **the real exam on Saturday, the mark is 40/100**
> 12. Why? How could have we prevented this?
> 13. **Solution**: separate the 100 questions in
> - `70 train` to study & `30 test` for the mock exam.

In [62]:
from sklearn.model_selection import train_test_split

In [66]:
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# `train_test_split()` the Data

> 1. **`fit()` the model with `Train Data`**
>
> - `model.fit(70%questions, 70%solutions)`

In [68]:
model = DecisionTreeClassifier()

In [69]:
model.fit(X_train, y_train)

DecisionTreeClassifier()

> 2. **`.predict()` answers with `Test Data` (mock exam)**
>
> - `your_solutions = model.predict(30%questions)`

In [71]:
your_solutions = model.predict(X_test)

> **3. Compare `your_solutions` with `correct answers` from mock exam**
>
> - `your_solutions == real_solutions`?

In [75]:
(y_test == your_solutions).mean()

0.8064118372379778

# Optimize All Models & Compare Again

## Make a Procedure Sample for `DecisionTreeClassifier()`

In [76]:
model = DecisionTreeClassifier()

model.fit(X_train, y_train)

your_solutions = model.predict(X_test)

(y_test == your_solutions).mean()

0.8064118372379778

## Automate the Procedure into a `function()`

In [77]:
def precision(model):

    model.fit(X_train, y_train)

    your_solutions = model.predict(X_test)

    a = (y_test == your_solutions).mean()

    return a

## `DecisionTreeClassifier()` Accuracy

In [79]:
precision(dt)

0.8064118372379778

## `SVC()` Accuracy

In [80]:
precision(sv)

0.7866831072749692

## `LogisticRegression()` Accuracy

In [81]:
precision(lr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8508014796547472

# Which is the Best Model with `train_test_split()`?

> Which model has the **highest accuracy**?

# Reflect

> - Banks deploy models to predict the **probability for a customer to pay the loan**
> - If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
> - What would have happened?
> - Is `train_test_split()` always required to compare models?