![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# The Challenge

<div class="alert alert-danger">
    Build different Models and choose the best one:
</div>

In [1]:
import pandas as pd #!

df_internet = pd.read_excel('../data/internet_usage_spain.xlsx', sheet_name=1, index_col=0)
df_internet

Unnamed: 0_level_0,internet_usage,sex,age,education
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Josefina,0,Female,66,Elementary
Vicki,1,Male,72,Elementary
...,...,...,...,...
Christine,1,Male,31,High School
Kimberly,0,Male,52,Elementary


# The Covered Solution

<div class="alert alert-success">
    and get the following comparisons ↓
</div>

In [55]:
?? #! read the full story to find out the solution

dt    0.859878
lr    0.833401
sv    0.783707
dtype: float64

In [99]:
?? #! read the full story to find out the solution

lr    0.854817
dt    0.804613
sv    0.778833
dtype: float64

# What will we learn?

- How to automate a repetitive process into a `function()`?
- Why we should not evaluate the model with the same data we used to fit it?
- How companies could mess up in production if they use overfitted models?
- The power of randomisation

# Which concepts will we use?

- Train Test Split
- Unpacking iterable objects into multiple variables
- Overfitting

# Requirements?

- Accuracy to evaluate a model

# The starting *thing*

In [1]:
import pandas as pd #!

df_internet = pd.read_excel('../data/internet_usage_spain.xlsx', sheet_name=1, index_col=0)
df_internet

Unnamed: 0_level_0,internet_usage,sex,age,education
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Josefina,0,Female,66,Elementary
Vicki,1,Male,72,Elementary
...,...,...,...,...
Christine,1,Male,31,High School
Kimberly,0,Male,52,Elementary


# Syllabus for the [Notebook](01script_functions.ipynb)

1. Load the Data
2. Preprocess the Data
3. Feature Selection
4. Build & Compare Models' Scores
    1. Decision TreeClassifier ( ) Model in Python
    2. SVC ( ) Model in Python
    3. LogisticRegression () Model in Python
5. Function to Automate Lines of Code
    1. Make a Procedure Sample for DecisionTreeClassifier()
    2. Automate the Procedure into a function()
        1. Select all cells and [shift] + [M]
        2. Distinguish the line that gives you the result you want and put it into a variable
        3. Add a line with a return to tell the function the object you want in the end
        4. Indent everything to the right
        5. Define the function in the first line
        6. What am I gonna change every time I run the function
        7. What am I gonna change every time I run the function
        8. Generalize the name of the parameter
        9. Add docstring
6. Calculate Models' Accuracies
    1. Decision TreeClassifier() Accuracy
    2. SVC() Accuracy
    3. LogisticRegression() Accuracy
7. Which is the Best Model?
8. University Access Exams Analogy
9. train_test_split () the Data
    1. What the heck is returning the function?
    2. fit ( ) the model with Train Data
    3. Compare the predictions with the real data
10. Optimize All Models & Compare Again
    1. Make a Procedure Sample for DecisionTreeClassifier()
    2. Automate the Procedure into a function ()
11. Calculate Models' Accuracies
    1. DecisionTreeClassifier ( ) Accuracy
    2. SVC ( ) Accuracy
    3. LogisticRegression () Accuracy
12. Which is the Best Model with train_test_split()?

# The Uncovered Solution

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [9]:
df_internet = pd.get_dummies(df_internet, drop_first=True)

In [10]:
X = df_internet.drop(columns='internet_usage')
y = df_internet.internet_usage

In [16]:
model_dt = DecisionTreeClassifier()
model_sv = SVC()
model_lr = LogisticRegression(max_iter=1000)

In [17]:
def calculate_accuracy(model):
    model.fit(X,y)
    result = model.score(X,y)
    
    return result

In [21]:
accuracies = {}

In [22]:
accuracies['dt'] = calculate_accuracy(model_dt)

In [23]:
accuracies['sv'] = calculate_accuracy(model_sv)

In [24]:
accuracies['lr'] = calculate_accuracy(model_lr)

In [26]:
pd.Series(accuracies).sort_values(ascending=False)

dt    0.859878
lr    0.833401
sv    0.783707
dtype: float64

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

In [29]:
def calculate_accuracy_test(model):
    model.fit(X_train,y_train)
    result = model.score(X_test,y_test)
    
    return result

In [32]:
accuracies_test = {}

In [33]:
accuracies_test['dt'] = calculate_accuracy_test(model_dt)

In [34]:
accuracies_test['sv'] = calculate_accuracy_test(model_sv)

In [35]:
accuracies_test['lr'] = calculate_accuracy_test(model_lr)

In [37]:
pd.Series(accuracies_test).sort_values(ascending=False)

lr    0.850801
dt    0.806412
sv    0.786683
dtype: float64

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.