# **Exercise 00: models *regularization***

## Configuration:

Import necessary *Python* packages:

In [1]:
import sys

Add path to own modules:

In [2]:
sys.path.append("../../src", )

Import necessary entities:

In [3]:
from joblib import dump
from sklearn.svm import SVC
from warnings import filterwarnings
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from pandas import (
    Series,
    DataFrame,
    read_csv,
)

Import own necessary entities:

In [4]:
from machine_learning_models_utilities import (
    print_classification_model_cross_validation,
)

Ignore all warnings:

In [5]:
filterwarnings("ignore", )

## Preprocessing:

Create a dictionary for `read_csv()` method calling:

In [6]:
read_csv_params: dict[str, str] = {
    "file": "day_of_week.csv",
    "file_path": "../../data/datasets/",
}

Read the file `day_of_week.csv` to a *Pandas* dataframe:

In [7]:
df: DataFrame = read_csv(
    read_csv_params["file_path"] + read_csv_params["file"],
    index_col=0,
)

Check `df` *Pandas* dataframe:

In [8]:
df.head()

Unnamed: 0,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,uid_user_16,uid_user_17,...,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1,num_trials,hour,day_of_week,naive_prediction
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,-0.788667,-2.562352,4,3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,-0.756764,-2.562352,4,3
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,-0.724861,-2.562352,4,3
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,-0.692958,-2.562352,4,3
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,-0.661055,-2.562352,4,3


Prepare features and target variables:

In [9]:
X: DataFrame = df.drop(columns=["day_of_week", "naive_prediction", ], )
y: Series = df["day_of_week"]

Check `X` and `y` variables:

In [10]:
X.head()

Unnamed: 0,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,uid_user_16,uid_user_17,...,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1,num_trials,hour
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.788667,-2.562352
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.756764,-2.562352
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.724861,-2.562352
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.692958,-2.562352
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.661055,-2.562352


In [11]:
y.head()

0    4
1    4
2    4
3    4
4    4
Name: day_of_week, dtype: int64

Use `train_test_split()` function:

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify=y,
    test_size=0.2,
    random_state=21,
)

Check `X_train`, `X_test`, `y_train`, `y_test` variables:

In [13]:
X_train.head()

Unnamed: 0,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,uid_user_16,uid_user_17,...,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1,num_trials,hour
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.724861,-0.691561
385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.629151,-1.159259
422,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.788667,-1.393108
326,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,-0.533442,0.945382
714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.67888,-0.92541


In [14]:
X_test.head()

Unnamed: 0,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,uid_user_16,uid_user_17,...,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1,num_trials,hour
1087,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.316943,0.243835
16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.788667,-0.691561
563,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-0.373926,-1.393108
1381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.182507,-0.223863
1199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.533442,-0.691561


In [15]:
y_train.head()

860    6
385    4
422    5
326    6
714    1
Name: day_of_week, dtype: int64

In [16]:
y_test.head()

1087    1
16      5
563     6
1381    3
1199    2
Name: day_of_week, dtype: int64

## *Logistic regression regularization*:

Create a model of *logistic regression*:

In [17]:
log_reg_model: LogisticRegression = LogisticRegression(
    random_state=21,
    fit_intercept=False,
)

Use *cross-validation* to evaluate the *accuracy* metric of the *logistic regression* model:

In [18]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=log_reg_model,
)

train accuracy - 0.628 | test accuracy - 0.740
train accuracy - 0.653 | test accuracy - 0.615
train accuracy - 0.654 | test accuracy - 0.609
train accuracy - 0.636 | test accuracy - 0.544
train accuracy - 0.645 | test accuracy - 0.633
train accuracy - 0.645 | test accuracy - 0.580
train accuracy - 0.629 | test accuracy - 0.571
train accuracy - 0.644 | test accuracy - 0.619
train accuracy - 0.636 | test accuracy - 0.601
train accuracy - 0.640 | test accuracy - 0.613

Classification model STD is 0.038.
Classification model average accuracy on cross-validation is 0.627.



Get info about *cross-valiadtion* duration for *logistic regression* model:

In [19]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=log_reg_model,
)

295 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *logistic regression* with parameter `penalty=None`:

In [20]:
optimized_log_reg_model_one: LogisticRegression = LogisticRegression(
    penalty=None,
    random_state=21,
    fit_intercept=False,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *logistic regression* model with parameter `penalty=None`:

In [21]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_one,
)

train accuracy - 0.661 | test accuracy - 0.751
train accuracy - 0.664 | test accuracy - 0.633
train accuracy - 0.659 | test accuracy - 0.621
train accuracy - 0.660 | test accuracy - 0.574
train accuracy - 0.667 | test accuracy - 0.663
train accuracy - 0.669 | test accuracy - 0.615
train accuracy - 0.658 | test accuracy - 0.619
train accuracy - 0.656 | test accuracy - 0.637
train accuracy - 0.657 | test accuracy - 0.607
train accuracy - 0.675 | test accuracy - 0.637

Classification model STD is 0.034.
Classification model average accuracy on cross-validation is 0.649.



Get info about *cross-valiadtion* duration for optimized *logistic regression* model with parameter `penalty=None`:

In [22]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_one,
)

599 ms ± 25.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *logistic regression* with parameter `penalty="l1"`:

In [23]:
optimized_log_reg_model_two: LogisticRegression = LogisticRegression(
    penalty="l1",
    random_state=21,
    solver="liblinear",
    fit_intercept=False,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *logistic regression* model with parameter `penalty="l1"`:

In [24]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_two,
)

train accuracy - 0.612 | test accuracy - 0.686
train accuracy - 0.634 | test accuracy - 0.598
train accuracy - 0.643 | test accuracy - 0.592
train accuracy - 0.615 | test accuracy - 0.533
train accuracy - 0.630 | test accuracy - 0.598
train accuracy - 0.630 | test accuracy - 0.556
train accuracy - 0.625 | test accuracy - 0.554
train accuracy - 0.637 | test accuracy - 0.625
train accuracy - 0.626 | test accuracy - 0.613
train accuracy - 0.622 | test accuracy - 0.589

Classification model STD is 0.034.
Classification model average accuracy on cross-validation is 0.611.



Get info about *cross-valiadtion* duration for optimized *logistic regression* model with parameter `penalty="l1"`:

In [25]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_two,
)

190 ms ± 6.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *logistic regression* with parameter `penalty="l2"`:

In [26]:
optimized_log_reg_model_three: LogisticRegression = LogisticRegression(
    penalty="l2",
    random_state=21,
    fit_intercept=False,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *logistic regression* model with parameter `penalty="l2"`:

In [27]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_three,
)

train accuracy - 0.628 | test accuracy - 0.740
train accuracy - 0.653 | test accuracy - 0.615
train accuracy - 0.654 | test accuracy - 0.609
train accuracy - 0.636 | test accuracy - 0.544
train accuracy - 0.645 | test accuracy - 0.633
train accuracy - 0.645 | test accuracy - 0.580
train accuracy - 0.629 | test accuracy - 0.571
train accuracy - 0.644 | test accuracy - 0.619
train accuracy - 0.636 | test accuracy - 0.601
train accuracy - 0.640 | test accuracy - 0.613

Classification model STD is 0.038.
Classification model average accuracy on cross-validation is 0.627.



Get info about *cross-valiadtion* duration for optimized *logistic regression* model with parameter `penalty="l2"`:

In [28]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_three,
)

328 ms ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *logistic regression* with parameter `penalty="elasticnet"`:

In [29]:
optimized_log_reg_model_four: LogisticRegression = LogisticRegression(
    l1_ratio=0.5,
    solver="saga",
    random_state=21,
    fit_intercept=False,
    penalty="elasticnet",
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *logistic regression* model with parameter `penalty="elasticnet"`:

In [30]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_four,
)

train accuracy - 0.630 | test accuracy - 0.734
train accuracy - 0.651 | test accuracy - 0.615
train accuracy - 0.650 | test accuracy - 0.609
train accuracy - 0.635 | test accuracy - 0.538
train accuracy - 0.647 | test accuracy - 0.633
train accuracy - 0.645 | test accuracy - 0.586
train accuracy - 0.631 | test accuracy - 0.577
train accuracy - 0.644 | test accuracy - 0.631
train accuracy - 0.632 | test accuracy - 0.607
train accuracy - 0.644 | test accuracy - 0.607

Classification model STD is 0.037.
Classification model average accuracy on cross-validation is 0.627.



Get info about *cross-valiadtion* duration for optimized *logistic regression* model with parameter `penalty="elasticnet"`:

In [31]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_log_reg_model_four,
)

3.03 s ± 147 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## *SVC regularization*:

Create a model of *SVC*:

In [32]:
svc_model: SVC = SVC(
    kernel="linear",
    random_state=21,
    probability=True,
)

Use *cross-validation* to evaluate the *accuracy* metric of the *SVC* model:

In [33]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=svc_model,
)

train accuracy - 0.693 | test accuracy - 0.769
train accuracy - 0.711 | test accuracy - 0.686
train accuracy - 0.701 | test accuracy - 0.675
train accuracy - 0.699 | test accuracy - 0.609
train accuracy - 0.701 | test accuracy - 0.698
train accuracy - 0.707 | test accuracy - 0.728
train accuracy - 0.700 | test accuracy - 0.655
train accuracy - 0.711 | test accuracy - 0.637
train accuracy - 0.696 | test accuracy - 0.685
train accuracy - 0.711 | test accuracy - 0.643

Classification model STD is 0.034.
Classification model average accuracy on cross-validation is 0.691.



Get info about *cross-valiadtion* duration for *SVC* model:

In [34]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=svc_model,
)

4.19 s ± 298 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *SVC* with parameter `C=0.25`:

In [35]:
optimized_svc_model_one: SVC = SVC(
    C=0.25,
    kernel="linear",
    random_state=21,
    probability=True,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *SVC* model with parameter `C=0.25`:

In [36]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_one,
)

train accuracy - 0.657 | test accuracy - 0.734
train accuracy - 0.660 | test accuracy - 0.633
train accuracy - 0.646 | test accuracy - 0.604
train accuracy - 0.654 | test accuracy - 0.533
train accuracy - 0.654 | test accuracy - 0.675
train accuracy - 0.654 | test accuracy - 0.633
train accuracy - 0.655 | test accuracy - 0.589
train accuracy - 0.661 | test accuracy - 0.595
train accuracy - 0.652 | test accuracy - 0.667
train accuracy - 0.651 | test accuracy - 0.595

Classification model STD is 0.040.
Classification model average accuracy on cross-validation is 0.640.



Get info about *cross-valiadtion* duration for optimized *SVC* model with parameter `C=0.25`:

In [37]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_one,
)

4.84 s ± 648 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *SVC* with parameter `C=0.5`:

In [38]:
optimized_svc_model_two: SVC = SVC(
    C=0.5,
    kernel="linear",
    random_state=21,
    probability=True,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *SVC* model with parameter `C=0.5`:

In [39]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_two,
)

train accuracy - 0.673 | test accuracy - 0.751
train accuracy - 0.681 | test accuracy - 0.675
train accuracy - 0.680 | test accuracy - 0.663
train accuracy - 0.676 | test accuracy - 0.550
train accuracy - 0.690 | test accuracy - 0.686
train accuracy - 0.678 | test accuracy - 0.686
train accuracy - 0.678 | test accuracy - 0.625
train accuracy - 0.688 | test accuracy - 0.619
train accuracy - 0.678 | test accuracy - 0.679
train accuracy - 0.689 | test accuracy - 0.643

Classification model STD is 0.038.
Classification model average accuracy on cross-validation is 0.669.



Get info about *cross-valiadtion* duration for optimized *SVC* model with parameter `C=0.5`:

In [40]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_two,
)

4.77 s ± 314 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *SVC* with parameter `C=1`:

In [41]:
optimized_svc_model_three: SVC = SVC(
    C=1,
    kernel="linear",
    random_state=21,
    probability=True,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *SVC* model with parameter `C=1`:

In [42]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_three,
)

train accuracy - 0.693 | test accuracy - 0.769
train accuracy - 0.711 | test accuracy - 0.686
train accuracy - 0.701 | test accuracy - 0.675
train accuracy - 0.699 | test accuracy - 0.609
train accuracy - 0.701 | test accuracy - 0.698
train accuracy - 0.707 | test accuracy - 0.728
train accuracy - 0.700 | test accuracy - 0.655
train accuracy - 0.711 | test accuracy - 0.637
train accuracy - 0.696 | test accuracy - 0.685
train accuracy - 0.711 | test accuracy - 0.643

Classification model STD is 0.034.
Classification model average accuracy on cross-validation is 0.691.



Get info about *cross-valiadtion* duration for optimized *SVC* model with parameter `C=1`:

In [43]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_three,
)

4.25 s ± 332 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *SVC* with parameter `C=5`:

In [44]:
optimized_svc_model_four: SVC = SVC(
    C=5,
    kernel="linear",
    random_state=21,
    probability=True,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *SVC* model with parameter `C=5`:

In [45]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_four,
)

train accuracy - 0.761 | test accuracy - 0.817
train accuracy - 0.762 | test accuracy - 0.704
train accuracy - 0.772 | test accuracy - 0.757
train accuracy - 0.769 | test accuracy - 0.686
train accuracy - 0.730 | test accuracy - 0.757
train accuracy - 0.760 | test accuracy - 0.763
train accuracy - 0.762 | test accuracy - 0.696
train accuracy - 0.758 | test accuracy - 0.696
train accuracy - 0.771 | test accuracy - 0.738
train accuracy - 0.773 | test accuracy - 0.690

Classification model STD is 0.034.
Classification model average accuracy on cross-validation is 0.746.



Get info about *cross-valiadtion* duration for optimized *SVC* model with parameter `C=5`:

In [46]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_svc_model_four,
)

4.94 s ± 309 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## *Decision tree regularization*:

Create a model of *decision tree*:

In [47]:
tree_model: DecisionTreeClassifier = DecisionTreeClassifier(
    max_depth=10,
    random_state=21,
)

Use *cross-validation* to evaluate the *accuracy* metric of the *decision tree* model:

In [48]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=tree_model,
)

train accuracy - 0.809 | test accuracy - 0.781
train accuracy - 0.822 | test accuracy - 0.763
train accuracy - 0.817 | test accuracy - 0.740
train accuracy - 0.823 | test accuracy - 0.757
train accuracy - 0.813 | test accuracy - 0.787
train accuracy - 0.821 | test accuracy - 0.822
train accuracy - 0.814 | test accuracy - 0.726
train accuracy - 0.819 | test accuracy - 0.720
train accuracy - 0.824 | test accuracy - 0.762
train accuracy - 0.825 | test accuracy - 0.756

Classification model STD is 0.035.
Classification model average accuracy on cross-validation is 0.790.



Get info about *cross-valiadtion* duration for *decision tree* model:

In [49]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=tree_model,
)

107 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Create a optimized model of *decision tree* with parameter `max_depth=15`:

In [50]:
optimized_tree_model_one: DecisionTreeClassifier = DecisionTreeClassifier(
    max_depth=15,
    random_state=21,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *decision tree* model with parameter `max_depth=15`:

In [51]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_one,
)

train accuracy - 0.942 | test accuracy - 0.876
train accuracy - 0.947 | test accuracy - 0.882
train accuracy - 0.955 | test accuracy - 0.864
train accuracy - 0.958 | test accuracy - 0.858
train accuracy - 0.953 | test accuracy - 0.876
train accuracy - 0.956 | test accuracy - 0.917
train accuracy - 0.947 | test accuracy - 0.845
train accuracy - 0.955 | test accuracy - 0.851
train accuracy - 0.947 | test accuracy - 0.839
train accuracy - 0.947 | test accuracy - 0.839

Classification model STD is 0.046.
Classification model average accuracy on cross-validation is 0.908.



Get info about *cross-valiadtion* duration for optimized *decision tree* model with parameter `max_depth=15`:

In [52]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_one,
)

101 ms ± 7.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Create a optimized model of *decision tree* with parameter `max_depth=20`:

In [53]:
optimized_tree_model_two: DecisionTreeClassifier = DecisionTreeClassifier(
    max_depth=20,
    random_state=21,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *decision tree* model with parameter `max_depth=20`:

In [54]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_two,
)

train accuracy - 0.993 | test accuracy - 0.899
train accuracy - 0.987 | test accuracy - 0.905
train accuracy - 0.990 | test accuracy - 0.888
train accuracy - 0.984 | test accuracy - 0.888
train accuracy - 0.992 | test accuracy - 0.893
train accuracy - 0.986 | test accuracy - 0.929
train accuracy - 0.990 | test accuracy - 0.893
train accuracy - 0.992 | test accuracy - 0.881
train accuracy - 0.989 | test accuracy - 0.857
train accuracy - 0.986 | test accuracy - 0.869

Classification model STD is 0.051.
Classification model average accuracy on cross-validation is 0.940.



Get info about *cross-valiadtion* duration for optimized *decision tree* model with parameter `max_depth=20`:

In [55]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_two,
)

117 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Create a optimized model of *decision tree* with parameter `min_samples_split=1.0`:

In [56]:
optimized_tree_model_three: DecisionTreeClassifier = DecisionTreeClassifier(
    max_depth=20,
    random_state=21,
    min_samples_split=1.0,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *decision tree* model with parameter `min_sample_split=1.0`:

In [57]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_three,
)

train accuracy - 0.351 | test accuracy - 0.414
train accuracy - 0.359 | test accuracy - 0.349
train accuracy - 0.359 | test accuracy - 0.343
train accuracy - 0.359 | test accuracy - 0.349
train accuracy - 0.368 | test accuracy - 0.266
train accuracy - 0.355 | test accuracy - 0.379
train accuracy - 0.363 | test accuracy - 0.310
train accuracy - 0.354 | test accuracy - 0.393
train accuracy - 0.354 | test accuracy - 0.393
train accuracy - 0.355 | test accuracy - 0.381

Classification model STD is 0.030.
Classification model average accuracy on cross-validation is 0.358.



Get info about *cross-valiadtion* duration for optimized *decision tree* model with parameter `min_sample_split=1.0`:

In [58]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_three,
)

73.2 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Create a optimized model of *decision tree* with parameter `min_samples_split=2`:

In [59]:
optimized_tree_model_four: DecisionTreeClassifier = DecisionTreeClassifier(
    max_depth=20,
    random_state=21,
    min_samples_split=2,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *decision tree* model with parameter `min_sample_split=2`:

In [60]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_four,
)

train accuracy - 0.993 | test accuracy - 0.899
train accuracy - 0.987 | test accuracy - 0.905
train accuracy - 0.990 | test accuracy - 0.888
train accuracy - 0.984 | test accuracy - 0.888
train accuracy - 0.992 | test accuracy - 0.893
train accuracy - 0.986 | test accuracy - 0.929
train accuracy - 0.990 | test accuracy - 0.893
train accuracy - 0.992 | test accuracy - 0.881
train accuracy - 0.989 | test accuracy - 0.857
train accuracy - 0.986 | test accuracy - 0.869

Classification model STD is 0.051.
Classification model average accuracy on cross-validation is 0.940.



Get info about *cross-valiadtion* duration for optimized *decision tree* model with parameter `min_sample_split=2`:

In [61]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_tree_model_four,
)

102 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## *Random forest tree regularization*:

Create a model of *random forest tree*:

In [62]:
forest_tree_model: RandomForestClassifier = RandomForestClassifier(
    max_depth=14,
    random_state=21,
    n_estimators=50,
)

Use *cross-validation* to evaluate the *accuracy* metric of the *random forest tree* model:

In [63]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=forest_tree_model,
)

train accuracy - 0.969 | test accuracy - 0.917
train accuracy - 0.966 | test accuracy - 0.947
train accuracy - 0.963 | test accuracy - 0.888
train accuracy - 0.966 | test accuracy - 0.834
train accuracy - 0.971 | test accuracy - 0.923
train accuracy - 0.968 | test accuracy - 0.923
train accuracy - 0.973 | test accuracy - 0.887
train accuracy - 0.968 | test accuracy - 0.857
train accuracy - 0.969 | test accuracy - 0.899
train accuracy - 0.967 | test accuracy - 0.887

Classification model STD is 0.042.
Classification model average accuracy on cross-validation is 0.932.



Get info about *cross-valiadtion* duration for *random forest tree* model:

In [64]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=forest_tree_model,
)

1.02 s ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *random forest tree* with parameter `max_depth=20`:

In [65]:
optimized_forest_tree_model_one: RandomForestClassifier = \
RandomForestClassifier(
    max_depth=20,
    random_state=21,
    n_estimators=50,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *random forest tree* model with parameter `max_depth=20`:

In [66]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_one,
)

train accuracy - 0.997 | test accuracy - 0.935
train accuracy - 0.998 | test accuracy - 0.941
train accuracy - 0.998 | test accuracy - 0.929
train accuracy - 0.995 | test accuracy - 0.905
train accuracy - 0.997 | test accuracy - 0.935
train accuracy - 0.994 | test accuracy - 0.953
train accuracy - 0.997 | test accuracy - 0.905
train accuracy - 0.996 | test accuracy - 0.881
train accuracy - 0.998 | test accuracy - 0.923
train accuracy - 0.997 | test accuracy - 0.923

Classification model STD is 0.039.
Classification model average accuracy on cross-validation is 0.960.



Get info about *cross-valiadtion* duration for optimized *random forest tree* model with parameter `max_depth=20`:

In [67]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_one,
)

1.25 s ± 90.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *random forest tree* with parameter `n_estimators=75`:

In [68]:
optimized_forest_tree_model_two: RandomForestClassifier = \
RandomForestClassifier(
    max_depth=20,
    random_state=21,
    n_estimators=75,
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *random forest tree* model with parameter `n_estimators=75`:

In [69]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_two,
)

train accuracy - 0.999 | test accuracy - 0.941
train accuracy - 0.998 | test accuracy - 0.941
train accuracy - 0.999 | test accuracy - 0.917
train accuracy - 0.993 | test accuracy - 0.905
train accuracy - 0.996 | test accuracy - 0.935
train accuracy - 0.995 | test accuracy - 0.953
train accuracy - 0.998 | test accuracy - 0.899
train accuracy - 0.995 | test accuracy - 0.887
train accuracy - 0.997 | test accuracy - 0.923
train accuracy - 0.997 | test accuracy - 0.917

Classification model STD is 0.040.
Classification model average accuracy on cross-validation is 0.959.



Get info about *cross-valiadtion* duration for optimized *random forest tree* model with parameter `n_estimators=75`:

In [70]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_two,
)

1.83 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *random forest tree* with parameter `criterion="entropy"`:

In [71]:
optimized_forest_tree_model_three: RandomForestClassifier = \
RandomForestClassifier(
    max_depth=20,
    random_state=21,
    n_estimators=75,
    criterion="entropy",
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *random forest tree* model with parameter `criterion="entropy"`:

In [72]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_three,
)

train accuracy - 0.999 | test accuracy - 0.935
train accuracy - 0.999 | test accuracy - 0.947
train accuracy - 1.000 | test accuracy - 0.917
train accuracy - 0.998 | test accuracy - 0.911
train accuracy - 0.998 | test accuracy - 0.935
train accuracy - 0.998 | test accuracy - 0.959
train accuracy - 0.999 | test accuracy - 0.899
train accuracy - 0.998 | test accuracy - 0.899
train accuracy - 0.997 | test accuracy - 0.929
train accuracy - 0.999 | test accuracy - 0.911

Classification model STD is 0.040.
Classification model average accuracy on cross-validation is 0.961.



Get info about *cross-valiadtion* duration for optimized *random forest tree* model with parameter `criterion="entropy"`:

In [73]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_three,
)

1.91 s ± 89.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Create a optimized model of *random forest tree* with parameter `criterion="log_loss"`:

In [74]:
optimized_forest_tree_model_four: RandomForestClassifier = \
RandomForestClassifier(
    max_depth=20,
    random_state=21,
    n_estimators=75,
    criterion="log_loss",
)

Use *cross-validation* to evaluate the *accuracy* metric of the optimized *random forest tree* model with parameter `criterion="log_loss"`:

In [75]:
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_four,
)

train accuracy - 0.999 | test accuracy - 0.935
train accuracy - 0.999 | test accuracy - 0.947
train accuracy - 1.000 | test accuracy - 0.917
train accuracy - 0.998 | test accuracy - 0.911
train accuracy - 0.998 | test accuracy - 0.935
train accuracy - 0.998 | test accuracy - 0.959
train accuracy - 0.999 | test accuracy - 0.899
train accuracy - 0.998 | test accuracy - 0.899
train accuracy - 0.997 | test accuracy - 0.929
train accuracy - 0.999 | test accuracy - 0.911

Classification model STD is 0.040.
Classification model average accuracy on cross-validation is 0.961.



Get info about *cross-valiadtion* duration for optimized *random forest tree* model with parameter `criterion="log_loss"`:

In [76]:
%%timeit
%%capture
print_classification_model_cross_validation(
    X=X,
    y=y,
    classification_model=optimized_forest_tree_model_four,
)

1.97 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Prediction:

Train the `optimized_forest_tree_model_four` best model on the data:

In [77]:
optimized_forest_tree_model_four.fit(X_train, y_train, );

Calculate the `optimized_forest_tree_model_four` best model *accuracy* metric for the data:

In [78]:
print(
    f"The best model accuracy metric is {
        accuracy_score(
            optimized_forest_tree_model_four.predict(X_test, ),
            y_test,
        ):.3f
    }",
)

The best model accuracy metric is 0.920


Create a `prediction` *Pandas* dataframe column:

In [79]:
df["prediction"] = optimized_forest_tree_model_four.predict(X, )

Calculate the error percentage for each class:

In [80]:
(
    df[df["day_of_week"] != df["prediction"]]["day_of_week"].value_counts() /
    df["day_of_week"].value_counts()
).dropna() * 100

day_of_week
0    5.147059
1    2.554745
2    2.013423
3    0.505051
4    2.884615
5    1.476015
6    0.561798
Name: count, dtype: float64

`For which weekday the best model makes the most errors?`

Answer: for the `Monday`.

Save the best model:

In [81]:
dump(
    optimized_forest_tree_model_four,
    "../../models/ex_00_best_model.joblib",
);