# **Exercise 04: pipelines**

## Configuration:

Import necessary *Python* packages:

In [1]:
import sys

Add path to own modules:

In [2]:
sys.path.append("../../src", )

Import necessary entities:

In [3]:
from typing import Any
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from pandas import DataFrame, read_csv
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Import own necessary entities:

In [4]:
from pipelines_blocks import *

## Preprocessing:

Create a dictionary for `read_csv()` method callings:

In [5]:
read_csv_params: dict[str, Any] = {
    "file": "checker_submits.csv",
    "file_path": "../../data/datasets/",
    "parse_dates": ["timestamp", ],
}

Read the file `checker_submits.csv` to a *Pandas* dataframe:

In [6]:
df: DataFrame = read_csv(
    read_csv_params["file_path"] + read_csv_params["file"],
    parse_dates=read_csv_params["parse_dates"],
)

Check `df` *Pandas* dataframe:

In [7]:
df.head()

Unnamed: 0,uid,labname,num_trials,timestamp
0,user_4,project1,1,2020-04-17 05:19:02.744528
1,user_4,project1,2,2020-04-17 05:22:45.549397
2,user_4,project1,3,2020-04-17 05:34:24.422370
3,user_4,project1,4,2020-04-17 05:43:27.773992
4,user_4,project1,5,2020-04-17 05:46:32.275104


Create the model of preprocessing pipeline:

In [8]:
preprocessing_pipe: Pipeline = Pipeline([
    (
        "feature_extractor",
        FeatureExtractor(),
    ),
    (
        "one_hot_endcoder",
        MyOneHotEncoder("day_of_week", ),
    ),
], )

Transform the `df` *Pandas* dataframe by the model of preprocessing pipeline:

In [9]:
processed_df: DataFrame = preprocessing_pipe.fit_transform(df, )

Check `processed_df` *Pandas* dataframe:

In [10]:
processed_df.head()

Unnamed: 0,num_trials,hour,weekday,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,...,labname_lab02,labname_lab03,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1
0,1,5,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2,5,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,3,5,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,4,5,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,5,5,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Create a model of `TrainValidationTest`:

In [11]:
train_valid_test_model: TrainValidationTest = TrainValidationTest(
    X=processed_df.drop(columns=["weekday", ], ),
    y=processed_df["weekday"],
)

Get `X_train`, `X_valid`, `X_test`, `y_train`, `y_valid`, `y_test` data:

In [12]:
X_train, X_valid, X_test, y_train, y_valid, y_test = \
train_valid_test_model.get_train_validation_test_data()

Check `X_train`, `X_valid`, `X_test`, `y_train`, `y_valid`, `y_test` variables:

In [13]:
X_train.head()

Unnamed: 0,num_trials,hour,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,...,labname_lab02,labname_lab03,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1
1577,11,22,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
574,22,12,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
796,73,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1301,1,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
530,2,19,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [14]:
X_valid.head()

Unnamed: 0,num_trials,hour,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,...,labname_lab02,labname_lab03,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1
871,14,13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
656,39,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
781,15,17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1607,3,21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1564,22,21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [15]:
X_test.head()

Unnamed: 0,num_trials,hour,uid_user_0,uid_user_1,uid_user_10,uid_user_11,uid_user_12,uid_user_13,uid_user_14,uid_user_15,...,labname_lab02,labname_lab03,labname_lab03s,labname_lab05s,labname_laba04,labname_laba04s,labname_laba05,labname_laba06,labname_laba06s,labname_project1
1087,67,17,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
16,1,13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
563,14,10,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1381,20,15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1199,9,13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [16]:
y_train.head()

1577    3
574     6
796     4
1301    3
530     5
Name: weekday, dtype: int32

In [17]:
y_valid.head()

871     6
656     0
781     4
1607    6
1564    3
Name: weekday, dtype: int32

In [18]:
y_test.head()

1087    1
16      5
563     6
1381    3
1199    2
Name: weekday, dtype: int32

## Models testing:

Create a model of *SVC*:

In [19]:
svc_model: SVC = SVC(
    random_state=21,
    probability=True,
)

Create a parameters grid `svc_model_params_grid` for the *SVC* model:

In [20]:
svc_model_params_grid: dict[str, list[Any]] = {
    "gamma": ["auto", "scale", ],
    "class_weight": [None, "balanced", ],
    "kernel": [
        "rbf",
        "linear",
        "sigmoid",
    ],
    "C": [
        0.01,
        0.1,
        1,
        1.5,
        5,
        10,
    ],
}

Create a *gridsearch* model for the *SVC* model:

In [21]:
svc_model_grid_search: GridSearchCV = GridSearchCV(
    cv=5,
    n_jobs=-1,
    scoring="accuracy",
    estimator=svc_model,
    param_grid=svc_model_params_grid,
)

Create a model of *decision tree*:

In [22]:
tree_model: DecisionTreeClassifier = DecisionTreeClassifier(random_state=21, )

Create a parameters grid `tree_model_params_grid` for the *decision tree* model:

In [23]:
tree_model_params_grid: dict[str, Any] = {
    "max_depth": range(1, 50, ),
    "criterion": ["gini", "entropy", ],
    "class_weight": [None, "balanced", ],
}

Create a *gridsearch* model for the *decision tree* model:

In [24]:
tree_model_grid_search: GridSearchCV = GridSearchCV(
    cv=10,
    n_jobs=-1,
    scoring="accuracy",
    estimator=tree_model,
    param_grid=tree_model_params_grid,
)

Create a model of *random forest tree*:

In [25]:
tree_forest_model: RandomForestClassifier = RandomForestClassifier(
    random_state=21,
)

Create a parameters grid `tree_forest_model_params_grid` for the *random forest tree* model:

In [26]:
tree_forest_model_params_grid: dict[str, Any] = {
    "max_depth": range(1, 50, ),
    "criterion": ["gini", "entropy", ],
    "class_weight": [None, "balanced", ],
    "n_estimators": [
        5,
        10,
        50,
        100,
    ],
}

Create a *gridsearch* model for the *random forest tree* model:

In [27]:
tree_forest_model_grid_search: GridSearchCV = GridSearchCV(
    cv=10,
    n_jobs=-1,
    scoring="accuracy",
    estimator=tree_forest_model,
    param_grid=tree_forest_model_params_grid,
)

Create a model of `ModelSelection`:

In [28]:
model_selection_model: ModelSelection = ModelSelection(
    grid_searches=[
        svc_model_grid_search,
        tree_model_grid_search,
        tree_forest_model_grid_search,
    ],
    models_data={
        0: "SVC",
        1: "decision_tree",
        2: "random_forest_tree",
    },
)

Get the best classification models *accuracy* metric scores:

In [29]:
model_selection_model.get_the_best_classification_models_results(
    X_train=X_train,
    y_train=y_train,
    X_valid=X_valid,
    y_valid=y_valid,
)

Unnamed: 0,model,parameters,validation_score
0,SVC,"{'C': 10, 'class_weight': None, 'gamma': 'auto...",0.899408
1,decision_tree,"{'class_weight': None, 'criterion': 'gini', 'm...",0.899408
2,random_forest_tree,"{'class_weight': 'balanced', 'criterion': 'gin...",0.928994


Find the best classification model name:

In [30]:
model_selection_model.get_best_classification_model_name(
    X_train=X_train,
    y_train=y_train,
    X_valid=X_valid,
    y_valid=y_valid,
);


Estimator is SVC.


SVC:   0%|                                                                                                    …

Best classification model parameters are {'C': 10, 'class_weight': None, 'gamma': 'auto', 'kernel': 'rbf'}.
Classification model training accuracy metric is 0.846.
Classification model validation accuracy metric is 0.899.

Estimator is decision_tree.


decision_tree:   0%|                                                                                          …

Best classification model parameters are {'class_weight': None, 'criterion': 'gini', 'max_depth': 25}.
Classification model training accuracy metric is 0.869.
Classification model validation accuracy metric is 0.899.

Estimator is random_forest_tree.


random_forest_tree:   0%|                                                                                     …

Best classification model parameters are {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 22, 'n_estimators': 100}.
Classification model training accuracy metric is 0.903.
Classification model validation accuracy metric is 0.929.

Classification model with best validation accuracy metric is random_forest_tree.


## The best model:

Create a model of `Finalize` with the best classification model:

In [31]:
finalize_model: Finalize = Finalize(
    RandomForestClassifier(
        max_depth=22,
        criterion="gini",
        n_estimators=100,
        class_weight="balanced",
    ),
)

Get the best classification model *accurcy* metric score:

In [32]:
finalize_model.get_final_score(
    X_test=X_test,
    y_test=y_test,
    X_train=X_train,
    y_train=y_train,
);

Accuracy metric of the classification model is 0.923.


Save the best model:

In [33]:
finalize_model.save_classification_model("../../models/ex_04_best_model.sav", );

Classification model was successfuly saved: ../../models/ex_04_best_model.sav.
