# **Machine Learning: Full Pipeline**

## Library Import, Constants

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [2]:
RANDOM_STATE = 42

In [3]:
DATASET_PATH = "https://raw.githubusercontent.com/aiedu-courses/stepik_eda_and_dev_tools/main/datasets/online_shoppers_intention.csv"

## Loading and Preprocessing Data

Loading data:

In [4]:
df = pd.read_csv(DATASET_PATH)

Working with duplicates:

In [5]:
df.drop_duplicates(inplace = True)
df.reset_index(inplace = True, drop = True)

Working with missing values:

In [6]:
median_infdur = df['Informational_Duration'].median()
df['Informational_Duration'].fillna(median_infdur, inplace=True)

median_prdur = df['ProductRelated_Duration'].median()
df['ProductRelated_Duration'].fillna(median_prdur, inplace=True)

df = df.dropna(subset=['ExitRates'])

In [7]:
df['Month'] = df['Month'].replace('aug', 'Aug')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Month'] = df['Month'].replace('aug', 'Aug')


## NB

Let's build a Naive Bayesian Classifier model on numerical features with default parameters.

In [8]:
df

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12216,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12217,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12218,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12219,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,Nov,2,2,3,11,Returning_Visitor,False,False


In [9]:
X = df[['Administrative',	'Administrative_Duration',	'Informational',
        'Informational_Duration',	'ProductRelated',	'ProductRelated_Duration',
        'BounceRates',	'ExitRates',	'PageValues',	'SpecialDay']]

y = df['Revenue'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = df['Revenue'].astype(np.int)


In [10]:
y.value_counts()

0    10236
1     1886
Name: Revenue, dtype: int64

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

In [12]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

In [13]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.8488947542065325

We see that approximately 85% of the observations were correctly classified. Let's look at the confusion matrix to get more specific conclusions.


In [14]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

array([[2310,  235],
       [ 223,  263]])

In [15]:
y_test.value_counts()

0    2545
1     486
Name: Revenue, dtype: int64

We see that we have 2310 true positive observations, 235 false positive observations, 223 false negative observations, and 263 true negative observations.


That is, when predicting Revenue = 0, we have 9-10 times fewer incorrect responses than correct responses.

When predicting Revenue = 1, we got approximately the same number of correct and incorrect responses.

**We can see that the model is not very high quality.**

## KNN

Let's build a K-Nearest Neighbors model on numerical features with default parameters.

In [16]:
from sklearn.neighbors import KNeighborsClassifier

knn_cl = KNeighborsClassifier()

knn_cl.fit(X_train, y_train)

pred_knn = knn_cl.predict(X_test)

In [17]:
accuracy_score(y_test, pred_knn)

0.860442098317387

We see that approximately 86% of the observations were correctly classified. Let's look at the confusion matrix to get more specific conclusions. In terms of this metric, this model worked **better**.

In [18]:
confusion_matrix(y_test, pred_knn)

array([[2466,   79],
       [ 344,  142]])

We see that we have 2466 true positive observations, 79 false positive observations, 344 false negative observations, and 142 true negative observations.

That is, when predicting Revenue = 0, we have 31 times fewer incorrect responses than correct responses.

When predicting Revenue = 1, we got 2 times fewer correct responses than incorrect responses.

Perhaps such a model would be useful for ecommerce, but it turned out to be too "scrambled".

**We can see that the model is not very high quality.**

## KNN with GridSearchCV

Let's build a K-Nearest Neighbors model on numerical features with parameters matched with GridSearchCV

In [22]:
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = KNeighborsClassifier()

params = {'n_neighbors' : np.arange(2, 20, 2),
          'weights' : ['uniform', 'distance'],
          'p' : [1, 2]}

gs = GridSearchCV(model, params, scoring='accuracy', cv=5, n_jobs=-1, verbose=2)
gs.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [23]:
gs.best_score_, gs.best_params_

(0.865801214535967, {'n_neighbors': 8, 'p': 2, 'weights': 'uniform'})

In [24]:
pred = gs.best_estimator_.predict(X_test)

accuracy_score(y_test, pred)

0.8630814912570108

We see that approximately 86% of the observations were correctly classified. Let's look at the confusion matrix to get more specific conclusions. In terms of this metric, the quality of the model has not changed.

In [25]:
confusion_matrix(y_test, pred)

array([[2513,   32],
       [ 383,  103]])

We see that we have 2513 true positive observations, 32 false positive observations, 383 false negative observations, and 103 true negative observations.

That is, when predicting Revenue = 0, we have 79 times fewer incorrect responses than correct responses.

When predicting Revenue = 1, we got 4 times fewer correct responses than incorrect responses.

We see that the model became better at predicting non-purchase but worse at predicting purchase. The model became even more "scared" than in the previous step.

## NB with GridSearchCV

Let's build a Naive Bayesian Classifier model on numerical features with parameters matched with GridSearchCV

In [26]:
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = GaussianNB()

params = {'priors': [None, [0.7, 0.3], [0.3, 0.7], [0.8, 0.2], [0.2, 0.8], [0.9, 0.1], [0.1, 0.9]],
          'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]}

gs = GridSearchCV(model, params, scoring='accuracy', cv=5, n_jobs=-1, verbose=2)
gs.fit(X_train, y_train)


Fitting 5 folds for each of 28 candidates, totalling 140 fits


In [27]:
gs.best_score_, gs.best_params_

(0.8568906258410338, {'priors': [0.9, 0.1], 'var_smoothing': 1e-06})

In [28]:
pred = gs.best_estimator_.predict(X_test)

accuracy_score(y_test, pred)

0.8630814912570108

We see that approximately 86% of the observations were correctly classified.
The quality of the model, relative to this metric, does not differ from that of KNN.Let's look at the confusion matrix to get more specific conclusions.

In [29]:
confusion_matrix(y_test, pred)

array([[2396,  149],
       [ 266,  220]])

We see that we have 2396 true positive observations, 149 false positive observations, 266 false negative observations, and 220 true negative observations.

That is, when predicting Revenue = 0, we have 16 times fewer incorrect responses than correct responses.

When predicting Revenue = 1, we got 1.2 times fewer correct responses than incorrect responses.

We see that the model became better at predicting non-purchase but worse at predicting purchase. That said, the model isn't as "scuzzy" as past ones. It is a better predictor of buying than not buying.


## Keeping categorical attributes

### **Encoding**

In [30]:
df

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12216,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12217,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12218,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12219,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,Nov,2,2,3,11,Returning_Visitor,False,False


In [31]:
{column: list(df[column].unique()) for column in df.columns if df.dtypes[column] == 'object'}

{'Month': ['Feb',
  'Aug',
  'Mar',
  'May',
  'Oct',
  'June',
  'Jul',
  'Nov',
  'Sep',
  'Dec'],
 'VisitorType': ['Returning_Visitor', 'New_Visitor', 'Other']}

We can encode the values of `Month` with ordinary integers in ascending order, so we will keep priorities. We will encode  `VisitorType` using One Hot Encoding.

Let's encode object columns

In [32]:
month_ordering = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Month'] = df['Month'].apply(lambda x: month_ordering.index(x))

In [33]:
dummies = pd.get_dummies(df['VisitorType'], prefix='VT', drop_first=True)
df = pd.concat([df, dummies], axis=1)
df = df.drop('VisitorType', axis=1)

In [34]:
df

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,Weekend,Revenue,VT_Other,VT_Returning_Visitor
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,1,1,1,1,1,False,False,0,1
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,1,2,2,1,2,False,False,0,1
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,1,4,1,9,3,False,False,0,1
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,1,3,2,2,4,False,False,0,1
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,1,3,3,1,4,True,False,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12216,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,11,4,6,1,1,True,False,0,1
12217,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,10,3,2,1,8,True,False,0,1
12218,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,10,3,2,1,13,True,False,0,1
12219,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,10,2,2,3,11,False,False,0,1


Let's encode boolean columns (`Weekend` and `Revenue`)

In [35]:
df['Weekend'] = df['Weekend'].astype(np.int)
df['Revenue'] = df['Revenue'].astype(np.int)
df

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df['Weekend'] = df['Weekend'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df['Revenue'] = df['Revenue'].astype(np.int)


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,Weekend,Revenue,VT_Other,VT_Returning_Visitor
0,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,1,1,1,1,1,0,0,0,1
1,0,0.0,0,0.0,2,64.000000,0.000000,0.100000,0.000000,0.0,1,2,2,1,2,0,0,0,1
2,0,0.0,0,0.0,1,0.000000,0.200000,0.200000,0.000000,0.0,1,4,1,9,3,0,0,0,1
3,0,0.0,0,0.0,2,2.666667,0.050000,0.140000,0.000000,0.0,1,3,2,2,4,0,0,0,1
4,0,0.0,0,0.0,10,627.500000,0.020000,0.050000,0.000000,0.0,1,3,3,1,4,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12216,3,145.0,0,0.0,53,1783.791667,0.007143,0.029031,12.241717,0.0,11,4,6,1,1,1,0,0,1
12217,0,0.0,0,0.0,5,465.750000,0.000000,0.021333,0.000000,0.0,10,3,2,1,8,1,0,0,1
12218,0,0.0,0,0.0,6,184.250000,0.083333,0.086667,0.000000,0.0,10,3,2,1,13,1,0,0,1
12219,4,75.0,0,0.0,15,346.000000,0.000000,0.021053,0.000000,0.0,10,2,2,3,11,0,0,0,1


### **Splitting and Scaling**

In [36]:
X = df.drop('Revenue', axis=1)
y = df['Revenue']

In [37]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

In [38]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(Xtrain)

Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

In [39]:
Xtrain = pd.DataFrame(Xtrain, columns=X.columns)
Xtest = pd.DataFrame(Xtest, columns=X.columns)

In [40]:
Xtrain

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,Weekend,VT_Other,VT_Returning_Visitor
0,-0.701679,-0.455635,-0.401593,-0.239368,-0.113607,0.071570,-0.449659,-0.417126,-0.326504,-0.308377,0.986859,-0.133381,-0.209625,-0.893424,-0.768960,-0.548072,-0.085095,0.416827
1,-0.401984,-0.367657,1.970228,3.679180,0.177517,-0.221514,-0.209397,-0.390571,-0.326504,-0.308377,0.097953,0.987151,-0.209625,0.350742,-0.521496,-0.548072,-0.085095,0.416827
2,-0.701679,-0.455635,-0.401593,-0.239368,-0.539098,-0.400925,-0.449659,-0.629687,-0.326504,-0.308377,1.283161,2.107682,-0.209625,-0.893424,-0.521496,-0.548072,-0.085095,-2.399078
3,-0.701679,-0.455635,-0.401593,-0.239368,0.692585,0.751241,0.367681,0.554583,-0.326504,-0.308377,-0.198350,0.987151,-0.209625,2.009629,3.932849,-0.548072,-0.085095,0.416827
4,-0.701679,-0.455635,-0.401593,-0.239368,-0.673463,-0.414090,1.789141,2.376536,-0.326504,-0.308377,-1.383558,-0.133381,4.452677,-0.063980,-0.274033,-0.548072,-0.085095,0.416827
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8480,0.197407,0.143716,-0.401593,-0.239368,-0.024031,1.029271,-0.449659,-0.789108,-0.326504,-0.308377,1.283161,0.987151,-0.209625,-0.893424,1.458213,-0.548072,-0.085095,0.416827
8481,-0.401984,-0.400648,-0.401593,-0.239368,-0.471915,-0.061428,-0.001899,-0.028442,-0.326504,-0.308377,-0.790954,-0.133381,-0.209625,-0.893424,-0.521496,1.824577,-0.085095,0.416827
8482,-0.701679,-0.455635,-0.401593,-0.239368,-0.695857,-0.628454,4.027942,3.469709,-0.326504,-0.308377,0.690557,-1.253912,-0.792413,2.009629,-0.768960,1.824577,-0.085095,0.416827
8483,-0.701679,-0.455635,-0.401593,-0.239368,-0.225579,-0.135986,-0.042604,0.157066,-0.326504,-0.308377,-1.383558,-0.133381,-0.209625,-0.063980,-0.768960,-0.548072,-0.085095,0.416827


### **KNN**

Let's focus on the KNN model. Since it is known that it works better with a large number of attributes

In [45]:
model = KNeighborsClassifier()

params = {'n_neighbors' : np.arange(2, 20, 2),
          'weights' : ['uniform', 'distance'],
          'p' : [1, 2]}

gs = GridSearchCV(model, params, scoring='accuracy', cv=5, n_jobs=-1, verbose=2)
gs.fit(Xtrain, ytrain)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [46]:
gs.best_score_, gs.best_params_

(0.8789628756629344, {'n_neighbors': 10, 'p': 2, 'weights': 'distance'})

In [47]:
pred = gs.best_estimator_.predict(Xtest)

accuracy_score(ytest, pred)

0.8847951608468518

We see that approximately 88% of the observations were correctly classified. Let's look at the confusion matrix to get more specific conclusions. In terms of this metric, the quality of the model has **become better**.

In [48]:
confusion_matrix(ytest, pred)

array([[2993,   69],
       [ 350,  225]])

We see that we have 2993 true positive observations, 69 false positive observations, 350 false negative observations, and 225 true negative observations.

That is, when predicting Revenue = 0, we have 43 times fewer incorrect responses than correct responses.

When predicting Revenue = 1, we got 1.6 times fewer correct responses than incorrect responses.

We see that the model became better at predicting non-purchase and purchase cases. The model became less "scared" than in the previous steps.

**We've made the model better**

## Explainer Dashboard

###**Building and Saving**

In [49]:
!pip install explainerdashboard -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.9/286.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.6/220.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.8/91.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.9/547.9 kB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.6/233.6 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m136.5/136.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [50]:
from explainerdashboard import ClassifierExplainer, ExplainerDashboard

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [51]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [56]:
explainer = ClassifierExplainer(gs.best_estimator_, Xtest.iloc[:10], ytest.iloc[:10])

Note: shap values for shap='kernel' normally get calculated against X_background, but paramater X_background=None, so setting X_background=shap.sample(X, 50)...
Generating self.shap_explainer = shap.KernelExplainer(model, X, link='identity')


In [57]:
db = ExplainerDashboard(explainer)

Building ExplainerDashboard..
Detected google colab environment, setting mode='external'
For this type of model and model_output interactions don't work, so setting shap_interaction=False...
The explainer object has no decision_trees property. so setting decision_trees=False...
Generating layout...
Calculating shap values...



JupyterDash is deprecated, use Dash instead.
See https://dash.plotly.com/dash-in-jupyter for more details.



  0%|          | 0/10 [00:00<?, ?it/s]

Calculating prediction probabilities...
Calculating metrics...
Calculating confusion matrices...
Calculating classification_dfs...
Calculating roc auc curves...
Calculating pr auc curves...
Calculating liftcurve_dfs...
Calculating dependencies...
Calculating permutation importances (if slow, try setting n_jobs parameter)...
Calculating predictions...
Calculating pred_percentiles...
Reminder: you can store the explainer (including calculated dependencies) with explainer.dump('explainer.joblib') and reload with e.g. ClassifierExplainer.from_file('explainer.joblib')
Registering callbacks...


In [58]:
db.run()

Starting ExplainerDashboard on http://172.28.0.12:8050
You can terminate the dashboard with ExplainerDashboard.terminate(8050)
Dash app running on:


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [59]:
db.save_html('dashboard.html')

### **Analysis**

#### **Feature Importances**

From the plot in the dashboard, we can see that the most important feature in the data is the `PageValues` feature.

This feature stands out significantly among the others

The average value of the page plays the most important role in the model

#### **Metrics**

Let's look at the metrics that were presented in the dashboard

|   Metric         |   Score   |
|------------------|-----------|
|   accuracy       |   0.9     |
|   precision      |   0.5     |
|   recall         |   1.0     |
|   f1             |   0.667   |
|   roc_auc_score  |   1.0     |
|   pr_auc_score   |   1.0     |
|   log_loss       |   0.201   |

Accuracy is 0.9, which means that the model correctly predicted classes for 90% of all objects.

Precision is 0.5, which means that 50% of the objects that the model predicted
as class 1 were correctly predicted.

Log loss is 0.201, which means that the model predicts class probabilities quite accurately.

**Some of the metrics from the dashboard raise questions. The confusion matrix we built earlier suggests that not all classes were correctly predicted, but the metrics say otherwise:**

* Recall is 1, which means that the model correctly predicted all objects of class 1.

* F1-score is 0.667, which means that the model has achieved a balance between Precision and Recall.

* The ROC AUC is equal to 1, which means that the model perfectly discriminates between classes.

* PR AUC is 1, which means that the model perfectly distinguishes between classes.


#### **Individual Predictions**

Let's look at how variables affect some individual predictions.

Let's look at the prediction with **index 3**. We can see that `PageValues` had a negative impact: -70.5%. At the same time, the other attributes did not have much impact. Because of this influence of `PageValues` the initial prediction became equal to 1.

Let's look at the prediction with **index 4**. We can see that `PageValues` had a positive impact: +16.6%. At the same time, the `Weekend` feature had a negative impact: -7%. Due to this influence of `PageValues` the initial prediction became equal to 0 (73.7%).

Let's look at the prediction with **index 5**. We see that `PageValues` had a negative impact: -54%. At the same time, the `Region` feature had a positive impact: 1.8%. Due to this influence of variables, the prediction became equal to 1 (80%).