# Practical AI and MLOps : Assignment 2
Download the datasets.
The datasets are downloaded and stored in pandas dataframes df1 and df2. You are free to change the names as you like. You can split the datasets using train_test_split function from the scikit-learn library.

1st dataset: (df1) used for Decision Tree

## Problem and Dataset Description

You have been provided with a dataset containing various attributes about the behavior of an online shopper and whether they made a purchase or not. Your task is to build a decision tree model to predict whether a visitor to the webpage actually made a purchase or not based on the provided attributes.

Dataset columns:

*   Electronic_Devices : the number of pages of electronic devices visited by the shopper in a session
*   Electronic_Devices_Duration : the total time spent in electronic devices category by the shopper
*   Groceries : the number of pages of groceries visited by the shopper
*   Groceries_Duration : the total time spent in groceries category by the shopper
*   Sports_Equipments : the number of pages of sports equipments visited by the shopper
*   Sports_Equipments_Duration : the total time spent in sports equipments category by the shopper
*   Bounce_Rates : this feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests
*   Special_Day : this feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Independence Day)
*   Month : the specific month of the year
*   Browser : the browser used by the shopper
*   Region : the region where the searches were made
*   Type_of_visitor : this feature indicates whether the shopper is a returning or new visitor to the page
*   Weekend : Boolean value indicating whether the date of the visit is weekend
*   Purchase_made : Boolean value indicating whether the purchase was made or not

## Problem 1: Decision Tree (2 Marks)


1.   Using the provided dataset, build a decision tree model that can predict whether a visitor will make a purchase during their online session. Additionally, evaluate the performance of your decision tree model using appropriate metrics such as accuracy, precision, recall, and F1-score.
2.   Which attribute(s) did your decision tree identify as the most important for predicting whether a visitor will make a purchase or not?
3.   What is the maximum depth of your decision tree and how did you estimate it?
4.   What is the accuracy of your decision tree model in predicting purchase behavior, and did you employ any techniques to handle categorical features or missing values in the dataset?





In [91]:
# DO NOT EDIT

!pip install gdown
!gdown 18NuvJotUFiTAHW0YgaoLVu2blW_V6YX0
!unzip -o /content/assignment2.zip -d data

import pandas as pd

df1 = pd.read_csv('/content/data/assignment2-1.csv')
df2 = pd.read_csv('/content/data/assignment2-2.csv')

Downloading...
From: https://drive.google.com/uc?id=18NuvJotUFiTAHW0YgaoLVu2blW_V6YX0
To: /content/assignment2.zip
100% 120M/120M [00:02<00:00, 56.7MB/s]
Archive:  /content/assignment2.zip
  inflating: data/assignment2-1.csv  
  inflating: data/assignment2-2.csv  


In [92]:
# Load libraries
import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [93]:
#CREATED A COPY OF THE ORIG DATASET for later use if required
defdectree = df1[["ElectronicDevices", "ElectronicDevices_Duration", "Groceries", "Groceries_Duration", "SportsRelated", "Sports_Equipments_Duration", "Bounce_Rates", "Special_Day", "Month", "Browser", "Region", "Type_of_visitor","Weekend","Purchase_made"]].copy()

In [94]:
defdectree.head()

Unnamed: 0,ElectronicDevices,ElectronicDevices_Duration,Groceries,Groceries_Duration,SportsRelated,Sports_Equipments_Duration,Bounce_Rates,Special_Day,Month,Browser,Region,Type_of_visitor,Weekend,Purchase_made
0,0,0.0,0,0.0,1,0.0,0.2,0.0,May,Mozilla,1,Old_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.0,May,Edge,1,Old_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.0,May,Mozilla,9,Old_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.0,May,Edge,2,Old_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.0,May,Opera,1,Old_Visitor,True,False


In [95]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 80% training and 20% test

# We need to identify the various values for category related columns and then assign numerical values

In [96]:
print(defdectree['Month'].unique())

['May' 'Mar']


In [97]:
d = {'May': 0, 'Mar': 1}
defdectree['Month'] = defdectree['Month'].map(d)

In [98]:
print(defdectree['Browser'].unique())

['Mozilla' 'Edge' 'Opera' 'Brave' 'Chrome' 'DuckDuckGo' '7' 'Mozilla0' '8'
 '9']


In [99]:
d = {'Mozilla': 0, 'Edge' : 1, 'Opera' : 2, 'Brave' : 3, 'Chrome': 4, 'DuckDuckGo': 5, '7': 6, 'Mozilla0': 7, '8': 8, '9': 9,}
defdectree['Browser'] = defdectree['Browser'].map(d)

In [100]:
print(defdectree['Type_of_visitor'].unique())

['Old_Visitor' 'New_Visitor']


In [101]:
d = {'Old_Visitor': 0, 'New_Visitor': 1}
defdectree['Type_of_visitor'] = defdectree['Type_of_visitor'].map(d)

In [102]:
defdectree['Weekend'] = defdectree['Weekend'].astype(int)

In [103]:
defdectree['Purchase_made'] = defdectree['Purchase_made'].astype(int)

In [104]:
defdectree.head()

Unnamed: 0,ElectronicDevices,ElectronicDevices_Duration,Groceries,Groceries_Duration,SportsRelated,Sports_Equipments_Duration,Bounce_Rates,Special_Day,Month,Browser,Region,Type_of_visitor,Weekend,Purchase_made
0,0,0.0,0,0.0,1,0.0,0.2,0.0,0,0,1,0,0,0
1,0,0.0,0,0.0,2,64.0,0.0,0.0,0,1,1,0,0,0
2,0,0.0,0,0.0,1,0.0,0.2,0.0,0,0,9,0,0,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.0,0,1,2,0,0,0
4,0,0.0,0,0.0,10,627.5,0.02,0.0,0,2,1,0,1,0


In [105]:
features = ['ElectronicDevices', 'ElectronicDevices_Duration', 'Groceries', 'Groceries_Duration', 'SportsRelated', 'Sports_Equipments_Duration', 'Bounce_Rates', 'Special_Day', 'Month', 'Browser', 'Region', 'Type_of_visitor', 'Weekend' ]

In [106]:
X = defdectree[features]
y = defdectree['Purchase_made']

# TRAIN THE MODEL

In [107]:
# Create Decision Tree classifier object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [108]:
conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)

# find the metrics incl accuracy, prec Score, F1 Score and Recall

In [109]:
print('Accuracy:  %.3f' % accuracy_score(y_test, y_pred))
print('Precision: %.3f' % precision_score(y_test, y_pred))
print('Recall: %.3f' % recall_score(y_test, y_pred))
print('F1 Score: %.3f' % f1_score(y_test, y_pred))

Accuracy:  0.846
Precision: 0.185
Recall: 0.233
F1 Score: 0.206


In [110]:
import matplotlib.pyplot as plt
import sys
import matplotlib

# What is the maximum depth of your decision tree and how did you estimate it?

In [134]:
text_representation = tree.export_text(clf)

print('Max Depth of Dec Tree is :: %.3f' % clf.tree_.max_depth)

print(text_representation)

Max Depth of Dec Tree is :: 18.000
|--- feature_5 <= 750.96
|   |--- feature_5 <= 244.58
|   |   |--- feature_1 <= 14.50
|   |   |   |--- class: 0
|   |   |--- feature_1 >  14.50
|   |   |   |--- feature_1 <= 15.50
|   |   |   |   |--- feature_4 <= 6.50
|   |   |   |   |   |--- class: 0
|   |   |   |   |--- feature_4 >  6.50
|   |   |   |   |   |--- class: 1
|   |   |   |--- feature_1 >  15.50
|   |   |   |   |--- feature_0 <= 1.50
|   |   |   |   |   |--- feature_9 <= 2.00
|   |   |   |   |   |   |--- feature_1 <= 112.75
|   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |--- feature_1 >  112.75
|   |   |   |   |   |   |   |--- feature_1 <= 162.50
|   |   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |   |--- feature_1 >  162.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- feature_9 >  2.00
|   |   |   |   |   |   |--- class: 1
|   |   |   |   |--- feature_0 >  1.50
|   |   |   |   |   |--- class: 0
|   |--- feature_5 >  244.58


In [112]:
fig = plt.figure(figsize=(25,20))
tree.plot_tree(clf)

[Text(0.42057057899461403, 0.9736842105263158, 'x[5] <= 750.962\ngini = 0.166\nsamples = 1503\nvalue = [1366, 137]'),
 Text(0.11826750448833034, 0.9210526315789473, 'x[5] <= 244.583\ngini = 0.084\nsamples = 1005\nvalue = [961, 44]'),
 Text(0.02154398563734291, 0.868421052631579, 'x[1] <= 14.5\ngini = 0.011\nsamples = 558\nvalue = [555, 3]'),
 Text(0.01436265709156194, 0.8157894736842105, 'gini = 0.0\nsamples = 463\nvalue = [463, 0]'),
 Text(0.02872531418312388, 0.8157894736842105, 'x[1] <= 15.5\ngini = 0.061\nsamples = 95\nvalue = [92, 3]'),
 Text(0.01436265709156194, 0.7631578947368421, 'x[4] <= 6.5\ngini = 0.5\nsamples = 2\nvalue = [1, 1]'),
 Text(0.00718132854578097, 0.7105263157894737, 'gini = 0.0\nsamples = 1\nvalue = [1, 0]'),
 Text(0.02154398563734291, 0.7105263157894737, 'gini = 0.0\nsamples = 1\nvalue = [0, 1]'),
 Text(0.04308797127468582, 0.7631578947368421, 'x[0] <= 1.5\ngini = 0.042\nsamples = 93\nvalue = [91, 2]'),
 Text(0.03590664272890485, 0.7105263157894737, 'x[9] <= 2.

In [135]:
clf2 = DecisionTreeClassifier(criterion='gini')

# Fit the decision tree classifier
clf2 = clf2.fit(X_train, y_train)
print('Max Depth of Dec Tree is :: %.3f' % clf2.tree_.max_depth)

Max Depth of Dec Tree is :: 17.000


In [117]:
feature_importances = clf2.feature_importances_

# WHICH FEATURE IS MOST IMPORTAMT

In [124]:
import seaborn as sns

# Sort the feature importances from greatest to least using the sorted indices
sorted_indices = feature_importances.argsort()[::-1]
#sorted_feature_names = features[sorted_indices]
#print("SORTED FEATURE NAMES")
#print(sorted_feature_names)

print("SORTED INDICES")
sorted_importances = feature_importances[sorted_indices]
print(sorted_indices)

# Create a bar plot of the feature importances
sns.set(rc={'figure.figsize':(11.7,8.27)})
#sns.barplot(sorted_importances, sorted_feature_names)
print("SORTED IMPORTANCES")
print(sorted_importances)

SORTED INDICES
[ 5  4  1  3  0  9  6 10 12  8 11  7  2]
SORTED IMPORTANCES
[0.28962889 0.15910056 0.15147824 0.08414852 0.06816402 0.06807486
 0.06616778 0.04859415 0.0475849  0.00917867 0.00787941 0.
 0.        ]


In [132]:
#print('Highest Important Feature:  %.3f' % features[0])

print("MOST IMPORTANT FEATURE")
most_important = sorted_indices[0]
print(features[most_important])

print("SECOND MOST IMPORTANT FEATURE")
second_most_important = sorted_indices[1]
print(features[second_most_important])

MOST IMPORTANT FEATURE
Sports_Equipments_Duration
SECOND MOST IMPORTANT FEATURE
SportsRelated
