# Problem

The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study: product purchased, TM195, TM498, or TM798; gender; age, in years;education, in years; relationship status, single or partnered; annual household income ($); average number of times the customer plans to use the treadmill each week; average number of miles the customer expects to walk/run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.

# Basic Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statistics import mean

# Dataset

In [2]:
df = pd.read_csv(r'C:\Users\jyama\Git\CardioGoods\CardioGoodFitness.csv')
df.head()

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,TM195,18,Male,14,Single,3,4,29562,112
1,TM195,19,Male,15,Single,2,3,31836,75
2,TM195,19,Female,14,Partnered,4,3,30699,66
3,TM195,19,Male,12,Single,3,3,32973,85
4,TM195,20,Male,13,Partnered,4,2,35247,47


In [3]:
df.shape


(180, 9)

# Data Cleaning

In [4]:
#Remove nans
df = df.dropna()
df.shape


(180, 9)

In [5]:
#look at unique values per column
for (columnName, columnData) in df.iteritems():
    print('Column Name : ', columnName)
    print('Column Contents : ', columnData.unique())


Column Name :  Product
Column Contents :  ['TM195' 'TM498' 'TM798']
Column Name :  Age
Column Contents :  [18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
 43 44 46 47 50 45 48 42]
Column Name :  Gender
Column Contents :  ['Male' 'Female']
Column Name :  Education
Column Contents :  [14 15 12 13 16 18 20 21]
Column Name :  MaritalStatus
Column Contents :  ['Single' 'Partnered']
Column Name :  Usage
Column Contents :  [3 2 4 5 6 7]
Column Name :  Fitness
Column Contents :  [4 3 2 1 5]
Column Name :  Income
Column Contents :  [ 29562  31836  30699  32973  35247  37521  36384  38658  40932  34110
  39795  42069  44343  45480  46617  48891  53439  43206  52302  51165
  50028  54576  68220  55713  60261  67083  56850  59124  61398  57987
  64809  47754  65220  62535  48658  54781  48556  58516  53536  61006
  57271  52291  49801  62251  64741  70966  75946  74701  69721  83416
  88396  90886  92131  77191  52290  85906 103336  99601  89641  95866
 104581  95508]
Colu

We see that a couple of the columns are catagorical. Example: Gender

In [6]:
# One hot encode columns 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
features = ohe.fit_transform(df[['Gender','MaritalStatus']]).toarray()
labels =[]
for item in ohe.categories_:
    for i in item:
        labels.append(i)

ohedf = pd.DataFrame(features, columns = labels)
ohedf.head()
df = df.drop(columns = ['Gender','MaritalStatus'])
df = pd.concat([df,ohedf], axis = 1)

#df = df.append(ohedf)
df.head()


Unnamed: 0,Product,Age,Education,Usage,Fitness,Income,Miles,Female,Male,Partnered,Single
0,TM195,18,14,3,4,29562,112,0.0,1.0,0.0,1.0
1,TM195,19,15,2,3,31836,75,0.0,1.0,0.0,1.0
2,TM195,19,14,4,3,30699,66,1.0,0.0,1.0,0.0
3,TM195,19,12,3,3,32973,85,0.0,1.0,0.0,1.0
4,TM195,20,13,4,2,35247,47,0.0,1.0,1.0,0.0


In [7]:
#df['Product'] is our target. Must catagories numerically.
print(df['Product'].unique(), df['Product'].value_counts())
    

['TM195' 'TM498' 'TM798'] TM195    80
TM498    60
TM798    40
Name: Product, dtype: int64


In [8]:
#function to catagorize
def catagorise(item):
    if item == 'TM195':
        return 1
    elif item == 'TM498':
        return 2
    else:
        return 3
#apply function to df['Product']
df['Product'] = df['Product'].apply(catagorise)
df['Product'].value_counts()

1    80
2    60
3    40
Name: Product, dtype: int64

# About the Data

In [9]:
# Seperate dataframes into three with each product
df1 = df[df['Product']==1]
df2 = df[df['Product']==2]
df3 = df[df['Product'] ==3]


for (columnName, columnData) in df.iteritems():
    print('Column Name : ', columnName)
    print('Column Mean : ', mean(columnData))

Column Name :  Product
Column Mean :  1.7777777777777777
Column Name :  Age
Column Mean :  28.788888888888888
Column Name :  Education
Column Mean :  15.572222222222223
Column Name :  Usage
Column Mean :  3.4555555555555557
Column Name :  Fitness
Column Mean :  3.311111111111111
Column Name :  Income
Column Mean :  53719.57777777778
Column Name :  Miles
Column Mean :  103.19444444444444
Column Name :  Female
Column Mean :  0.4222222222222222
Column Name :  Male
Column Mean :  0.5777777777777777
Column Name :  Partnered
Column Mean :  0.5944444444444444
Column Name :  Single
Column Mean :  0.40555555555555556


In [10]:
#for Df1
for (columnName, columnData) in df1.iteritems():
    
    print(columnName, ' : ', mean(columnData))

Product  :  1
Age  :  28.55
Education  :  15.0375
Usage  :  3.0875
Fitness  :  2.9625
Income  :  46418.025
Miles  :  82.7875
Female  :  0.5
Male  :  0.5
Partnered  :  0.6
Single  :  0.4


In [11]:
#product 2
for (columnName, columnData) in df2.iteritems():
    
    print(columnName, ' : ', mean(columnData))

Product  :  2
Age  :  28.9
Education  :  15.116666666666667
Usage  :  3.066666666666667
Fitness  :  2.9
Income  :  48973.65
Miles  :  87.93333333333334
Female  :  0.48333333333333334
Male  :  0.5166666666666667
Partnered  :  0.6
Single  :  0.4


In [12]:
#Product 3
for (columnName, columnData) in df3.iteritems():
    
    print(columnName, ' : ', mean(columnData))

Product  :  3
Age  :  29.1
Education  :  17.325
Usage  :  4.775
Fitness  :  4.625
Income  :  75441.575
Miles  :  166.9
Female  :  0.175
Male  :  0.825
Partnered  :  0.575
Single  :  0.425


We can see that most of the features in the dataframe have little to no variance. Across all three products; 'Age', 'Partnered', and 'Single' have almost no variance. In fact, we can see that there is hardly a difference in customors between products 1 and 2. Product 3, however, has customors who have  higher 'Fitness', 'Income', and 'Miles', while also having a majority being males. From this, we can assume that product 3 is catered towards people who are well off, have high fitness goals, and also have the time to complete these fitness goals, while products 1 and 2 would be better suited for beginners of fitness who may have a tighter budget. 

# Begin Model Selection

In [13]:
#split dataframe into x and y
data = df.drop(columns = ["Product"])
target = df['Product']

In [14]:
data.head()


Unnamed: 0,Age,Education,Usage,Fitness,Income,Miles,Female,Male,Partnered,Single
0,18,14,3,4,29562,112,0.0,1.0,0.0,1.0
1,19,15,2,3,31836,75,0.0,1.0,0.0,1.0
2,19,14,4,3,30699,66,1.0,0.0,1.0,0.0
3,19,12,3,3,32973,85,0.0,1.0,0.0,1.0
4,20,13,4,2,35247,47,0.0,1.0,1.0,0.0


In [15]:
target.head()

0    1
1    1
2    1
3    1
4    1
Name: Product, dtype: int64

In [16]:
# we can get rid of features with low variance
data = data.drop(columns =['Age', 'Partnered','Single'])
data

Unnamed: 0,Education,Usage,Fitness,Income,Miles,Female,Male
0,14,3,4,29562,112,0.0,1.0
1,15,2,3,31836,75,0.0,1.0
2,14,4,3,30699,66,1.0,0.0
3,12,3,3,32973,85,0.0,1.0
4,13,4,2,35247,47,0.0,1.0
...,...,...,...,...,...,...,...
175,21,6,5,83416,200,0.0,1.0
176,18,5,4,89641,200,0.0,1.0
177,16,5,5,90886,160,0.0,1.0
178,18,4,5,104581,120,0.0,1.0


In [17]:
from sklearn.preprocessing import MinMaxScaler

In [18]:
data = MinMaxScaler().fit_transform(data)

In [19]:
data = pd.DataFrame(data)
data

Unnamed: 0,0,1,2,3,4,5,6
0,0.222222,0.2,0.75,0.000000,0.268437,0.0,1.0
1,0.333333,0.0,0.50,0.030312,0.159292,0.0,1.0
2,0.222222,0.4,0.50,0.015156,0.132743,1.0,0.0
3,0.000000,0.2,0.50,0.045468,0.188791,0.0,1.0
4,0.111111,0.4,0.25,0.075781,0.076696,0.0,1.0
...,...,...,...,...,...,...,...
175,1.000000,0.8,1.00,0.717871,0.528024,0.0,1.0
176,0.666667,0.6,0.75,0.800850,0.528024,0.0,1.0
177,0.444444,0.6,1.00,0.817446,0.410029,0.0,1.0
178,0.666667,0.4,1.00,1.000000,0.292035,0.0,1.0


In [20]:
#import Libraries from sklearn
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import KFold
from sklearn import tree
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.linear_model import LogisticRegression

In [21]:
#function to get score of model
def model_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [22]:
#our KFolds
kf = KFold(n_splits = 8)

In [23]:
knn_score = []
tree_score =[]
svc_score = []
rf_score = []
lr_score = []
#for loop using the kfolds cross val. appends accuracy score for each cross validation for each model we try. 
for train_index, test_index in kf.split(data):
    X_train, X_test, y_train, y_test = data.iloc[train_index], data.iloc[test_index], target.iloc[train_index], target.iloc[test_index]
    knn_score.append(model_score(KNN(), X_train, X_test, y_train, y_test))
    tree_score.append(model_score(tree.DecisionTreeClassifier(), X_train, X_test, y_train, y_test))
    svc_score.append(model_score(SVC(), X_train, X_test, y_train, y_test))
    rf_score.append(model_score(RF(), X_train, X_test, y_train, y_test))
    lr_score.append(model_score(LogisticRegression(), X_train, X_test, y_train, y_test))

In [24]:
score_list = [knn_score, tree_score, svc_score, rf_score, lr_score]

In [25]:
#get mean  of each model to see which performs best
for item in score_list:
    print(mean(item))

0.3824110671936759
0.5170454545454546
0.3050889328063241
0.3890810276679842
0.2507411067193676


Decision tree seems to be the best model for our given Problem

# Hyper Parameter Tuning

In [26]:
from sklearn.model_selection import GridSearchCV
parameters = {'criterion':('gini', 'entropy', 'log_loss'),
                'splitter': ('best', 'random'),
              'max_depth' : range(10,110,10),
              'min_samples_split' : (2,4,6),
              'max_features': ('auto', 'sqrt', 'log2')
              }

In [27]:
tree_grid = GridSearchCV(tree.DecisionTreeClassifier(), param_grid = parameters, cv = 5, verbose = True)

In [28]:
tree_grid.fit(X_train, y_train)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Traceback (most recent call last):
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 903, in fit
    super().fit(
  File "C:\Users\jyama\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 348, in fit
    criterion = CRITERIA_CLF[self.criterion](self.n_outputs_,
KeyError: 'log_loss'

Tracebac

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ('gini', 'entropy', 'log_loss'),
                         'max_depth': range(10, 110, 10),
                         'max_features': ('auto', 'sqrt', 'log2'),
                         'min_samples_split': (2, 4, 6),
                         'splitter': ('best', 'random')},
             verbose=True)

In [29]:
tree_grid.best_estimator_

DecisionTreeClassifier(max_depth=20, max_features='sqrt', min_samples_split=4)

In [33]:
new_tree = tree.DecisionTreeClassifier(max_depth=20, max_features='sqrt', min_samples_split=4)

In [34]:
new_tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=20, max_features='sqrt', min_samples_split=4)

In [35]:
print('Train Score: ', new_tree.score(X_train, y_train))
print('Test Score: ', new_tree.score(X_test, y_test))

Train Score:  0.9240506329113924
Test Score:  0.9545454545454546


## new_tree is our final model. I am pretty confident in its performance because of its high train and test scores, but also the similarity between the two.