# Parameter Play

## Instructions

There are a lot of parameters that are set by default when working with these classifiers. Intellisense in VS Code can help you dig into them. Adopt one of the ML Classification Techniques in this lesson and retrain models tweaking various parameter values. Build a notebook explaining why some changes help the model quality while others degrade it. Be detailed in your answer.



In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report,precision_recall_curve


In [2]:
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [3]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [4]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [5]:
# split training and test data
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df,cuisines_label_df,test_size=0.3)

In [6]:
# reproducing the lesson models:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear',C=C,probability=True,random_state=0),
    'KNN classifier': KNeighborsClassifier(C),
    'SVC': SVC(),
    'RFST': RandomForestClassifier(n_estimators=100),
    'ADA': AdaBoostClassifier(n_estimators=100)
}

# Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers)
for index, (name,classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train,np.ravel(y_train))
    
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print('Accuracy (train) for %s:%0.1f%%' % (name,accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC:76.1%
              precision    recall  f1-score   support

     chinese       0.70      0.66      0.68       241
      indian       0.89      0.83      0.86       250
    japanese       0.72      0.77      0.74       240
      korean       0.82      0.73      0.77       244
        thai       0.69      0.83      0.76       224

    accuracy                           0.76      1199
   macro avg       0.77      0.76      0.76      1199
weighted avg       0.77      0.76      0.76      1199

Accuracy (train) for KNN classifier:72.1%
              precision    recall  f1-score   support

     chinese       0.59      0.75      0.66       241
      indian       0.85      0.74      0.79       250
    japanese       0.67      0.85      0.75       240
      korean       0.89      0.54      0.67       244
        thai       0.75      0.72      0.74       224

    accuracy                           0.72      1199
   macro avg       0.75      0.72      0.72      11

I will work with random forest as it had the highest accuracy

In [8]:
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 83.15262718932443
              precision    recall  f1-score   support

     chinese       0.80      0.73      0.76       241
      indian       0.89      0.89      0.89       250
    japanese       0.85      0.85      0.85       240
      korean       0.87      0.80      0.83       244
        thai       0.76      0.89      0.82       224

    accuracy                           0.83      1199
   macro avg       0.83      0.83      0.83      1199
weighted avg       0.83      0.83      0.83      1199



Let's tweak a bit

In [10]:
# Increasing n_estimators to 500
classifier = RandomForestClassifier(n_estimators=500,random_state=40)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 83.31943286071727
              precision    recall  f1-score   support

     chinese       0.78      0.74      0.76       241
      indian       0.90      0.88      0.89       250
    japanese       0.85      0.85      0.85       240
      korean       0.88      0.81      0.84       244
        thai       0.76      0.88      0.82       224

    accuracy                           0.83      1199
   macro avg       0.83      0.83      0.83      1199
weighted avg       0.84      0.83      0.83      1199



It increased a bit, this is expected since the accuracy can increase with more trees, however there can be a point where increasing the trees (n_estimators) can reduce the accuracy.

In [11]:
classifier = RandomForestClassifier(n_estimators=500,random_state=40,max_depth=None)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 83.31943286071727
              precision    recall  f1-score   support

     chinese       0.78      0.74      0.76       241
      indian       0.90      0.88      0.89       250
    japanese       0.85      0.85      0.85       240
      korean       0.88      0.81      0.84       244
        thai       0.76      0.88      0.82       224

    accuracy                           0.83      1199
   macro avg       0.83      0.83      0.83      1199
weighted avg       0.84      0.83      0.83      1199



In [12]:
classifier = RandomForestClassifier(n_estimators=500,random_state=40,max_depth=5)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 70.97581317764804
              precision    recall  f1-score   support

     chinese       0.71      0.58      0.64       241
      indian       0.87      0.78      0.83       250
    japanese       0.51      0.88      0.65       240
      korean       0.82      0.66      0.73       244
        thai       0.84      0.64      0.73       224

    accuracy                           0.71      1199
   macro avg       0.75      0.71      0.71      1199
weighted avg       0.75      0.71      0.72      1199



Somehow, limiting the max_depth reduced the accuracy. Let's increase it a bit

In [13]:
classifier = RandomForestClassifier(n_estimators=500,random_state=40,max_depth=20)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 80.90075062552127
              precision    recall  f1-score   support

     chinese       0.81      0.68      0.74       241
      indian       0.92      0.86      0.89       250
    japanese       0.66      0.91      0.76       240
      korean       0.89      0.80      0.84       244
        thai       0.84      0.79      0.81       224

    accuracy                           0.81      1199
   macro avg       0.82      0.81      0.81      1199
weighted avg       0.82      0.81      0.81      1199



Increasing the max-depth improved the accuracy. Increasing the max-depth further may sometimes lead to overfitting and capturing noise. We have to be careful.

Let's tinker with max features, default is auto

In [14]:
classifier = RandomForestClassifier(n_estimators=500,random_state=40,max_features=0.5)
classifier.fit(X_train,np.ravel(y_train))
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f'Random Forest Accuracy (train): {accuracy * 100}')
print(classification_report(y_test,y_pred))

Random Forest Accuracy (train): 78.6488740617181
              precision    recall  f1-score   support

     chinese       0.71      0.70      0.71       241
      indian       0.89      0.84      0.86       250
    japanese       0.80      0.81      0.81       240
      korean       0.83      0.73      0.78       244
        thai       0.71      0.86      0.78       224

    accuracy                           0.79      1199
   macro avg       0.79      0.79      0.79      1199
weighted avg       0.79      0.79      0.79      1199



Setting to 0.5 reduced the accuracy from above 83%.