# Build Classification Model

In [1]:
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [2]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [3]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


# Split the Data

We follow the path given by the sklearn cheat sheat and move

In [4]:
# import needed libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report,precision_recall_curve
import numpy as np


In [5]:
# split training and test data
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df,cuisines_label_df,test_size=0.3)

# Linear SVC classifier

Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques. In this method, we can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of several; here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.

# Applying a linear SVC

We start by creating an array of classifiers. We will add progressively to this array as we test.

In [6]:
# Start with a Linear SVC:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear',C=C,probability=True,random_state=0)
}

# Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers)
for index, (name,classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train,np.ravel(y_train))
    
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print('Accuracy (train) for %s:%0.1f%%' % (name,accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC:77.0%
              precision    recall  f1-score   support

     chinese       0.67      0.72      0.70       242
      indian       0.85      0.88      0.86       227
    japanese       0.77      0.72      0.74       250
      korean       0.84      0.72      0.78       257
        thai       0.74      0.81      0.77       223

    accuracy                           0.77      1199
   macro avg       0.77      0.77      0.77      1199
weighted avg       0.77      0.77      0.77      1199



# K-Neighbors classifier

K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.

Exercise - apply the K-Neighbors classifier

The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Let's try a K-Neighbors classifier.

Add a line to your classifier array (add a comma after the Linear SVC item):
```py
'KNN classifier': KNeighborsClassifier(C),
```

In [7]:
# Start with a Linear SVC:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear',C=C,probability=True,random_state=0),
    'KNN classifier': KNeighborsClassifier(C)
}

# Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers)
for index, (name,classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train,np.ravel(y_train))
    
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print('Accuracy (train) for %s:%0.1f%%' % (name,accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC:77.0%
              precision    recall  f1-score   support

     chinese       0.67      0.72      0.70       242
      indian       0.85      0.88      0.86       227
    japanese       0.77      0.72      0.74       250
      korean       0.84      0.72      0.78       257
        thai       0.74      0.81      0.77       223

    accuracy                           0.77      1199
   macro avg       0.77      0.77      0.77      1199
weighted avg       0.77      0.77      0.77      1199

Accuracy (train) for KNN classifier:71.6%
              precision    recall  f1-score   support

     chinese       0.64      0.69      0.66       242
      indian       0.80      0.79      0.80       227
    japanese       0.63      0.80      0.71       250
      korean       0.92      0.52      0.67       257
        thai       0.70      0.80      0.75       223

    accuracy                           0.72      1199
   macro avg       0.74      0.72      0.72      11

# Support Vector Classifier

Support-Vector classifiers are part of the Support-Vector Machine family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.

Exercise - apply a Support Vector Classifier

Let's try for a little better accuracy with a Support Vector Classifier.

Add a comma after the K-Neighbors item, and then add this line:

```py
'SVC': SVC(),
```

In [8]:
# Start with a Linear SVC:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear',C=C,probability=True,random_state=0),
    'KNN classifier': KNeighborsClassifier(C),
    'SVC': SVC()
}

# Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers)
for index, (name,classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train,np.ravel(y_train))
    
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print('Accuracy (train) for %s:%0.1f%%' % (name,accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC:77.0%
              precision    recall  f1-score   support

     chinese       0.67      0.72      0.70       242
      indian       0.85      0.88      0.86       227
    japanese       0.77      0.72      0.74       250
      korean       0.84      0.72      0.78       257
        thai       0.74      0.81      0.77       223

    accuracy                           0.77      1199
   macro avg       0.77      0.77      0.77      1199
weighted avg       0.77      0.77      0.77      1199

Accuracy (train) for KNN classifier:71.6%
              precision    recall  f1-score   support

     chinese       0.64      0.69      0.66       242
      indian       0.80      0.79      0.80       227
    japanese       0.63      0.80      0.71       250
      korean       0.92      0.52      0.67       257
        thai       0.70      0.80      0.75       223

    accuracy                           0.72      1199
   macro avg       0.74      0.72      0.72      11

# Ensemble Classifiers

Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:

```py
  'RFST': RandomForestClassifier(n_estimators=100),
  'ADA': AdaBoostClassifier(n_estimators=100)
```



In [9]:
C = 10
# Create different classifiers
classifiers = {
    'Linear SVC': SVC(kernel='linear',C=C,probability=True,random_state=0),
    'KNN classifier': KNeighborsClassifier(C),
    'SVC': SVC(),
    'RFST': RandomForestClassifier(n_estimators=100),
    'ADA': AdaBoostClassifier(n_estimators=100)
}

# Train your model using the Linear SVC and print out a report:
n_classifiers = len(classifiers)
for index, (name,classifier) in enumerate(classifiers.items()):
    classifier.fit(X_train,np.ravel(y_train))
    
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print('Accuracy (train) for %s:%0.1f%%' % (name,accuracy * 100))
    print(classification_report(y_test,y_pred))

Accuracy (train) for Linear SVC:77.0%
              precision    recall  f1-score   support

     chinese       0.67      0.72      0.70       242
      indian       0.85      0.88      0.86       227
    japanese       0.77      0.72      0.74       250
      korean       0.84      0.72      0.78       257
        thai       0.74      0.81      0.77       223

    accuracy                           0.77      1199
   macro avg       0.77      0.77      0.77      1199
weighted avg       0.77      0.77      0.77      1199

Accuracy (train) for KNN classifier:71.6%
              precision    recall  f1-score   support

     chinese       0.64      0.69      0.66       242
      indian       0.80      0.79      0.80       227
    japanese       0.63      0.80      0.71       250
      korean       0.92      0.52      0.67       257
        thai       0.70      0.80      0.75       223

    accuracy                           0.72      1199
   macro avg       0.74      0.72      0.72      11

# Ensemble Classifiers explained

This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.

Random Forest, an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.

AdaBoost fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.