# Build Classification Models

In [21]:
import pandas as pd
cuisines_df = pd.read_csv('../data/cleaned_cuisines.csv')
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [22]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report,precision_recall_curve
from sklearn.svm import SVC


In [23]:
# Divide the X and y coordinates into two dataframes for training. cuisine can be the labels dataframe:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()

0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [24]:
# Drop the "Unnamed: 0" column and the cuisine column, calling drop(). Save the rest of the data as trainable features:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


# Choosing our classifier

Now that our data is clean and ready for training, we have to decide which algorithm to use for the job.

Scikit-learn groups classification under Supervised Learning, and in that category there are many ways to classify. The variety is quite bewildering at first sight. The following methods all include classification techniques:

- Linear Models
- Support Vector Machines
- Stochastic Gradient Descent
- Nearest Neighbors
- Gaussian Processes
- Decision Trees
- Ensemble methods (voting Classifier)
- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)
- Neural networks (not within our scope here though)

Selecting the classifier to go with?

So, which classifier should we choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a side-by-side comparison on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized.

A better approach

A better way than wildly guessing, however, is to follow the ideas on the downloadable ML Cheat sheet. 

Reasoning

Let's see if we can reason our way through different approaches given the constraints we have:

- Neural networks are too heavy. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.
- No two-class classifier. We do not use a two-class classifier, so that rules out one-vs-all.
- Decision tree or logistic regression could work. A decision tree might work, or logistic regression for multiclass data.
- Multiclass Boosted Decision Trees solve a different problem. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.

# Using Scikit-learn

We will be using Scikit-learn to analyze our data. However, there are many ways to use logistic regression in Scikit-learn. Take a look at the parameters to pass.

Essentially there are two important parameters - multi_class and solver - that we need to specify, when we ask Scikit-learn to perform a logistic regression. The multi_class value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all multi_class values.

According to the docs, in the multiclass case, the training algorithm:

- Uses the one-vs-rest (OvR) scheme, if the multi_class option is set to ovr
- Uses the cross-entropy loss, if the multi_class option is set to multinomial. (Currently the multinomial option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)"


# Split the data

We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous lesson. Split your data into training and testing groups by calling train_test_split():

In [25]:
X_train,X_test,y_train,y_test = train_test_split(cuisines_feature_df,cuisines_label_df,test_size=0.3)

# Apply logistic regression

Since we are using the multiclass case, we need to choose what scheme to use and what solver to set. Let's use LogisticRegression with a multiclass setting and the liblinear solver to train.


In [26]:
# Create a logistic regression with multi_class set to ovr and the solver set to liblinear:
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train,np.ravel(y_train))
accuracy = model.score(X_test,y_test)
print(f'Accuracy is {accuracy:3.3}')

Accuracy is 0.804


In [27]:
lr = LogisticRegression(multi_class='ovr',solver='lbfgs')
model2 = lr.fit(X_train,np.ravel(y_train))
accuracy_ = model2.score(X_test,y_test)
print(f'Accuracy is {accuracy_:3.3}')

Accuracy is 0.804


In [29]:
# We can see this model in action by testing one row of data (#55):

print(f'ingredients: {X_test.iloc[55][X_test.iloc[55]!=0].keys()}')
print(f'cuisine: {y_test.iloc[55]}')

ingredients: Index(['chicken', 'scallion', 'soy_sauce', 'wine'], dtype='object')
cuisine: chinese


In [31]:
# Digging deeper, we can check for the accuracy of this prediction:
test= X_test.iloc[55].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)

topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()



Unnamed: 0,0
chinese,0.481102
japanese,0.44851
thai,0.059953
korean,0.009457
indian,0.000979


In [32]:
# printing a classification report, as  we did in the regression lessons:
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     chinese       0.77      0.65      0.71       250
      indian       0.92      0.88      0.90       228
    japanese       0.71      0.77      0.74       241
      korean       0.83      0.84      0.83       234
        thai       0.80      0.89      0.84       246

    accuracy                           0.80      1199
   macro avg       0.81      0.81      0.80      1199
weighted avg       0.81      0.80      0.80      1199

