# Training and testing with different classifiers

This notebook contains code to apply different ml models to the pre-processed recipe-ingredient dataset. For pre-processing, check the other notebook.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk import WordNetLemmatizer
import inflect

In [6]:
# Read the dataset
df = pd.read_json('dataset.json')

Loading the dataset and pre-processing based on understanding of previous EDA.
1. Converting text into lowercase
2. Removing leading and trailing whitespace
3. Removing punctuation, numbers and special characters
4. Replacing plural words with singular versions
5. Lemmatizing the words

In [7]:
wn = WordNetLemmatizer() # lemmatizing instead of stemming to preserve context
p = inflect.engine() # to change to singular instead of stemming

def format_ingredients(ingredient_list):
    formatted = [ing.strip().lower() for ing in ingredient_list]
    alpha = [(''.join(char for char in ing if char.isalpha())) for ing in formatted]
    singular = [p.singular_noun(ing) or ing for ing in alpha]
    lemmatized = [wn.lemmatize(ing) for ing in singular]
    return (', '.join(lemmatized))

df['ingredients_formatted'] = df['ingredients'].apply(lambda x: format_ingredients(x))

### Vectorizing

The ingredient list is already tokenized (as an array of ingredients), but needs to be vectorized (i.e. encoded so as to be able to create feature vectors for the machine learning algorithms to train/test)

Apply TF-IDF vectorization on the dataset, and transform it to a matrix.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', analyzer='word', max_df=0.8, token_pattern=r'\w+')
x_tfidf = tfidf.fit_transform(df['ingredients_formatted'])
print(x_tfidf.shape)

(39774, 6636)


### Evaluate
Asses the performance of models, parameters used here are selected based on experiementation. 
All scripts for random variations/param tuning/grid search that led to selection of these particular values can be found in the [`/scripts`](https://github.com/nutellaweera/ML_Assignment/tree/main/scripts) folder. 

Code for below is adapted from the 'Batter of Algorithms' code included in the Learning Material (revision) section (Phoebe Pring + HI).

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_tfidf, df['cuisine'], random_state=42, shuffle=True)

models = [
    ('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=17, metric='euclidean')),
    ('Logistic Regression', LogisticRegression(max_iter=50, multi_class='multinomial')),
    ('Naive Bayes', MultinomialNB(alpha=0.01)),
    ('Random Forest', RandomForestClassifier(bootstrap=True, criterion='gini', min_samples_split=2, n_estimators=200)),
]

results = []
for name, model in models:
    kfold = KFold(n_splits=12, shuffle=True, random_state=42)
    results.append((name, cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy')))

print(results)


