###Analyse
* Wat is het probleem? 
  - Voorspellen of een maaltijd vegetarisch is of niet (classificatie).
* Wat voor labels heb je?
  - diet

* Dimensie data set?
  - Er zijn 9 kolommen en 250 rijen
* Soorten data/features?
  - name : name of the dish
  - ingredients : main ingredients used
  - diet : type of diet - either vegetarian or non vegetarian
  - prep_time : preparation time
  - cook_time : cooking time
  - flavor_profile : flavor profile includes whether the dish is spicy, sweet, bitter, etc
  - course : course of meal - starter, main course, dessert, etc
  - state : state where the dish is famous or is originated
  - region : region where the state belongs




###Voorspelling:
* Welke classifier zal het beste zijn en waarom? 
  - SVM, dit omdat deze het vaak aanzienlijk beter doet op kleine datasets.
* Welke hyperparameters zijn relevant en waarom?
  - Voor SVM kan C een verschil uitmaken om alle outliers goed te classificeren.

###Methoden
* Welke preprocessing stappen zijn nodig?
  - De ingredienten moeten onehot-encoded worden
  - De diet-waardes moeten omgezet worden naar numerieke waardes voor de modellen
  - Overige categorische waardes moeten omgezet worden naar numerieke waardes.
* Welke classifiers ga je vergelijken? (Minimaal 2 per dataset)
  - De SVM classifier en naive bayes.
* Welke performance metric is passend?
  - Precision en recall.

###Resultaten

*Niet alleen beste resultaat geven, het gaat juist ook om de vergelijking/verbetering!*
* Minimaal 1 visualisatie/plotje per dataset.
* Evaluatie/conclusie
* Welke classifier leverde beste resultaat op?
* Klopt dit met je voorspelling? Waarom wel/niet?

In [6]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import math
from sklearn import metrics, model_selection, svm, naive_bayes

%matplotlib inline

In [7]:
df = pd.read_csv("/content/drive/MyDrive/minor/ML/indian_food.csv", sep =",")

In [8]:
# from google.colab import drive
# drive.mount('/content/drive')

In [9]:
df.head()

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
0,Balu shahi,"Maida flour, yogurt, oil, sugar",vegetarian,45,25,sweet,dessert,West Bengal,East
1,Boondi,"Gram flour, ghee, sugar",vegetarian,80,30,sweet,dessert,Rajasthan,West
2,Gajar ka halwa,"Carrots, milk, sugar, ghee, cashews, raisins",vegetarian,15,60,sweet,dessert,Punjab,North
3,Ghevar,"Flour, ghee, kewra, milk, clarified butter, su...",vegetarian,15,30,sweet,dessert,Rajasthan,West
4,Gulab jamun,"Milk powder, plain flour, baking powder, ghee,...",vegetarian,15,40,sweet,dessert,West Bengal,East


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255 entries, 0 to 254
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            255 non-null    object
 1   ingredients     255 non-null    object
 2   diet            255 non-null    object
 3   prep_time       255 non-null    int64 
 4   cook_time       255 non-null    int64 
 5   flavor_profile  255 non-null    object
 6   course          255 non-null    object
 7   state           255 non-null    object
 8   region          254 non-null    object
dtypes: int64(2), object(7)
memory usage: 18.1+ KB


In [11]:
# Bekijk welke data categorisch is 
df.select_dtypes(exclude = [np.number]).columns

Index(['name', 'ingredients', 'diet', 'flavor_profile', 'course', 'state',
       'region'],
      dtype='object')

Als we de naar bovenstaande info kijken, is er te zien dat er veel data categorisch is. Sommige kolommen zullen we niet nodig hebben, dus deze kunnen we gewoon verwijderen. Anderen zoals bijv. diet zullen aangepast moeten worden.

In [12]:
# Verwijder irrelevante kolommen
df = df.drop(columns=["state", "region", "name"], axis = 1)

# Zet categorische features om naar indices
df["diet"] = pd.factorize(df.diet)[0]
df.head()

Unnamed: 0,ingredients,diet,prep_time,cook_time,flavor_profile,course
0,"Maida flour, yogurt, oil, sugar",0,45,25,sweet,dessert
1,"Gram flour, ghee, sugar",0,80,30,sweet,dessert
2,"Carrots, milk, sugar, ghee, cashews, raisins",0,15,60,sweet,dessert
3,"Flour, ghee, kewra, milk, clarified butter, su...",0,15,30,sweet,dessert
4,"Milk powder, plain flour, baking powder, ghee,...",0,15,40,sweet,dessert


In [13]:
print(df.isna().sum())

ingredients       0
diet              0
prep_time         0
cook_time         0
flavor_profile    0
course            0
dtype: int64


Er zijn gelukkig geen ontbrekende gegevens, nu kunnen we verder. We gaan de ingredienten One Hot Encoden. Dit omdat we de inggredienten wel nodig hebben om te voorspellen of het vegetarisch is, maar dit niet kan als het categorische data is.

In [14]:
ingredient_list = set()

# Alle ingredients in een list zetten
for ingredients in df['ingredients']:
    for food in ingredients.split(','):
        if food.strip().lower() not in ingredient_list:
            ingredient_list.add(food.strip().lower())

# ingredient_list

In [15]:
def one_hot_encode_ingredients(ingredient_list, df):
  
  for i, ingredients in enumerate(df['ingredients']):
    for ing in ingredients.split(','):
        if ing.strip().lower() in ingredient_list:
            df.loc[i, ing.strip().lower()] = 1
  df = df.fillna(0)
  df = df.drop(columns=['ingredients'], axis = 1)
  return df  

df = one_hot_encode_ingredients(ingredient_list, df)
df  

Unnamed: 0,diet,prep_time,cook_time,flavor_profile,course,maida flour,yogurt,oil,sugar,gram flour,ghee,carrots,milk,cashews,raisins,flour,kewra,clarified butter,almonds,pistachio,saffron,green cardamom,milk powder,plain flour,baking powder,water,rose water,sugar syrup,lentil flour,maida,corn flour,baking soda,vinegar,curd,turmeric,cardamom,cottage cheese,rice,dried fruits,nuts,...,orange rind,raw papaya,panch phoran masala,eggs,beetroot,brinjal,forbidden black rice,slivered almonds,garlic powder,biryani masala,mixed vegetables,yellow moong daal,whole red,brown rice,soy sauce,coconut milk,lobster,fresh green chilli,lamb,prawns,mustard seed,fish fillet,mint,fermented bamboo shoot,banana flower,mutton,fish roe,pumpkin flowers,dry chilli,tea leaves,soaked rice,cardamom pods,red pepper,watercress,glutinous rice,egg yolks,dry dates,dried rose petals,arrowroot powder,ginger powder
0,0,45,25,sweet,dessert,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,80,30,sweet,dessert,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,15,60,sweet,dessert,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,15,30,sweet,dessert,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,15,40,sweet,dessert,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
250,0,5,30,sweet,dessert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
251,0,20,60,sweet,dessert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
252,0,-1,-1,sweet,dessert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
253,0,20,45,sweet,dessert,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Nu moeten we course en flavor profile nog omzetten naar nummers. Laten we eerst kijken naar welke categorien er zijn en hoeveel er per categeorie in zitten.

In [16]:
print(f'{df.course.value_counts()}\n')
print(f'{df.flavor_profile.value_counts()}\n')

main course    129
dessert         85
snack           39
starter          2
Name: course, dtype: int64

spicy     133
sweet      88
-1         29
bitter      4
sour        1
Name: flavor_profile, dtype: int64



Bij flavour_profile, prep_time en cook_time zien we dat er waardes van -1 in staan, dit kan natuurlijk niet. Om er voor te zorgen dat dit geen problemen geeft, vervangen we deze waardes door de mediaan. Bij flavor_profile zullen we het veranderen naar 'other'.

In [17]:
df['flavor_profile'].replace('-1', 'other', inplace=True)
df['prep_time'].replace(-1, df['prep_time'].median(), inplace=True)
df['cook_time'].replace(-1, df['cook_time'].median(), inplace=True)
df[245:250]

Unnamed: 0,diet,prep_time,cook_time,flavor_profile,course,maida flour,yogurt,oil,sugar,gram flour,ghee,carrots,milk,cashews,raisins,flour,kewra,clarified butter,almonds,pistachio,saffron,green cardamom,milk powder,plain flour,baking powder,water,rose water,sugar syrup,lentil flour,maida,corn flour,baking soda,vinegar,curd,turmeric,cardamom,cottage cheese,rice,dried fruits,nuts,...,orange rind,raw papaya,panch phoran masala,eggs,beetroot,brinjal,forbidden black rice,slivered almonds,garlic powder,biryani masala,mixed vegetables,yellow moong daal,whole red,brown rice,soy sauce,coconut milk,lobster,fresh green chilli,lamb,prawns,mustard seed,fish fillet,mint,fermented bamboo shoot,banana flower,mutton,fish roe,pumpkin flowers,dry chilli,tea leaves,soaked rice,cardamom pods,red pepper,watercress,glutinous rice,egg yolks,dry dates,dried rose petals,arrowroot powder,ginger powder
245,0,10,20,other,main course,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
246,0,10,30,sweet,dessert,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
247,1,15,50,spicy,main course,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
248,0,10,30,other,main course,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
249,0,10,20,spicy,main course,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# Zet de waardes om naar nummers
df.flavor_profile = df.flavor_profile.replace('spicy', 0)
df.flavor_profile = df.flavor_profile.replace('sweet', 1)
df.flavor_profile = df.flavor_profile.replace('bitter', 2)
df.flavor_profile = df.flavor_profile.replace('sour', 3)
df.flavor_profile = df.flavor_profile.replace('other', 4)

In [19]:
# Zet de waardes om naar nummers
df.course = df.course.replace('main course', 0)
df.course = df.course.replace('dessert', 1)
df.course = df.course.replace('snack', 2)
df.course = df.course.replace('starter', 3)

In [20]:
df[df.diet == 1]

Unnamed: 0,diet,prep_time,cook_time,flavor_profile,course,maida flour,yogurt,oil,sugar,gram flour,ghee,carrots,milk,cashews,raisins,flour,kewra,clarified butter,almonds,pistachio,saffron,green cardamom,milk powder,plain flour,baking powder,water,rose water,sugar syrup,lentil flour,maida,corn flour,baking soda,vinegar,curd,turmeric,cardamom,cottage cheese,rice,dried fruits,nuts,...,orange rind,raw papaya,panch phoran masala,eggs,beetroot,brinjal,forbidden black rice,slivered almonds,garlic powder,biryani masala,mixed vegetables,yellow moong daal,whole red,brown rice,soy sauce,coconut milk,lobster,fresh green chilli,lamb,prawns,mustard seed,fish fillet,mint,fermented bamboo shoot,banana flower,mutton,fish roe,pumpkin flowers,dry chilli,tea leaves,soaked rice,cardamom pods,red pepper,watercress,glutinous rice,egg yolks,dry dates,dried rose petals,arrowroot powder,ginger powder
64,1,10,40,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
65,1,10,30,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67,1,5,15,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75,1,30,120,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76,1,10,35,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79,1,10,35,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
80,1,10,50,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81,1,120,45,0,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
122,1,240,30,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
123,1,240,30,0,3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
X = df.drop('diet', axis=1)
print(X.shape)

y = df.diet
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 6) 

(255, 369)
(255,)


Het SVM model

In [25]:
from sklearn.svm import LinearSVC

svm = LinearSVC(C=25)
svm.fit(X_train,y_train)
svm_pred = svm.predict(X_test)
svm_acc = accuracy_score(svm_pred, y_test)

Het naive bayes model

In [23]:
nb = naive_bayes.MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
nb_acc = accuracy_score(nb_pred, y_test)

In [26]:
print(classification_report(y_test, svm_pred))
print(classification_report(y_test, nb_pred))

print(f'SVM accuracy: {svm_acc}')
print(f'Naive bayes accuracy: {nb_acc}')

              precision    recall  f1-score   support

           0       0.96      0.98      0.97        45
           1       0.80      0.67      0.73         6

    accuracy                           0.94        51
   macro avg       0.88      0.82      0.85        51
weighted avg       0.94      0.94      0.94        51

              precision    recall  f1-score   support

           0       0.92      0.98      0.95        45
           1       0.67      0.33      0.44         6

    accuracy                           0.90        51
   macro avg       0.79      0.66      0.70        51
weighted avg       0.89      0.90      0.89        51

SVM accuracy: 0.9411764705882353
Naive bayes accuracy: 0.9019607843137255


#Conclusie

Het beste resultaat is behaald door de: Support Vector Machine. Dit komt waarschijnlijk door de kleine hoeveelheid data. Ook hebben we voor de naive bayes niet veel gedaan qua parameter optimalisatie, dus als we dit veranderen is er een kans dat deze het beter doet.

Dit komt overeen met mijn voorspelling, want hij heeft een hogere accuraatheidscore behaald.