<div align=right><img src=UNH-Logo2.png align=right></div> <br>




# Food Ingredients Classifier Notebook
 ---
*The Final Project for COMP 840 Machine Learning* <br>
*illya K & Nick B Dec 6, 2018* <br>

#### Index
<ul>
<a href="#Intro">Introduction</a><br>
<a href="#Import">Import and Arrange the Data</a><br>
<a href="#Load">Load and Arrange the Data</a><br>
<a href="#Pick">Pick the Likely Columns to Classify to and Reformat the Data</a><br>
<a href="#Set">Set the Training and Test Data Set</a><br>
<a href="#Run">Run Classification</a><br>
<a href="#See">Visualize Results</a><br>
<a href="#Talk">Discussion of Results</a><br>
</ul>

### <a id="Intro">Introduction:</a> <br>
<br>

<div align=left> <img src=openfoodfacts-logo-en.svg align=left></div> <br>
<br>
<br>
<br>
<br>
<br>
<br>

#### Open Food Facts:


***A food products database:*** <br>
Open Food Facts is a database of food products with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels. <br>
***Made by everyone:*** <br>
Open Food Facts is a non-profit association of volunteers. <br>
1800+ contributors like you have added 75 000+ products from 150 countries using our Android, iPhone or Windows Phone app or their camera to scan barcodes and upload pictures of products and their labels.<br>
***For everyone:*** <br>
Data about food is of public interest and has to be open. The complete database is published as open data and can be reused by anyone and for any use. Check-out the cool reuses or make your own! <br>

(source: this above description was lifted from the website) <br>





In [None]:
from IPython.display import IFrame    # For linking to documentation 
IFrame('https://world.openfoodfacts.org', width=300, height=400)

### <a id="Import">Import Modules:</a> <br>

In [None]:
import pandas as pd                   # For array manipulations
import matplotlib.pyplot as plt       # For generating plots
import seaborn as sns ; sns.set()     # For beauty treatments
%matplotlib inline

### <a id="Load">Load and Arrange the Data:</a> <br>

In [None]:
df = pd.read_csv('en.openfoodfacts.org.products.csv', sep='\t')

In [None]:
df.shape

In [None]:
dim = df.shape
cols = list(df)
pd.set_option("display.max_rows", dim[0]) # option to be able to display all rows
pd.set_option("display.max_columns", dim[1]) # displays all columns with scroll bar

In [None]:
df.head(3)   # defaults to showing first 5 rows
#df.head(dim[0])   # shows the whole table

In [None]:
df.info()

### <a id="Pick">Pick the Likely Columns to Classify to and Reformat the Data:</a> <br>

#### Leading Thought:
This is a fairly big dataset, compared to what we've worked on it class, and needs to be pared down to manageable and usable entries. The dataset is arranged that all food ingredients are listed as columns but certainly not all foods contain all ingredients. Thus, there are a lot of legitimate no-data (NaN) entries in the dataset. In addition to that, there are a lot of missing entries for whatever reason. This is, after all, a free and crowd-share dataset. <br>

***We assume the main use of this dataset is to help you make informed decisions about your food choices.*** So the approach taken is to use the **nutrition_grade_fr** index column result based on the presence of one ingredient. The user can modify the notebook to test against other ingredients, as one might like to check. It will be found that certain ingredients will consistently score low and should generally be avoided. <br>

For this example, we start with the **additives_n** column and see how it contributes to a given nutrition grade score.  <br>

---


In [None]:
df.describe()

In [None]:
# Pick the columns to measure and everything else is eliminated
sample_df = df[['product_name','nutrition_grade_fr','additives_n']]

In [None]:
sample_df.shape

In [None]:
sample_df.isna().sum()

In [None]:
sample_df.replace(to_replace=dict(a=1, b=2, c=3, d=4, e=5), inplace=True)

In [None]:
# Drop the NaN because there is no data. This might be legitimate and there is no way to fill in.
sample_df.dropna()

### <a id="Set">Set the Training and Test Data Set</a> <br>

In [None]:
# Split into training and test
X, y = sample_df["additives_n"], sample_df["nutrition_grade_fr"]
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=.2, random_state=30)

### <a id="Run">Run Classification:</a> <br>

In [None]:
# first import all needed models (more than needed):

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import average_precision_score
from sklearn.linear_model import LogisticRegression
logit_clf = LogisticRegression(solver = 'lbfgs') # the default solver='liblinear' is very slow

In [None]:
# The fancy one -- NOT RUNNING YET:

# not running because it is trying to lower-case the number on this classifier...

#classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))])
classifier = classifier.fit(X_train, Y_train)
predicted = classifier.predict(X_test)

In [None]:
# Simpler -- NOT RUNNING YET EITHER:

# Not running because it is looking for a 2d array on this classifier

# Decided with the support vector classifier.
from sklearn.svm import SVC

svm_clf = SVC(gamma='auto')
svm_clf.fit(X_train, Y_train)

 

### Revised Thought: 
Running into issues with the data compatibility and formating. 

***Trying a straight decision tree:***
The reasoning here is that we are trying to help make informed decisions about foods, so a Decision Tree which is a white box model, would seem to be a better and clearer choice, almost intuitive. <br>

When you look at a food label and try to make a decision about whether something is good or not, you will first look at key ingredients one by one. With simple 'good' vs 'bad' decisions the gini score, the level of uncertainty, is simple to assess. In that sense, you apply the decision tree without realizing.

Should have tried this in the first place.

In [None]:
# Import the needed libraries:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
# data has already been split out from above

dt1 = DecisionTreeClassifier(random_state=42)
dt1.fit(X_train, Y_train)

In [None]:
dt_params = [
    {'max_depth': [1, 2, 4, 8, 16, 32, 64],
     'min_samples_leaf' : [1, 2, 3, 4, 5, 6],
    },
]

In [None]:
cross_val_score(dt1, X_train, y_train, cv=3, scoring="accuracy")

In [None]:
# We add the GridSearchCV to tune the model
dt_cv = GridSearchCV(estimator=dt, param_grid=dt_params, cv=4)
dt_cv.fit(X_train, Y_train)

In [None]:
from sklearn.metrics import roc_curve, auc
y_pred_small = dt_cv.best_estimator_.predict(X_test_small)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_small, y_pred_small)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc

In [None]:
cross_val_score(dt1, X_train, y_train, cv=3, scoring="accuracy")

### <a id="Talk">Discussion of Results:</a> <br>

This has been a comparative study of the nutrition grade as derived from the food ingredients. Some seemingly "healthy" foods do not rank that high. Other foods which are a default healthy natural product, like butter, get a low score.  

#### Resources:

The Kaggle Website has various notebooks but they most all involve image processing (the labels)

Google Summer of Code mentions some ideas for ML notebooks but also revolve around images

This notebook here was the only one which did a similar classification idea:<br>
https://github.com/aromadhony/GSoC2018/blob/master/OpenFoodFacts/get_cat_ingredients.ipynb

Of course: the homework for Week 8, where we went over the Decision Tree and Chapter 6 of the textbook