# Healthier Groceries Recommender

### Imputing Food Categories for Null Values

*In this notebook you will find:*
1. KMeans Clusters Unsupervised Learning to Impute Food Categories

### Unsupervised Learning Attempts

Initially I attempted to create all of my own categories for the data using unsupervised learning.  I tried TF-IDF and Word2Vec to create vectors of all food items ingredients list.  I then used this to create KMeans clusters and DBSCAN clusters.  There was far too much data and too many outliers so these methods did not work.  

Then I decided to only use clustering for the null values in categories.  This ran much faster and with a better silhouette score.  I wound up using TF-IDF and KMeans because this worked the best for my model.

In [16]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [17]:
null_fix = pd.read_csv('Null_Categories.csv')

In [18]:
null_fix.head()

Unnamed: 0,fdc_id,brand_owner,ingredients,serving_size,serving_size_unit,branded_food_category,data_type,description,publication_date,cat_null,...,mufa_grams,protein_grams,fiber_grams,vitamin_a_IU,vitamin_c_MG,calcium_MG,iron_MG,sugar_G,sodium_MG,NRFNn.3
0,344604,Red Gold,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",123.0,g,,branded_food,Tutturosso Green 14.5oz. NSA Italian Diced Tom...,2019-04-01,True,...,0.0,0.0162,0.032,,,0.013,0.0,0.01952,0.084583,
1,344605,Red Gold,"Tomatoes, Tomato Juice, Less Than 2% Of: Salt,...",123.0,g,,branded_food,Tutturosso Green 14.5oz. Italian Diced Tomatoes,2019-04-01,True,...,0.0,0.0162,0.032,,,0.016,0.0,0.01952,0.084583,
2,344606,Cargill,"White Turkey, Natural Flavoring",112.0,g,,branded_food,Honeysuckle White Fresh 97% Ground White Turkey,2019-04-01,True,...,,0.4642,0.0,,0.0,0.0,0.071667,0.0,0.027917,
3,344607,Cargill,"Turkey Breast, Natural Flavoring",112.0,g,,branded_food,Honeysuckle White 97% Ground White Turkey,2019-04-01,True,...,,0.4642,0.0,0.0,0.0,0.0,0.071667,,0.027917,
4,344608,Cargill,"Turkey, natural Flavoring.",112.0,g,,branded_food,Honeysuckle Whtie 85% Ground Turkey,2019-04-01,True,...,,0.375,0.0,0.0178,0.0,0.018,0.053333,,0.042917,


In [19]:
null_fix.shape

(9317, 24)

In [27]:
null_fix.isnull().sum()

fdc_id                      0
brand_owner                 0
ingredients                 5
serving_size                0
serving_size_unit           0
branded_food_category    9317
data_type                   0
description                 0
publication_date            0
cat_null                    0
sat_fat_g                 515
trans_fat_g               844
magnesium_mg             8833
potassium_mg             2748
mufa_grams               5506
protein_grams              42
fiber_grams               747
vitamin_a_IU             2910
vitamin_c_MG             1981
calcium_MG                515
iron_MG                   488
sugar_G                   104
sodium_MG                  51
NRFNn.3                  9137
dtype: int64

In [28]:
for i in range(len(null_fix)):
    if null_fix['cat_null'][i] == True:
        null_fix['ingredients'][i] = null_fix['description'][i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [29]:
null_fix.isnull().sum()

fdc_id                      0
brand_owner                 0
ingredients                 0
serving_size                0
serving_size_unit           0
branded_food_category    9317
data_type                   0
description                 0
publication_date            0
cat_null                    0
sat_fat_g                 515
trans_fat_g               844
magnesium_mg             8833
potassium_mg             2748
mufa_grams               5506
protein_grams              42
fiber_grams               747
vitamin_a_IU             2910
vitamin_c_MG             1981
calcium_MG                515
iron_MG                   488
sugar_G                   104
sodium_MG                  51
NRFNn.3                  9137
dtype: int64

# Tagging Items

In [20]:
X = null_fix[['ingredients', 'description']]

In [30]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(null_fix['ingredients'])

In [53]:
modelkmeans = KMeans(n_clusters=1000, max_iter=300, init='k-means++')
modelkmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=1000, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [54]:
modelkmeans.labels_

array([568, 568, 942, ..., 314, 515, 515])

In [55]:
null_fix['cluster'] = modelkmeans.labels_
null_fix.head()

Unnamed: 0,fdc_id,brand_owner,ingredients,serving_size,serving_size_unit,branded_food_category,data_type,description,publication_date,cat_null,...,protein_grams,fiber_grams,vitamin_a_IU,vitamin_c_MG,calcium_MG,iron_MG,sugar_G,sodium_MG,NRFNn.3,cluster
0,344604,Red Gold,Tutturosso Green 14.5oz. NSA Italian Diced Tom...,123.0,g,,branded_food,Tutturosso Green 14.5oz. NSA Italian Diced Tom...,2019-04-01,True,...,0.0162,0.032,,,0.013,0.0,0.01952,0.084583,,568
1,344605,Red Gold,Tutturosso Green 14.5oz. Italian Diced Tomatoes,123.0,g,,branded_food,Tutturosso Green 14.5oz. Italian Diced Tomatoes,2019-04-01,True,...,0.0162,0.032,,,0.016,0.0,0.01952,0.084583,,568
2,344606,Cargill,Honeysuckle White Fresh 97% Ground White Turkey,112.0,g,,branded_food,Honeysuckle White Fresh 97% Ground White Turkey,2019-04-01,True,...,0.4642,0.0,,0.0,0.0,0.071667,0.0,0.027917,,942
3,344607,Cargill,Honeysuckle White 97% Ground White Turkey,112.0,g,,branded_food,Honeysuckle White 97% Ground White Turkey,2019-04-01,True,...,0.4642,0.0,0.0,0.0,0.0,0.071667,,0.027917,,942
4,344608,Cargill,Honeysuckle Whtie 85% Ground Turkey,112.0,g,,branded_food,Honeysuckle Whtie 85% Ground Turkey,2019-04-01,True,...,0.375,0.0,0.0178,0.0,0.018,0.053333,,0.042917,,942


In [56]:
silhouette_score(X, modelkmeans.labels_)

0.18001778909289268

In [57]:
null_fix['cluster'].value_counts()

15     88
4      70
515    69
47     47
74     46
61     41
13     39
377    37
126    36
431    35
759    35
795    34
357    33
412    32
136    31
410    31
588    30
85     29
312    28
194    27
904    27
818    27
613    27
37     27
5      26
215    25
614    25
266    25
121    24
859    24
       ..
16      2
530     2
167     2
913     2
895     2
761     2
757     2
813     2
972     2
35      2
938     2
737     2
707     2
809     2
874     2
661     2
748     2
853     2
858     2
994     2
372     1
414     1
40      1
193     1
460     1
468     1
17      1
851     1
29      1
267     1
Name: cluster, Length: 1000, dtype: int64

In [58]:
null_fix[null_fix['cluster'] == 220]

Unnamed: 0,fdc_id,brand_owner,ingredients,serving_size,serving_size_unit,branded_food_category,data_type,description,publication_date,cat_null,...,protein_grams,fiber_grams,vitamin_a_IU,vitamin_c_MG,calcium_MG,iron_MG,sugar_G,sodium_MG,NRFNn.3,cluster
4385,350749,Beaver Street Fisheries Inc.,26/30 EZ PEEL SHRIMP,112.0,g,,branded_food,26/30 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
5163,351530,"BEAVER STREET FISHERIES, INC.",30/40 E-Z PEEL SHRIMP,112.0,g,,branded_food,30/40 E-Z PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
6922,353294,Beaver Street Fisheries Inc.,30/40 EZ PEEL SHRIMP,112.0,g,,branded_food,30/40 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.026667,0.027,0.0,0.0,0.249167,,220
7016,353388,Beaver Street Fisheries Inc.,21/25 EZ PEEL SHRIMP,112.0,g,,branded_food,21/25 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
7018,353390,"BEAVER STREET FISHERIES, INC.",16/20 EZ PEEL SHRIMP,112.0,g,,branded_food,16/20 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
7200,353572,Beaver Street Fisheries Inc.,21/30 CT. E/Z PEEL SHRIMP,112.0,g,,branded_food,21/30 CT. E/Z PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
7316,353688,Beaver Street Fisheries Inc.,40/50 EZ PEEL SHRIMP,112.0,g,,branded_food,40/50 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220
7637,354010,"BEAVER STREET FISHERIES, INC.",21/25 EZ PEEL SHRIMP,112.0,g,,branded_food,21/25 EZ PEEL SHRIMP,2019-04-01,True,...,0.2322,0.0,0.0,0.0,0.027,0.0,0.0,0.249167,,220


In [59]:
null_fix.to_csv('Null_Categories_Fixed.csv', index=False)