<a href="https://colab.research.google.com/github/ogreen8084/cleaning/blob/master/food_coded.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#College Food Choices (Data Cleaning)
In this notebook we will be cleaning a college food choices dataset. In the process we will use some NLP libraries such as spaCy, NLTK and TextBlob. The dataset can be found here: 
https://www.kaggle.com/datasets/borapajo/food-choices. Since the questions were open ended so there are plenty of opportunites for data cleaning. 

In [2]:
from google.colab import drive

drive.mount('/content/gdrive') 

Mounted at /content/gdrive


# Import Dataset

In [3]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('/content/gdrive/MyDrive/food_coded.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Data Exploration

The questionnaire had 125 responses to 61 questions.


In [5]:
df.shape

(125, 61)

In [6]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,father_profession,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,mother_profession,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
0,2.4,2,1,430,,315.0,1,none,we dont have comfort,9.0,2.0,9,,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,profesor,Arabic cuisine,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,unemployed,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Self employed,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,Nurse RN,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,owns business,italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,owns business,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.


Now that we have a preview of the data, we'll see if there are any null values

In [7]:
df.isnull().sum()[:10]

GPA                            2
Gender                         0
breakfast                      0
calories_chicken               0
calories_day                  19
calories_scone                 1
coffee                         0
comfort_food                   1
comfort_food_reasons           1
comfort_food_reasons_coded    19
dtype: int64

Two individuals didn't give their GPAs, we'll delete those

In [8]:
df = df[df['GPA'].notnull()]

Calories and Calories in a Scone are numerical values, we'll impute the mean for any null values in these fields

In [86]:
df['calories_day'] = df['calories_day'].fillna(df['calories_day'].mean())

In [None]:
df['calories_scone'] = df['calories_scone'].fillna(df['calories_scone'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Comfort foods can include more than one food, we'll look for common foods and create distinct columns based on these foods 

In [17]:
foods_listed = {}
for foods in df['comfort_food']:
    try:
        for food in foods.split(", "):
            if food.lower() not in foods_listed.keys():
                foods_listed[food.lower()] = 1
            else:
                foods_listed[food.lower()] += 1
    except:
        pass

In [16]:
from collections import Counter

Now that we have the most common foods we'll create a function to look for the word and assign a value if the word is included

In [18]:
d = Counter(foods_listed)
d.most_common(6)

[('ice cream', 41),
 ('pizza', 35),
 ('chocolate', 25),
 ('chips', 21),
 ('cookies', 15),
 ('mac and cheese', 11)]

In [19]:
def word_finder(val):
    try:
        for word in words.split(" "):
            if word not in val.lower():
                return 0
        return 1
    except:
        return 0

In [20]:
words = 'pizza'
df['pizza'] = df['comfort_food'].map(word_finder)

In [21]:
words = 'ice cream'
df['ice_cream'] = df['comfort_food'].map(word_finder)

In [22]:
words = 'chips'
df['chips'] = df['comfort_food'].map(word_finder)

In [23]:
words = 'chocolate'
df['chocolate'] = df['comfort_food'].map(word_finder)

In [24]:
words = 'mac cheese'
df['mac_cheese'] = df['comfort_food'].map(word_finder)

In [25]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,father_profession,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,mother_profession,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese
0,2.4,2,1,430,,315.0,1,none,we dont have comfort,9.0,2.0,9,,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,profesor,Arabic cuisine,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,unemployed,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187,0.263932,0,0,0,0,0
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Self employed,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,Nurse RN,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155,0.894638,0,1,1,1,0
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,owns business,italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,owns business,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0


We'll use the Spacy library and sentence similarity to find out how close comfort food reasons are to the statement "bored, sad, stressed"

In [11]:
#!python -m spacy download en_core_web_md

In [26]:
import spacy

nlp = spacy.load('en_core_web_md')

sentence = 'bored, sad, stressed'
def similarity_nlp_col(val):
    try:
        return nlp(val).similarity(nlp(sentence))     
    except:
        return 0

In [27]:
df['comfort_food_sim'] = df['comfort_food_reasons'].map(similarity_nlp_col)

We'll also create binary columns for the most popular comfort food reasons 

In [28]:
stress_listed = {}
for words in df['comfort_food_reasons']:
    try:
        words = words.replace(",","").replace("/","")
        for word in words.split(" "):
            if word.lower() not in stress_listed.keys():
                stress_listed[word.lower()] = 1
            else:
                stress_listed[word.lower()] += 1
    except:
        pass

In [29]:
d = Counter(stress_listed)
d.most_common(15)

[('boredom', 71),
 ('', 41),
 ('sadness', 41),
 ('stress', 33),
 ('and', 26),
 ('i', 19),
 ('when', 12),
 ('anger', 9),
 ('happiness', 9),
 ('comfort', 8),
 ('bored', 8),
 ('am', 8),
 ('eat', 7),
 ('sad', 7),
 ('or', 6)]

In [30]:
stress_words = ['bored','sad','stress']
for word in stress_words:
    words = word
    df['comfort_reason_'+ word] = df['comfort_food_reasons'].map(word_finder)

In [31]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,father_profession,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,mother_profession,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress
0,2.4,2,1,430,,315.0,1,none,we dont have comfort,9.0,2.0,9,,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,profesor,Arabic cuisine,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,unemployed,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187,0.263932,0,0,0,0,0,0,0,0
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Self employed,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,Nurse RN,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155,0.894638,0,1,1,1,0,1,0,1
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,owns business,italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,owns business,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0,0,1,1


In [32]:
df.drop(columns=['comfort_food_reasons'], inplace=True)

For categorical columns we will impute the mode for missing values

In [33]:
df['comfort_food_reasons_coded'] = df['comfort_food_reasons_coded'].fillna(df['comfort_food_reasons_coded'].mode())
df['cook'] = df['cook'].fillna(df['cook'].mode())

For diets we will use NLTK's pretrained Vader to predict the sentiment towards the respondents' diet

In [34]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [35]:
sid = SentimentIntensityAnalyzer()
def diet_sentiment(text):
    try:
        score = sid.polarity_scores(text)['compound']
        if score > 0.33: 
            return 2
        elif score < -0.33:
            return 0
        else:
            return 1
    except:
        return 1

In [36]:
df['diet_sentiment'] = df['diet_current'].map(diet_sentiment)

In [37]:
df[['diet_current','diet_sentiment']][:10]

Unnamed: 0,diet_current,diet_sentiment
0,eat good and exercise,2
1,I eat about three times a day with some snacks...,1
2,"toast and fruit for breakfast, salad for lunch...",1
3,"College diet, cheap and easy foods most nights...",2
4,I try to eat healthy but often struggle becaus...,1
5,My current diet is terrible. I barely have tim...,0
6,I eat a lot of chicken and broccoli for dinner...,1
7,"I eat a very healthy diet. Ocassionally, i wil...",1
8,I eat whatever I want in moderation.,1
9,I eat healthy all the time when possible. I tr...,2


In [38]:
df['cuisine'] = df['cuisine'].fillna(df['cuisine'].mode())

In [39]:
df['diet_current'] = df['diet_current'].fillna('Unknown')

For what eating changes the respondent would make we will take the similarity with their current diet using Vader again

In [40]:
current_minus_change=[]
for i, cols in df.iterrows():
    try:
        current_minus_change.append(nlp(cols['diet_current']).similarity(nlp(cols['eating_changes'])))
    except:
        current_minus_change.append(0)

In [41]:
df['current_minus_change'] = current_minus_change

In [42]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,father_profession,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,mother_profession,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change
0,2.4,2,1,430,,315.0,1,none,9.0,2.0,9,1.0,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,profesor,Arabic cuisine,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,unemployed,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187,0.263932,0,0,0,0,0,0,0,0,2,0.730432
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Self employed,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,Nurse RN,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155,0.894638,0,1,1,1,0,1,0,1,1,0.85461
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,owns business,italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,owns business,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0,0,1,1,1,0.712965


Given the open format of this questionaire, respondents can use extra terms with their profession. We'll try to pull the nouns out using TextBlob

In [43]:
from textblob import TextBlob
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [46]:
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [55]:
def change_prof(val):
    try:
        return [w.lower() for (w, pos) in TextBlob(val).pos_tags if pos[0] == 'N']
    except:
        return []
df['father_profession'] = df['father_profession'].map(change_prof)
df['mother_profession'] = df['mother_profession'].map(change_prof)

In [51]:
prof_listed = {}
for words in df['father_profession']:
    try:
        for word in words:
            if word not in prof_listed.keys():
                prof_listed[word] = 1
            else:
                prof_listed[word] += 1
    except:
        pass

In [52]:
for words in df['mother_profession']:
    try:
        for word in words:
            if word not in prof_listed.keys():
                prof_listed[word] = 1
            else:
                prof_listed[word] += 1
    except:
        pass

Even after doing so, parent profession is still all over the place, so we'll just delete these columns

In [57]:
df.drop(columns=['mother_profession','father_profession'], inplace=True)

We'll find the most popular favorite cuisines and add a column that will assign a value if a respondent mentioned one of the most popular cuisines

In [58]:
fav_listed = {}
for words in df['fav_cuisine']:
    try:
        words = words.replace(",","").replace("/","")
        for word in words.split(" "):
            if word.lower() not in fav_listed.keys():
                fav_listed[word.lower()] = 1
            else:
                fav_listed[word.lower()] += 1
    except:
        pass

In [59]:
d = Counter(fav_listed)
d.most_common(10)

[('italian', 54),
 ('', 46),
 ('mexican', 12),
 ('chinese', 11),
 ('american', 10),
 ('food', 9),
 ('cuisine', 6),
 ('and', 5),
 ('thai', 4),
 ('or', 4)]

In [60]:
def fav_cuisine_fixer(val):
    try:
        if 'italian' in val.lower():
            return 'Italian'
        elif 'mexican' in val.lower():
            return 'Mexican'
        elif 'chinese' in val.lower():
            return 'Chinese'
        elif 'american' in val.lower():
            return 'American'
        else:
            return 'Other'
    except:
        return 'Other'
df['fav_cuisine'] = df['fav_cuisine'].map(fav_cuisine_fixer)

In [61]:
df['fav_cuisine'].value_counts()

Italian     55
Other       39
Mexican     12
Chinese      9
American     8
Name: fav_cuisine, dtype: int64

In [62]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change
0,2.4,2,1,430,,315.0,1,none,9.0,2.0,9,1.0,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,Other,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187,0.263932,0,0,0,0,0,0,0,0,2,0.730432
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155,0.894638,0,1,1,1,0,1,0,1,1,0.85461
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,Italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0,0,1,1,1,0.712965


We'll use Vader again to try to detect the similarity to childhood foods and current comfort foods

In [63]:
comfort_minus_childhood =[]
for i, cols in df.iterrows():
    try:
        comfort_minus_childhood.append(nlp(cols['food_childhood']).similarity(nlp(cols['comfort_food'])))
    except:
        comfort_minus_childhood.append(0)

  after removing the cwd from sys.path.


In [64]:
df['comfort_vs_child'] = comfort_minus_childhood

A healthy diet consists of protein, vegetables and carbs we'll calculate the respondent's similarity to this statement

In [66]:
def healthy_meal_vs_proto(val):
    proto = 'protein, vegetables and carbs'
    try:
        return nlp(val).similarity(nlp(proto))
    except:
        return 0

In [67]:
df['healthy_deviance'] = df['healthy_meal'].map(healthy_meal_vs_proto)

  after removing the cwd from sys.path.


We'll also look at the similarity of the current diet to the respondent's ideal diet

In [68]:
ideal_vs_current=[]
for i, cols in df.iterrows():
    try:
        ideal_vs_current.append(nlp(cols['diet_current']).similarity(nlp(cols['ideal_diet'])))
    except:
        ideal_vs_current.append(0)

In [69]:
df['ideal_deviance'] = ideal_vs_current

In [70]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current,diet_current_coded,drink,eating_changes,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,fav_cuisine,fav_cuisine_coded,fav_food,food_childhood,fries,fruit_day,grade_level,greek_food,healthy_feeling,healthy_meal,ideal_diet,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,meals_dinner_friend,mother_education,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change,comfort_vs_child,healthy_deviance,ideal_deviance
0,2.4,2,1,430,,315.0,1,none,9.0,2.0,9,1.0,eat good and exercise,1,1.0,eat faster,1,1,3,3.0,1,1.0,5.0,Other,3,1.0,rice and chicken,2,5,2,5,2,looks not oily,being healthy,8,5.0,5,5,1.0,1.0,"rice, chicken, soup",1.0,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187,0.263932,0,0,0,0,0,0,0,0,2,0.730432,0.134412,0.351235,0.618729
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream",1.0,3.0,1,1.0,I eat about three times a day with some snacks...,2,2.0,I eat out more than usual.,1,2,2,2.0,4,1.0,2.0,Italian,1,1.0,"chicken and biscuits, beef soup, baked beans",1,4,4,4,5,"Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,3,4.0,4,4,1.0,2.0,"Pasta, steak, chicken",4.0,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,Basketball,4,2,900,155,0.894638,0,1,1,1,0,1,0,1,1,0.85461,0.782912,0.883655,0.75346
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food",1.0,1.0,1,3.0,"toast and fruit for breakfast, salad for lunch...",3,1.0,sometimes choosing to eat fast food instead of...,1,3,2,3.0,5,2.0,2.0,Italian,1,3.0,"mac and cheese, pizza, tacos",1,5,3,5,6,usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,6,6.0,5,5,7.0,2.0,"chicken and rice with veggies, pasta, some kin...",2.0,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0,0,1,1,1,0.712965,0.839322,0.786113,0.262112


We'll look for the most common sports by finding nouns in the sport column 

In [71]:
def change_sport(val):
    try:
        holder = [w.lower() for (w, pos) in TextBlob(val).pos_tags if pos[0] == 'N']
        return " ".join(holder)
    except:
        return 'None'
df['type_sports'] = df['type_sports'].map(change_sport)

sport_listed = {}

for words in df['type_sports']:
    try:
        for word in words.split(' '):
            if word not in sport_listed.keys():
                sport_listed[word] = 1
            else:
                sport_listed[word] += 1
    except:
        pass

In [72]:
d = Counter(sport_listed)
d.most_common(6)

[('None', 20),
 ('none', 19),
 ('hockey', 14),
 ('soccer', 11),
 ('basketball', 10),
 ('softball', 10)]

In [73]:
def sport_fixer(val):
    try:
        if 'hockey' in val.lower():
            return 'hockey'
        elif 'soccer' in val.lower():
            return 'soccer'
        elif 'basketball' in val.lower():
            return 'basketball'
        elif 'softball' in val.lower():
            return 'softball'
        else:
            return 'None or Other'
    except:
        return 'None or Other'

df['type_sports'] = df['type_sports'].map(sport_fixer)
df['type_sports']= df['type_sports'].fillna('None or')


In [74]:
df['type_sports'].value_counts(dropna=False)

None or Other    81
hockey           14
soccer           10
basketball        9
softball          9
Name: type_sports, dtype: int64

We'll drop the categorical columns that we no longer need

In [75]:
df.drop(columns=['comfort_food','diet_current','eating_changes','food_childhood','healthy_meal','ideal_diet','meals_dinner_friend'], inplace=True)

In [76]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current_coded,drink,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,fav_cuisine,fav_cuisine_coded,fav_food,fries,fruit_day,grade_level,greek_food,healthy_feeling,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,mother_education,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change,comfort_vs_child,healthy_deviance,ideal_deviance
0,2.4,2,1,430,,315.0,1,9.0,2.0,9,1.0,1,1.0,1,1,3,3.0,1,1.0,5.0,Other,3,1.0,2,5,2,5,2,8,5.0,5,5,1.0,1.0,1.0,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,None or Other,5,1,1315,187,0.263932,0,0,0,0,0,0,0,0,2,0.730432,0.134412,0.351235,0.618729
1,3.654,1,1,610,3.0,420.0,2,1.0,3.0,1,1.0,2,2.0,1,2,2,2.0,4,1.0,2.0,Italian,1,1.0,1,4,4,4,5,3,4.0,4,4,1.0,2.0,4.0,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,basketball,4,2,900,155,0.894638,0,1,1,1,0,1,0,1,1,0.85461,0.782912,0.883655,0.75346
2,3.3,1,1,720,4.0,420.0,2,1.0,1.0,1,3.0,3,1.0,1,3,2,3.0,5,2.0,2.0,Italian,1,3.0,1,5,3,5,6,6,6.0,5,5,7.0,2.0,2.0,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,None or Other,5,1,900,I'm not answering this.,0.781846,1,0,0,0,0,0,1,1,1,0.712965,0.839322,0.786113,0.262112


For all columns that remain except for weight, we will fill null values with the mode

In [77]:
cols = list(df.columns[df.isnull().any()])
cols.remove('weight')

In [78]:
for col in cols:
    df[col] = df[col].fillna(df[col].mode())

For weight, we only want numeric values, we'll make all non-numeric values nan and assign the mean to these 

In [79]:
import re

def fix_num(val):
    try:
        return int(re.findall(r'\d+', val)[0])
    except:
        return np.nan


In [80]:
df['weight']= df['weight'].map(fix_num)
df['weight'] = df['weight'].fillna(df['weight'].mean())

We also see that there are GPAs that are invalid as well, We'll use the same function to find non-numeric GPAs and delete them from the dataset

In [81]:
df['GPA']= df['GPA'].map(fix_num)

In [82]:
df = df[df['GPA'].notnull()]

In [83]:
df.head(3)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food_reasons_coded,cook,comfort_food_reasons_coded.1,cuisine,diet_current_coded,drink,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,exercise,father_education,fav_cuisine,fav_cuisine_coded,fav_food,fries,fruit_day,grade_level,greek_food,healthy_feeling,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,mother_education,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change,comfort_vs_child,healthy_deviance,ideal_deviance
0,2.0,2,1,430,3.0,315.0,1,9.0,2.0,9,1.0,1,1.0,1,1,3,3.0,1,1.0,5.0,Other,3,1.0,2,5,2,5,2,8,5.0,5,5,1.0,1.0,1.0,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1165.0,345,None or Other,5,1,1315,187.0,0.263932,0,0,0,0,0,0,0,0,2,0.730432,0.134412,0.351235,0.618729
1,3.0,1,1,610,3.0,420.0,2,1.0,3.0,1,1.0,2,2.0,1,2,2,2.0,4,1.0,2.0,Italian,1,1.0,1,4,4,4,5,3,4.0,4,4,1.0,2.0,4.0,4,1.0,1,4,4.0,3.0,1.0,1.0,2,725.0,690,basketball,4,2,900,155.0,0.894638,0,1,1,1,0,1,0,1,1,0.85461,0.782912,0.883655,0.75346
2,3.0,1,1,720,4.0,420.0,2,1.0,1.0,1,3.0,3,1.0,1,3,2,3.0,5,2.0,2.0,Italian,1,3.0,1,5,3,5,6,6,6.0,5,5,7.0,2.0,2.0,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1165.0,500,None or Other,5,1,900,159.075,0.781846,1,0,0,0,0,0,1,1,1,0.712965,0.839322,0.786113,0.262112


In [84]:
df.shape

(121, 65)

Any column with a significant amount of nulls will now be deleted (I am defining significant as ~10% or more) 

In [87]:
df[df.columns[df.isnull().sum() > 10]].isnull().sum()

comfort_food_reasons_coded    19
cuisine                       15
exercise                      12
dtype: int64

In [88]:
df.drop(columns=['comfort_food_reasons_coded','cuisine','exercise'], inplace=True)

Any remaining columns with null values will be assigned the mode of the column

In [None]:
for col in df.columns[df.isnull().sum()>0]:
    df[col] = df[col].fillna(df[col].mode())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Our last step will be to make all of the terms the same order of magnitude, to do so if the max value of any column is greater than 10 we will make the max value scale to 1

In [89]:
over_10 = df[df.select_dtypes(include=np.number).columns].max() > 10
over_10_cols = over_10[over_10==True].index
for col in over_10_cols:
    df[col] = df[col]/ df[col].max()

Our data cleaning is now complete

In [90]:
df.head(5)

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,cook,comfort_food_reasons_coded.1,diet_current_coded,drink,eating_changes_coded,eating_changes_coded1,eating_out,employment,ethnic_food,father_education,fav_cuisine,fav_cuisine_coded,fav_food,fries,fruit_day,grade_level,greek_food,healthy_feeling,ideal_diet_coded,income,indian_food,italian_food,life_rewarding,marital_status,mother_education,nutritional_check,on_off_campus,parents_cook,pay_meal_out,persian_food,self_perception_weight,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight,comfort_food_sim,pizza,ice_cream,chips,chocolate,mac_cheese,comfort_reason_bored,comfort_reason_sad,comfort_reason_stress,diet_sentiment,current_minus_change,comfort_vs_child,healthy_deviance,ideal_deviance
0,2.0,2,1,0.597222,3.0,0.321429,1,2.0,9,1,1.0,1,0.076923,3,3.0,1,5.0,Other,3,1.0,2,5,2,5,2,8,5.0,5,5,1.0,1.0,1.0,5,1.0,1,2,5.0,3.0,1.0,1.0,1,1.0,0.405882,None or Other,5,1,1.0,0.70566,0.263932,0,0,0,0,0,0,0,0,2,0.730432,0.134412,0.351235,0.618729
1,3.0,1,1,0.847222,3.0,0.428571,2,3.0,1,2,2.0,1,0.153846,2,2.0,4,2.0,Italian,1,1.0,1,4,4,4,5,3,4.0,4,4,1.0,2.0,4.0,4,1.0,1,4,4.0,3.0,1.0,1.0,2,0.622318,0.811765,basketball,4,2,0.684411,0.584906,0.894638,0,1,1,1,0,1,0,1,1,0.85461,0.782912,0.883655,0.75346
2,3.0,1,1,1.0,4.0,0.428571,2,1.0,1,3,1.0,1,0.230769,2,3.0,5,2.0,Italian,1,3.0,1,5,3,5,6,6,6.0,5,5,7.0,2.0,2.0,4,2.0,1,3,5.0,6.0,1.0,2.0,5,1.0,0.588235,None or Other,5,1,0.684411,0.600283,0.781846,1,0,0,0,0,0,1,1,1,0.712965,0.839322,0.786113,0.262112
3,3.0,1,1,0.597222,3.0,0.428571,2,2.0,2,2,2.0,1,0.230769,2,3.0,5,2.0,Other,3,1.0,2,4,4,5,7,2,6.0,5,5,2.0,2.0,4.0,2,1.0,1,2,5.0,5.0,1.0,2.0,5,0.622318,0.811765,None or Other,3,1,1.0,0.90566,0.473967,1,1,0,0,1,1,0,0,2,0.68716,0.784329,0.658415,0.543849
4,3.0,1,1,1.0,2.0,0.428571,2,1.0,1,2,2.0,3,0.307692,2,2.0,4,4.0,Italian,1,3.0,1,4,4,4,6,2,6.0,2,5,1.0,1.0,5.0,3,1.0,1,4,2.0,4.0,1.0,1.0,4,0.806867,0.588235,softball,4,2,0.577947,0.716981,0.855828,0,1,1,1,0,1,0,1,1,0.931122,0.796545,0.811562,0.908637
