# **Data Cleaning & EDA Practice** 
by Daniel Lee

The main purpose of this project is to clean the data and exmplanatory data analysis

In [803]:
import pandas as pd
import numpy as np
import math
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
# from nltk.corpus import stopwords
import re

The data file is in the data directory and file type is csv file.

## **Data Description**


In [804]:
csv_file_path = "../data/Food_choices/food_coded.csv"
data_ori = pd.read_csv(csv_file_path, delimiter=',', quotechar='"', encoding='utf-8')
data_cleaning = data_ori.copy()

In [805]:
data_cleaning.info()
# data_cleaning.drop_duplicates() # check duplicate

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GPA                           123 non-null    object 
 1   Gender                        125 non-null    int64  
 2   breakfast                     125 non-null    int64  
 3   calories_chicken              125 non-null    int64  
 4   calories_day                  106 non-null    float64
 5   calories_scone                124 non-null    float64
 6   coffee                        125 non-null    int64  
 7   comfort_food                  124 non-null    object 
 8   comfort_food_reasons          123 non-null    object 
 9   comfort_food_reasons_coded    106 non-null    float64
 10  cook                          122 non-null    float64
 11  comfort_food_reasons_coded.1  125 non-null    int64  
 12  cuisine                       108 non-null    float64
 13  diet_

Pandas.Dataframe.select_dtypes() function: Return a subset of the dataframe based on the datatypes of column.

In [806]:
# pandas.Dataframe.select_dtypes() function: Return a subset of the DataFrame’s columns based on the column dtypes.

obj_df = data_cleaning.select_dtypes(include=['object'])
num_df = data_cleaning.select_dtypes(exclude=['object'])

#helpfunction to seperate categorical and numerical features


'''separates non-numeric and numeric columns'''
print("Non-Numeric columns:")
for col in obj_df:
    print(col)
print("\nNumeric columns:")
for col in num_df:
    print(col)



Non-Numeric columns:
GPA
comfort_food
comfort_food_reasons
diet_current
eating_changes
father_profession
fav_cuisine
food_childhood
healthy_meal
ideal_diet
meals_dinner_friend
mother_profession
type_sports
weight

Numeric columns:
Gender
breakfast
calories_chicken
calories_day
calories_scone
coffee
comfort_food_reasons_coded
cook
comfort_food_reasons_coded.1
cuisine
diet_current_coded
drink
eating_changes_coded
eating_changes_coded1
eating_out
employment
ethnic_food
exercise
father_education
fav_cuisine_coded
fav_food
fries
fruit_day
grade_level
greek_food
healthy_feeling
ideal_diet_coded
income
indian_food
italian_food
life_rewarding
marital_status
mother_education
nutritional_check
on_off_campus
parents_cook
pay_meal_out
persian_food
self_perception_weight
soup
sports
thai_food
tortilla_calories
turkey_calories
veggies_day
vitamins
waffle_calories


### There are a few problems for missing data method

    * For example, by dropping rows/columns, you’re essentially losing information that might be useful for prediction

    * On the other hand, imputing values will introduce bias to your data but it still might better than removing your features.

___
Here is a great analogy for this dilemma in this article by Elite Data Science.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

source:
https://medium.com/bitgrit-data-science-publication/data-cleaning-with-python-f6bc3da64e45
___

In [807]:
missing_per_column = data_cleaning.isnull().sum()
# .sum() funciton return series(dataframe with one column; have different parameter from dataframe)

print(missing_per_column.sort_values(ascending = False))

type_sports                     26
calories_day                    19
comfort_food_reasons_coded      19
cuisine                         17
exercise                        13
                                ..
comfort_food_reasons_coded.1     0
diet_current_coded               0
ethnic_food                      0
eating_out                       0
fruit_day                        0
Length: 61, dtype: int64


## Droping Feature

In [808]:
'''
Threshold for drop attributes would be 30% for big data and 20% for small data.
'''

'''
shape function give give dimension of the dataframe which is [x,y]. length:x -> index(row) = 0, width(column):y -> index = 1
Thus, we can use that info to find number of data and attributes.
'''
# dropna(thresh): Require that many non-NA values.

data_cleaning = data_ori.copy()
thresh4data_col = data_cleaning.shape[0]*0.8
thresh4data_row = data_cleaning.shape[1]*0.8

data_cleaning = data_cleaning.dropna(axis =1, thresh=thresh4data_col)
# dropna() function returns a new DataFrame with missing values removed and does not modify the original DataFrame in place.
data_cleaning = data_cleaning.dropna(axis =0, thresh=thresh4data_row)

    When we use the dtypes function to distinguish between categorical and numerical data, we may find that some features that appear to be numerical are instead assigned to the categorical type.
    
    We have observed that the features, GPA and Weight, have the potential to be represented as numerical data. However, the raw data in these features require cleaning to achieve this representation.

In [809]:
obj_df.head()

Unnamed: 0,GPA,comfort_food,comfort_food_reasons,diet_current,eating_changes,father_profession,fav_cuisine,food_childhood,healthy_meal,ideal_diet,meals_dinner_friend,mother_profession,type_sports,weight
0,2.4,none,we dont have comfort,eat good and exercise,eat faster,profesor,Arabic cuisine,rice and chicken,looks not oily,being healthy,"rice, chicken, soup",unemployed,car racing,187
1,3.654,"chocolate, chips, ice cream","Stress, bored, anger",I eat about three times a day with some snacks...,I eat out more than usual.,Self employed,Italian,"chicken and biscuits, beef soup, baked beans","Grains, Veggies, (more of grains and veggies),...",Try to eat 5-6 small meals a day. While trying...,"Pasta, steak, chicken",Nurse RN,Basketball,155
2,3.3,"frozen yogurt, pizza, fast food","stress, sadness","toast and fruit for breakfast, salad for lunch...",sometimes choosing to eat fast food instead of...,owns business,italian,"mac and cheese, pizza, tacos",usually includes natural ingredients; nonproce...,i would say my ideal diet is my current diet,"chicken and rice with veggies, pasta, some kin...",owns business,none,I'm not answering this.
3,3.2,"Pizza, Mac and cheese, ice cream",Boredom,"College diet, cheap and easy foods most nights...",Accepting cheap and premade/store bought foods,Mechanic,Turkish,"Beef stroganoff, tacos, pizza","Fresh fruits& vegetables, organic meats","Healthy, fresh veggies/fruits & organic foods",Grilled chicken \rStuffed Shells\rHomemade Chili,Special Education Teacher,,"Not sure, 240"
4,3.5,"Ice cream, chocolate, chips","Stress, boredom, cravings",I try to eat healthy but often struggle becaus...,I have eaten generally the same foods but I do...,IT,Italian,"Pasta, chicken tender, pizza","A lean protein such as grilled chicken, green ...",Ideally I would like to be able to eat healthi...,"Chicken Parmesan, Pulled Pork, Spaghetti and m...",Substance Abuse Conselor,Softball,190


    We have observed that certain data columns are currently stored as object data types; however, they are expected to be represented as floats. 
    
    To detect non-numerical data within a column, we can examine the unique values present in that specific column.

In [810]:
print(obj_df['GPA'].unique())
print(obj_df['weight'].unique())

['2.4' '3.654' '3.3' '3.2' '3.5' '2.25' '3.8' '3.904' '3.4' '3.6' '3.1'
 nan '4' '2.2' '3.87' '3.7' '3.9' '2.8' '3' '3.65' '3.89' '2.9' '3.605'
 '3.83' '3.292' '3.35' 'Personal ' '2.6' '3.67' '3.73' '3.79 bitch' '2.71'
 '3.68' '3.75' '3.92' 'Unknown' '3.77' '3.63' '3.882']
['187' '155' "I'm not answering this. " 'Not sure, 240' '190' '180' '137'
 '125' '116' '110' '264' '123' '185' '145' '170' '135' '165' '175' '195'
 '105' '160' '167' '115' '205' nan '128' '150' '140' '120' '100' '113'
 '168' '169' '200' '265' '192' '118' '210' '112' '144 lbs' '130' '127'
 '129' '260' '184' '230' '138' '156']


In [811]:
##If you examine the data types of the values, you'll notice that the majority of them are stored as strings, even though they appear to be numerical in nature.

# Use regular expression to find the numeric part
for x in obj_df['GPA'].unique():

    if not isinstance(x, float):
        match = re.search(r'\d+\.\d+|\d+', x)

        if match:
            # Extract and convert the matched part to a float
            convert_float = float(match.group())
        obj_df['GPA'] =obj_df['GPA'].replace(x, convert_float)


for x in obj_df['weight'].unique():

    if not isinstance(x, float):
        match = re.search(r'\d+\.\d+|\d+', x)

        if match:
            # Extract and convert the matched part to a float
            convert_float = float(match.group())
        obj_df['weight'] =obj_df['weight'].replace(x, convert_float)



In [812]:
print(obj_df['GPA'].unique())
print(obj_df['weight'].unique())

[2.4   3.654 3.3   3.2   3.5   2.25  3.8   3.904 3.4   3.6   3.1     nan
 4.    2.2   3.87  3.7   3.9   2.8   3.    3.65  3.89  2.9   3.605 3.83
 3.292 3.35  2.6   3.67  3.73  3.79  2.71  3.68  3.75  3.92  3.77  3.63
 3.882]
[187. 155. 240. 190. 180. 137. 125. 116. 110. 264. 123. 185. 145. 170.
 135. 165. 175. 195. 105. 160. 167. 115. 205.  nan 128. 150. 140. 120.
 100. 113. 168. 169. 200. 265. 192. 118. 210. 112. 144. 130. 127. 129.
 260. 184. 230. 138. 156.]


In [813]:
# The drop() method will remove an entire row from a DataFrame when given a single index label, instead of removing a single value from a specific column.

data_cleaning[['GPA', 'weight']] = obj_df[['GPA', 'weight']]
obj_df = obj_df.drop(columns=['GPA', 'weight'],axis=1) 
num_df = data_cleaning.select_dtypes(exclude=['object'])


In [814]:
obj_columns = data_cleaning.select_dtypes(include="object").columns

# use lambda function tp convert it to lowercase
data_cleaning[obj_columns] = data_cleaning[obj_columns].apply(lambda x: (x.str.lower()))

    
print(data_cleaning[obj_columns].head())
# print(data_cleaning['fav_cuisine'].unique())

                       comfort_food        comfort_food_reasons   
0                              none       we dont have comfort   \
1       chocolate, chips, ice cream        stress, bored, anger   
2   frozen yogurt, pizza, fast food             stress, sadness   
3  pizza, mac and cheese, ice cream                     boredom   
4      ice cream, chocolate, chips   stress, boredom, cravings    

                                        diet_current   
0                              eat good and exercise  \
1  i eat about three times a day with some snacks...   
2  toast and fruit for breakfast, salad for lunch...   
3  college diet, cheap and easy foods most nights...   
4  i try to eat healthy but often struggle becaus...   

                                      eating_changes father_profession   
0                                        eat faster          profesor   \
1                        i eat out more than usual.     self employed    
2  sometimes choosing to eat fast food

In the codebook, there are comments that some features are ideal for perform NLP(Nautral Language Processing). Thus, I used 3 features: comfort_food, comfort_food_reasons and diet_current

I used NLTK python library




## NLTK implementation (NLP)

In [815]:
## NOTE: Tools for NLTK package

tokenizer = TreebankWordTokenizer()
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer() 

In [816]:
'''packages to download '''

# nltk.download("stopwords")
# nltk.download("averaged_perceptron_tagger")
# nltk.download("maxent_ne_chunker")
# nltk.download("words")

'packages to download '

https://www.ars.usda.gov/ARSUserFiles/80400530/pdf/1112/food_category_list.pdf


select the data that we are going to apply NLP

In [817]:

data_nlp = data_cleaning[['comfort_food', 'comfort_food_reasons', 'diet_current']]

data_nlp = data_nlp.astype(str)
# data_nlp = data_nlp.applymap(lambda x: x.strip()) # remove space
data_nlp_prev = data_nlp ## copies of orginal data to compare before and after

        

### Chunk Function

In [818]:
'''This is main function that chunk the sentence into word with POS tag using parser. It will also create tree that shows structure of sentence'''
def chunk_NP(text, origin=True):    
    grammar =  r'''
    NP: {<NN.*>|<RB>?<NN.*>+|<NN><CC><NN>|<DT>?<NN.*>+<POS>?}
    '''

    '''
    The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.
    '''
    
    if type(text) is not str:
        return ['none']
    else:    
        
        tokens = nltk.word_tokenize(text)
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
        lotr_pos_tags = nltk.pos_tag(tokens)
        chunk_parser = nltk.RegexpParser(grammar)
        tree = chunk_parser.parse(lotr_pos_tags)

        noun_phrases = []
        for subtree in tree.subtrees():
            if subtree.label() =='NP':
                np_parts = []
                for leaf in subtree.leaves():
                    np_parts.append(leaf[0])
                noun_phrases.append(" ".join(np_parts))

        
        return noun_phrases


In [819]:
def get_Tree(text):
    grammar =  r'''
    NP: {<NN.*>|<RB>?<NN.*>+|<NN><CC><NN>|<DT>?<NN.*>+<POS>?}
    '''
    
    
    '''RB for extract 'and', NN.* for any word tagged as a noun (NN) along with any additional modifiers
    '''

    tokens = nltk.word_tokenize(text)
    lotr_pos_tags = nltk.pos_tag(tokens)
    chunk_parser = nltk.RegexpParser(grammar)
    tree = chunk_parser.parse(lotr_pos_tags)

    return tree

In [820]:
comfort_food_col1 = []
comfort_food_col2 = []


for x in range(len(data_nlp["comfort_food"])):
    a = chunk_NP(data_nlp.loc[x,'comfort_food'])
    comfort_food_col1.append(a)

    b = chunk_NP(data_nlp.loc[x, 'comfort_food_reasons'])
    comfort_food_col2.append(b)

flat_list_1st = [item for sublist in comfort_food_col1 for item in sublist]

print(sorted(set(flat_list_1st))) #TODO check the output


['/', 'almond', 'any kind', 'anything', 'banana', 'bar', 'beef', 'bread', 'bread/crackers', 'broccoli', 'brownie', 'burger', 'burrito', 'butter', 'cake', 'candy', 'capps', 'carrot', 'cereal', 'cheese', 'cheeseburger', 'cheesecake', 'cheez-its', 'chex-mix', 'chicken', 'chilli', 'chip', 'chipotle', 'chocolate', 'coffee', 'cookie', 'cooky', 'cornbread', 'cottage', 'crea', 'cream', 'cream/milkshake', 'cup', 'curry', 'deli', 'dessets', 'dip', 'dish', 'donut', 'doritos', 'dough', 'doughnut', 'egg', 'fast food', 'finger', 'fire', 'food', 'fritos', 'fruit', 'fry', 'grandma', 'grape', 'hamburger', 'home', 'homemade', 'ice', 'ice-cream', 'icecream', 'jerky', 'jims', 'kat', 'kit', 'lasagna', 'lasagne', 'mac', 'macaroni', 'macaroon', 'masala', 'mcdonalds', 'meatball', 'milkshake', 'mix', 'moe', 'moes', 'mozzarella', 'naan', 'nan', 'none', 'noodle', 'nugget', 'nuggs', 'nutella', 'omelet', 'pasta', 'peanut', 'pepper', 'pepsi', 'pie', 'pierogies', 'pizza', 'plantain', 'pop', 'popcorn', 'pot', 'potato

### These are the results from the unique value from comfort_food:

['/', 'almond', 'any kind', 'banana sandwich', 'beef jerky', 'bread', 'bread/crackers', 'broccoli', 'brownie', 'burger', 'burrito', 'butter', 'butter naan', 'cake', 'candy', 'candy bar', 'candy pop chocolate chipotle moe', 'carrot', 'cereal', 'cheese', 'cheeseburger', 'cheesecake', 'cheez-its', 'chex-mix', 'chicken', 'chicken curry', 'chicken finger', 'chicken nugget', 'chicken nuggs', 'chicken wing', 'chilli', 'chip', 'chocolate', 'chocolate bar', 'chocolate brownie', 'chocolate ice cream', 'coffee', 'cookie dough', 'cooky', 'cornbread', 'cottage cheese', 'cup', 'deli sandwhich', 'dessets', 'dip', 'dish', 'donut', 'doritos', 'doughnut', 'egg', 'fast food', 'fire', 'food', 'fritos', 'fruit', 'fruit snack', 'fry', 'grandma', 'grandma homemade chocolate cake anything homemade', 'grape', 'hamburger', 'home', 'ice capps', 'ice crea', 'ice cream', 'ice cream/milkshake', 'ice-cream', 'icecream', 'kit kat', 'lasagna', 'lasagne', 'mac', 'macaroni', 'macaroon', 'mcdonalds', 'meatball sub', 'milkshake', 'mix', 'moes', 'mozzarella stick', 'nan', 'none', 'noodle', 'nugget', 'nutella', 'omelet', 'pasta', 'peanut butter', 'peanut butter sandwich', 'pepper', 'pepsi', 'pierogies', 'pizza', 'pizza chocolate chip', 'pizza cooky steak', 'plantain chip', 'pop', 'popcorn', 'pot pie', 'potato', 'potato chip', 'potato soup', 'pretzals', 'pretzel', 'protein bar', 'quinoa', 'ranch', 'reese', 'rice', 'ritz', 'salsa', 'salt', 'salty snack', 'slim jims', 'snack', 'soda', 'soup', 'spaghetti', 'sponge candy', 'squash', 'sub', 'sushi', 'sweet', 'terra chip', 'tikka masala', 'toast', 'tomato soup', 'truffle', 'tuna sandwich', 'twizzlers', 'vinegar chip', 'watermelon', 'wine', 'wing', 'yogurt']



#### I manually removed certain words from the list that were not appropriate for describing food.


In [821]:
list_keep_1st = ['almond', 'banana sandwich', 'beef jerky', 'bread',  'broccoli', 'brownie', 'burger', 'burrito', 'butter', 'butter naan', 'cake', 'candy', 'candy bar','carrot', 'cereal', 'cheese', 'cheeseburger', 'cheesecake', 'cheez-its', 'chex-mix', 'chicken', 'chicken curry', 'chicken finger', 'chicken nugget', 'chicken wing',  'chip', 'chocolate', 'chocolate bar', 'chocolate brownie', 'chocolate ice cream', 'coffee', 'cookie dough', 'cooky', 'cornbread', 'cottage cheese', 'deli sandwhich', 'doritos', 'doughnut', 'egg', 'fast food', 'fritos', 'fruit', 'fruit snack', 'fry', 'grape', 'hamburger', 'ice capps', 'ice cream', 'kit kat', 'lasagna', 'macaroni', 'macaroon', 'mcdonalds', 'meatball sub', 'milkshake', 'moes', 'mozzarella stick', 'none', 'noodle', 'nugget', 'nutella', 'omelet', 'pasta', 'peanut butter', 'peanut butter sandwich', 'pepper', 'pepsi', 'pierogies', 'pizza', 'pizza chocolate chip', 'pizza cooky steak', 'plantain chip', 'popcorn', 'pot pie', 'potato', 'potato chip', 'potato soup', 'pretzel', 'protein bar', 'quinoa', 'ranch', 'reese', 'rice', 'ritz', 'salsa', 'salty snack', 'slim jims', 'snack', 'soda', 'soup', 'spaghetti', 'sponge candy', 'squash', 'sub', 'sushi', 'terra chip', 'tikka masala', 'toast', 'tomato soup', 'truffle', 'tuna sandwich', 'twizzlers', 'vinegar chip', 'watermelon', 'wine', 'wing', 'yogurt']


list_remove_1st = list(set(flat_list_1st)-set(list_keep_1st))
print(sorted(list_remove_1st)) ##TODO Check what words werer sorted out 


['/', 'any kind', 'anything', 'banana', 'bar', 'beef', 'bread/crackers', 'capps', 'chilli', 'chipotle', 'cookie', 'cottage', 'crea', 'cream', 'cream/milkshake', 'cup', 'curry', 'deli', 'dessets', 'dip', 'dish', 'donut', 'dough', 'finger', 'fire', 'food', 'grandma', 'home', 'homemade', 'ice', 'ice-cream', 'icecream', 'jerky', 'jims', 'kat', 'kit', 'lasagne', 'mac', 'masala', 'meatball', 'mix', 'moe', 'mozzarella', 'naan', 'nan', 'nuggs', 'peanut', 'pie', 'plantain', 'pop', 'pot', 'pretzals', 'protein', 'salt', 'salty', 'sandwhich', 'sandwich', 'slim', 'sponge', 'steak', 'stick', 'sweet', 'terra', 'tikka', 'tomato', 'tuna', 'vinegar']


In [822]:
def find_sentences_with_word(sentences, target_words):
    matched_sentences = []
    count =0
    for target_word in target_words:
        pattern = r'\b{}\b'.format(re.escape(target_word))
        matched_sentence = [sentence for sentence in sentences if re.search(pattern, sentence, re.IGNORECASE)]
        matched_sentences.append(matched_sentence)
        count = count+1
        print(target_word, matched_sentence, ": \n")
    return matched_sentences

In [823]:
import difflib

def find_sentences_with_word(sentences, target_words, threshold=0.85):
    matched_sentences = []
    
    for target_word in target_words:
        matched_sentence = []
        
        for sentence in sentences:
            words = re.findall(r'\w+', sentence.lower())
            
            for word in words:
                similarity = difflib.SequenceMatcher(None, word, target_word).ratio()
                
                if similarity >= threshold:
                    matched_sentence.append(sentence)
                    break  # No need to check further words in the same sentence
                
        matched_sentences.append(matched_sentence)
        print(target_word, matched_sentence, ": \n")

    return matched_sentences

In [824]:
# in a format of list, you can see some strings are not the same as displayed in the pandas dataframe
print(data_nlp['comfort_food'].tolist())
 

['none', 'chocolate, chips, ice cream', 'frozen yogurt, pizza, fast food', 'pizza, mac and cheese, ice cream', 'ice cream, chocolate, chips ', 'candy, brownies and soda.', 'chocolate, ice cream, french fries, pretzels', 'ice cream, cheeseburgers, chips.', 'donuts, ice cream, chips', 'mac and cheese, chocolate, and pasta ', 'pasta, grandma homemade chocolate cake anything homemade ', 'chocolate, pasta, soup, chips, popcorn', 'cookies, popcorn, and chips', 'ice cream, cake, chocolate', 'pizza, fruit, spaghetti, chicken and potatoes  ', 'cookies, donuts, candy bars', 'saltfish, candy and kit kat ', 'chips, cookies, ice cream', 'chocolate, ice crea ', 'pizza, wings, chinese', 'fast food, pizza, subs', 'chocolate, sweets, ice cream', 'burgers, chips, cookies', 'chilli, soup, pot pie', 'soup, pasta, brownies, cake', 'chocolate, ice cream/milkshake, cookies', 'chips, ice cream, microwaveable foods ', 'chicken fingers, pizza ', 'cookies, hot chocolate, beef jerky', 'tomato soup, pizza, fritos,

In [825]:
# data with wrong format in csv file

pattern = r"\r"

old_values = []
for x in data_nlp['comfort_food']:
    if re.search(pattern, x):
        old_values.append(x)

data_nlp['comfort_food'] = [old_value.replace('\r', ',') for old_value in data_nlp['comfort_food']]  #NOTE the strip() method is used to remove any leading or trailing spaces that might be present around the comma.

print(old_values)


["candy\rpop\rchocolate \rchipotle \rmoe's ", 'burgers, indian and korean food\r', 'noodle ( any kinds of noodle), tuna sandwich, and egg.\r']


In [826]:
avadacadabra = find_sentences_with_word(data_nlp['comfort_food'], list_remove_1st)


tomato ['tomato soup, pizza, fritos, meatball sub, dr. pepper'] : 

kit ['saltfish, candy and kit kat '] : 

donut ['donuts, ice cream, chips', 'cookies, donuts, candy bars', 'little debbie snacks, donuts, pizza'] : 

sandwhich ['mac n cheese, peanut butter and banana sandwich, omelet', 'peanut butter sandwich, pretzals, garlic bread', 'pizza, pretzels, fruit snacks, deli sandwhich', 'noodle ( any kinds of noodle), tuna sandwich, and egg.,'] : 

sandwich ['mac n cheese, peanut butter and banana sandwich, omelet', 'peanut butter sandwich, pretzals, garlic bread', 'pizza, pretzels, fruit snacks, deli sandwhich', 'noodle ( any kinds of noodle), tuna sandwich, and egg.,'] : 

pot ['chilli, soup, pot pie'] : 

pie ['chilli, soup, pot pie'] : 

meatball ['tomato soup, pizza, fritos, meatball sub, dr. pepper'] : 

cottage ["dark chocolate, terra chips, reese's cups(dark chocolate), and bread/crackers with cottage cheese"] : 

fire ['chocolate, chips, ice cream, french fires, pizza'] : 

spong

Here's list of problem of unconsistant string format:

1. no comma between items
2. sepeated by slas or dash instead of comma
3. wrong spelling(typos)
4. too broad name (such as food, salt)
5. name with mark

chinese food, korean food 
mac and cheese vs mac n cheese


## 1st Modification after 1st output

Check list:

1. no comma between items
2. sepeated by slas or dash instead of comma -> done using regular expression
3. wrong spelling
4. too broad name (such as food, salt)
5. name with mark

In [827]:
from textblob import TextBlob

def correct_word_spelling(word):
    
    # word = Word(word)
    
    result = word.correct()
    
    print(result)

I tried to use one of the spelling check libary but it's hard for me to say that it correct Thus, I had to replace word with typos by tokens

In [828]:
'''This is an additional function I developed to sanitize informal language. Given my expertise, I couldn't come up with a more suitable alternative.'''

def informal2formal(text):

    
    text = need_RE(text)
    
    formal_tokens = []

    
    informal_to_formal = {
        "n" : "and",  
        "crea" : "cream", # typo
        'egg.' : "egg", # typo
        'fire' : 'fry', # typo
        'icecream' : 'ice cream', # no space
        'lasagne' : 'lasagna', # typo
        'dessets': 'dessert', # typo
        'macaroni': 'mac', #same word, commonize
        'nuggs': 'nugget',
        'donut': 'dougnut'
        # 'donut': 
        # "\r" : "," # input format : mult,iline cell
    }

    ## convert plural to singular
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word)for word in tokens]
    combined_word = []
    is_macheese = False

    for token in tokens:
        if informal_to_formal.get(token) is not None:
            formal_tokens.append(informal_to_formal.get(token))
        else:
            formal_tokens.append(token)
    
    ## convert plural to singular
    # formal_tokens = [lemmatizer.lemmatize(word)for word in formal_tokens]
    return formal_tokens
 


In [829]:
def get_Tree(text):
    grammar =  r'''
    NP: {<custom_tags_cc_front><CC><NN.*>?|<custom_tags_foodtypes><NN>|<JJ>?<NN.*>+<POS>?<NN|NNS>?|<RB>?<NN.*>+|<DT>?<NN.*>+<POS>?}
    '''

    custom_tags_foodtypes = ['chinese', 'korean', 'indian']
    custom_tags_waste = ['grandma', 'homemade', 'anything', 'any', 'kind','home', 'back']
    custom_tags_cc_front = ['mac', 'macaroni', 'salt']



    '''
    The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.
    '''

    if type(text) is not str:
        return ['none']
    else:
        tokens = informal2formal(text)
        
        lotr_pos_tags = [   
            (word, "custom_tags_foodtypes") if word in custom_tags_foodtypes
        else (word, "custom_tags_waste") if word in custom_tags_waste
        else (word, "custom_tags_cc_front") if word in custom_tags_cc_front
        else (word, tag) for word, tag in nltk.pos_tag(tokens)]

        chunk_parser = nltk.RegexpParser(grammar)
        tree = chunk_parser.parse(lotr_pos_tags)

    return tree

In [830]:
def need_RE(text):

    text = text.replace("-", " ")
    text = text.replace("/", ",")
    text = text.replace('&', "and")
    
    return text

In [831]:
test_string1 = 'mac and cheese, lasagna, chinese food '
test_string2 = 'cookies, mac-n-cheese, brownies, french fries, '
test_string3 = 'mac & cheese, frosted brownies, chicken nuggs'


In [832]:
''' This is main function that chunk the sentence into word with POS tag using parser. It will also create tree that shows structure of sentence '''
def chunk_NP(text):
    grammar =  r'''
    NP: {<custom_tags_cc_front><CC><NN.*>?|<custom_tags_foodtypes><NN>|<JJ>?<NN.*>+<POS>?<NN|NNS>?|<RB>?<NN.*>+|<DT>?<NN.*>+<POS>?}
    '''

    custom_tags_foodtypes = ['chinese', 'korean', 'indian']
    custom_tags_waste = ['grandma', 'homemade', 'anything', 'any', 'kind','home', 'back']
    custom_tags_cc_front = ['mac', 'macaroni', 'salt']


    '''
    The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.
    '''

    if type(text) is not str:
        return ['none']
    else:
        tokens = informal2formal(text)
        
        lotr_pos_tags = [   
            (word, "custom_tags_foodtypes") if word in custom_tags_foodtypes
        else (word, "custom_tags_waste") if word in custom_tags_waste
        else (word, "custom_tags_cc_front") if word in custom_tags_cc_front
        else (word, tag) for word, tag in nltk.pos_tag(tokens)]

        chunk_parser = nltk.RegexpParser(grammar)
        tree = chunk_parser.parse(lotr_pos_tags)
        noun_phrases = []

        pos_index = 0
        is_pos = False
        for subtree in tree.subtrees():
            if subtree.label() =='NP':
                is_pos = False
                np_parts = []
                for leaf in subtree.leaves():
                    np_parts.append(leaf[0])
                    if leaf[1]=='POS':
                        is_pos = True
                        pos_index = subtree.leaves().index(leaf)


                
                if is_pos == False:
                    noun_phrases.append(" ".join(np_parts))
                else: ##NOTE POS (genitive marker) is appropriately added directly after nouns without any intervening spaces.
                    parts = ""
                    for x in range(len(np_parts)):
                        if x == pos_index:
                            parts = parts+ np_parts[x]
                        else:
                            parts = parts+ " "+ np_parts[x] 
                    noun_phrases.append(parts)
        return noun_phrases


In [833]:
example1 = "dark chocolate, terra chips, reese 's cups(dark chocolate), and bread/crackers with cottage cheese"

c = (chunk_NP(example1))
print(c)
# c = [x.replace(" 's", "'s") for x in c]


['dark chocolate', 'terra chip', " reese's cup", 'dark chocolate', 'bread', 'cracker', 'cottage cheese']


In [834]:

comfort_food_col1 = []
comfort_food_col2 = []


for x in range(len(data_nlp["comfort_food"])):
    a = chunk_NP(data_nlp.loc[x,'comfort_food'])
    comfort_food_col1.append(a)

    b = chunk_NP(data_nlp.loc[x, 'comfort_food_reasons'])
    comfort_food_col2.append(b)

flat_list_2nd = [item for sublist in comfort_food_col1 for item in sublist]

print(list(set(flat_list_2nd)- set(list_keep_1st)))

['noodle soup', 'chex mix', " moe's", 'wegmans', 'garlic bread', 'stuffed pepper', 'sweet', 'peruvian food', " reese's cup", 'cracker', 'french fry', 'spaghetti squash', 'chinese food', 'dark chocolate', 'debbie snack', 'pasta dish', 'chocolate cake', 'dougnut', 'mac and cheese', 'pop', 'hot chocolate', 'frozen yogurt', 'chilli', 'microwaveable food', 'dip', 'salt and vinegar', 'sweet popcorn', 'korean food', 'chipotle', 'dessert', 'bagel ice capps', 'pretzals', 'nan']


In [835]:
list_keep_2nd = ['microwaveable food', 'mac and cheese', 'hot chocolate', 'dark chocolate', 'frozen yogurt', 'peruvian food', 'nan', 'chinese food', 'sweet', 'dip', 'pretzals', 'stuffed pepper', 'salt and vinegar', 'spaghetti squash', 'chocolate cake', 'chilli', 'french fry', 'cracker', 'sweet popcorn', 'garlic bread', 'pasta dish', " reese's cup", 'chipotle', 'chex mix', 'bagel ice capps', " moe's", 'korean food', 'dougnut', 'debbie snack', 'wegmans', 'noodle soup', 'dessert', 'pop']


In [836]:


list_remove_2nd = list(set(flat_list_2nd) -set(list_keep_1st) - set(list_keep_2nd))
print(list_remove_2nd) ##TODO Check what words werer sorted out 

[]


In [837]:
avadacadabra = find_sentences_with_word(data_nlp['comfort_food'], list_remove_1st)

count =0
for x in avadacadabra:
    count = count+1
    print(x, count)

tomato ['tomato soup, pizza, fritos, meatball sub, dr. pepper'] : 

kit ['saltfish, candy and kit kat '] : 

donut ['donuts, ice cream, chips', 'cookies, donuts, candy bars', 'little debbie snacks, donuts, pizza'] : 

sandwhich ['mac n cheese, peanut butter and banana sandwich, omelet', 'peanut butter sandwich, pretzals, garlic bread', 'pizza, pretzels, fruit snacks, deli sandwhich', 'noodle ( any kinds of noodle), tuna sandwich, and egg.,'] : 

sandwich ['mac n cheese, peanut butter and banana sandwich, omelet', 'peanut butter sandwich, pretzals, garlic bread', 'pizza, pretzels, fruit snacks, deli sandwhich', 'noodle ( any kinds of noodle), tuna sandwich, and egg.,'] : 

pot ['chilli, soup, pot pie'] : 

pie ['chilli, soup, pot pie'] : 

meatball ['tomato soup, pizza, fritos, meatball sub, dr. pepper'] : 

cottage ["dark chocolate, terra chips, reese's cups(dark chocolate), and bread/crackers with cottage cheese"] : 

fire ['chocolate, chips, ice cream, french fires, pizza'] : 

spong

In [838]:
data_cleaning.loc[:,['comfort_food','comfort_food_reasons', 'diet_current']] = data_nlp


In [839]:
cf_explode = data_cleaning.explode('comfort_food') 
## The explode() function is used to transform each element of a list-like to a row, replicating the index values.

print(cf_explode['comfort_food'].unique())
print(cf_explode['comfort_food'].unique().shape)
print(cf_explode['comfort_food'].shape)

['none' 'chocolate, chips, ice cream' 'frozen yogurt, pizza, fast food'
 'pizza, mac and cheese, ice cream' 'ice cream, chocolate, chips '
 'candy, brownies and soda.'
 'chocolate, ice cream, french fries, pretzels'
 'ice cream, cheeseburgers, chips.' 'donuts, ice cream, chips'
 'mac and cheese, chocolate, and pasta '
 'pasta, grandma homemade chocolate cake anything homemade '
 'chocolate, pasta, soup, chips, popcorn' 'cookies, popcorn, and chips'
 'ice cream, cake, chocolate'
 'pizza, fruit, spaghetti, chicken and potatoes  '
 'cookies, donuts, candy bars' 'saltfish, candy and kit kat '
 'chips, cookies, ice cream' 'chocolate, ice crea '
 'pizza, wings, chinese' 'fast food, pizza, subs'
 'chocolate, sweets, ice cream' 'burgers, chips, cookies'
 'chilli, soup, pot pie' 'soup, pasta, brownies, cake'
 'chocolate, ice cream/milkshake, cookies'
 'chips, ice cream, microwaveable foods ' 'chicken fingers, pizza '
 'cookies, hot chocolate, beef jerky'
 'tomato soup, pizza, fritos, meatball s

grandma homemade chocolate cake anything homemade
ice crea