# Individual assignment Data mining 
### National Health and Nutrition Examination Survey

For this assignment we have a set of 6 different files with data available from survey research. 

In [1]:
import os
import pandas as pd
import LLMConnect
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, 
                             roc_auc_score, confusion_matrix, classification_report, 
                             silhouette_score)
from sklearn.preprocessing import MinMaxScaler


from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mlxtend.frequent_patterns import fpgrowth, association_rules


## Importing the data

In [2]:
df_diet = pd.read_csv("data/diet.csv")

In [3]:
[col for col in df_diet.columns if col.startswith("DR1")]

['DR1DRSTZ',
 'DR1EXMER',
 'DR1DBIH',
 'DR1DAY',
 'DR1LANG',
 'DR1MNRSP',
 'DR1HELPD',
 'DR1STY',
 'DR1SKY',
 'DR1TNUMF',
 'DR1TKCAL',
 'DR1TPROT',
 'DR1TCARB',
 'DR1TSUGR',
 'DR1TFIBE',
 'DR1TTFAT',
 'DR1TSFAT',
 'DR1TMFAT',
 'DR1TPFAT',
 'DR1TCHOL',
 'DR1TATOC',
 'DR1TATOA',
 'DR1TRET',
 'DR1TVARA',
 'DR1TACAR',
 'DR1TBCAR',
 'DR1TCRYP',
 'DR1TLYCO',
 'DR1TLZ',
 'DR1TVB1',
 'DR1TVB2',
 'DR1TNIAC',
 'DR1TVB6',
 'DR1TFOLA',
 'DR1TFA',
 'DR1TFF',
 'DR1TFDFE',
 'DR1TCHL',
 'DR1TVB12',
 'DR1TB12A',
 'DR1TVC',
 'DR1TVD',
 'DR1TVK',
 'DR1TCALC',
 'DR1TPHOS',
 'DR1TMAGN',
 'DR1TIRON',
 'DR1TZINC',
 'DR1TCOPP',
 'DR1TSODI',
 'DR1TPOTA',
 'DR1TSELE',
 'DR1TCAFF',
 'DR1TTHEO',
 'DR1TALCO',
 'DR1TMOIS',
 'DR1TS040',
 'DR1TS060',
 'DR1TS080',
 'DR1TS100',
 'DR1TS120',
 'DR1TS140',
 'DR1TS160',
 'DR1TS180',
 'DR1TM161',
 'DR1TM181',
 'DR1TM201',
 'DR1TM221',
 'DR1TP182',
 'DR1TP183',
 'DR1TP184',
 'DR1TP204',
 'DR1TP205',
 'DR1TP225',
 'DR1TP226',
 'DR1.300',
 'DR1.320Z',
 'DR1.330Z',
 'DR1BWATZ',

In [4]:
df_diet["DR1CCMTX"]

KeyError: 'DR1CCMTX'

# 

# Analyzing the vegatarians
 order for us to analyze the vegatarians in the data, we must find the vegatarians. However, in the data there is no self reported vegatarian question. However, what is available is in a .xpt file on https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/DS1IDS_H.htm. In this file we can see what each user who is represented by a unique SEQN, what their daily food consisted out of. There are two separate data files, where we can find the food that has been eaten by the participant. DR1IFF_H.xpt  includes the food data of day 1 and DR2IFF_H.xpt contains the food data of day 2. Not every participant in the provided 5 data files from kaggle, has participated in the food eating evaluation interviews. So we will only take the rows where the sequence number exists in both food evaluating interviews and also in the general interview. 
 
In order to label the SEQN as vegetarian or non_vegatarian, we must implement a proxy to identify a vegetarian. As proxy we use whether an individual has eaten no meat, poultry or fish. 

In [5]:
df_day1 = pd.read_sas("data/DR1IFF_H.xpt")

In [6]:
print(f"A total of {len(df_day1.SEQN.unique())} unique SEQN numbers are in the dataset")
print(f"A total of {len(df_day1)} rows are in the dataset, which means that each individual has reported {round(len(df_day1)/len(df_day1.SEQN.unique()), ndigits=1)} different food items on day 1")

A total of 8661 unique SEQN numbers are in the dataset
A total of 131394 rows are in the dataset, which means that each individual has reported 15.2 different food items on day 1


In the day 1 dataset, there are 8661 different participants who have reported 15.2 different food items for a total of 13194 total food items in day 1. 

We now want to evaluate the food items efficiently, we can use a metric which is the food group item namely "DR1CCMTX". This metric devides all possible food items in into 16 groups. Below you can see the results. We see that there is an error, the extreme negative value (5.4 * e-79), should be 0, but this might be due to the xpt reading of the food. So we should round the vlaues of DR1CCMTX

In [7]:
df_day1.DR1CCMTX.unique()

array([5.39760535e-79, 1.00000000e+00, 9.00000000e+01, 2.00000000e+00,
       3.00000000e+00, 5.00000000e+00, 1.10000000e+01, 1.00000000e+01,
       9.00000000e+00, 1.20000000e+01, 4.00000000e+00, 1.40000000e+01,
       6.00000000e+00, 8.00000000e+00, 1.30000000e+01, 7.00000000e+00])

In [8]:
df_day1["DR1CCMTX"] = round(df_day1["DR1CCMTX"])
occurrences = df_day1["DR1CCMTX"].value_counts().sort_index()
occurrences

DR1CCMTX
0.0     75335
1.0     10947
2.0      5466
3.0      5300
4.0      5337
5.0     13313
6.0       832
7.0        52
8.0       491
9.0      3690
10.0      666
11.0     1781
12.0     2653
13.0      169
14.0      721
90.0     4641
Name: count, dtype: int64

In [9]:
len(df_day1[df_day1["DR1CCMTX"] == 0]["SEQN"].unique())

8648

## First found challenge
The food combination type metric "DR1CCMTX", includes a lot of 0 values. The value 0, refers to the food combination not having a specific code in the researchers coding system. We thought we might be able to be smart and exclude the participants that have eaten food that has food code 0. But this group consists 99.9% of the dataset, so we have to find a workaround. As there are only 721 instances of food that participants have eaten food that belongs to the poultry, meat and fish group, which is number 14

You can see a table that includes the foodtypes and their respective code below

![IMG](Images/Table-food-type-codes.png)
source:  https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/DRXFCD_H.htm

## How to label vegetarians

As the group of combination food codes has 57% undefined values, who might contain meat or fish. We must use a different approach to find out who in this dataset is vegetarian. There is another column that identifies what the food type can be in this dataset. It is the column DR1IFDCD, which has a column description of the USDA food code of the food, (called DR2IFDCD in day 2 dataset). So by identifying which DR1IFDCD food codes contain meat, poultry or fish, we should be able to correctly link participants to being vegetarian. 

The only missing piece that is left is to find out which food codes contain meat and which do not. Luckily, there is a description of food codes file available at the 

In [10]:
df_day1

Unnamed: 0,SEQN,WTDRD1,WTDR2D,DR1ILINE,DR1DRSTZ,DR1EXMER,DRABF,DRDINT,DR1DBIH,DR1DAY,...,DR1IM181,DR1IM201,DR1IM221,DR1IP182,DR1IP183,DR1IP184,DR1IP204,DR1IP205,DR1IP225,DR1IP226
0,73557.0,16888.327864,12930.890649,1.0,1.0,49.0,2.0,2.0,6.0,2.0,...,3.595000e+00,3.400000e-02,1.000000e-03,9.490000e-01,1.080000e-01,5.397605e-79,5.100000e-02,1.000000e-03,5.397605e-79,1.000000e-02
1,73557.0,16888.327864,12930.890649,2.0,1.0,49.0,2.0,2.0,6.0,2.0,...,5.397605e-79,5.397605e-79,5.397605e-79,4.000000e-03,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
2,73557.0,16888.327864,12930.890649,3.0,1.0,49.0,2.0,2.0,6.0,2.0,...,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
3,73557.0,16888.327864,12930.890649,4.0,1.0,49.0,2.0,2.0,6.0,2.0,...,8.100000e-02,5.397605e-79,5.397605e-79,1.030000e-01,3.100000e-02,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
4,73557.0,16888.327864,12930.890649,5.0,1.0,49.0,2.0,2.0,6.0,2.0,...,2.600000e-02,5.397605e-79,5.397605e-79,2.400000e-02,9.000000e-03,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131389,83731.0,5805.674812,4339.132077,23.0,1.0,49.0,2.0,2.0,12.0,6.0,...,3.798000e+00,3.800000e-02,5.397605e-79,3.372000e+00,4.790000e-01,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
131390,83731.0,5805.674812,4339.132077,24.0,1.0,49.0,2.0,2.0,12.0,6.0,...,5.260000e-01,5.000000e-03,5.397605e-79,4.730000e-01,8.200000e-02,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
131391,83731.0,5805.674812,4339.132077,25.0,1.0,49.0,2.0,2.0,12.0,6.0,...,1.483000e+00,1.500000e-02,5.397605e-79,1.346000e+00,1.980000e-01,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79
131392,83731.0,5805.674812,4339.132077,26.0,1.0,49.0,2.0,2.0,12.0,6.0,...,6.830000e-01,5.397605e-79,5.397605e-79,6.000000e-02,2.500000e-02,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79,5.397605e-79


### Importing dataset which contains food code descriptions
In this dataset we can find what each of the foodcodes means in plain english.

In [16]:
df_foodcodes_description = pd.read_sas("data/DRXFCD_H.xpt")

### Check if foodcodes are able to be linked in the previous dataset of the food interviews

In [None]:
95330100 in df_foodcodes_description.DRXFDCD.unique()


The code above was taken from df_day1, so we can match the food codes with the descriptions.

In [17]:
df_foodcodes_description["DRXFDCD"] = df_foodcodes_description["DRXFDCD"].apply(lambda x: int(x))
df_foodcodes_description

Unnamed: 0,DRXFDCD,DRXFCSD,DRXFCLD
0,11000000,"b'MILK, HUMAN'","b'Milk, human'"
1,11100000,"b'MILK, NFS'","b'Milk, NFS'"
2,11111000,"b'MILK, WHOLE'","b'Milk, whole'"
3,11111100,"b'MILK, LOW SODIUM, WHOLE'","b'Milk, low sodium, whole'"
4,11111150,"b'MILK, CALCIUM FORTIFIED, WHOLE'","b'Milk, calcium fortified, whole'"
...,...,...,...
8531,95323000,"b'SPORTS DRINK, LOW CALORIE'","b'Sports drink, low calorie'"
8532,95330100,"b'FLUID REPLACEMENT, ELECTROLYTE SOLUTION'","b'Fluid replacement, electrolyte solution'"
8533,95330500,"b'FLUID REPLACEMENT, 5% GLUCOSE IN WATER'","b'Fluid replacement, 5% glucose in water'"
8534,95341000,b'FUZE SLENDERIZE FORTIFIED LOW CALORIE FRUIT ...,b'FUZE Slenderize fortified low calorie fruit ...


### Interpretation of the data
As we can see from the table above, each unique food identifier, is reprented by a binary notation. However, the proability that we can infer what food is vegatarian or not is going to be very difficult by hand, it depends on how many different ingredients there are. 

### Binary notation
The column DRXFCSD and DRXFCLD, is represented by a binary string, b''. Such a notation might become an issue later on if we try to evaluate wheter a food item consist out of meat or not. So therefore let's transform the binary string into a list of individual words

In [18]:
def turn_binary_string_to_list(df):
    df["food_list"] = df["DRXFCSD"].apply(lambda x: str(x).replace("b'", "").replace("'", "").split(","))
    return df
df_foodcodes_description = turn_binary_string_to_list(df_foodcodes_description)

It is important to understand how many individual food items there are available. We might be able to use a trick to find if the ingredients contain meat or not. Or if there are not too many, we could do so by hand. 

In [38]:
unique_elemnts_food = set([food for food_list in df_foodcodes_description.food_list.values for food in food_list])
print(f"There are {len(unique_elemnts_food)} unique ingredients in the foodcodes description")

There are 6191 unique ingredients in the foodcodes description


In [19]:
def turn_food_df_to_dict(df):
    food_code_dict = {}
    df_food_code = df[["DRXFDCD", "food_list"]]
    for index, row in df_food_code.iterrows():
        food_code_dict[row["DRXFDCD"]] = row["food_list"]
    return food_code_dict
food_code_dict = turn_food_df_to_dict(df_foodcodes_description)

## Use of LLM to create food codes
Because the food codes are so diverse, it is not feasible to identify a resistance rule without relying on a large number of if-else statements to determine the appropriate food category. Our goal is to classify whether a food is vegetarian by leveraging an LLM to analyze each data row and identify whether the food description includes fish, red meat, white meat, poultry, dairy, or other food-related products; otherwise, it is labeled as “none.” This is implemented using an LLM through the LLM Connect python file, where the process is detailed. The classification is performed by providing the LLM with a carefully designed prompt, which is demonstrated in this notebook, in the code block below. The prompt is entered into GPT-4 mini, and the output, formatted in JSON, is then used to label the data and determine if an individual is vegetarian. While this approach may feel like venturing into a rabbit hole, it provides a structured and effective method for achieving the desired labeling.




        system_message = """
                You are a food detecting agent that responds only in JSON mode.
                Analyze the following list of JSON objects representing food meals and their ingredients.
                """
        user_task = """
        You are a food detecting agent that responds only in JSON mode.
        Analyze the following list of JSON objects representing food meals and their ingredients:
        {{input_json}}
        ---

        For each ingredient in each meal, determine if it contains any of the following categories, you are allowed to 
        label multiple ingredients as 1:
        - poultry
        - red meat
        - fish
        - shellfish
        - dairy products
        - other animal products

        Return the result as a JSON array of objects with the structure:
        "meals": [
            {
                "ingredient_list": "list of food ingredients",
                "meal_number: unique_id
                "poultry": 1 or 0,
                "red_meat": 1 or 0,
                "fish": 1 or 0,
                "shellfish": 1 or 0,
                "dairy_products": 1 or 0,
                "other_animal_products": 1 or 0,
                "none": 1 or 0
            }
        ]
        Only return the array of jsons in dict formats, nothing else.
        
        """

In [20]:

def turn_json_into_segments(json_to_divide, dict_per_chunk=30):
    

    smaller_json = {}
    json_list_combined = []
    for index,(key, json_dict) in enumerate(json_to_divide.items()):
        if index % dict_per_chunk ==0 and index !=0:
            json_list_combined.append(smaller_json)
            smaller_json = {}
        smaller_json[key] = json_dict
    json_list_combined.append(smaller_json)
    return json_list_combined
    


def detect_food_json_list_dict(json_list:str):
    json_list = turn_json_into_segments(json_to_divide=json_list,
                                        dict_per_chunk=30)    
    
    
    
    LLM_translator = LLMConnect.DetectFoodIngredients(model_name="gpt-4o-mini", max_tokens=9069)
    
    complete_food_labeled_json_dict = []
    
    for index, json_dict in enumerate(json_list):
        food_json= json.dumps(json_dict)
        food_labeled_json = LLM_translator.detect_food_ingredients(input_json= food_json
                                                               )
        
        complete_food_labeled_json_dict.append(food_labeled_json)
    return complete_food_labeled_json_dict
        
        
complete_food_json = detect_food_json_list_dict(json_list=food_code_dict)



                response_format was transferred to model_kwargs.
                Please confirm that response_format is what you intended.
  LLM_translator = LLMConnect.DetectFoodIngredients(model_name="gpt-4o-mini", max_tokens=9069)


I am sending a request to an LLM now
save_file_path =  json/data_1.json
save_file_path =  json/data_1.json
I am sending a request to an LLM now
save_file_path =  json/data_2.json
save_file_path =  json/data_2.json
I am sending a request to an LLM now
save_file_path =  json/data_3.json
save_file_path =  json/data_3.json
I am sending a request to an LLM now
save_file_path =  json/data_4.json
save_file_path =  json/data_4.json
I am sending a request to an LLM now
save_file_path =  json/data_5.json
save_file_path =  json/data_5.json
I am sending a request to an LLM now
save_file_path =  json/data_6.json
save_file_path =  json/data_6.json
I am sending a request to an LLM now
save_file_path =  json/data_7.json
save_file_path =  json/data_7.json
I am sending a request to an LLM now
save_file_path =  json/data_8.json
save_file_path =  json/data_8.json
I am sending a request to an LLM now
save_file_path =  json/data_9.json
save_file_path =  json/data_9.json
I am sending a request to an LLM now


In [22]:
with open("complete_food_codes.json", "w") as f:
    f.write(json.dumps(complete_food_json))

In [35]:
print(f"There are a total of {len(complete_food_json)} chunks of food jsons")
print(f"For a total of {len([food for chunk in complete_food_json for food in chunk["meals"]])} of individual food items")

There are a total of 285 chunks of food jsons
For a total of 8527 of individual food items


### Evaluating LLM output
As we can see from above, there are a total of 8527 individual food items that the LLM has returned. The structure of each chunk can be seen below.

`"meals": [
            {
                "ingredient_list": "list of food ingredients",\n
                "meal_number: unique_id
                "poultry": 1 or 0,
                "red_meat": 1 or 0,
                "fish": 1 or 0,
                "shellfish": 1 or 0,
                "dairy_products": 1 or 0,
                "other_animal_products": 1 or 0,
                "none": 1 or 0
            }
        ]`
        
We now have a list of chunks, each containing 30 individual food items. The first step is to create a large dictionary where each food number serves as the key, and the corresponding food codes are stored as the values. Once this dictionary is constructed, we need to verify that all data values are correctly loaded and accurately reflect the intended information. After ensuring the data’s accuracy, the final step is to match each individual food item to the corresponding entries in the food record table. This process ensures the integration and alignment of data for further analysis.

In [40]:
from copy import deepcopy
chunked_food_json = deepcopy(complete_food_json)

In [41]:
def turn_json_chunks_df(json_chunks):
    complete_json = []
    for chunk in json_chunks:
        # Check if meals is in the chunk
        if "meals" in chunk:
            complete_json.extend(chunk["meals"])
        # If the llm named the chunk differently
        elif len(chunk.keys()) == 1:
            key_name = list(chunk.keys())[0]
            complete_json.extend(chunk[key_name])
        # If the LLM accidentally didn't give a key for the chunk
        elif len(chunk.keys()) == 9:
            complete_json.append(chunk)
        else:
            raise Exception("there is an issue with formatting")
            
            
        
    return complete_json
        
exploded_food_json = turn_json_chunks_df(chunked_food_json)
len(exploded_food_json)

8527