# Dataset Analysis

**Goal**: choose the most suitable dataset for the project

In [6]:
import pandas as pd

In [9]:
nutri_facts = pd.read_csv(r'..\data\raw\nutritional-facts.csv')
nutri_facts_common = pd.read_csv(r'..\data\raw\nutritional-facts-for-most-common-foods-and-products.csv')
nutri_values = pd.read_csv(r'..\data\raw\nutritional-values-for-common-foods.csv')

## Nutritional facts for common foods

### Pros

- `Category` feature is present
- `Measure` (serving size) is present but needs conversion
- Simple nutritional facts

### Cons

- Lacks of interesting nutritional facts
- Foods have not IDs
- Some names are "broken" but can be fixed


In [14]:
nutri_facts_common.columns

Index(['Food', 'Measure', 'Grams', 'Calories', 'Protein', 'Fat', 'Sat.Fat',
       'Fiber', 'Carbs', 'Category'],
      dtype='object')

## Nutritional facts

### Pros

- `Category Name` feature is present
- Complete nutritional facts

### Cons

- Lacks of serving sizes
- Foods have no IDs

In [13]:
nutri_facts.columns

Index(['Food Name', 'Category Name', 'Calcium', 'Calories', 'Carbs',
       'Cholesterol', 'Copper', 'Fats', 'Fiber', 'Folate', 'Iron', 'Magnesium',
       'Monounsaturated Fat', 'Net carbs', 'Omega-3 - DHA', 'Omega-3 - DPA',
       'Omega-3 - EPA', 'Phosphorus', 'Polyunsaturated fat', 'Potassium',
       'Protein', 'Saturated Fat', 'Selenium', 'Sodium', 'Trans Fat',
       'Vitamin A (IU)', 'Vitamin A RAE', 'Vitamin B1', 'Vitamin B12',
       'Vitamin B2', 'Vitamin B3', 'Vitamin B5', 'Vitamin B6', 'Vitamin C',
       'Zinc', 'Choline', 'Fructose', 'Histidine', 'Isoleucine', 'Leucine',
       'Lysine', 'Manganese', 'Methionine', 'Phenylalanine', 'Starch', 'Sugar',
       'Threonine', 'Tryptophan', 'Valine', 'Vitamin D', 'Vitamin E',
       'Vitamin K', 'Omega-3 - ALA', 'Omega-6 - Eicosadienoic acid',
       'Omega-6 - Gamma-linoleic acid', 'Omega-3 - Eicosatrienoic acid',
       'Omega-6 - Dihomo-gamma-linoleic acid', 'Omega-6 - Linoleic acid',
       'Omega-6 - Arachidonic acid'],
   

## Nutritional values for common foods and products

### Pros

- Complete nutritional facts

### Cons

- Has no category feature
- `serving_size` has always value 100 (g)

In [15]:
nutri_values.columns

Index(['Unnamed: 0', 'name', 'serving_size', 'calories', 'total_fat',
       'saturated_fat', 'cholesterol', 'sodium', 'choline', 'folate',
       'folic_acid', 'niacin', 'pantothenic_acid', 'riboflavin', 'thiamin',
       'vitamin_a', 'vitamin_a_rae', 'carotene_alpha', 'carotene_beta',
       'cryptoxanthin_beta', 'lutein_zeaxanthin', 'lucopene', 'vitamin_b12',
       'vitamin_b6', 'vitamin_c', 'vitamin_d', 'vitamin_e', 'tocopherol_alpha',
       'vitamin_k', 'calcium', 'copper', 'irom', 'magnesium', 'manganese',
       'phosphorous', 'potassium', 'selenium', 'zink', 'protein', 'alanine',
       'arginine', 'aspartic_acid', 'cystine', 'glutamic_acid', 'glycine',
       'histidine', 'hydroxyproline', 'isoleucine', 'leucine', 'lysine',
       'methionine', 'phenylalanine', 'proline', 'serine', 'threonine',
       'tryptophan', 'tyrosine', 'valine', 'carbohydrate', 'fiber', 'sugars',
       'fructose', 'galactose', 'glucose', 'lactose', 'maltose', 'sucrose',
       'fat', 'saturated_fatt

## Check datasets overlapping by names

In [16]:
nutri_facts_common['Food']

0                 Cows' milk
1                  Milk skim
2                 Buttermilk
3      Evaporated, undiluted
4             Fortified milk
               ...          
330      Fruit-flavored soda
331               Ginger ale
332                Root beer
333                   Coffee
334                      Tea
Name: Food, Length: 335, dtype: object

In [17]:
nutri_facts['Food Name']

0              Acerola
1                Apple
2              Apricot
3          Dried fruit
4              Avocado
             ...      
1169    Sesame chicken
1170        Vermicelli
1171         Baby food
1172          Zwieback
1173      Cherry juice
Name: Food Name, Length: 1174, dtype: object

In [24]:
overlapping_foods = set(nutri_facts['Food Name']).intersection(set(nutri_facts_common['Food']))
print(f"Number of overlapping foods: {len(overlapping_foods)}")
print(overlapping_foods)

Number of overlapping foods: 84
{'Kale', 'Pineapple juice', 'Avocado', 'Sauerkraut', 'Banana', 'Honey', 'Corn', 'Artichoke', 'Veal', 'Oysters', 'Eggplant', 'Spinach', 'Corned beef', 'Tuna', 'Butter', 'Crab meat', 'Cream cheese', 'Beef', 'Haddock', 'Gin', 'Noodles', 'Rutabagas', 'Asparagus', 'Lobster', 'Swordfish', 'Cantaloupe', 'Oyster stew', 'Oatmeal', 'Halibut', 'Ginger ale', 'Herring', 'Gingerbread', 'Steak', 'Broccoli', 'Custard', 'Vegetable', 'Lettuce', 'Dates', 'Parsley', 'Okra', 'Dandelion greens', 'Rice', 'Chocolate syrup', 'Potato chips', 'French dressing', 'Papaya', 'Grape juice', 'Celery', 'Orange juice', 'Watermelon', 'Flour', 'Tea', 'Kohlrabi', 'Tomato soup', 'Mayonnaise', 'Lard', 'Powdered milk', 'Hamburger', 'Ice cream', 'Roast beef', 'Coffee', 'Olive oil', 'Pineapple', 'Tomato juice', 'Fudge', 'Molasses', 'Turnip greens', 'Margarine', 'Cod', 'Shrimp', 'Spanish rice', 'Cauliflower', 'Prunes', 'Endive', 'Shad', 'Beer', 'Figs', 'Peanut butter', 'Rye', 'Cheese', 'Buttermilk

## Conclusions

As it is intended for academic purposes, we will use the simplest dataset `nutri_facts_common`