# Categorizing Data Features with Pandas and Numpy

### Importing and Previewing a CSV Data File with Pandas

In [1]:
import pandas as pd
dataset_file = "../data/test.csv"
# Load the header of the CSV file
df = pd.read_csv(dataset_file, nrows=0, sep="\t")
df

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,ph_100g,fruits-vegetables-nuts_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g


### Creating a list of features with the `columns` attribute of a pandas DataFrame

In [2]:
# Access the column names and convert them to a list
features = df.columns.tolist()
print(features)

['code', 'url', 'creator', 'created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime', 'product_name', 'generic_name', 'quantity', 'packaging', 'packaging_tags', 'brands', 'brands_tags', 'categories', 'categories_tags', 'categories_fr', 'origins', 'origins_tags', 'manufacturing_places', 'manufacturing_places_tags', 'labels', 'labels_tags', 'labels_fr', 'emb_codes', 'emb_codes_tags', 'first_packaging_code_geo', 'cities', 'cities_tags', 'purchase_places', 'stores', 'countries', 'countries_tags', 'countries_fr', 'ingredients_text', 'allergens', 'allergens_fr', 'traces', 'traces_tags', 'traces_fr', 'serving_size', 'no_nutriments', 'additives_n', 'additives', 'additives_tags', 'additives_fr', 'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil', 'ingredients_from_palm_oil_tags', 'ingredients_that_may_be_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil', 'ingredients_that_may_be_from_palm_oil_tags', 'nutrition_grade_uk', 'nutrition_grade_fr', 'pnns_groups_1', 

### Classifying Features in a Data Set

In [3]:
import pandas as pd

# Create an empty dictionary
feature_classification = {
    'metadata': [],
    'product_info': [],
    'location_info': [],
    'ingredients': [],
    'nutritional_info': []
}

# Iterate through the list of features
for feature in features:
    # Check the type of information contained in the feature and add it to the appropriate list in the dictionary
    if '_t' in feature or '_datetime' in feature:
        feature_classification['metadata'].append(feature)
    elif '_name' in feature or '_tags' in feature or '_fr' in feature:
        feature_classification['product_info'].append(feature)
    elif '_places' in feature or '_code_geo' in feature or 'cities' in feature:
        feature_classification['location_info'].append(feature)
    elif 'ingredients' in feature:
        feature_classification['ingredients'].append(feature)
    elif '_100g' in feature or '_serving' in feature or '_value' in feature:
        feature_classification['nutritional_info'].append(feature)

import pandas as pd

# Create a DataFrame from the dictionary
df = pd.DataFrame.from_dict(feature_classification, orient='index').transpose()

# Display the DataFrame with borders
display(df.style.set_caption('Feature Classification').set_table_styles([{'selector': '*', 'props': [('border', '1px solid black')]}]))



Unnamed: 0,metadata,product_info,location_info,ingredients,nutritional_info
0,created_t,product_name,manufacturing_places,,energy_100g
1,created_datetime,generic_name,first_packaging_code_geo,,energy-from-fat_100g
2,last_modified_t,categories_fr,cities,,fat_100g
3,last_modified_datetime,labels_fr,purchase_places,,saturated-fat_100g
4,packaging_tags,countries_fr,,,butyric-acid_100g
5,brands_tags,allergens_fr,,,caproic-acid_100g
6,categories_tags,traces_fr,,,caprylic-acid_100g
7,origins_tags,additives_fr,,,capric-acid_100g
8,manufacturing_places_tags,ingredients_from_palm_oil_n,,,lauric-acid_100g
9,labels_tags,ingredients_from_palm_oil,,,myristic-acid_100g


This code is creating a dictionary with keys representing different categories of information (metadata, product info, location info, ingredients, and nutritional info) and values that are lists of features that belong to each category. It then iterates through a list of features and categorizes each feature based on certain conditions (such as whether the feature name contains certain strings). Finally, it creates a DataFrame from the dictionary, transposes it, and displays it with borders. This can be useful for organizing and visualizing the different types of features contained in a data set.

## Reading a CSV file with pandas and displaying the data types of its features

In [4]:
df = pd.read_csv("../data/test.csv", sep="\t")
# Get the data type of each feature 
feature_types = df.dtypes 
print(feature_types)

  df = pd.read_csv("../data/test.csv", sep="\t")


code                        object
url                         object
creator                     object
created_t                   object
created_datetime            object
                            ...   
carbon-footprint_100g      float64
nutrition-score-fr_100g    float64
nutrition-score-uk_100g    float64
glycemic-index_100g        float64
water-hardness_100g        float64
Length: 162, dtype: object


## Distinguishing between quantitative and qualitative features

In [5]:
import numpy as np 
# Create an empty list to store the feature names 
quantitative_features = [] 
qualitative_features = [] 
# Iterate over the feature types 
for feature, dtype in feature_types.items():     
    # If the data type is float or int, it is quantitative     
    if dtype == float or dtype == int:         
        quantitative_features.append(feature)     
    else:         
        qualitative_features.append(feature) 

If the data type is `float` or `int`, the feature is considered quantitative and its name is added to the `quantitative_features` list. Otherwise, the feature is considered qualitative and it is added to the qualitative_features

### Displaying the quantitative features of a DataFrame

In [6]:
print(f'Quantitative features:\n{quantitative_features}') 

Quantitative features:
['no_nutriments', 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil', 'ingredients_that_may_be_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil', 'nutrition_grade_uk', 'energy_100g', 'energy-from-fat_100g', 'fat_100g', 'saturated-fat_100g', 'butyric-acid_100g', 'caproic-acid_100g', 'caprylic-acid_100g', 'capric-acid_100g', 'lauric-acid_100g', 'myristic-acid_100g', 'palmitic-acid_100g', 'stearic-acid_100g', 'arachidic-acid_100g', 'behenic-acid_100g', 'lignoceric-acid_100g', 'cerotic-acid_100g', 'montanic-acid_100g', 'melissic-acid_100g', 'monounsaturated-fat_100g', 'polyunsaturated-fat_100g', 'omega-3-fat_100g', 'alpha-linolenic-acid_100g', 'eicosapentaenoic-acid_100g', 'docosahexaenoic-acid_100g', 'omega-6-fat_100g', 'linoleic-acid_100g', 'arachidonic-acid_100g', 'gamma-linolenic-acid_100g', 'dihomo-gamma-linolenic-acid_100g', 'omega-9-fat_100g', 'oleic-acid_100g', 'elaidic-acid_100g', 'gondoic-acid_100g', 'mead-acid_100g', 'erucic-

In [7]:
# Create an empty dictionary
feature_classification2 = {
    'metadata': [],
    'product_info': [],
    'location_info': [],
    'ingredients': [],
    'nutritional_info': []
}

# Iterate through the list of features
for feature in quantitative_features:
    # Check the type of information contained in the feature and add it to the appropriate list in the dictionary
    if '_t' in feature or '_datetime' in feature:
        feature_classification2['metadata'].append(feature)
    elif '_name' in feature or '_tags' in feature or '_fr' in feature:
        feature_classification2['product_info'].append(feature)
    elif '_places' in feature or '_code_geo' in feature or 'cities' in feature:
        feature_classification2['location_info'].append(feature)
    elif 'ingredients' in feature:
        feature_classification2['ingredients'].append(feature)
    elif '_100g' in feature or '_serving' in feature or '_value' in feature:
        feature_classification2['nutritional_info'].append(feature)

import pandas as pd

# Create a DataFrame from the dictionary
df_quantitative_features = pd.DataFrame.from_dict(feature_classification2, orient='index').transpose()

# Display the DataFrame with borders
display(df_quantitative_features.style.set_caption('Feature Classification').set_table_styles([{'selector': '*', 'props': [('border', '1px solid black')]}]))


Unnamed: 0,metadata,product_info,location_info,ingredients,nutritional_info
0,ingredients_that_may_be_from_palm_oil_n,ingredients_from_palm_oil_n,,,energy_100g
1,ingredients_that_may_be_from_palm_oil,ingredients_from_palm_oil,,,energy-from-fat_100g
2,,,,,fat_100g
3,,,,,saturated-fat_100g
4,,,,,butyric-acid_100g
5,,,,,caproic-acid_100g
6,,,,,caprylic-acid_100g
7,,,,,capric-acid_100g
8,,,,,lauric-acid_100g
9,,,,,myristic-acid_100g


### Displaying the qualitative features of a DataFrame

In [8]:
print(f'Qualitative features:\n{qualitative_features}')

Qualitative features:
['code', 'url', 'creator', 'created_t', 'created_datetime', 'last_modified_t', 'last_modified_datetime', 'product_name', 'generic_name', 'quantity', 'packaging', 'packaging_tags', 'brands', 'brands_tags', 'categories', 'categories_tags', 'categories_fr', 'origins', 'origins_tags', 'manufacturing_places', 'manufacturing_places_tags', 'labels', 'labels_tags', 'labels_fr', 'emb_codes', 'emb_codes_tags', 'first_packaging_code_geo', 'cities', 'cities_tags', 'purchase_places', 'stores', 'countries', 'countries_tags', 'countries_fr', 'ingredients_text', 'allergens', 'allergens_fr', 'traces', 'traces_tags', 'traces_fr', 'serving_size', 'additives', 'additives_tags', 'additives_fr', 'ingredients_from_palm_oil_tags', 'ingredients_that_may_be_from_palm_oil_tags', 'nutrition_grade_fr', 'pnns_groups_1', 'pnns_groups_2', 'states', 'states_tags', 'states_fr', 'main_category', 'main_category_fr', 'image_url', 'image_small_url']


In [9]:
# Create an empty dictionary
feature_classification3 = {
    'metadata': [],
    'product_info': [],
    'location_info': [],
    'ingredients': [],
    'nutritional_info': []
}

# Iterate through the list of features
for feature in qualitative_features:
    # Check the type of information contained in the feature and add it to the appropriate list in the dictionary
    if '_t' in feature or '_datetime' in feature:
        feature_classification3['metadata'].append(feature)
    elif '_name' in feature or '_tags' in feature or '_fr' in feature:
        feature_classification3['product_info'].append(feature)
    elif '_places' in feature or '_code_geo' in feature or 'cities' in feature:
        feature_classification3['location_info'].append(feature)
    elif 'ingredients' in feature:
        feature_classification3['ingredients'].append(feature)
    elif '_100g' in feature or '_serving' in feature or '_value' in feature:
        feature_classification3['nutritional_info'].append(feature)

# Create a DataFrame from the dictionary
df_qualitative_features = pd.DataFrame.from_dict(feature_classification3, orient='index').transpose()

# Display the DataFrame with borders
display(df_qualitative_features.style.set_caption('Feature Classification').set_table_styles([{'selector': '*', 'props': [('border', '1px solid black')]}]))


Unnamed: 0,metadata,product_info,location_info,ingredients,nutritional_info
0,created_t,product_name,manufacturing_places,,
1,created_datetime,generic_name,first_packaging_code_geo,,
2,last_modified_t,categories_fr,cities,,
3,last_modified_datetime,labels_fr,purchase_places,,
4,packaging_tags,countries_fr,,,
5,brands_tags,allergens_fr,,,
6,categories_tags,traces_fr,,,
7,origins_tags,additives_fr,,,
8,manufacturing_places_tags,nutrition_grade_fr,,,
9,labels_tags,states_fr,,,


### Classifying Datetime Columns as Quantitative Features

In [10]:
time_features = [item for item in qualitative_features if item.endswith(('_t', '_datetime'))]
quantitative_features.extend(time_features)
qualitative_features = [item for item in qualitative_features if item not in time_features]

A datetime column is typically considered a quantitative feature, since it represents a measurable quantity (the time at which an event occurred). In statistics and data analysis, quantitative features are those that represent numerical data, as opposed to categorical or qualitative data.

## Analyzing Feature Fill Rates in a DataFrame

In [11]:
def get_fill_rates(df):
    # Calculate the fill rate for each column
    fill_rates = df.count() / df.shape[0]
    
    # Convert the fill rates into a dataframe
    fill_rates_df = pd.DataFrame(fill_rates, columns=['Fill Rate'])
    
    # Sort the dataframe by the Fill Rate column in descending order
    fill_rates_df = fill_rates_df.sort_values(by='Fill Rate', ascending=False)
    
    return fill_rates_df
fill_rates_df = get_fill_rates(df)
fill_rates_df.style.format({'Fill Rate': '{:.2%}'}).background_gradient(cmap='RdYlGn')


Unnamed: 0,Fill Rate
last_modified_t,100.00%
last_modified_datetime,100.00%
creator,100.00%
created_t,100.00%
created_datetime,100.00%
code,99.99%
url,99.99%
states_fr,99.99%
states_tags,99.99%
states,99.99%


In [12]:
# Calculate the fill rate for each column

fill_rate = df.count() / df.shape[0]

# Apply a threshold to the fill rate and count the number of columns above the threshold
threshold = 0.5
low_fill_rate = fill_rate[fill_rate < threshold].count()

# Calculate the proportion of columns with a fill rate above the threshold
low_fill_rate_proportion = low_fill_rate / len(fill_rate)

# Print the results
print(f"Number of features with fill rate under {threshold:.0%}: {low_fill_rate}")
print(f"Proportion of features with fill rate under {threshold:.0%}: {low_fill_rate_proportion:.2%}")


Number of features with fill rate under 50%: 128
Proportion of features with fill rate under 50%: 79.01%


# Observations

Based on the results provided, it is clear that:

- A significant majority (<span style="color:red">**79.01%**</span>) of the features have a fill rate below 50%.

This suggests that these features may have a high number of missing or null values, which could potentially impact the accuracy of any models that are trained on the data.

In [13]:
# Select the Fill Rate column from the fill_rates_df dataframe
fill_rate = fill_rates_df['Fill Rate']

# Create a boolean mask indicating which columns have a fill rate below 50%
mask = fill_rate >= 0.5

# Select only the columns with a fill rate above 50%
filtered_df = df.loc[:, mask]

filtered_df.shape


(320772, 34)

In [14]:
fill_rates_df = get_fill_rates(filtered_df)
fill_rates_df.style.format({'Fill Rate': '{:.2%}'}).background_gradient(cmap='RdYlGn')
filtered_df.columns

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name', 'brands',
       'brands_tags', 'countries', 'countries_tags', 'countries_fr',
       'ingredients_text', 'serving_size', 'additives_n', 'additives',
       'ingredients_from_palm_oil_n',
       'ingredients_that_may_be_from_palm_oil_n', 'nutrition_grade_fr',
       'states', 'states_tags', 'states_fr', 'energy_100g', 'fat_100g',
       'saturated-fat_100g', 'carbohydrates_100g', 'sugars_100g', 'fiber_100g',
       'proteins_100g', 'salt_100g', 'sodium_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g'],
      dtype='object')

# Nutritional Comparison Application

If I am building an application that helps users compare the nutritional values of different products, I might want to select the following features:

- `energy_100g`: Energy content per 100g
- `proteins_100g`: Protein content per 100g
- `salt_100g`: Salt content per 100g
- `sodium_100g`: Sodium content per 100g
- `sugars_100g`: Sugar content per 100g
- `fat_100g`: Fat content per 100g
- `carbohydrates_100g`: Carbohydrate content per 100g
- `saturated-fat_100g`: Saturated fat content per 100g
- `nutrition_grade_fr`: Nutrition grade according to French food labeling system
- `nutrition-score-fr_100g`: Nutrition score according to French food labeling system per 100g
- `nutrition-score-uk_100g`: Nutrition score according to UK food labeling system per 100g

These features provide important information about the nutritional content of the products, which would be relevant and useful for this type of application.


# Building a Product Classification Model

To build a machine learning model that classifies products based on their nutritional content or other characteristics, you could use a variety of features from the list you provided. Some possible features that might be relevant for this task include:

- `energy_100g`: The energy content of the product, in kilocalories per 100 grams.
- `fat_100g`: The fat content of the product, in grams per 100 grams.
- `saturated-fat_100g`: The saturated fat content of the product, in grams per 100 grams.
- `carbohydrates_100g`: The carbohydrate content of the product, in grams per 100 grams.
- `sugars_100g`: The sugar content of the product, in grams per 100 grams.
- `fiber_100g`: The fiber content of the product, in grams per 100 grams.
- `proteins_100g`: The protein content of the product, in grams per 100 grams.
- `salt_100g`: The salt content of the product, in grams per 100 grams.
- `sodium_100g`: The sodium content of the product, in milligrams per 100 grams.
- `nutrition_grade_fr`: The nutritional grade of the product according to the French system, where "A" is the best grade and "E" is the worst.

I could also consider using additional features that provide information about the product, such as `product_name`, `brands`, `ingredients_text`, and `serving_size`, as these might be relevant for determining the overall "healthiness" of a product.
