<a href="https://www.kaggle.com/code/mcpenguin/mcdonalds-predict-calorie-content-from-nutrients?scriptVersionId=143235089" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# McDonalds Dataset - Predict Calorie Content from Nutrient Composition

In this notebook, we will try to predict the calorie count of McDonalds food items from their nutritional composition.

# 0 Import Libraries Needed

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
import seaborn as sns
import torch # only to chekc if gpu acceleration is available

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

# 1 Load Data

We first load our dataset:

In [None]:
DATA_DIR = "../input/mcdonalds-nutrition"

df = pd.read_csv(os.path.join(DATA_DIR, "McDonaldsMenuNutrition.csv"))

# 2 First Look at Data

As a first step, let us see how the data looks like without any feature engineering:

In [None]:
df.head()

The column names are a little messy, so one potential step we should do before further anlaysis is to clean these column names up to make them easier for analysis.

Most of these attributes are self-explanatory, but I had never encountered the concept of "Weight Watchers" points before, so I wanted to clarify what they entail. After a quick Google search, it seems like the Weight Watcher points quantifies the potential of how significant a food is to make you gain weight, with high-sugar foods receiving a high point value, whereas high-protein/high-fiber foods receive a low point value. You can find more information through this link: https://www.weightwatchers.com/au/how-it-works/points

In [None]:
print("Size of dataset:", df.shape)

Let's also check for any missing values in the dataset:

In [None]:
df.isna().sum()

We see that there are some missing values, including data where the Calories count is missing. We will have to exclude these examples from our training.

Before we proceed with any further analysis, let's clean up the column names to get rid of the newline character `\n` and expand any abbreviations to improve clarity:

In [None]:
df = df.rename({
    "Calories from\nFat": "Calories From Fat",
    "Total Fat\n(g)": "Total Fat (g)",
    "Saturated Fat\n(g)": "Saturated Fat (g)",
    "Trans Fat\n(g)": "Trans Fat (g)",
    "Cholesterol\n(mg)": "Cholesterol (mg)",
    "Sodium \n(mg)": "Sodium (mg)",
    "Carbs\n(g)": "Carbohydrates (g)",
    "Fiber\n(g)": "Fiber (g)",
    "Sugars\n(g)": "Sugars (g)",
    "Protein\n(g)": "Protein (g)",
    "Weight Watchers\nPnts": "Weight Watchers Points"
}, axis='columns')

Let's examine how the dataset looks like now:

In [None]:
df.head()

Let's also examine the types of the columns to check they align with our intuitions:

Next, an attribute that might be of interest is to classify the food items by their categories - for example, Burgers, Sides, Desserts, etc. To do this, let's visualize the full dataset to see all the different items present:

In [None]:
pd.set_option('display.max_rows', None)

df

From this, we can pick out a few key observations:

* The data points are roughly arranged in order of category, in the sense that similarly grouped items are grouped together in the dataset. This makes  categorizing them much easier.

* Many items in the dataset have sizes attributed to them, like **Small, Medium and Large**. One possible feature we could add is a `Size` variate by extracting this information from the item names. However, different items with the same size might still have massively different calorie counts, due to the nature of the food item (e.g. think of the discrepancy between a Large Coke Zero vs. a Large Latte).

* The item **Salad Dressings** (row 64) has basically no information. As such, we can just remove this from the dataset entirely.

* The item **Hamburger Happy Meal** has a `Saturated Fat (g)` value of "5.5 g", which is a string. We should convert this to the corresponding numeric value and change the `dtype` of the associated column to be numeric.

In [None]:
df.loc[df["Item"] == "Hamburger Happy Meal", "Saturated Fat (g)"] = 5.5
df["Saturated Fat (g)"] = df["Saturated Fat (g)"].astype("float64")

# 3 Graphical Attributes

Let's plot histograms for all the explanatory variates:

In [None]:
df.hist(bins=30, figsize=(15, 10))
plt.show()

We can also plot the correlation matrix between all the different variates:

In [None]:
corr = df.corr(numeric_only=True)

sns.heatmap(corr)

In particular, as we want to predict the `Calories` variate, we should also find the highest correlating explanatory variates with this:

In [None]:
corr["Calories"].sort_values(ascending=False)

It seems like the weight watchers points a food item has correlates very strongly with its calorie count, which makes sense given the nature of weight watcher points. Unsurprisingly, the three main food nutrients - Carbohydrates, Protein and Fat - also constitute significant correlations with a food's calorie count.

# 4 Feature Engineering

We are now ready to add some features to our dataset that might be useful in our predictive analysis.

Firstly, let's add categories to the dataset. These were decided by me by observing the dataset, and might not reflect the actual McDonalds' categories, but they should serve as a good basis:

In [None]:
# default "other", which will include toppings and sauces
df["Category"] = "Other"

df.loc[0:15, "Category"] = "Burgers"
df.loc[15:22, "Category"] = "Sandwiches"
df.loc[22:34, "Category"] = "Wraps"
df.loc[34:36, "Category"] = "Fries"
df.loc[39:42, "Category"] = "McNuggets"
df.loc[46:48, "Category"] = "Chicken Strips"
df.loc[52:62, "Category"] = "Salads"
df.loc[70:104, "Category"] = "Breakfast"
df.loc[105:131, "Category"] = "Desserts"
df.loc[131:145, "Category"] = "Milkshakes"
df.loc[149:175, "Category"] = "Soft Drinks"
df.loc[182:324, "Category"] = "Coffees, Teas and Hot Chocolate"
df.loc[325:330, "Category"] = "Smoothies"

In [None]:
df["Category"].value_counts()

We will need to convert these into one-hot encodings when we model this dataset, so let us do so:

In [None]:
df = pd.get_dummies(df, columns=["Category"])
df.head()

We also need to deal with missing values. Since we are only predicting a food's calorie count with its nutritional breakup, we do not really care about any missing Weight Watcher points values. If a nutritional value is NA, I will replace it with 0.

In [None]:
df = df.fillna(0)
df.isna().sum()

# 5 Modelling

To model the calorie count of the various food items, we will use gradient boosting regression.

We first train our model:

In [None]:
X = df.loc[:, df.columns != "Calories"]
y = df["Calories"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

_X_train = X_train.drop(["Item", "Weight Watchers Points"], axis=1)
_X_test = X_test.drop(["Item", "Weight Watchers Points"], axis=1)

dtrain_reg = xgb.DMatrix(_X_train, y_train, enable_categorical=True)
dtest_reg = xgb.DMatrix(_X_test, y_test, enable_categorical=True)

# define parameters for xgboost regression
if torch.cuda.is_available():
    tree_method = "gpu_hist"
else:
    tree_method = "hist"

params = {"objective": "reg:squarederror", "tree_method": tree_method}
model = xgb.train(
   params = params,
   dtrain = dtrain_reg,
   num_boost_round = 100,
)

We are now ready to predict on our test set.

# 6 Prediction

Firstly, we initialize a results table:

In [None]:
results = pd.DataFrame({"Item": X_test["Item"], "Calories": y_test})
results = results.rename({"Calories": "Actual Calories"}, axis='columns')
results.head()

Next, we can load in our predictions:

In [None]:
pred = model.predict(dtest_reg)
results["Predicted Results"] = pred
results.head()

To quantify the error between the actual and predicted values, we can use the **Mean Squared Error (MSE)** between the values.

In [None]:
mean_squared_error(y_test, pred)