# Predict Calorie Expenditure

### *Playground Series - Season 5, Episode 5*  
  
##### **Dataset Description**

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Calories Burnt Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

###### **Files**

*    `train.csv` - the training dataset; Calories is the continuous target
*    `test.csv` - the test dataset; your objective is to predict the Calories for each row
*    `sample_submission.csv` - a sample submission file in the correct format.

### Exploratory Data Analysis

We see from the Kaggle contest page the distribution of values within fields. 

In [3]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/maggieclark/kaggle-calories/refs/heads/main/train.csv'
df = pd.read_csv(url)

df.head()

Unnamed: 0,id,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,0,male,36,189.0,82.0,26.0,101.0,41.0,150.0
1,1,female,64,163.0,60.0,8.0,85.0,39.7,34.0
2,2,female,51,161.0,64.0,7.0,84.0,39.8,29.0
3,3,male,20,192.0,90.0,25.0,105.0,40.7,140.0
4,4,female,38,166.0,61.0,25.0,102.0,40.6,146.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   id          750000 non-null  int64  
 1   Sex         750000 non-null  object 
 2   Age         750000 non-null  int64  
 3   Height      750000 non-null  float64
 4   Weight      750000 non-null  float64
 5   Duration    750000 non-null  float64
 6   Heart_Rate  750000 non-null  float64
 7   Body_Temp   750000 non-null  float64
 8   Calories    750000 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 51.5+ MB


In [5]:
df.describe()

Unnamed: 0,id,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
count,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0
mean,374999.5,41.420404,174.697685,75.145668,15.421015,95.483995,40.036253,88.282781
std,216506.495284,15.175049,12.824496,13.982704,8.354095,9.449845,0.779875,62.395349
min,0.0,20.0,126.0,36.0,1.0,67.0,37.1,1.0
25%,187499.75,28.0,164.0,63.0,8.0,88.0,39.6,34.0
50%,374999.5,40.0,174.0,74.0,15.0,95.0,40.3,77.0
75%,562499.25,52.0,185.0,87.0,23.0,103.0,40.7,136.0
max,749999.0,79.0,222.0,132.0,30.0,128.0,41.5,314.0


#### **Pre-processing Steps:** 

* drop `id` column
* drop duplicate rows
* encode categorical variable
* normalize numeric variables
* 80/20 split of training/testing

In [7]:
# drop id column and duplicates
df = df.drop_duplicates().drop('id',axis=1)
df.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,male,36,189.0,82.0,26.0,101.0,41.0,150.0
1,female,64,163.0,60.0,8.0,85.0,39.7,34.0
2,female,51,161.0,64.0,7.0,84.0,39.8,29.0
3,male,20,192.0,90.0,25.0,105.0,40.7,140.0
4,female,38,166.0,61.0,25.0,102.0,40.6,146.0


In [8]:
# encode categorical features, since Sex is male femail we can encode it numerically
df['Sex'] = df['Sex'].map({'male':0, 'female':1})

# normalize numeric features 
from sklearn.preprocessing import StandardScaler

features = ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']
scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])

In [9]:
df.head()

Unnamed: 0,Sex,Age,Height,Weight,Duration,Heart_Rate,Body_Temp,Calories
0,0,-0.357192,1.115235,0.490201,1.266324,0.583714,1.235772,150.0
1,1,1.487943,-0.912137,-1.083172,-0.888309,-1.109436,-0.431163,34.0
2,1,0.631273,-1.068088,-0.797104,-1.008011,-1.215258,-0.302938,29.0
3,0,-1.411555,1.349162,1.062337,1.146622,1.007002,0.851095,140.0
4,1,-0.225397,-0.678209,-1.011655,1.146622,0.689536,0.722869,146.0


In [10]:
# split into training and testing sets 
from sklearn.model_selection import train_test_split 

X = df.drop('Calories', axis=1)
y = df['Calories']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

In [13]:
# linear regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R-squared:", r2_score(y_test, y_pred))

RMSE: 11.113501855993082
R-squared: 0.9682386795387966


In [15]:
# random forest 
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))
print("R-squared:", r2_score(y_test, y_pred))

RMSE: 3.860387983396077
R-squared: 0.9961677078714221
