<a href="https://www.kaggle.com/code/manishkr1754/calories-burnt-prediction?scriptVersionId=144357952" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Calories Burnt Prediction Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

In today's health-conscious society, monitoring and managing calorie expenditure is a key aspect of maintaining a healthy lifestyle. Understanding how various activities and individual factors impact calorie burn is crucial for individuals striving to achieve fitness goals. Leveraging the capabilities of data science, we aim to address this health and wellness challenge.

This project falls within the domain of **Regression Machine Learning Problem**. The primary objective is **to develop a predictive model for calorie burnt prediction**. By analyzing a combination of input features such as physical activity type, duration, intensity, and individual characteristics like age, weight, and gender, the goal is to create a model that accurately estimates the number of calories burnt during a specific activity. This predictive model can empower individuals, fitness enthusiasts, and healthcare professionals with valuable insights to optimize their calorie management and physical activity planning.

## 2) Understanding Data
---

The project uses **Calories Data** which contains several variables (independent variables) and one outcome variable (dependent variable).

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from six.moves import urllib

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
calories = pd.read_csv('Datasets/Day16_Calories_Data1.csv') 

In [None]:
calories

In [None]:
exercise = pd.read_csv('Datasets/Day16_Calories_Data2.csv') 

In [None]:
exercise

### Concatenating both dataframes

In [None]:
calories_data = pd.concat([exercise,calories['Calories']], axis=1)

In [None]:
calories_data

In [None]:
print('The size of Dataframe is: ', calories_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
calories_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in calories_data.columns if calories_data[feature].dtype != 'O']
categorical_features = [feature for feature in calories_data.columns if calories_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=calories_data.isnull().sum().sort_values(ascending=False)
percent=(calories_data.isnull().sum()/calories_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
calories_data.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
calories_data.describe(include='object').T

## 5) Data Cleaning & Preprocessing
---

### Data Visualization

#### Gender Distribution

In [None]:
plt.figure(figsize=(4,4))
sns.countplot(x= 'Gender', data= calories_data)
plt.show()

#### Age Distribution

In [None]:
plt.figure(figsize=(4,4))
sns.distplot(calories_data['Age'])
plt.show()

#### Height Distribution

In [None]:
plt.figure(figsize=(4,4))
sns.distplot(calories_data['Height'])
plt.show()

#### Weight Distribution

In [None]:
plt.figure(figsize=(4,4))
sns.distplot(calories_data['Weight'])
plt.show()

#### Heatmap to understand Correlation

In [None]:
correlation = calories_data.corr(numeric_only=True)

In [None]:
# constructing a heatmap to understand the correlation

plt.figure(figsize=(10,6))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f', annot=True, annot_kws={'size':8}, cmap='Blues')


### Encoding the Categorical Features

In [None]:
calories_data.replace({"Gender":{'male':0,'female':1}}, inplace=True)

In [None]:
calories_data

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = calories_data.drop(columns = ['User_ID','Calories'], axis=1) # Feature matrix
y = calories_data['Calories'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [None]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

### Inference

The evaluation of various machine learning models for calorie expenditure prediction yields insightful findings:

1. **Accuracy**: All models demonstrate high predictive accuracy, with R-squared (R2) values exceeding 0.96 for all models. This suggests that they effectively capture the variation in calorie burn data, indicating their suitability for the task.

2. **Error Metrics**: The Random Forest Regressor outperforms other models in terms of Mean Absolute Error (MAE) and Mean Squared Error (MSE), indicating its ability to make calorie burn predictions with the smallest errors. This signifies its superior precision in estimating calorie expenditure.

3. **Generalization**: The Support Vector Regressor (SVR) and Decision Tree Regressor also perform exceptionally well, demonstrating low errors and high R-squared values. These models exhibit strong generalization capabilities and are adept at capturing underlying patterns in the data.

In summary, all models provide accurate calorie expenditure predictions, with the Random Forest Regressor showing a slight edge in terms of precision. The choice of model may depend on specific requirements, but overall, these models are well-suited for estimating calorie burn based on the provided features.