## Goal


The goal is to use various factors to predict obesity risk in individuals, which is related to cardiovascular disease.
     
## About the Dataset
    
The data consist of the estimation of obesity levels in people from the countries of Mexico, Peru and Colombia, with ages between 14 and 61 and diverse eating habits and physical condition, data was generated from a deep learning model trained on the [Obesity risk dataset](https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster).  


<a id="toc"></a>

- [1.1 Import Libraries](#1.1)
- [1.2 Import Data](#1.2)
- [1.3 Quick overview](#1.3)
- [1.4 Summary of the data](#1.4)
- [2. Exploratory Data Analysis ](#2)
- [3. Pre-Processing](#3)
- [4. Model building](#4)
- [5. Prediction on Test data](#5)


<a id="1.1"></a>
## <b>1.1 <span style='color:#E1B12D'>Import Libraries</span></b> 

In [None]:
%%capture
!pip install scikit-learn xgboost lightgbm catboost

In [None]:
%%capture
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings 
warnings.filterwarnings('ignore')

from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier


<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="1.2"></a>
## <b>1.2 <span style='color:#E1B12D'>Import Data</span></b> 

In [None]:
# Install the visual_eda package using the setuptools setup.py script found in setup.py
# Only use this when the package is finished developing. 
# (It would be better to install this directly into your virtual environment) 
# %pip install -e .

#Otherwise you can also append the folder to your sys path
import os
import sys
project_root = os.getcwd()
src_path = os.path.join(project_root, 'src')
sys.path.append(src_path)



In [None]:

from visual_eda.dataset import Dataset
# original_data = Dataset('/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv')
train_data = Dataset('data/raw/train.csv')
test_data = Dataset('data/raw/test.csv')
print(str(train_data))
print(str(test_data))
train_data.summary()

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="1.3"></a>
## <b>1.3 <span style='color:#E1B12D'>Quick overview</span></b> 

In [None]:
#Let's check the samples of data
display('Train:',train_data.data.head())
display('Test:',test_data.data.head())

The attributes related with eating habits are:

In [None]:
from tabulate import tabulate
data = [
    ["FAVC", "Frequent consumption of high caloric food"],
    ["FCVC", "Frequency of consumption of vegetables"],
    ["NCP", "Number of main meals"],
    ["CAEC", "Consumption of food between meals"],
    ["CH20", "Consumption of water daily"],
    ["CALC", "Consumption of alcohol"],
    ["SCC", "Calories consumption monitoring"],
    ["FAF", "Physical activity frequency"],
    ["TUE", "Time using technology devices"],
    ["MTRANS", "Transportation used"]
]
headers = ["Abbreviation", "Full Form"]

table = tabulate(data, headers, tablefmt="pipe")
print(table)

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="1.4"></a>
## <b>1.4 <span style='color:#E1B12D'>Summary of the data</span></b> 

In [None]:
train_data.summary()

About Data:
- The train dataset contains 20758 rows and 18 columns.
- There are no missing or duplicate values in any of the columns.
- Target Variable - Obesity Classification

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="2"></a>
## <b>2 <span style='color:#E1B12D'> Exploratory Data Analysis</span></b> 

Let's visualize each of the variables:

**Target Variable:**

In [None]:
# Plot for NObeyesdad
train_data.show_plot("NObeyesdad")

- We have highest number of people with **Obesity_Type III** having share of **19.5%**.

In [None]:
# Plot for Gender
train_data.show_plot("Gender")

- **Gender** distribution is fairly equal in the dataset.

In [None]:
# Plot for FAVC
train_data.show_plot("family_history_with_overweight")

- **82.0%** people have a family history with Overweight.

In [None]:
train_data.show_plot("FAVC")

- **91.4%** people **Frequently consume high caloric food**.

In [None]:
train_data.show_plot("CAEC")

- **84.4% sometimes** consumes food between meals while ~1.5% says No meals in between.

In [None]:
train_data.show_plot("SMOKE")

- **98.8%** are non-Smokers. Doesn't sound correct, but let's trust the data.

In [None]:
train_data.show_plot("SCC")

- **96.7% don't bother** monitoring calorie consumption. 

(Counting calories? Only folks who don't truly appreciate the art of savoring food would do that, right? )

In [None]:
train_data.show_plot("CALC")

- **72.6%** consumes alcohol sometimes while 2.5% does Frequent.

(Interesting to note that in Test dataset, we have "Always" as well.)

In [None]:
train_data.show_plot("MTRANS")

- **97.6%** use some form of vehicles while only **~2.4% prefers walking/using bike** That's concerning!

In [None]:
# Checking for distributions
numeric_columns = df_train.select_dtypes(include=['float64', 'int64']).drop(columns=['id'], axis=1)
def dist(train_dataset, original_dataset, columns_list, rows, cols):
    fig, axs = plt.subplots(rows, cols, figsize=(24, 10))
    plt.suptitle('Distribution for numerical features: Train vs Original Dataset', fontsize=16, fontweight='bold')
    axs = axs.flatten()
    
    for i, col in enumerate(columns_list):
        sns.kdeplot(train_dataset[col], ax=axs[i], fill=True, alpha=0.5, linewidth=0.5, color='#05b0a3', label='Train')
        sns.kdeplot(original_dataset[col], ax=axs[i], fill=True, alpha=0.5, linewidth=0.5, color='#d68c78', label='Original')
        axs[i].set_title(f'{col}, Train skewness: {train_dataset[col].skew():.2f}\n Original skewness: {original_dataset[col].skew():.2f}')
        axs[i].legend()
        
    plt.tight_layout()

In [None]:
dist(train_dataset=df_train, original_dataset=original, columns_list=numeric_columns.columns, rows=2, cols=4)

-  Age, height and Weight are normally distributed with some skewness

**Visualizing Features with Target Variable:**

In [None]:
colors = ['#1f77b4', '#fc6c44', '#2b8a2b', '#fc7c7c', '#9467bd', '#4ba4ad', '#c7ad18', '#7f7f7f', '#69d108']
fig, axes = plt.subplots(1, 3, figsize=(20, 10))
ax1 = sns.scatterplot(x=df_train['Height'], y=df_train['Age'], hue="NObeyesdad",
                       data=df_train, palette=colors, edgecolor='grey', alpha=0.8, s=9, ax=axes[0])
axes[0].set_title('Height vs Age')
ax2 = sns.scatterplot(x=df_train['Height'], y=df_train['Weight'], hue="NObeyesdad",
                       data=df_train, palette=colors, edgecolor='grey', alpha=0.8, s=9, ax=axes[1])
axes[1].set_title('Height vs Weight')
ax3 = sns.scatterplot(x=df_train['Age'], y=df_train['Weight'], hue="NObeyesdad",
                       data=df_train, palette=colors, edgecolor='grey', alpha=0.8, s=9, ax=axes[2])
axes[2].set_title('Age vs Weight')
for ax in axes.flatten():
    ax.get_legend().remove()
handles, labels = ax1.get_legend_handles_labels()
fig.legend(handles, labels, loc='lower center', bbox_to_anchor=(0.5, -0.1), ncol=len(df_train['NObeyesdad'].unique()),
           title='')
fig.suptitle('Age, Height, Weight against Target', fontsize=20)
fig.subplots_adjust(bottom=0.5, top=0.9, hspace=0.5)
plt.tight_layout()
plt.show()

- This doesn't provide much information, let's create BMI as a feature & check it individually against Target.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
df_train['BMI']=  df_train['Weight'] / df_train['Height']**2
ax1 = axes[0]
df_sort = df_train.groupby('NObeyesdad')['BMI'].mean().sort_values(ascending=False).index
sns.barplot(x='BMI', y='NObeyesdad', data=df_train, palette='light:#4caba4_r', order=df_sort,
            estimator=np.mean, ci=None, errwidth=0, ax=ax1)
for p in ax1.patches:
    ax1.annotate(f'{p.get_width():.2f}', (p.get_x() + p.get_width() / 2., p.get_y() + p.get_height()),
                ha='center', va='center', xytext=(0, 20), textcoords='offset points', fontsize=10, color='black')
ax1.set_title('Mean BMI by NObeyesdad')
ax1.set_xlabel('BMI')
ax1.set_ylabel('')
sns.despine(left=True, bottom=True, ax=ax1)

# Violin Plot
ax2 = axes[1]
sns.violinplot(x='BMI', y='NObeyesdad', data=df_train, palette='light:#4caba4_r', order=df_sort, ax=ax2)
ax2.set_title('Distribution of BMI by NObeyesdad')
ax2.set_ylabel("")
plt.yticks([])
sns.despine(left=True, bottom=True, ax=ax2)
plt.tight_layout()
plt.show()

- It is evident from the above plot that Obesity Type III has the highest Mean BMI of 41.78 against normal weight having an mean BMI of 22.0
- There is inconsistency in the categories wherein the BMI levels are not as these should ideally be. Let's further investigate it.

In [None]:
df_train.groupby('NObeyesdad')['BMI'].describe().reset_index().style.background_gradient()

- The BMI for each categories are inconsistent, e.g. Obesity_Type_II ranges from 24.05 to 46.22 or Obesity_Type_III ranges from 18.18 to 54.99, which is not ideal case

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
ax1 = axes[0]
df_sort = df_train.groupby('NObeyesdad')['Age'].mean().sort_values(ascending=False).index
sns.barplot(x='Age', y='NObeyesdad', data=df_train, palette='light:#4caba4_r', order=df_sort,
            estimator=np.mean, ci=None, errwidth=0, ax=ax1)
for p in ax1.patches:
    ax1.annotate(f'{p.get_width():.2f}', (p.get_x() + p.get_width() / 2., p.get_y() + p.get_height()),
                ha='center', va='center', xytext=(0, 20), textcoords='offset points', fontsize=10, color='black')
ax1.set_title('Mean Age by NObeyesdad')
ax1.set_xlabel('Age')
ax1.set_ylabel('')
sns.despine(left=True, bottom=True, ax=ax1)

# Violin Plot
ax2 = axes[1]
sns.violinplot(x='Age', y='NObeyesdad', data=df_train, palette='light:#4caba4_r', order=df_sort, ax=ax2)
ax2.set_title('Distribution of Age by NObeyesdad')
ax2.set_ylabel("")
plt.yticks([])
sns.despine(left=True, bottom=True, ax=ax2)
plt.tight_layout()
plt.show()

- Normal Weight or Insufficient weight people seems to be yonger on an average than the rest

In [None]:
cross_tab = pd.crosstab(df_train['NObeyesdad'], df_train['MTRANS'])
plt.figure(figsize=(10, 5))
sns.heatmap(cross_tab, annot=True, cmap='Blues', fmt='d', cbar=False)
plt.title(' NObeyesdad and MTRANS')
plt.xlabel('')
plt.ylabel('')
plt.show()

- Obesity_Type II or Type III people do not or rarely walk or use bike which shows lack of physical activity

In [None]:
plt.figure(figsize=(15, 6))
ax = sns.countplot(x='Gender', hue='NObeyesdad', data=df_train, palette=colors, dodge=True)
plt.title('Distribution of NObeyesdad across Gender')
sns.despine(left=True, bottom=False)
plt.xlabel('')
plt.ylabel('')
plt.yticks([])
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{round(height)}', (p.get_x() + p.get_width() / 2., height),
                ha='center', va='center', xytext=(0, 8), textcoords='offset points')
plt.show()

- Obesity Type II is most common among Males, while Obesity Type III is most common among Females.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
df_train['BMI']=  df_train['Weight'] / df_train['Height']**2
ax1 = axes[0]
df_sort = df_train.groupby('Gender')['BMI'].mean().sort_values(ascending=False).index
sns.barplot(x='BMI', y='Gender', data=df_train, palette='light:#4caba4_r', order=df_sort,
            estimator=np.mean, ci=None, errwidth=0, ax=ax1)
for p in ax1.patches:
    ax1.annotate(f'{p.get_width():.2f}', (p.get_x() + p.get_width() / 2., p.get_y() + p.get_height()),
                ha='center', va='center', xytext=(0, 50), textcoords='offset points', fontsize=10, color='black')
ax1.set_title('Mean BMI by Gender')
ax1.set_xlabel('BMI')
ax1.set_ylabel('')
sns.despine(left=True, bottom=True, ax=ax1)
# Violin Plot
ax2 = axes[1]
sns.violinplot(x='BMI', y='Gender', data=df_train, palette='light:#4caba4_r', order=df_sort, ax=ax2)
ax2.set_title('Distribution of BMI by Gender')
ax2.set_ylabel("")
plt.yticks([])
sns.despine(left=True, bottom=True, ax=ax2)
plt.tight_layout()
plt.show()

- BMI is higher amongs Female as compared to male which was visible in previous plot of Obeity vs Gender Plot

In [None]:
#Correlation heatmap
numeric_columns_original = original.select_dtypes(include=np.number)
numeric_columns_train = df_train.select_dtypes(include=np.number).drop(['id','BMI'], axis=1)
# original
corr_original = numeric_columns_original.corr(method='pearson')
mask_original = np.triu(np.ones_like(corr_original))
fig, axes = plt.subplots(1, 2, figsize=(20, 8))
sns.heatmap(corr_original, annot=True, fmt='.2f', mask=mask_original, cmap='copper_r', cbar=None, linewidth=2, ax=axes[0])
axes[0].set_title('Original Dataset', fontsize=16, fontweight='bold')

# Train
corr_train = numeric_columns_train.corr(method='pearson')
mask_train = np.triu(np.ones_like(corr_train))
sns.heatmap(corr_train, annot=True, fmt='.2f', mask=mask_train, cmap='copper_r', cbar=None, linewidth=2, ax=axes[1])
axes[1].set_title('Train Dataset', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

- Weight and Height has the highest positive correlation
- Correlation is almost similar in both original & Train dataset

In [None]:
#Check for presence of outliers in each feature
numeric_columns = df_train.select_dtypes(include=['float64', 'int64']).drop(columns=['id'], axis=1)
fig = plt.figure(figsize=[32,10])
plt.suptitle('Outliers in the data', fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92);
fig.subplots_adjust(hspace=0.5, wspace=0.4);
for i ,col in enumerate(numeric_columns):
    ax = fig.add_subplot(3,3, i+1);
    ax = sns.boxplot(data = df_train, x=col ,  color= colors[i]);
    ax.set_title(f'{col}')
    ax.set_xlabel(f'{col}')
    ax.grid(False)
plt.show()

- Outliers present in Age
- Rest features do not have presence of outliers

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="3"></a>
## <b>3. <span style='color:#E1B12D'>Pre-Processing</span></b> 


The Pre-processing & Hyperparameters are taken from the notebook. Please check out the original work https://www.kaggle.com/code/moazeldsokyx/pgs4e2-highest-score-lgbm-hyperparameter-tuning/notebook

In [None]:
#Loading the dataset again to revert previously made changed on BMI etc.
df_train = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
original = pd.read_csv('/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv')
df_test = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')

In [None]:
def get_variable_types(dataframe):
    continuous_vars = []
    categorical_vars = []

    for column in dataframe.columns:
        if dataframe[column].dtype == 'object':
            categorical_vars.append(column)
        else:
            continuous_vars.append(column)

    return continuous_vars, categorical_vars

continuous_vars, categorical_vars = get_variable_types(df_train)
categorical_vars.remove('NObeyesdad')

In [None]:
train = pd.concat([df_train, original]).drop(['id'], axis=1).drop_duplicates()
test = df_test.drop(['id'], axis=1)

In [None]:
train = pd.get_dummies(train, columns=categorical_vars, drop_first=True)
test = pd.get_dummies(test, columns=categorical_vars, drop_first=True)

In [None]:
#Let's check the Shape of data
print(f'The encoded Train dataset has {train.shape[0]} rows and {train.shape[1]} columns')
print(f'The encoded Test dataset has {test.shape[0]} rows and {test.shape[1]} columns')

In [None]:
X = train.drop(['NObeyesdad'], axis=1)
y = train['NObeyesdad']

In [None]:
X.shape

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Feature Scaling
# from sklearn.preprocessing import StandardScaler
# sc = StandardScaler()
# X_train = sc.fit_transform(X_train)
# X_test = sc.transform(X_test)

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="4"></a>
## <b>4. <span style='color:#E1B12D'>Model Building</span></b> 


**Hyperparameters for LGBMClassifier using Optuna**

In [None]:

# # Define the objective function for Optuna optimization
# import optuna
# from optuna.samplers import TPESampler

# def objective(trial, X_train, y_train, X_test, y_test):
#      # Define parameters to be optimized for the LGBMClassifier
#      param = {
#          "objective": "multiclass",
#          "metric": "multi_logloss",
#          "verbosity": -1,
#          "boosting_type": "gbdt",
#          "random_state": 42,
#          "num_class": 7,
#          "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.2),
#          "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
#          "lambda_l1": trial.suggest_float("lambda_l1", 0.005, 0.015),
#          "lambda_l2": trial.suggest_float("lambda_l2", 0.02, 0.06),
#          "max_depth": trial.suggest_int("max_depth", 5, 20),
#          "colsample_bytree": trial.suggest_float("colsample_bytree", 0.3, 0.9),
#          "subsample": trial.suggest_float("subsample", 0.8, 1.0),
#          "min_child_samples": trial.suggest_int("min_child_samples", 5, 50),
#      }

#  # LGBMClassifier with the suggested parameters
#      lgbm_classifier = LGBMClassifier(**param)
    
# # Fit 
#      lgbm_classifier.fit(X_train, y_train)

# # Evaluate
#      score = lgbm_classifier.score(X_test, y_test, )

#      return score

# # Train Test split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# #sampler for Optuna optimization
# sampler = optuna.samplers.TPESampler(seed=42)  # Using Tree-structured Parzen Estimator sampler for optimization

# # Create a study object
# study = optuna.create_study(direction="maximize", sampler=sampler)

# # Run the optimization process
# study.optimize(lambda trial: objective(trial, X_train, y_train, X_test, y_test), n_trials=50)

# # best parameters after optimization
# best_params = study.best_params

# print('='*50)
# print(best_params)

In [None]:
# Best parameters obtained from Optuna optimization from notebook in comments
# https://www.kaggle.com/code/moazeldsokyx/pgs4e2-highest-score-lgbm-hyperparameter-tuning/notebook

best_params = {
    "objective": "multiclass",          # Objective function for the model
    "metric": "multi_logloss",          # Evaluation metric
    "verbosity": -1,                    # Verbosity level (-1 for silent)
    "boosting_type": "gbdt",            # Gradient boosting type
    "random_state": 42,       # Random state for reproducibility
    "num_class": 7,                     # Number of classes in the dataset
    'learning_rate': 0.01197852738297134,  # Learning rate for gradient boosting
    'n_estimators': 509,                # Number of boosting iterations
    'lambda_l1': 0.009715116714365275,  # L1 regularization term
    'lambda_l2': 0.03853395161282091,   # L2 regularization term
    'max_depth': 11,                    # Maximum depth of the trees
    'colsample_bytree': 0.7364306508830604,  # Fraction of features to consider for each tree
    'subsample': 0.9529973839959326,    # Fraction of samples to consider for each boosting iteration
    'min_child_samples': 17             # Minimum number of data needed in a leaf
}

**LGBMClassifier with the best parameters**

In [None]:
lgbm_classifier = LGBMClassifier(**best_params)
lgbm_classifier.fit(X_train, y_train)
y_pred = lgbm_classifier.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred) 

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
from lightgbm import LGBMClassifier, plot_importance
plt.figure(figsize=(15, 6))
conf_matrix = confusion_matrix(y_test, y_pred)
conf_labels = [f'{i}' for i in range(conf_matrix.shape[0])]
conf_matrix_df = pd.DataFrame(conf_matrix, columns=conf_labels, index=conf_labels)
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Blues)
plt.xticks(np.arange(conf_matrix.shape[0]), conf_labels, rotation=45)
plt.yticks(np.arange(conf_matrix.shape[0]), conf_labels)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
        plt.text(j, i, str(conf_matrix[i, j]), ha='center', va='center', color='black')
plt.grid(False)
plt.show()

In [None]:
# feature importances
feature_importance = lgbm_classifier.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(12, 10))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('')
sns.despine(left=True, bottom=True)
plt.show()

- Weight, Height, Age and FAF appear to be the most important features.
- CH20, Time using technology devices (TUE), and NCP are the other key important features.

<a id="toc"></a>

<a href="#toc" style="background-color: #E1B12D; color: #ffffff; padding: 7px 10px; text-decoration: none; border-radius: 50px;">Back to top</a><a id="toc"></a>

<a id="5"></a>
## <b>5. <span style='color:#E1B12D'>Prediction on Test data</span></b> 


In [None]:
# Evaluate the best model on the test set
predictions = lgbm_classifier.predict(test)

**Final Submission**

In [None]:
submission = pd.read_csv("/kaggle/input/playground-series-s4e2/sample_submission.csv")
submission["NObeyesdad"] = predictions
submission.to_csv("submission1.csv", index=False)
submission.head()