<a href="https://colab.research.google.com/github/jassynavarro/CCADMACL_EXERCISES/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [209]:
import pandas as pd
import seaborn as sns
import numpy as np


from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor
from sklearn import datasets, ensemble
from sklearn.metrics import root_mean_squared_log_error
from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [210]:
# put your answer here
df = pd.read_csv("train.csv")
dt = pd.read_csv("test.csv")

In [211]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1200000 non-null  int64  
 1   Age                   1181295 non-null  float64
 2   Gender                1200000 non-null  object 
 3   Annual Income         1155051 non-null  float64
 4   Marital Status        1181471 non-null  object 
 5   Number of Dependents  1090328 non-null  float64
 6   Education Level       1200000 non-null  object 
 7   Occupation            841925 non-null   object 
 8   Health Score          1125924 non-null  float64
 9   Location              1200000 non-null  object 
 10  Policy Type           1200000 non-null  object 
 11  Previous Claims       835971 non-null   float64
 12  Vehicle Age           1199994 non-null  float64
 13  Credit Score          1062118 non-null  float64
 14  Insurance Duration    1199999 non-

In [212]:
df.isnull().sum()

Unnamed: 0,0
id,0
Age,18705
Gender,0
Annual Income,44949
Marital Status,18529
Number of Dependents,109672
Education Level,0
Occupation,358075
Health Score,74076
Location,0


In [213]:
mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')
mode_imputer = SimpleImputer(strategy='most_frequent')

df['Age'] = mean_imputer.fit_transform(df[['Age']])
df['Annual Income'].fillna(df['Annual Income'].median(), inplace=True)
df['Number of Dependents'] = median_imputer.fit_transform(df[['Number of Dependents']])
df['Health Score'] = median_imputer.fit_transform(df[['Health Score']])
df['Previous Claims'] = mode_imputer.fit_transform(df[['Previous Claims']])
df['Vehicle Age'] = median_imputer.fit_transform(df[['Vehicle Age']])
df['Credit Score'] = median_imputer.fit_transform(df[['Credit Score']])
df.dropna(subset=['Premium Amount'])

# Encoding categorical data
label_encoder = LabelEncoder()
df['Occupation'] = label_encoder.fit_transform(df['Occupation'])
df['Marital Status'] = label_encoder.fit_transform(df['Marital Status'])
df['Property Type'] = label_encoder.fit_transform(df['Property Type'])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Annual Income'].fillna(df['Annual Income'].median(), inplace=True)


In [214]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1200000 non-null  int64  
 1   Age                   1200000 non-null  float64
 2   Gender                1200000 non-null  object 
 3   Annual Income         1200000 non-null  float64
 4   Marital Status        1200000 non-null  int64  
 5   Number of Dependents  1200000 non-null  float64
 6   Education Level       1200000 non-null  object 
 7   Occupation            1200000 non-null  int64  
 8   Health Score          1200000 non-null  float64
 9   Location              1200000 non-null  object 
 10  Policy Type           1200000 non-null  object 
 11  Previous Claims       1200000 non-null  float64
 12  Vehicle Age           1200000 non-null  float64
 13  Credit Score          1200000 non-null  float64
 14  Insurance Duration    1199999 non-

In [215]:
df.isnull().sum()

Unnamed: 0,0
id,0
Age,0
Gender,0
Annual Income,0
Marital Status,0
Number of Dependents,0
Education Level,0
Occupation,0
Health Score,0
Location,0


In [216]:
df.drop(columns=["Education Level", "Exercise Frequency", "Customer Feedback", "Policy Start Date", "Location", "Insurance Duration"], inplace=True)

In [217]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 15 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1200000 non-null  int64  
 1   Age                   1200000 non-null  float64
 2   Gender                1200000 non-null  object 
 3   Annual Income         1200000 non-null  float64
 4   Marital Status        1200000 non-null  int64  
 5   Number of Dependents  1200000 non-null  float64
 6   Occupation            1200000 non-null  int64  
 7   Health Score          1200000 non-null  float64
 8   Policy Type           1200000 non-null  object 
 9   Previous Claims       1200000 non-null  float64
 10  Vehicle Age           1200000 non-null  float64
 11  Credit Score          1200000 non-null  float64
 12  Smoking Status        1200000 non-null  object 
 13  Property Type         1200000 non-null  int64  
 14  Premium Amount        1200000 non-

## 2. Perform Data preprocessing

In [218]:
# Split data into features and target
X = df.drop(["Premium Amount", "id"], axis=True)
y = df["Premium Amount"]

In [219]:
# put your answer here
# Define categorical and numerical features
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_features = X.select_dtypes(
   include=["object"]
).columns.tolist()

numerical_features = X.select_dtypes(
   include=["float64", "int64"]
).columns.tolist()


preprocessor = ColumnTransformer(
   transformers=[
       ("cat", OneHotEncoder(), categorical_features),
       ("num", StandardScaler(), numerical_features),
   ]
)

In [220]:
xgb_model = XGBRegressor(
    n_estimators=500,
    max_depth=4,
    learning_rate=0.01,
    random_state=42,
)

## 3. Create a Pipeline

In [221]:
# put your answer here
from sklearn.pipeline import Pipeline

pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("regressor", xgb_model),
    ]
)

## 4. Train the Model

In [222]:
# put your answer here
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.2, random_state=42
)

## 5. Evaluate the Model

In [223]:
df.isnull().sum()

Unnamed: 0,0
id,0
Age,0
Gender,0
Annual Income,0
Marital Status,0
Number of Dependents,0
Occupation,0
Health Score,0
Policy Type,0
Previous Claims,0


In [224]:
# put your answer here

# Fit the model on the training data
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

In [225]:
val_rmse = root_mean_squared_log_error(y_test, y_pred)
val_rmsle = val_rmse**0.5

print(f"Validation RMSE: {val_rmsle:.4f}")

Validation RMSE: 1.0765


## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [226]:
mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')
mode_imputer = SimpleImputer(strategy='most_frequent')

dt['Age'] = mean_imputer.fit_transform(dt[['Age']])
dt['Annual Income'].fillna(dt['Annual Income'].median(), inplace=True)
dt['Number of Dependents'] = median_imputer.fit_transform(dt[['Number of Dependents']])
dt['Health Score'] = median_imputer.fit_transform(dt[['Health Score']])
dt['Previous Claims'] = mode_imputer.fit_transform(dt[['Previous Claims']])
dt['Vehicle Age'] = median_imputer.fit_transform(dt[['Vehicle Age']])
dt['Credit Score'] = median_imputer.fit_transform(dt[['Credit Score']])

# Encoding categorical data
label_encoder = LabelEncoder()
dt['Occupation'] = label_encoder.fit_transform(dt['Occupation'])
dt['Marital Status'] = label_encoder.fit_transform(dt['Marital Status'])
dt['Property Type'] = label_encoder.fit_transform(dt['Property Type'])

test_features = dt.drop(columns=["Premium Amount", "id", "Education Level", "Exercise Frequency", "Customer Feedback", "Policy Start Date", "Location", "Insurance Duration"], errors='ignore')
test_predictions = pipeline.predict(test_features)

submission_df = pd.DataFrame({
    "id": dt["id"],
    "Premium Amount": test_predictions
})

submission_df.to_csv("submission_file.csv", index=False)
print("Submission file created: submission_file.csv")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dt['Annual Income'].fillna(dt['Annual Income'].median(), inplace=True)


Submission file created: submission_file.csv
