<h1><center>Supervised ML Regression Competition</center></h1>


<img align="center" src="https://compraracciones.com/wp-content/uploads/2021/04/insurance.jpg" style="height:200px" style="width:100px"/>

<hr style="border:2px solid pink"> </hr>

You have been assigned the task of building a model that will predict the insurance cost

You'll find the data in the csv file `insurance`


- target col: "charges"


<hr style="border:2px solid pink"> </hr>


**Guidelines:** 


- train_test_split
    - random state = 42
    - test size = 0.3


- The one who gets the highest r2-score on test data wins


## 1. Initial Data Exploration

Let's start by loading our dataset and taking a first look at it.


In [None]:
# Loads the insurance dataset and provides basic information about it.
# Loads the necessary libraries
import pandas as pd

# Reads the data file
data = pd.read_csv('insurance.csv', header=0)

data.head()

In [None]:
# Provides dataset information
data.info()

In [None]:
# Sets the first row as column headers
#data.columns = data.iloc[0]
#data = data[1:]

# Prints the data types present in the data
print(data.dtypes)

In [None]:
# Ensures 'age' and 'children' columns are integers
data.loc[:, 'age'] = pd.to_numeric(data['age'], errors='coerce').fillna(0).astype(int)
data.loc[:, 'children'] = pd.to_numeric(data['children'], errors='coerce').fillna(0).astype(int)

# Ensures 'bmi' and 'charges' columns are floats
data.loc[:, 'bmi'] = pd.to_numeric(data['bmi'], errors='coerce').fillna(data['bmi'].mean()).astype(float)
data.loc[:, 'charges'] = pd.to_numeric(data['charges'], errors='coerce').fillna(data['charges'].mean()).astype(float)

# Prints data types to verify
print(data.dtypes)

## 2. Checking for Missing Values

It's important to know if our data has any missing values. Let's check that next.


In [None]:
# Checks the dataset for missing values
print(data.isnull().sum())

# The dataset has no missing data. Because of this no further analysis of the missing values will be conducted.


## 3. Descriptive Statistics

Now, let's move on to some descriptive statistics.

Understanding the distribution of our data is crucial. Let's calculate some descriptive statistics.


In [None]:
# Shows descriptive statistics for the entire dataset
print("Descriptive Statistics for the Dataset:")
print(data.describe(include='all'))

## 4. Distribution Analysis

Visualizing the distributions of our features can provide valuable insights. Let's plot the distributions for 'age', 'bmi', and 'charges'.

### Task:
- Plot the histogram for 'age'
- Plot the histogram for 'bmi'
- Plot the histogram for 'charges'


In [None]:
# Imports the necessary libraries
import matplotlib.pyplot as plt

# Plots the histogram for 'age'
plt.figure(figsize=(8, 5))
plt.hist(data['age'], bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Plots the histogram for 'bmi'
plt.figure(figsize=(8, 5))
plt.hist(data['bmi'], bins=20, color='lightgreen', edgecolor='black')
plt.title('Histogram of BMI')
plt.xlabel('BMI')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Plots the histogram for 'charges'
plt.figure(figsize=(8, 5))
plt.hist(data['charges'], bins=20, color='salmon', edgecolor='black')
plt.title('Histogram of Charges')
plt.xlabel('Charges')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## 5. Relationship Between Variables

Let's explore the relationship between some of our features and the target variable 'charges'. We'll create scatter plots to visualize these relationships.

### Task:
- Create a scatter plot for 'age' vs 'charges'
- Create a scatter plot for 'bmi' vs 'charges'
- Create a scatter plot for 'children' vs 'charges'


In [None]:
# Imports the necessary libraries
import matplotlib.pyplot as plt

# Creates a scatter plot for 'age' vs 'charges'
plt.figure(figsize=(8, 5))
plt.scatter(data['age'], data['charges'], alpha=0.7, color='blue', edgecolor='black')
plt.title('Scatter Plot: Age vs Charges')
plt.xlabel('Age')
plt.ylabel('Charges')
plt.grid(alpha=0.3)
plt.show()

# Creates a scatter plot for 'bmi' vs 'charges'
plt.figure(figsize=(8, 5))
plt.scatter(data['bmi'], data['charges'], alpha=0.7, color='green', edgecolor='black')
plt.title('Scatter Plot: BMI vs Charges')
plt.xlabel('BMI')
plt.ylabel('Charges')
plt.grid(alpha=0.3)
plt.show()

# Creates a scatter plot for 'children' vs 'charges'
plt.figure(figsize=(8, 5))
plt.scatter(data['children'], data['charges'], alpha=0.7, color='red', edgecolor='black')
plt.title('Scatter Plot: Children vs Charges')
plt.xlabel('Number of Children')
plt.ylabel('Charges')
plt.grid(alpha=0.3)
plt.show()

## 6. Categorical Analysis

Let's analyze the categorical features 'sex', 'smoker', and 'region' to see how they relate to 'charges'.

### Task:
- Plot the distribution of 'charges' for different 'sex'
- Plot the distribution of 'charges' for different 'smoker'
- Plot the distribution of 'charges' for different 'region'


In [None]:
# Imports the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Plots the distribution of 'charges' for different 'sex'
plt.figure(figsize=(10, 6))
sns.boxplot(x='sex', y='charges', data=data, hue='sex', dodge=False)
plt.title('Distribution of Charges by Sex')
plt.xlabel('Sex')
plt.ylabel('Charges')
plt.legend([], [], frameon=False)  # Disable redundant legend
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Plots the distribution of 'charges' for different 'smoker'
plt.figure(figsize=(10, 6))
sns.boxplot(x='smoker', y='charges', data=data, hue='smoker', dodge=False)
plt.title('Distribution of Charges by Smoker')
plt.xlabel('Smoker')
plt.ylabel('Charges')
plt.legend([], [], frameon=False)  # Disable redundant legend
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Plots the distribution of 'charges' for different 'region'
plt.figure(figsize=(10, 6))
sns.boxplot(x='region', y='charges', data=data, hue='region', dodge=False)
plt.title('Distribution of Charges by Region')
plt.xlabel('Region')
plt.ylabel('Charges')
plt.legend([], [], frameon=False)  # Disable redundant legend
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## 7. Correlation Analysis

To understand how our numerical features relate to each other and to the target variable, let's calculate and visualize the correlation matrix.

### Task:
- Calculate the correlation matrix for the dataset
- Visualize the correlation matrix using a heatmap


In [None]:
# Imports the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# One-hot encoding
data_encoded = pd.get_dummies(data, drop_first = True)

# Calculates the correlation matrix
correlation_matrix = data_encoded.corr()

# Visualizes the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', cbar=True, square=True, linewidths=0.5)
plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Modelling time!

## 1. Find the Naive Baseline

Before we build any models, let's establish a naive baseline. This will help us understand how well our models perform compared to a simple approach. In regression problems, the naive baseline is often the mean of the target variable.

### Task:
- Calculate the mean of the target variable 'charges'
- Explain why it's important to establish a naive baseline


In [None]:
# Calculates the mean of the target variable 'charges'
mean_charges = data['charges'].mean()

# Prints the mean value
print(f"The mean of the target variable 'charges' is: {mean_charges:.2f}")

# It's important to establish a naive baseline because it provides a minimum performance
# standard for my model. If my model can't beat this baseline, it is ineffective.

## 2. Initial Modelling Without GridSearch or Pipeline

Let's build a simple linear regression model without any feature engineering, grid search, or pipeline. This will serve as our initial baseline for comparison.

### Task:
- Split the data into training and test sets
- Train a simple linear regression model
- Evaluate its performance using regression metrics
- Write it down as a markdown below so you can keep track. This is a scientific experiment


In [None]:
# Splits the data into training and test sets, trains a simple linear regression model,
# Evaluates its performance using regression metrics.

# Imports the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Assuming data_encoded is preprocessed and includes the target variable 'charges'

# Splits the features and the target
X = data_encoded.drop(columns=['charges'])  # Replace 'charges' with your actual target column
y = data_encoded['charges']

# Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializes the Linear Regression model
linear_model = LinearRegression()

# Trains the model on the training data
linear_model.fit(X_train, y_train)

# Predicts on the test set
y_pred = linear_model.predict(X_test)

# Evaluates the model's performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Prints the metrics
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

In [None]:
**Results**

**Mean Absolute Error (MAE):** 4901.38
                              
**Mean Squared Error (MSE):** 42549664.96
                             
**R-squared (R²):** 0.74

Steps:
	1.	Split the Data:
	•	Divide the dataset into training and testing sets using train_test_split.
	2.	Train a Simple Linear Regression Model:
	•	Use the LinearRegression class from sklearn to fit the training data.
	3.	Evaluate the Model:
	•	Use regression metrics such as:
	•	Mean Absolute Error (MAE)
	•	Mean Squared Error (MSE)
	•	R-squared (R²)

Resulting mean absolute error(MAE): 4901.38
Resulting mean squared error (MSE): 42549664.96
R-squared (R²): 0.74

## 3. Feature Engineering

Now, let's brainstorm and create some new features to see if we can improve the model's performance.

### Questions:
1. Should we create an interaction feature between 'bmi' and 'children'? 
2. Should we create age groups to see if the model improves by categorizing age?
3. Should we create a high-risk indicator based on 'smoker' and 'bmi'?

- Remember nothing is set in stone, this is your experiment, your hypothesis. You may not need to, but its important to explore these questions

### Task:
- Create new features based on the questions above
- Explain the rationale behind each feature



In [None]:
# Import the necessary libraries

import pandas as pd
import numpy as np

# Create a copy of the dataset to add new features
data_encoded = data_encoded.copy()

data_encoded.head()

In [None]:
# Ensure the 'smoker' column is encoded into numeric values
if 'smoker' in data_encoded.columns and data_encoded['smoker_yes'].dtype == 'object':
    data_encoded['smoker_yes'] = data_encoded['smoker_yes'].apply(lambda x: 1 if x.lower() == 'yes' else 0)

# 1. Interaction feature between 'bmi' and 'children'
data_encoded['bmi_children_interaction'] = data_encoded['bmi'] * data_encoded['children']

# 2. Categorize 'age' into age groups
age_bins = [0, 18, 30, 45, 60, 100]  # Example age bins
age_labels = ['0-18', '19-30', '31-45', '46-60', '61+']
data_encoded['age_group'] = pd.cut(data_encoded['age'], bins=age_bins, labels=age_labels)

# 3. High-risk indicator based on 'smoker' and 'bmi'
high_bmi_threshold = 30  # Example threshold for high BMI
data_encoded['high_risk'] = np.where(
    (data_encoded['smoker_yes'] == 1) & (data_encoded['bmi'] > high_bmi_threshold), 1, 0
)

# Display the first few rows of the modified dataset to verify new features
data_encoded.head()

## 4. Modelling with Feature Engineering

Now that we have new features, let's see if they improve our model's performance.
Did it improve the performance? Yes? No? Why

### Task:
- Split the data into training and test sets
- Train a linear regression model with the new features
- Evaluate its performance using regression metrics


In [None]:
# Splits the data with the new features into training and test sets. Trains a linear regression model with the
# new features. Evaluates the performance of this model using regression metrics.

# Imports the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
print(X.dtypes)

In [None]:
# Features and target variable
X = data_encoded.drop(columns=['charges'])  # Drops the target column
y = data_encoded['charges']  # Target variable

# Uses one-hot encoding for age_group
X = pd.get_dummies(X, drop_first = True)

print(X.dtypes)

# Splits the data into training and test sets (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Trains a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Makes predictions on the test data
y_pred = linear_model.predict(X_test)

# Evaluates the model's performance using regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Prints the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

**Results**

**Mean Absolute Error (MAE):** 3680.32

**Mean Squared Error (MSE):** 26999974.38

**R-squared (R²):** 0.84

The new features improved our model. The **revised** model is better than the baseline. 
It has lower MAE and MSE scores, which indicate higher prediction accuracy, and the R2 increased
from 0.74 to 0.84, which means that it explains more of the variance in the target variable (charges).

The results are as follow:
**Baseline Model**	

**MAE**	4901.38

**MSE**	42,549,664.96	

**R²**	0.74

**Revised Model**

**MAE**	3680.32

**MSE** 26,999,974.38

**R²** 0.84

## 5. Modelling with Pipeline and Grid Search

Now, let's see how using pipelines can simplify our workflow and prevent data leakage. We'll also use GridSearchCV to find the best hyperparameters.

### Task:
- Create a pipeline that includes scaling and linear regression
- Define a parameter grid for hyperparameter tuning
- Use GridSearchCV to find the best parameters and evaluate the model performance


In [None]:
# X_encoded = X_encoded.apply(pd.to_numeric, errors = 'coerce')
X_encoded = X_encoded.fillna(0)

In [None]:
# Creates a pipeline that includes scaling and linear regression, defines a parameter grid for hyperparameter tuning,
# and uses GridSearchCV to find the best parameters and evaluate the model performance

# Imports the necessary libraries
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Encodes categorical columns using one-hot encoding (in case any still remain)
# Assumes 'age_group' is the only categorical column
X_encoded = pd.get_dummies(data_encoded.drop(columns=['charges']), drop_first=True)  # Drop 'charges' and encode
y = data_encoded['charges']  # Target variable

# Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Creates a pipeline with scaling and linear regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('regressor', LinearRegression())  # Regression step
])

# Defines a parameter grid for hyperparameter tuning
param_grid = {
    'regressor__fit_intercept': [True, False]  # Whether to calculate the intercept
}

# Uses GridSearchCV to find the best parameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

# Evaluates the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculates performance metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Displays the best parameters and evaluation metrics
print("Best Parameters:", grid_search.best_params_)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

**Results:**

Fitting 5 folds for each of 2 candidates, totalling **10 fits**

**Best Parameters:** {'regressor__fit_intercept': True}

**Mean Absolute Error (MAE):** 3680.32

**Mean Squared Error (MSE):** 26999974.38

**R-squared (R²):** 0.84


## 6. Trying Another Model with Pipeline

Let's try using a Gradient Boosting Regressor to see if it performs better.

### Task:
- Create and use a pipeline for Gradient Boosting Regressor
- Define a parameter grid for grid search
- Use GridSearchCV to find the best parameters and evaluate the model


In [None]:
# Creates and uses a pipeline for Gradient Boosting Regressor, defines a parameter grid for grid search,
# Uses GridSearchCV to find the best parameters and evaluates the model

# Imports the necessary libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Encodes categorical columns using one-hot encoding
X_encoded = pd.get_dummies(data_encoded.drop(columns=['charges']), drop_first=True)
y = data_encoded['charges']

# Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Creates a pipeline for Gradient Boosting Regressor
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('gbr', GradientBoostingRegressor(random_state=42))  # Gradient Boosting step
])

# Defines a parameter grid for hyperparameter tuning
param_grid = {
    'gbr__n_estimators': [100, 200, 300],  # Number of trees
    'gbr__learning_rate': [0.01, 0.1, 0.2],  # Learning rate
    'gbr__max_depth': [3, 5, 7]  # Maximum depth of trees
}

# Uses GridSearchCV to find the best parameters
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

# Evaluates the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculates performance metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Displays the best parameters and evaluation metrics
print("Best Parameters:", grid_search.best_params_)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

**Results**

Fitting 5 folds for each of 27 candidates, totalling **135 fits**

**Best Parameters:** {'gbr__learning_rate': 0.1, 'gbr__max_depth': 3, 'gbr__n_estimators': 100}

**Mean Absolute Error (MAE):** 2360.01

**Mean Squared Error (MSE):** 17472493.13

**R-squared (R²):** 0.89

## 7. GridSearch with Several Models

Finally, let's compare several models using GridSearchCV to find the best one.

### Task:
- Define multiple models and their parameter grids
- Use GridSearchCV to find the best model and parameters


In [None]:
# Defines multiple models and their parameter grids
# Uses GridSearchCV to find the best model and parameters.

# Imports the necessary libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Encodes categorical columns using one-hot encoding
X_encoded = pd.get_dummies(data_encoded.drop(columns=['charges']), drop_first=True)
y = data_encoded['charges']

# Splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Defines multiple models and their parameter grids
models_and_parameters = [
    # Linear Regression
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', LinearRegression())
        ]),
        'param_grid': {
            'model__fit_intercept': [True, False]
        }
    },
    
    # Gradient Boosting Regressor
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', GradientBoostingRegressor(random_state=42))
        ]),
        'param_grid': {
            'model__n_estimators': [100, 200, 300],
            'model__learning_rate': [0.01, 0.1, 0.2],
            'model__max_depth': [3, 5, 7]
        }
    },
    
    # Random Forest Regressor
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', RandomForestRegressor(random_state=42))
        ]),
        'param_grid': {
            'model__n_estimators': [100, 200, 300],
            'model__max_depth': [5, 10, 15]
        }
    },
    
    # Support Vector Regressor (SVR)
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', SVR())
        ]),
        'param_grid': {
            'model__C': [0.1, 1, 10],
            'model__epsilon': [0.01, 0.1, 1]
        }
    },
    
    # ElasticNet
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', ElasticNet(random_state=42))
        ]),
        'param_grid': {
            'model__alpha': [0.01, 0.1, 1],
            'model__l1_ratio': [0.2, 0.5, 0.8]
        }
    },
    
    # K-Nearest Neighbors Regressor (KNN)
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', KNeighborsRegressor())
        ]),
        'param_grid': {
            'model__n_neighbors': [3, 5, 7],
            'model__weights': ['uniform', 'distance']
        }
    },
    
    # XGBoost Regressor
    {
        'pipeline': Pipeline([
            ('scaler', StandardScaler()),
            ('model', XGBRegressor(random_state=42, eval_metric='rmse'))
        ]),
        'param_grid': {
            'model__n_estimators': [100, 200, 300],
            'model__learning_rate': [0.01, 0.1, 0.2],
            'model__max_depth': [3, 5, 7]
        }
    }
]  

# Iterates through the models and evaluate using GridSearchCV
best_model = None
best_score = -float('inf')
best_params = None

for item in models_and_parameters:
    pipeline = item['pipeline']
    param_grid = item['param_grid']
    
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', verbose=1)
    grid_search.fit(X_train, y_train)
    
    # Check if the current model performs better
    if grid_search.best_score_ > best_score:
        best_model = grid_search.best_estimator_
        best_score = grid_search.best_score_
        best_params = grid_search.best_params_

# Evaluates the best model on the test set
y_pred = best_model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Displays the results
print("Best Model:", type(best_model['model']).__name__)
print("Best Parameters:", best_params)
print(f"Cross-Validation R²: {best_score:.2f}")
print(f"Test Set MAE: {mae:.2f}")
print(f"Test Set MSE: {mse:.2f}")
print(f"Test Set R²: {r2:.2f}")

In [None]:
# Analyzes the feature importances from GradientBoostingRegressor

# Extracts feature importances from the best GradientBoostingRegressor model
feature_importances = best_model['model'].feature_importances_

# Creates a DataFrame for visualization
feature_importances_df = pd.DataFrame({
    'Feature': X_encoded.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Displays the top features
print("Top Features:")
print(feature_importances_df.head(10))

# Plots the feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
plt.barh(feature_importances_df['Feature'], feature_importances_df['Importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title("Feature Importances from GradientBoostingRegressor", fontsize=16)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Feature", fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Refines the hyperparameter grid for further tuning.
# Defines a refined parameter grid for GradientBoostingRegressor
refined_param_grid = {
    'model__n_estimators': [100, 300, 500],  # Increased number of trees
    'model__learning_rate': [0.01, 0.05, 0.1],  # Added smaller learning rates
    'model__max_depth': [3, 5, 7],  # Testing a broader range of tree depths
    'model__subsample': [0.8, 1.0]  # Added subsampling to prevent overfitting
}

# Creates a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('model', GradientBoostingRegressor(random_state=42))  # Gradient Boosting
])

# Performs GridSearchCV
grid_search_refined = GridSearchCV(pipeline, refined_param_grid, cv=5, scoring='r2', verbose=1)
grid_search_refined.fit(X_train, y_train)

# Displays the best parameters from refined tuning
print("Refined Best Parameters:", grid_search_refined.best_params_)

# Evaluates the refined model on the test set
refined_best_model = grid_search_refined.best_estimator_
y_pred_refined = refined_best_model.predict(X_test)

# Calculates the performance metrics for the refined model
mae_refined = mean_absolute_error(y_test, y_pred_refined)
mse_refined = mean_squared_error(y_test, y_pred_refined)
r2_refined = r2_score(y_test, y_pred_refined)

print(f"Refined Test Set MAE: {mae_refined:.2f}")
print(f"Refined Test Set MSE: {mse_refined:.2f}")
print(f"Refined Test Set R²: {r2_refined:.2f}")

**Results**
Fitting 5 folds for each of 54 candidates, totalling 270 fits
Refined Best Parameters: {'model__learning_rate': 0.01, 'model__max_depth': 3, 'model__n_estimators': 500, 'model__subsample': 1.0}
Refined Test Set MAE: 2529.26
Refined Test Set MSE: 18161303.68
Refined Test Set R²: 0.89

In [None]:
# Looks at the median absolute error and adjusted R².
from sklearn.metrics import median_absolute_error

# Median Absolute Error
medae = median_absolute_error(y_test, y_pred_refined)
print(f"Median Absolute Error (MedAE): {medae:.2f}")

# Adjusted R²
def adjusted_r2(r2, n, p):
    """Calculate Adjusted R²."""
    return 1 - ((1 - r2) * (n - 1) / (n - p - 1))

# Calculates the adjusted R²
n = len(y_test)  # Number of observations
p = X_test.shape[1]  # Number of features
adj_r2 = adjusted_r2(r2_refined, n, p)
print(f"Adjusted R²: {adj_r2:.2f}")

**Results**
Median Absolute Error (MedAE): 1659.11
Adjusted R²: 0.88

# Machine Learning: Master Challenge

## 8. Calculating Potential Cost or Loss

### Challenge:
Now that you've built and optimized your models, it's time for the final challenge! Your task is to minimize the Root Mean Squared Error (RMSE) of your model's predictions and calculate the potential financial impact of your model's errors.

### Task:
1. Calculate the RMSE of your final model's predictions.
2. Break down the errors into underestimation and overestimation.
3. Calculate the total potential cost or loss to the company.
4. Compete with your classmates to see who can achieve the lowest RMSE and financial impact!

### Explanation:
The RMSE provides an estimate of the average error in your model's predictions. We will also analyze the errors by categorizing them into underestimations and overestimations to understand their financial impact.

#### Steps to Calculate Underestimation and Overestimation Errors:

1. **Calculate RMSE**:
   - Use the `mean_squared_error` function from `sklearn.metrics` and pass your actual values (`y_test`) and predicted values (`y_pred_final`) to it.
   - Take the square root of the result to get the RMSE.
   
2. **Calculate Underestimation Error**:
   - Identify the instances where the actual charges (`y_test`) are greater than the predicted charges (`y_pred_final`).
   - For these instances, calculate the difference between the actual and predicted charges.
   - Sum these differences to get the total underestimation error.

3. **Calculate Overestimation Error**:
   - Identify the instances where the actual charges (`y_test`) are less than the predicted charges (`y_pred_final`).
   - For these instances, calculate the difference between the predicted and actual charges.
   - Sum these differences to get the total overestimation error.

4. **Calculate Total Potential Cost or Loss**:
   - Add the total underestimation error and the total overestimation error to get the total potential cost or loss.

### Let's see who can build the best model!

#### Detailed Instructions:

1. **Calculate RMSE**:
   - Use `mean_squared_error` with `y_test` and `y_pred_final`.
   - Use `np.sqrt` to take the square root of the result.

2. **Calculate Underestimation Error**:
   - Use a boolean condition to filter `y_test` values that are greater than `y_pred_final`.
   - Subtract the predicted values from the actual values for these instances.
   - Sum these differences.

3. **Calculate Overestimation Error**:
   - Use a boolean condition to filter `y_test` values that are less than `y_pred_final`.
   - Subtract the actual values from the predicted values for these instances.
   - Sum these differences.

4. **Calculate Total Potential Cost or Loss**:
   - Add the results of the underestimation error and overestimation error to get the total potential cost or loss.

### Example Walkthrough:

1. **Calculate RMSE**:
   - `rmse = np.sqrt(mean_squared_error(y_test, y_pred_final))`
   - This gives you the average prediction error in dollars.

2. **Calculate Underestimation Error**:
   - `underestimation_error = np.sum(y_test[y_test > y_pred_final] - y_pred_final[y_test > y_pred_final])`
   - This gives you the total amount by which the model undercharged.

3. **Calculate Overestimation Error**:
   - `overestimation_error = np.sum(y_pred_final[y_test < y_pred_final] - y_test[y_test < y_pred_final])`
   - This gives you the total amount by which the model overcharged.

4. **Calculate Total Potential Cost or Loss**:
   - `total_potential_loss = underestimation_error + overestimation_error`
   - This gives you the total financial impact of the model's errors.

### Leaderboard:
Post your RMSE score and total potential cost or loss on the class leaderboard. The student with the lowest RMSE and total potential cost or loss wins bragging rights

### Post Your Results 

- Name
- Model Type
- RMSE
- Underestimation Error
- Overestimation Error
- Total Potential Cost/Loss

In [None]:
# Executes the final challenge, as described above. 
# Imports the necessary libraries
import numpy as np
from sklearn.metrics import mean_squared_error

# Assuming 'best_model' is the trained model and 'X_test' is the test dataset
# Generates predictions from the model
y_pred_final = best_model.predict(X_test)

# Calculates the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred_final))
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

# Calculates the underestimation error
underestimated_mask = y_test > y_pred_final
underestimation_error = np.sum(y_test[underestimated_mask] - y_pred_final[underestimated_mask])
print(f"Total Underestimation Error: {underestimation_error:.2f}")

# Calculates the overestimation error
overestimated_mask = y_test < y_pred_final
overestimation_error = np.sum(y_pred_final[overestimated_mask] - y_test[overestimated_mask])
print(f"Total Overestimation Error: {overestimation_error:.2f}")

# Calculates the Total Potential Cost or Loss
total_potential_loss = underestimation_error + overestimation_error
print(f"Total Potential Cost or Loss: {total_potential_loss:.2f}")

**Final Submission**

Sylvia Y. Perez-Montero

**Model Type:** GradientBoostingRegressor

**RMSE:** 2360.01

**Underestimation error:**  286027.07

**Overestimation error:** 346456.28

**Total financial impact caused by prediction errors in both directions:** 632483.35

## Conclusion

Congratulations! You've completed the lab. Here's a summary of what we've covered:
1. Established a naive baseline using the mean of the target variable.
2. Built an initial linear regression model without any feature engineering or optimization.
3. Performed feature engineering to create new, potentially useful features.
4. Used pipelines and GridSearchCV to optimize the model.
5. Evaluated the final model's performance using RMSE to understand its business impact.

By following these steps, you now have a robust understanding of how to approach a regression problem, from initial exploration to model optimization and business impact assessment. Great job!
