# Introduction

In this project, the objective is to develop a regression model to predict insurance charges using the provided dataset, insurance.csv. The primary evaluation metric for the model's performance will be the R-Squared Score, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variables.

**Objective:**
* Building a regression model using the insurance.csv dataset to predict charges.
* Evaluating the accuracy of the model using the R-Squared Score.
* Applying the trained model to estimate charges for unseen data in validation_dataset.csv.

**Data Preparation:**
Exploring and preprocessing the insurance.csv dataset to handle missing values, categorical variables, and any outliers.

**Model Development:**
* Training a regression model using the prepared dataset to predict charges.
* Evaluating the model's performance using the R-Squared Score.
* Ensuring that the R-Squared Score exceeds the threshold of 0.65 for the model to be considered successful.

**Model Application:**
* Utilizing the trained regression model to predict charges for the validation_dataset.csv.
* Storing the predictions as a new column named predicted_charges in the validation dataset.
* Handling any negative prediction values by replacing them with the minimum basic charge, set at 1000.

By following these steps, the aim is to build an accurate regression model that effectively predicts insurance charges, thus aiding in decision-making processes within the insurance industry.

# Dataset Description

* **Age:** This shows how old the main person getting insurance is.
* **Sex:** It tells us if the insurance buyer is male or female.
* **BMI:** This number gives an idea of how healthy a person's weight is based on their height and weight.
* **Children:** It tells us how many kids or dependents are covered by the insurance.
* **Smoker:** This lets us know if the person getting insurance smokes or not.
* **Region:** It shows where the person lives, split into four different areas in the US.
* **Charges:** This is how much money the insurance company bills for each person's medical costs.

# Importing Libraries

In [1]:
# Importing Libraries

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Load Dataset

In [2]:
insurance = pd.read_csv('/kaggle/input/main-insurance/insurance.csv')

In [3]:
# display the first few rows of the insurance dataset

insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.9,0.0,yes,southwest,16884.924
1,18.0,male,33.77,1.0,no,Southeast,1725.5523
2,28.0,male,33.0,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.88,0.0,no,northwest,$3866.8552


In [4]:
# Initial checks on the insurance dataset
insurance.shape

(1338, 7)

In [5]:
insurance.isnull().sum()

age         66
sex         66
bmi         66
children    66
smoker      66
region      66
charges     54
dtype: int64

# Cleaning Dataset

In [6]:
def clean_dataset(insurance):
    # Replace gender values
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    
    # Remove dollar sign and convert charges to float
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    
    # Filter out rows with negative age values
    insurance = insurance[insurance['age'] > 0]
    
    # Set negative children values to zero using .loc
    insurance.loc[insurance['children'] < 0, 'children'] = 0
    
    # Convert region values to lowercase using .loc
    insurance.loc[:, 'region'] = insurance['region'].str.lower()

    # Drop rows with any missing values
    insurance = insurance.dropna()

    return insurance


Firstly, the function replaces various representations of gender ('M', 'man', 'F', 'woman') with standard representations ('male', 'female') in the sex column. Then, it removes the dollar sign from the charges column and converts the values to floating-point numbers. Next, it filters out rows where the 'age' column has negative values, effectively removing them from the dataset.

Following that, it utilizes a lambda function to set any negative values in the children column to zero. This lambda function iterates through each value in the 'children' column, replacing any negative value with zero.

Subsequently, it converts all values in the region column to lowercase, ensuring consistency in the format of region names. Lastly, the function drops any rows with missing values from the cleaned dataset before returning the modified DataFrame.

# Creating and Evaluating Regression Model

In [7]:
def create_and_evaluate_regression_model(insurance):
    # Extracting features and target variable
    X = insurance.drop('charges', axis=1)
    y = insurance['charges']
    
    # Defining categorical and numerical features
    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']
    
    # Encoding categorical variables as dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    
    # Combining numerical features with dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
    
    # Scaling numerical features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)
    
    # Creating a linear regression model
    lin_reg = LinearRegression()
    
    # Constructing a pipeline with scaling and linear regression
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)
    
    # Fitting the model to the data
    insurance_model_pipeline.fit(X_scaled, y)
    
    # Evaluating the model using cross-validation
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)
    
    return insurance_model_pipeline, mean_mse, mean_r2


The create_and_evaluate_regression_model function coordinates the process of building and assessing a linear regression model for predicting insurance charges. It initiates by segregating the dataset into features and the target variable, where categorical and numerical features are identified separately. Categorical features undergo one-hot encoding to convert them into numerical form, while numerical features are standardized using the StandardScaler to ensure uniform scales across all features. A linear regression model is instantiated and combined with feature scaling in a pipeline. This pipeline facilitates seamless integration of preprocessing steps and model training. Subsequently, the model is trained on the scaled features and evaluated using k-fold cross-validation to gauge its generalization performance. Evaluation metrics such as mean squared error (MSE) and R-squared (R2) scores are computed to provide insights into the model's predictive accuracy and goodness of fit. Ultimately, the function returns a tuple comprising the trained pipeline object along with the mean MSE and mean R2 scores, encapsulating the trained model and its performance metrics for further analysis and interpretation.

In [8]:
# Clean the insurance dataset and store the cleaned data
cleaned_insurance = clean_dataset(insurance)

# Use the cleaned data to create and evaluate a regression model
insurance_model, mean_mse, mean_r2 = create_and_evaluate_regression_model(cleaned_insurance)

# Print the evaluation results
print("Evaluation Results:")
print(f"  - Mean MSE: {mean_mse}")
print(f"  - Mean R2: {mean_r2}")


Evaluation Results:
  - Mean MSE: 37431001.52191915
  - Mean R2: 0.7450511466263761


**Mean Mean Squared Error (MSE):** The average value of the squared differences between the actual charges and the predicted charges across all cross-validation folds. In this case, the mean MSE is approximately 37,431,001.52.

**Mean R-Squared (R2) Score:** The average R-squared value obtained from cross-validation. R-squared is a measure of how well the regression model fits the actual data. It indicates the proportion of the variance in the dependent variable (insurance charges) that is predictable from the independent variables (features) in the model. The mean R2 score is approximately 0.745, suggesting that the model explains approximately 74.5% of the variance in the insurance charges.

# Prediction

In [9]:
validation_data = pd.read_csv('/kaggle/input/insurance/validation_dataset.csv')

In [10]:
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)


In [11]:
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")


In [12]:
# Make predictions using the trained model
validation_predictions = insurance_model.predict(validation_data_processed)

In [13]:
# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

In [14]:
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989


In [15]:
# Ensure minimum charge is $1000
min_charge = 1000
validation_data['predicted_charges'] = validation_data['predicted_charges'].apply(lambda x: max(x, min_charge))

In [16]:
validation_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,128624.195643
1,39.0,male,26.41,0.0,yes,northeast,220740.537449
2,27.0,male,29.15,0.0,yes,southeast,181357.588606
3,71.0,male,65.502135,13.0,yes,southeast,423490.68727
4,28.0,male,38.06,0.0,no,southeast,193247.431989
