![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

In [106]:
#Importing required modules
import pandas as pd 
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

In [107]:
#Loading the training dataset
insurance = pd.read_csv('insurance.csv')
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19.0,female,27.900,0.0,yes,southwest,16884.924
1,18.0,male,33.770,1.0,no,Southeast,1725.5523
2,28.0,male,33.000,3.0,no,southeast,$4449.462
3,33.0,male,22.705,0.0,no,northwest,$21984.47061
4,32.0,male,28.880,0.0,no,northwest,$3866.8552
...,...,...,...,...,...,...,...
1333,50.0,male,30.970,3.0,no,Northwest,$10600.5483
1334,-18.0,female,31.920,0.0,no,Northeast,2205.9808
1335,18.0,female,36.850,0.0,no,southeast,$1629.8335
1336,21.0,female,25.800,0.0,no,southwest,2007.945


In [108]:
#Cleaning the dataset
def clean_dataset(insurance):
    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F':'female', 'woman': 'female'})
    insurance = insurance[insurance['age'] > 0]
    insurance.loc[insurance['children'] < 0, 'children'] = 0
    insurance['region'] = insurance['region'].str.lower()
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    return insurance.dropna()
# insurance = clean_dataset(insurance)
# insurance

In [109]:
def reg_model(insurance):
    #Preprocessing
    X = insurance.drop('charges', axis=1)
    y= insurance['charges']
    
    #Categorical and numeric features
    X_categorical = ['sex', 'smoker', 'region']
    X_numerical = ['age', 'bmi', 'children']
    
    #Encoding the categorical variables 
    X_encoded = pd.get_dummies(X[X_categorical], drop_first=True)
    
    #Combining encoded categorical and numerical features
    X_processed = pd.concat([X[X_numerical], X_encoded], axis=1)
    
    #Creating pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])
    pipeline.fit(X_processed, y)
    
    #Evaluating the model
    mse_scores = -cross_val_score(pipeline, X_processed, y, cv=5, scoring="neg_mean_squared_error")
    r2_scores = cross_val_score(pipeline, X_processed, y, cv=5, scoring="r2")
    mse_score = np.mean(mse_scores)
    r2_score = np.mean(r2_scores)
    
    return pipeline, mse_score, r2_score

In [110]:
#Usage
cleaned_insurance = clean_dataset(insurance)
insurance_model, mse_score, r2_score = reg_model(cleaned_insurance)
print("Mean mse: ", mse_score)
print("Mean r2: ", r2_score)

Mean mse:  37431001.52191916
Mean r2:  0.7450511466263761


In [111]:
#Prediction on validation data
validation_data = pd.read_csv('validation_dataset.csv')

#Encoding the categorical features
validation_encoded = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

validation_prediction = insurance_model.predict(validation_encoded)
validation_data['predicted_charges'] = validation_prediction
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000
validation_data

Unnamed: 0,age,sex,bmi,children,smoker,region,predicted_charges
0,18.0,female,24.09,1.0,no,southeast,1000.0
1,39.0,male,26.41,0.0,yes,northeast,30947.521922
2,27.0,male,29.15,0.0,yes,southeast,27951.157717
3,71.0,male,65.502135,13.0,yes,southeast,56291.274683
4,28.0,male,38.06,0.0,no,southeast,7147.814884
5,70.0,female,72.958351,11.0,yes,southeast,57910.338597
6,29.0,female,32.11,2.0,no,northwest,6866.745983
7,42.0,female,41.325,1.0,no,northeast,13200.828862
8,48.0,female,36.575,0.0,no,northwest,12562.227822
9,63.0,male,33.66,3.0,no,southeast,16010.331763
