![](image.jpg)


Dive into the heart of data science with a project that combines healthcare insights and predictive analytics. As a Data Scientist at a top Health Insurance company, you have the opportunity to predict customer healthcare costs using the power of machine learning. Your insights will help tailor services and guide customers in planning their healthcare expenses more effectively.

## Dataset Summary

Meet your primary tool: the `insurance.csv` dataset. Packed with information on health insurance customers, this dataset is your key to unlocking patterns in healthcare costs. Here's what you need to know about the data you'll be working with:

## insurance.csv
| Column    | Data Type | Description                                                      |
|-----------|-----------|------------------------------------------------------------------|
| `age`       | int       | Age of the primary beneficiary.                                  |
| `sex`       | object    | Gender of the insurance contractor (male or female).             |
| `bmi`       | float     | Body mass index, a key indicator of body fat based on height and weight. |
| `children`  | int       | Number of dependents covered by the insurance plan.              |
| `smoker`    | object    | Indicates whether the beneficiary smokes (yes or no).            |
| `region`    | object    | The beneficiary's residential area in the US, divided into four regions. |
| `charges`   | float     | Individual medical costs billed by health insurance.             |



A bit of data cleaning is key to ensure the dataset is ready for modeling. Once your model is built using the `insurance.csv` dataset, the next step is to apply it to the `validation_dataset.csv`. This new dataset, similar to your training data minus the `charges` column, tests your model's accuracy and real-world utility by predicting costs for new customers.

## Let's Get Started!

This project is your playground for applying data science in a meaningful way, offering insights that have real-world applications. Ready to explore the data and uncover insights that could revolutionize healthcare planning? Let's begin this exciting journey!

Develop a regression model using the `insurance.csv` dataset to predict `charges`. Evaluate the model's accuracy using the R-Squared Score. Then, apply the model to estimate `predicted_charges` for unseen data in `validation_dataset.csv`.

- Build a regression model to predict `charges` using the `insurance.csv` dataset. Evaluate the R-Squared Score of your trained model and save it as a variable named `r2_score`. The model's success will be assessed based on its R-Squared Score, which must exceed a threshold of **0.65**.

⚠️ Note: If you encounter errors during model training, make sure the `insurance` DataFrame is properly cleaned and ready for modeling.

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.width", 1000)

In [2]:
insurance = pd.read_csv('insurance.csv')
print(insurance.head())

    age     sex     bmi  children smoker     region       charges
0  19.0  female  27.900       0.0    yes  southwest     16884.924
1  18.0    male  33.770       1.0     no  Southeast     1725.5523
2  28.0    male  33.000       3.0     no  southeast     $4449.462
3  33.0    male  22.705       0.0     no  northwest  $21984.47061
4  32.0    male  28.880       0.0     no  northwest    $3866.8552


In [7]:
print(insurance.sex.unique())
print(insurance.charges.unique())
print(insurance.age.unique())
print(insurance.region.unique())

['female' 'male' 'woman' 'F' 'man' nan 'M']
['16884.924' '1725.5523' '$4449.462' ... '$1629.8335' '2007.945'
 '29141.3603']
[ 19.  18.  28.  33.  32. -31.  46.  37.  60.  25.  62.  23.  56. -27.
  52. -23.  30. -34.  59.  63.  55.  31.  22.  nan  26.  35.  24.  41.
  21.  48.  36.  40.  58.  34.  43.  64.  20.  61.  27.  53.  44.  57.
 -41.  45. -35.  54.  38.  29.  49.  47.  51.  42.  50. -44. -39. -28.
 -40.  39. -25. -52. -26. -47. -45. -57. -43. -50. -58. -56. -30. -51.
 -60. -37. -55. -64. -22. -36. -21. -18. -20. -19. -33.]
['southwest' 'Southeast' 'southeast' 'northwest' 'Northwest' 'Northeast'
 'northeast' 'Southwest' nan]


In [8]:
cleaned_insurance = insurance.copy()

cleaned_insurance['sex'] = cleaned_insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
cleaned_insurance['charges'] = cleaned_insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
cleaned_insurance = cleaned_insurance[cleaned_insurance["age"] > 0]
cleaned_insurance.loc[cleaned_insurance["children"] < 0, "children"] = 0
cleaned_insurance["region"] = cleaned_insurance["region"].str.lower()
cleaned_insurance = cleaned_insurance.dropna()

print(cleaned_insurance.head())

    age     sex     bmi  children smoker     region      charges
0  19.0  female  27.900       0.0    yes  southwest  16884.92400
1  18.0    male  33.770       1.0     no  southeast   1725.55230
2  28.0    male  33.000       3.0     no  southeast   4449.46200
3  33.0    male  22.705       0.0     no  northwest  21984.47061
4  32.0    male  28.880       0.0     no  northwest   3866.85520


In [9]:
X = cleaned_insurance.drop('charges', axis=1)
y = cleaned_insurance['charges']

categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
    
X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
    
X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)

In [10]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_processed)

In [11]:
lr = LinearRegression()
    
steps = [("scaler", scaler), ("lin_reg", lr)]
insurance_model_pipeline = Pipeline(steps)
    
insurance_model_pipeline.fit(X_scaled, y)

In [12]:
mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)

In [14]:
print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)

Mean MSE: 37431001.52191917
Mean R2: 0.7450511466263761


- Use the trained model to predict charges for the data in `validation_dataset.csv`. Store the predictions in a new column named `predicted_charges` within the validation dataset, and save it as a pandas DataFrame called `validation_data`. Ensure a minimum basic charge of **1000**.

In [18]:
validation_data_path = 'validation_dataset.csv'
validation_data = pd.read_csv(validation_data_path)

print(validation_data.head())

    age     sex        bmi  children smoker     region
0  18.0  female  24.090000       1.0     no  southeast
1  39.0    male  26.410000       0.0    yes  northeast
2  27.0    male  29.150000       0.0    yes  southeast
3  71.0    male  65.502135      13.0    yes  southeast
4  28.0    male  38.060000       0.0     no  southeast


In [19]:
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)

validation_predictions = insurance_model_pipeline.predict(validation_data_processed)

# Add predicted charges to the validation data
validation_data['predicted_charges'] = validation_predictions

# Adjust predictions to ensure minimum charge is $1000
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000

# Display the updated dataframe
print(validation_data.head())

    age     sex        bmi  children smoker     region  predicted_charges
0  18.0  female  24.090000       1.0     no  southeast      128624.195643
1  39.0    male  26.410000       0.0    yes  northeast      220740.537449
2  27.0    male  29.150000       0.0    yes  southeast      181357.588606
3  71.0    male  65.502135      13.0    yes  southeast      423490.687270
4  28.0    male  38.060000       0.0     no  southeast      193247.431989
