In [1]:
#!jupyter nbextension install --py --user hide_code
#!jupyter nbextension enable hide_code --user --py
#!jupyter nbextension enable --py --user hide_code

# How expensive is health?

## Dataset

The data comes from [www.kaggle.com](https://www.kaggle.com/datasets/mirichoi0218/insurance).

**Features**
<br>The dataset consists of approximately 1338 customers, with each customers having 7 features. 




* [1. Goal](#Goal)
* [2. Research](#Research)
* [3. Data processing](#Data_processing)
* [4. Model selection](#Model_selection)
* [5. Predictive analysis on a selected model](#Predictive)
* [6. Final conclusions](#Final_conclusions)
* [7. Application](#Application)

## 1. Goal<a id='Goal'></a>

The aim of the project is to determine health costs based on insured parameters.

## 2. Research<a id='Research'></a>
**Columns**
- **age**: Age of primary beneficiary
- **sex**: Insurance contractor gender, female, male
- **bmi**: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, (objective index of body weight (kg / m ^ 2))
- **children**: Number of children covered by health insurance / Number of dependents
- **smoker**: Smoking 
- **region**: The beneficiary's residential area in the US, northeast, southeast, southwest, northwest
- **charges**: Individual medical costs billed by health insurance

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)
import plotly.graph_objects as go

In [3]:
df = pd.read_csv('../data/insurance.csv')

In [4]:
df = df.drop_duplicates()

In [5]:
cat=df.select_dtypes(include=['object']).columns.tolist()
for i in cat:
    obs=df[i].value_counts()
    avg_claim=df.groupby(i)["charges"].mean()
    display(pd.DataFrame({"Number of Policyholders":obs, "Average Claim Amount":avg_claim.map('${:,.2f}'.format)})\
            .sort_values('Number of Policyholders', ascending=False)\
            .style.set_caption("Variable: {}".format(i)))



Unnamed: 0,Number of Policyholders,Average Claim Amount
male,675,"$13,975.00"
female,662,"$12,569.58"


Unnamed: 0,Number of Policyholders,Average Claim Amount
no,1063,"$8,440.66"
yes,274,"$32,050.23"


Unnamed: 0,Number of Policyholders,Average Claim Amount
southeast,364,"$14,735.41"
southwest,325,"$12,346.94"
northeast,324,"$13,406.38"
northwest,324,"$12,450.84"


In [6]:
fig = px.histogram(df, x="charges", color="sex",   marginal="box", title="Distribution of charges for male and female", labels={
                     "charges": "Charges",
                     "count": "Count",
                    "sex":"Sex"})
fig.show()

In [7]:
fig = px.histogram(df, x="charges", color='smoker',   marginal="box", title="Distribution of charges vs smoker",
      labels={
                     "count": "Count",
                    "charges":"Charges",
                    'smoker':'Smoker'      
      } )
fig.show()

In [8]:
fig = px.scatter(df, x='age', y='charges', facet_col="smoker", color="sex", trendline="ols", size = 'charges',
                title="Scatter plot of smoker, age and sex",
                    labels={ "age": "Age",
                        "charges":"Charges",
                        'sex':'Sex'
      })
fig.show()

In [9]:
fig = px.scatter(df, x='bmi', y='charges', facet_col="smoker",size='charges', color="sex", trendline="ols", title="Scatter plot of smoker, BMI and sex",
                    labels={ "bmi": "BMI",
                        "charges":"Charges",
                        'sex':'Sex'
      })
fig.show()

In [10]:
fig = px.scatter(df, x='age', y='bmi', facet_col="smoker", color="sex", trendline="ols", size = 'charges',
                title="Scatter plot of smoker, BMI and age",
                    labels={ "bmi": "BMI",
                        "age":"Age",
                        'sex':'Sex'
      })
fig.show()
#brak zależności age i bmi?

# 3.Data processing <a id='Data_processing'></a>

- After data exploration we decided to implement function which makes initial preprocessing of data for further modelling.
- Function, checks if there is any missing data (nulls) and drops them.
- Checks and clear duplicated values and makes one hot encoding on categorical columns based on pandas get dummies with drop first function on region, sex, smoker.

# 4.Model selection <a id='Model_selection'></a>
<br/>
<br/>

![metyrki_1](../img/basic_models_metrics1.png)
- The HyperOpt was used in order to define optimal hyperparameters

<br/>
<br/>

![metyrki_final](../img/final_metric1.png)

# 5.Predictive analysis on a selected model<a id='Predictive'></a>

<!-- ![best_params](../img/xgb_optimized.png) -->



<img src="../img/xgb_optimized1.png" alt="Drawing" style="width: 800px;"/>

<!-- ![feature importance](../img/xgb_FI.png) -->
<img src="../img/FI.png" alt="Drawing" style="width: 800px;"/>

![pred1](../img/predictions_horizontal1.png)

![pred2](../img/predictions_vertical1.png)

# 6.Final conclusions<a id='Final_conclusions'></a>
- The most common underestimation of insurance premiums is for people who do not smoke, have a BMI above normal and have less than 3 children
- Miscalculation of insurance premium is not based on region, gender or age 
- The model is safe for the company due to profit optimization/minimization of losses(2%)
- **Recommendations for business**: in order to calculate insurance premiums (to avoid losses), it would be appropriate to:
    - increase the number of records
    - acquire more features for analysis

# 7.Application<a id='Application'></a>
