# Analyzing Medical Insurance Costs
---
### The main goal of the project
To explore the “Medical Cost Personal Datasets” to uncover patterns and insights that could inform pricing strategies, risk assessment, and personalized health care planning.

**Tasks**:
- To analyze medical insurance costs based on various factors such as age, gender, BMI, number of dependents, smoking habits, and residential region.
- To investigate the relationships between demographic and lifestyle factors with medical expenses, offering valuable insights into the key drivers of insurance costs.
---

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import scipy
from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from scipy import ndimage
import matplotlib.pyplot as plt
import random
import matplotlib.dates as mdates

In [20]:
insurance = pd.read_csv('insurance.csv')

In [21]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [22]:
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [23]:
insurance.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [24]:
insurance.duplicated().sum()
# insurance = insurance.drop_duplicates()

np.int64(1)

In [25]:
insurance.shape

(1338, 7)

In [26]:
insurance.columns

# age - The age of the primary beneficiary (insured individual)

# sex - The gender of the policyholder (male or female)

# bmi - Body Mass Index. An indicator of a person's weight relative to their height, 
# representing objectively high or low body weight. The ideal range is 18.5–24.9

# children - The number of children covered by the health insurance (number of dependents)

# smoker - Smoker (yes/no). Indicates whether the beneficiary is a smoker

# region - The region of residence of the beneficiary in the US (e.g., southwest, southeast, northwest, northeast)

# charges - Individual medical costs billed by health insurance plans (the target variable often predicted in analysis)

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [27]:
# transform 'sex' and 'smokers' from categorical features to numerical features to use statistical analysis

# transform 'sex'
sex_mapping = {'female': 0, 'male': 1}
insurance['sex_encoded'] = insurance['sex'].map(sex_mapping)

# transform 'smoker'
smoker_mapping = {'no': 0, 'yes': 1}
insurance['smoker_encoded'] = insurance['smoker'].map(smoker_mapping)

In [29]:
# transform 'region' feature from nominal to numeric with One-Hot Encoding

region_dummies = pd.get_dummies(insurance['region'], prefix='region')

insurance = pd.concat([insurance, region_dummies], axis=1) # merge with 'insurance' dataset  
# if we will make 'bonus', we can transform booleans to numerical features .astype(int)

In [38]:
# standartization for 'age' and 'bmi' with minmax, because this feature has wide range of values

insurance['age_stand'] = (insurance['age'] - insurance['age'].mean()) / insurance['age'].std()

insurance['bmi_stand'] = (insurance['bmi'] - insurance['bmi'].mean()) / insurance['bmi'].std()

In [33]:
# normalization for 'charges' with minmax, because this feature has wide range of values

insurance['charges_norm'] = (insurance['charges'] - insurance['charges'].min()) / (insurance['charges'].max() - insurance['charges'].min())