# HACKATHON 1 HEALTH INSURANCE-Visualisations

This notebook explores the cleaned insurance datasetwith simple visualisations to highlight relationships between key factors and insurance charges. 

## Objectives
- Load the **cleaned insurance dataset**
- Visualise key distributions and relationships that effect **charges**
- Note 2-3 insights per chart to support the analysis. 
* "Engineer features for modelling and visualisation."


## Inputs
- Files:'cleaned_healthcare_insurance.csv'
- Expected columns : 'age', 'sex', 'BMI', 'Children', 'Smoker', 'region', 'charges'
- Notes:"This file is the cleaned output from 'etl.ipynb' and is the only dataset used in this notebook." 



In [11]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
# Load the cleaned dataset 
df= pd.read_csv('cleaned_healthcare_insurance.csv')
df.head()







Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86


---

# DATA VISUALISATION

In [13]:
import plotly.express as px
import plotly.graph_objects as go



## Chart 1: Age Impact Analysis

Testing **Hypothesis 1**: Insurance charges increase with age

In [14]:

avg_charges_by_age = df.groupby('age', as_index=False)['charges'].mean()
fig = px.line(avg_charges_by_age, x='age', y='charges',
              title='Average Insurance Charges by Age',
              labels={'charges': 'Avg Charges ($)', 'age': 'Age'})
fig.show()



## Chart 2: Smoking Impact Analysis

Testing **Hypothesis 4**: Smoking significantly increases insurance charges

In [17]:
# Box plot: Charges by Smoking Status
fig = px.box(df, x='smoker', y='charges', 
             title='Insurance Charges by Smoking Status',
             labels={'smoker': 'Smoking Status', 'charges': 'Insurance Charges ($)'})
fig.show()

# Calculate statistics
smoker_stats = df.groupby('smoker')['charges'].agg(['mean', 'median', 'count']).round(2)
print("\nCharges by Smoking Status:")
print(smoker_stats)


Charges by Smoking Status:
            mean    median  count
smoker                           
no       8434.27   7345.40   1064
yes     32050.23  34456.35    274


## Chart 3: Regional Analysis

Testing **Hypothesis 3**: Regional differences affect insurance charges

In [None]:
# Bar chart: Average charges by region
region_avg = df.groupby('region')['charges'].mean().reset_index()
fig = px.bar(region_avg, x='region', y='charges',
             title='Average Insurance Charges by Region',
             labels={'region': 'Region', 'charges': 'Average Charges ($)'})
fig.show()

print("\nRegional charge differences:")
print(region_avg.round(2))


Charges by Gender:
            mean  count
sex                    
female  12569.58    662
male    13956.75    676


## 📊 Key Insights Summary

Based on our 3 focused visualizations:

## 📊 Data Insights & Conclusions

### Hypothesis Testing Results:

**H1: Age Effect** ✅ **CONFIRMED**
- Insurance charges show clear upward trend with age
- Linear relationship supports age-based pricing models

**H3: Regional Variations** ✅ **CONFIRMED** 
- Different regions show distinct average charge levels
- Reflects regional healthcare cost differences

**H4: Smoking Impact** ✅ **STRONGLY CONFIRMED**
- Smokers have dramatically higher charges than non-smokers
- Strongest predictor of insurance costs in our analysis

### Business Implications:
- **Risk Assessment**: Age and smoking status are key risk factors
- **Pricing Strategy**: Regional adjustments may be warranted
- **Health Programs**: Anti-smoking initiatives could reduce costs