# HACKATHON 1 HEALTH INSURANCE-Visualisations

This notebook explores the cleaned insurance datasetwith simple visualisations to highlight relationships between key factors and insurance charges. 

## Objectives
- Load the **cleaned insurance dataset**
- Visualise key distributions and relationships that effect **charges**
- Note 1-2 insights per chart to support the analysis. 
* "Engineer features for modelling and visualisation."


## Inputs
- Files:'cleaned_healthcare_insurance.csv'
- Expected columns : 'age', 'sex', 'BMI', 'Children', 'Smoker', 'region', 'charges'
- Notes:"This file is the cleaned output from 'etl.ipynb' and is the only dataset used in this notebook." 



In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the cleaned dataset 
df= pd.read_csv('cleaned_healthcare_insurance.csv')
df.head()







Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86


---

# DATA VISUALISATION

In [4]:
import plotly.express as px
import plotly.graph_objects as go



#Descriptive visualisation

In [5]:

avg_charges_by_age = df.groupby('age', as_index=False)['charges'].mean()
fig = px.line(avg_charges_by_age, x='age', y='charges',
              title='Average Insurance Charges by Age',
              labels={'charges': 'Avg Charges ($)', 'age': 'Age'})
fig.show()



#Correlation Analysis

In [6]:
# correlation with charges

import pandas as pd
import plotly.express as px


# Calculate correlation matrix
corr_matrix = df.corr(numeric_only=True).round(2)


# Full heatmap
fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale='RdBu_r',
    zmin=-1, zmax=1,
    title='Correlation Matrix (Including Insurance Charges)')



fig.show()

#Grouped Predictive Analysis Chart

### Distribution Analysis

Let's examine the distribution of key variables to understand the data better.

In [7]:
# Distribution of Insurance Charges
fig = px.histogram(df, x='charges', nbins=30, 
                   title='Distribution of Insurance Charges',
                   labels={'charges': 'Insurance Charges ($)', 'count': 'Frequency'})
fig.update_layout(showlegend=False)
fig.show()

print(f"Mean charges: ${df['charges'].mean():.2f}")
print(f"Median charges: ${df['charges'].median():.2f}")
print(f"Standard deviation: ${df['charges'].std():.2f}")

Mean charges: $13270.42
Median charges: $9382.03
Standard deviation: $12110.01


### Smoking Impact Analysis

Smoking is often a major factor in insurance charges. Let's analyze this relationship.

In [8]:
# Box plot: Charges by Smoking Status
fig = px.box(df, x='smoker', y='charges', 
             title='Insurance Charges by Smoking Status',
             labels={'smoker': 'Smoking Status', 'charges': 'Insurance Charges ($)'})
fig.show()

# Calculate statistics
smoker_stats = df.groupby('smoker')['charges'].agg(['mean', 'median', 'count']).round(2)
print("\nCharges by Smoking Status:")
print(smoker_stats)


Charges by Smoking Status:
            mean    median  count
smoker                           
no       8434.27   7345.40   1064
yes     32050.23  34456.35    274


### BMI vs Charges Analysis

Examine the relationship between BMI and insurance charges, colored by smoking status.

In [9]:
# Scatter plot: BMI vs Charges colored by Smoking Status
fig = px.scatter(df, x='bmi', y='charges', color='smoker',
                 title='BMI vs Insurance Charges (by Smoking Status)',
                 labels={'bmi': 'Body Mass Index (BMI)', 'charges': 'Insurance Charges ($)'},
                 hover_data=['age', 'sex'])
fig.show()

# Correlation analysis
bmi_charges_corr = df['bmi'].corr(df['charges'])
print(f"Correlation between BMI and Charges: {bmi_charges_corr:.3f}")

Correlation between BMI and Charges: 0.198


### Regional and Demographic Analysis

Compare charges across different regions and demographic factors.

In [10]:
# Bar chart: Average charges by region
region_avg = df.groupby('region')['charges'].mean().reset_index()
fig = px.bar(region_avg, x='region', y='charges',
             title='Average Insurance Charges by Region',
             labels={'region': 'Region', 'charges': 'Average Charges ($)'})
fig.show()

# Gender comparison
gender_comparison = df.groupby('sex')['charges'].agg(['mean', 'count']).round(2)
print("\nCharges by Gender:")
print(gender_comparison)


Charges by Gender:
            mean  count
sex                    
female  12569.58    662
male    13956.75    676


### Key Insights Summary

Based on the visualizations above, here are the main findings from our analysis:

## 📊 Data Insights & Conclusions

### Key Findings:

1. **🚬 Smoking Impact**: Smokers have significantly higher insurance charges than non-smokers
   - This is the strongest predictor of insurance costs

2. **📈 Age Factor**: Insurance charges generally increase with age
   - Linear relationship suggests age-based pricing

3. **⚖️ BMI Relationship**: Higher BMI correlates with increased charges
   - Particularly noticeable among smokers

4. **🌍 Regional Variations**: Different regions show varying average charges
   - May reflect cost of living or healthcare costs by region

5. **👥 Demographics**: Gender appears to have minimal impact on charges
   - Number of children may influence costs

### Recommendations for Further Analysis:
- **Multi-factor analysis**: Examine combined effects (age + smoking, BMI + age, etc.)
- **Predictive modeling**: Build models to predict charges based on these factors
- **Cost optimization**: Identify factors that could reduce insurance costs