# Risk Profiling & Segmentation 
- Identifying high-risk customers. 
- What defines a high-cost customer? 
- Which features are common among high-cost customers? 
- Can we statistically identify risk groups?

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data\insurance.csv")
df.head()

  df = pd.read_csv("data\insurance.csv")


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


1) What defines a high-cost customer?

In [3]:
threshold = df['charges'].quantile(0.9)
high_cost = df[df['charges'] >= threshold]
high_cost


# High-cost customers = customers whose insurance charges fall in the top 10% of the dataset.

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
14,27,male,42.130,0,yes,southeast,39611.75770
19,30,male,35.300,0,yes,southwest,36837.46700
23,34,female,31.920,1,yes,northeast,37701.87680
29,31,male,36.300,2,yes,southwest,38711.00000
30,22,male,35.600,0,yes,southwest,35585.57600
...,...,...,...,...,...,...,...
1300,45,male,30.360,0,yes,southeast,62592.87309
1301,62,male,30.875,3,yes,northwest,46718.16325
1303,43,male,27.800,0,yes,southwest,37829.72420
1313,19,female,34.700,2,yes,southwest,36397.57600


---

2) Which features are common among high-cost customers?

In [None]:
# Smoking

smokers = high_cost['smoker'].value_counts(normalize=True)*100
smokers

smoker
yes    97.761194
no      2.238806
Name: proportion, dtype: float64

In [11]:
# Gender

genders = high_cost['sex'].value_counts(normalize=True)*100
genders

sex
male      62.686567
female    37.313433
Name: proportion, dtype: float64

In [12]:
# BMI
high_cost['bmi'].describe()

count    134.000000
mean      35.652500
std        4.483315
min       22.895000
25%       32.775000
50%       35.300000
75%       37.775000
max       52.580000
Name: bmi, dtype: float64

In [13]:
# AGE
high_cost['age'].describe()

count    134.000000
mean      41.783582
std       13.528784
min       18.000000
25%       31.000000
50%       43.000000
75%       53.000000
max       64.000000
Name: age, dtype: float64

---

3) Can we statistically identify risk groups?

In [14]:
low_thr = df['charges'].quantile(0.33)
high_thr = df['charges'].quantile(0.66)

def risk_group(x):
    if x < low_thr:
        return "Low Risk"
    elif x < high_thr:
        return "Medium Risk"
    else:
        return "High Risk"

df['risk_group'] = df['charges'].apply(risk_group)
df['risk_group'].value_counts()


risk_group
High Risk      455
Low Risk       442
Medium Risk    441
Name: count, dtype: int64

In [15]:
pd.crosstab(df['risk_group'], df['smoker'], normalize='index')*100
df.groupby('risk_group')[['bmi','age']].mean()

Unnamed: 0_level_0,bmi,age
risk_group,Unnamed: 1_level_1,Unnamed: 2_level_1
High Risk,31.069956,43.487912
Low Risk,30.067941,26.357466
Medium Risk,30.840737,47.668934


# REPORT

High-cost customers are mainly smokers, obese, and older. Smoking is the strongest driver of high insurance cost, followed by BMI and age. Gender has a smaller effect. By segmenting customers using charge percentiles, we can clearly identify low, medium, and high-risk groups. Low-risk customers are young and healthier, while high-risk customers are mostly smokers with high BMI and higher age. This segmentation helps insurers design better pricing, wellness programs, and preventive strategies.