In [23]:
#lets start by reading in the csv file and printing out the first few lines
import pandas as pd

df = pd.read_csv('insurance_claims.csv')


df.head (5)

Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,police_report_available,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_make,auto_model,auto_year,fraud_reported,_c39
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,YES,71610,6510,13020,52080,Saab,92x,2004,Y,
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,?,5070,780,780,3510,Mercedes,E400,2007,Y,
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,NO,34650,7700,3850,23100,Dodge,RAM,2007,N,
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,NO,63400,6340,6340,50720,Chevrolet,Tahoe,2014,Y,
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,NO,6500,1300,650,4550,Accura,RSX,2009,N,



## 🧩 Policy & Timeline
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `policy_id` | `policy_number` | Unique policy identifier. |
| `policy_bind_date` | `policy_start` | Date the policy began. |
| `days_since_policy_inception` | `policy_tenure` | Days since the policy started at the time of the claim. |

---

## 👤 Insured (Customer)
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `insured_age` | `age` | Policyholder’s age (years). |
| `insured_sex` | `sex` | Policyholder gender. |
| `insured_education_level` | — | Highest education attained. |
| `insured_occupation` | — | Job/occupation of the insured. |

---

## 🗺️ Location
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `policy_state` | — | U.S. state where the policy is written. |
| `incident_state` | — | U.S. state where the incident occurred. |

---

## 🚨 Incident Details
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `incident_date` | `date_of_incident` | When the loss occurred. |
| `incident_type` | — | Type of incident (collision, theft, fire, etc.). |
| `collision_type` | `incident_severity` | Specific collision type or severity descriptor. |
| `number_of_vehicles_involved` | — | Count of vehicles in the incident. |
| `witnesses` | — | Number of witnesses reported. |
| `police_report_available` | `authorities_contacted` | Whether authorities were involved / report available. |

---

## 💰 Claims & Damages
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `property_damage` | `property_claim` | Whether property was damaged / property claim amount. |
| `bodily_injuries` | `injury_claim` | Number of injuries / injury claim amount. |
| `total_claim_amount` | `total_claims` | Total amount claimed for the incident. |

---

## 🚗 Vehicle
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `vehicle_make` | `auto_make` | Vehicle manufacturer (e.g., Toyota). |
| `vehicle_model` | `auto_model` | Vehicle model (e.g., Camry). |

---

## 🎯 Target Label
| Preferred Name | Also Seen As | What it Means |
|---|---|---|
| `fraud_reported` | `is_fraud` | Whether the claim was flagged as fraud. |

---




In [24]:
#lets get a few summary statistics of the data
df.describe()

Unnamed: 0,months_as_customer,age,policy_number,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,capital-gains,capital-loss,incident_hour_of_the_day,number_of_vehicles_involved,bodily_injuries,witnesses,total_claim_amount,injury_claim,property_claim,vehicle_claim,auto_year,_c39
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,0.0
mean,203.954,38.948,546238.648,1136.0,1256.40615,1101000.0,501214.488,25126.1,-26793.7,11.644,1.839,0.992,1.487,52761.94,7433.42,7399.57,37928.95,2005.103,
std,115.113174,9.140287,257063.005276,611.864673,244.167395,2297407.0,71701.610941,27872.187708,28104.096686,6.951373,1.01888,0.820127,1.111335,26401.53319,4880.951853,4824.726179,18886.252893,6.015861,
min,0.0,19.0,100804.0,500.0,433.33,-1000000.0,430104.0,0.0,-111100.0,0.0,1.0,0.0,0.0,100.0,0.0,0.0,70.0,1995.0,
25%,115.75,32.0,335980.25,500.0,1089.6075,0.0,448404.5,0.0,-51500.0,6.0,1.0,0.0,1.0,41812.5,4295.0,4445.0,30292.5,2000.0,
50%,199.5,38.0,533135.0,1000.0,1257.2,0.0,466445.5,0.0,-23250.0,12.0,1.0,1.0,1.0,58055.0,6775.0,6750.0,42100.0,2005.0,
75%,276.25,44.0,759099.75,2000.0,1415.695,0.0,603251.0,51025.0,0.0,17.0,3.0,2.0,2.0,70592.5,11305.0,10885.0,50822.5,2010.0,
max,479.0,64.0,999435.0,2000.0,2047.59,10000000.0,620962.0,100500.0,0.0,23.0,4.0,2.0,3.0,114920.0,21450.0,23670.0,79560.0,2015.0,


# 📊 Insurance Claims — Summary Statistics (Numeric Features)

Here are the descriptive statistics from the dataset (n = 1000 claims):

---

## 📅 Customer & Policy
| Variable | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `months_as_customer` | 1000 | 203.95 | 115.11 | 0 | 115.75 | 199.5 | 276.25 | 479 |
| `age` | 1000 | 38.95 | 9.14 | 19 | 32 | 38 | 44 | 64 |
| `policy_number` | 1000 | 546,238.65 | 257,063.01 | 100,804 | 335,980 | 533,135 | 759,100 | 999,435 |
| `policy_deductable` | 1000 | 1,136.00 | 611.86 | 500 | 500 | 1000 | 2000 | 2000 |
| `policy_annual_premium` | 1000 | 1,256.41 | 244.17 | 433.33 | 1089.61 | 1257.20 | 1415.70 | 2047.59 |
| `umbrella_limit` | 1000 | 1.10e6 | 2.30e6 | -1.00e6 | 0 | 0 | 0 | 1.00e7 |

---

## 🏦 Financials
| Variable | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `insured_zip` | 1000 | 501,214.49 | 71,701.61 | 430,104 | 448,405 | 466,446 | 603,251 | 620,962 |
| `capital-gains` | 1000 | 25,126.10 | 27,872.19 | 0 | 0 | 0 | 51,025 | 100,500 |
| `capital-loss` | 1000 | -26,793.70 | 28,104.10 | -111,100 | -51,500 | -23,250 | 0 | 0 |

---

## 🚨 Incident Details
| Variable | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `incident_hour_of_the_day` | 1000 | 11.64 | 6.95 | 0 | 6 | 12 | 17 | 23 |
| `number_of_vehicles_involved` | 1000 | 1.84 | 1.02 | 1 | 1 | 1 | 3 | 4 |
| `bodily_injuries` | 1000 | 0.99 | 0.82 | 0 | 0 | 1 | 2 | 2 |
| `witnesses` | 1000 | 1.49 | 1.11 | 0 | 1 | 1 | 2 | 3 |

---

## 💰 Claims
| Variable | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `total_claim_amount` | 1000 | 52,761.94 | 26,401.53 | 100 | 41,813 | 58,055 | 70,593 | 114,920 |
| `injury_claim` | 1000 | 7,433.42 | 4,880.95 | 0 | 4,295 | 6,775 | 11,305 | 21,450 |
| `property_claim` | 1000 | 7,399.57 | 4,824.73 | 0 | 4,445 | 6,750 | 10,885 | 23,670 |
| `vehicle_claim` | 1000 | 37,928.95 | 18,886.25 | 70 | 30,293 | 42,100 | 50,823 | 79,560 |

---

## 🚗 Vehicle
| Variable | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `auto_year` | 1000 | 2005.10 | 6.02 | 1995 | 2000 | 2005 | 2010 | 2015 |

---

### ⚠️ Notes
- `umbrella_limit` has extreme outliers (min = -1,000,000, max = 10,000,000).  
- `capital-gains` and `capital-loss` are highly skewed (lots of 0s).  
- Claims data (`injury_claim`, `property_claim`, `vehicle_claim`) have heavy right tails — large maximums compared to median.  
- `_c39` column contains no values → candidate for drop.  


In [25]:
#Lets start by dropping any empty columns
import pandas as pd
import numpy as np

#lets drop the _c39 column as it is completely empty
df = df.drop(columns=['_c39'])

# umbrella_limit: inspect and cap / flag
print(df['umbrella_limit'].describe())

count    1.000000e+03
mean     1.101000e+06
std      2.297407e+06
min     -1.000000e+06
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+07
Name: umbrella_limit, dtype: float64
