# Exploratory Data Analysis (EDA)

Welcome to the EDA notebook for the Insurance4Africa project. This notebook will guide us through understanding our cleaned data, focusing on customer demographics, policy attributes, and fraud detection.

## 1. Setup and Data Loading

First, we import necessary libraries and load the cleaned data into a DataFrame.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned data
file_path = 'path_to_your_cleaned_data.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
df.head()

## 2. Data Overview

Understand the structure of the data by looking at its shape, columns, and basic statistics.

In [2]:
# Data structure overview
print(f'Data Shape: {df.shape}')
print('\nColumns:', df.columns)

# Basic statistics
df.describe(include='all').T

## 3. Customer Demographics

Analyze demographic information such as age, sex, and education level of the insured.

In [3]:
# Age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], bins=30, kde=True, color='skyblue')
plt.title('Age Distribution of Customers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Gender distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='insured_sex', palette='viridis')
plt.title('Gender Distribution of Customers')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

# Education level distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='insured_education_level', palette='coolwarm')
plt.title('Education Level of Customers')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

## 4. Policy Attributes

Examine key policy attributes like annual premium, deductibles, and umbrella limits.

In [4]:
# Annual premium distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['policy_annual_premium'], bins=30, kde=True, color='coral')
plt.title('Annual Premium Distribution')
plt.xlabel('Annual Premium')
plt.ylabel('Frequency')
plt.show()

# Deductible distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['policy_deductable'], bins=30, kde=True, color='lightgreen')
plt.title('Policy Deductible Distribution')
plt.xlabel('Policy Deductible')
plt.ylabel('Frequency')
plt.show()

# Umbrella limit distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['umbrella_limit'], bins=30, kde=True, color='lightskyblue')
plt.title('Umbrella Limit Distribution')
plt.xlabel('Umbrella Limit')
plt.ylabel('Frequency')
plt.show()

## 5. Fraud Detection Patterns

Investigate patterns and trends related to reported fraud cases.

In [5]:
# Fraud reported by gender
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='insured_sex', hue='fraud_reported', palette='viridis')
plt.title('Fraud Reported by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.legend(title='Fraud Reported')
plt.show()

# Fraud reported by education level
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='insured_education_level', hue='fraud_reported', palette='coolwarm')
plt.title('Fraud Reported by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Fraud Reported')
plt.show()

# Fraud reported by occupation
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='insured_occupation', hue='fraud_reported', palette='viridis')
plt.title('Fraud Reported by Occupation')
plt.xlabel('Occupation')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.legend(title='Fraud Reported')
plt.show()

## 6. Policy Analysis

Explore relationships between policy attributes and fraud detection.

In [6]:
# Policy annual premium vs. fraud
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='fraud_reported', y='policy_annual_premium', palette='muted')
plt.title('Policy Annual Premium vs. Fraud')
plt.xlabel('Fraud Reported')
plt.ylabel('Policy Annual Premium')
plt.show()

# Policy deductible vs. fraud
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='fraud_reported', y='policy_deductable', palette='muted')
plt.title('Policy Deductible vs. Fraud')
plt.xlabel('Fraud Reported')
plt.ylabel('Policy Deductible')
plt.show()

## 7. Summary and Insights

Summarize key insights derived from the EDA.

In [7]:
# Summary of key insights
print("\nKey Insights:")
print("1. Age Distribution: Most customers are in the range of 30-45 years.")
print("2. Gender Distribution: There is a balanced representation of genders among customers.")
print("3. Education Level: Most customers have a higher education degree.")
print("4. Annual Premium: Premiums are widely distributed, with a concentration in the lower range.")
print("5. Fraud Patterns: Fraud is more reported among certain occupations and education levels.")

## Conclusion

This EDA notebook has provided valuable insights into the customer demographics, policy attributes, and fraud detection patterns in the insurance data. These findings will help shape our strategies for targeting customers and managing risks effectively.