# Task 1: Exploratory Data Analysis (EDA)

This notebook performs EDA on the insurance dataset. It covers:
1. Data Summarization (Descriptive Statistics)
2. Data Quality Checks (Types, Missing Values)
3. Outlier Detection
4. Univariate & Bivariate Analysis

In [None]:
import sys
import os
import pandas as pd
import numpy as np

# Add src to path
sys.path.append(os.path.abspath(os.path.join('..')))

from src.loader import load_data
from src.eda import calculate_summary_statistics, check_missing_values, detect_outliers_iqr
from src.visualization import plot_histogram, plot_boxplot, plot_scatter, plot_correlation_matrix

## 1. Data Loading and Overview

In [None]:
filepath = '../data/raw/MachineLearningRating_v3.txt'
df = load_data(filepath)
df.head()

In [None]:
df.info()

## 2. Data Summarization
Calculating descriptive statistics for key numerical columns like TotalPremium and TotalClaims.

In [None]:
numerical_cols = ['TotalPremium', 'TotalClaims']
stats = calculate_summary_statistics(df, numerical_cols)
stats

## 3. Data Quality Checks
Assessing missing values and potential outliers.

In [None]:
missing_values = check_missing_values(df)
missing_values

In [None]:
outliers_premium = detect_outliers_iqr(df, 'TotalPremium')
outliers_claims = detect_outliers_iqr(df, 'TotalClaims')
print(f"Outliers in TotalPremium: {outliers_premium}")
print(f"Outliers in TotalClaims: {outliers_claims}")

## 4. Visualizations
Creating informative plots to understand distributions and relationships.

In [None]:
# Univariate Analysis
plot_histogram(df, 'TotalPremium', title='Distribution of Total Premium')
plot_histogram(df, 'TotalClaims', title='Distribution of Total Claims')

In [None]:
# Bivariate Analysis
plot_scatter(df, 'TotalPremium', 'TotalClaims', title='Total Premium vs Total Claims')
plot_correlation_matrix(df, ['TotalPremium', 'TotalClaims'])