# ACIS Insurance Risk Analytics â€“ Task 1

**Objective:** The goal of this notebook is to understand the dataset, check for data quality issues, and perform Exploratory Data Analysis (EDA) to derive meaningful insights.

## Setup & Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add the src directory to the path to import custom modules
sys.path.append(os.path.abspath('../src'))

from data_loader import load_raw_data, preview_data
from eda import summarize_numeric, missing_value_report, top_creative_plots

print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")

## Load Dataset

In [None]:
df = load_raw_data('../data/MachineLearningRating_v3.txt')

if df is not None:
    print(f"Dataset Shape: {df.shape}")
    preview_data(df)

## Column Understanding

In [None]:
if df is not None:
    print(df.columns)

# Explanation of key columns:
# - TotalClaims: The total amount claimed.
# - TotalPremium: The total premium paid.
# - Province: The location of the insured vehicle.
# - Make: The manufacturer of the vehicle.
# - Section: Type of insurance coverage.

## Missing Value Analysis

In [None]:
if df is not None:
    missing_report = missing_value_report(df)
    print(missing_report)

    if not missing_report.empty:
        plt.figure(figsize=(10, 6))
        sns.barplot(x=missing_report.index, y=missing_report['Percent'])
        plt.xticks(rotation=90)
        plt.title('Percentage of Missing Values by Column')
        plt.ylabel('Percentage')
        plt.show()

## Descriptive Statistics

In [None]:
if df is not None:
    stats = summarize_numeric(df)
    display(stats)

### Interpretation
- **TotalClaims**: Observe the mean and max values to understand claim magnitude.
- **TotalPremium**: Compare with claims to gauge profitability implicitly.
- **SumInsured**: Check for spread and potential outliers.

## Outlier Detection

In [None]:
if df is not None:
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    sns.boxplot(y=df['TotalClaims'], ax=axes[0])
    axes[0].set_title('TotalClaims Boxplot')

    sns.boxplot(y=df['TotalPremium'], ax=axes[1])
    axes[1].set_title('TotalPremium Boxplot')

    sns.boxplot(y=df['SumInsured'], ax=axes[2])
    axes[2].set_title('SumInsured Boxplot')

    plt.tight_layout()
    plt.show()

## Correlation Analysis

In [None]:
if df is not None:
    numeric_df = df.select_dtypes(include=[np.number])
    corr_matrix = numeric_df.corr()

    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
    plt.title('Correlation Matrix Heatmap')
    plt.show()

## Creative Visualizations

In [None]:
if df is not None:
    top_creative_plots(df)

## Key Insights

1. **Claims vs Premium**: [Insight derived from plots]
2. **Geographical Risk**: [Insight from Province heatmap]
3. **Vehicle Make Impact**: [Insight from Make distribution]
4. **Seasonality**: [Insight from trends]
5. **Outliers**: Significant outliers exist in Claims and Premium, suggesting high-value risk cases.