# FinTech Intelligent Document Processing â€” Exploratory Data Analysis

This notebook explores the synthetic dataset used for document classification.

**Tasks:**
- Dataset inspection
- Class distribution analysis
- Text length analysis
- Sample preview
- Model readiness checks


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_colwidth', None)
sns.set(style="whitegrid")

## Load Dataset

In [None]:
df = pd.read_csv('../pipeline/data/final_dataset.csv')
print("Shape:", df.shape)
df.head()

## Class Distribution

In [None]:
plt.figure(figsize=(7,4))
sns.countplot(x=df['label'])
plt.title('Document Class Distribution')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

## Text Length Distribution

In [None]:
df['text_length'] = df['text'].apply(lambda x: len(x.split()))

plt.figure(figsize=(7,4))
sns.histplot(df['text_length'], bins=30, kde=True)
plt.title('Text Length Distribution')
plt.xlabel('Words per document')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

## Average Length per Class

In [None]:
df.groupby('label')['text_length'].mean().sort_values().plot(kind='bar', figsize=(7,4))
plt.title('Average Text Length by Class')
plt.ylabel('Words')
plt.tight_layout()
plt.show()

## Sample Documents

In [None]:
for label in df['label'].unique():
    print(f"\n--- {label.upper()} ---")
    print(df[df['label']==label].iloc[0]['text'])

## Data Quality Checks

In [None]:
print("Missing values:\n", df.isnull().sum())
print("Duplicate rows:", df.duplicated().sum())

## Summary

- Dataset is balanced across 5 document classes
- Text lengths are consistent across categories
- No missing or duplicate values
- Ready for classical ML and deep learning pipelines
