# Exploratory Data Analysis (EDA) of Consumer Complaints

This notebook performs an exploratory data analysis on the consumer complaints dataset. It uses a modular script `EDA.py` which contains all the core functions for data loading, processing, and visualization. This approach keeps the notebook clean and focused on presenting the results and insights.

### Step 1: Import Necessary Functions

First, we import the required functions from our modular `EDA.py` script.

In [None]:
import pandas as pd
import os
from EDA import (
    convert_csv_to_parquet,
    load_data,
    plot_product_distribution,
    analyze_narrative_word_count,
    plot_narrative_availability,
    show_unique_products,
    filter_and_process_complaints,
    save_data
)

# Define file paths
BASE_DATA_DIR = '../data'
RAW_CSV_PATH = os.path.join(BASE_DATA_DIR, 'complaints.csv')
RAW_PARQUET_PATH = os.path.join(BASE_DATA_DIR, 'raw_complaints.parquet')

### Step 2: Data Conversion (One-Time Setup)

Working with Parquet files is much more efficient for large datasets than CSV. This step converts the raw CSV into Parquet format. It only needs to be run once.

In [None]:
# Check if the Parquet file already exists to avoid re-running
if not os.path.exists(RAW_PARQUET_PATH):
    convert_csv_to_parquet(RAW_CSV_PATH, RAW_PARQUET_PATH)
else:
    print(f"Parquet file already exists at: {RAW_PARQUET_PATH}")

### Step 3: Load and Inspect the Data

Now we load the Parquet file and perform a quick inspection.

In [None]:
df = load_data(RAW_PARQUET_PATH)
df.head()

In [None]:
df.info()

### Step 4: Analyze Complaint Distributions

#### 4.1 Distribution of Complaints by Product

Let's visualize the number of complaints for each product category.

In [None]:
plot_product_distribution(df)

**Insight:** "Credit reporting or other personal consumer reports" is by far the most common complaint category.

#### 4.2 Availability of Complaint Narratives

A significant part of our analysis will focus on the text narratives. Let's see how many complaints actually include a narrative.

In [None]:
plot_narrative_availability(df)

**Insight:** A large majority of the complaints (69.0%) do not have a narrative. This confirms that we must filter out these entries for any text-based analysis.

### Step 5: Analyze Narrative Content

#### 5.1 Word Count Distribution

For the complaints that do have a narrative, let's analyze their length.

In [None]:
analyze_narrative_word_count(df)

### 📊 Word Count Statistics for Complaint Narratives

| Statistic | Meaning |
|----------|---------|
| **count** | Number of rows analyzed = **9.6 million complaints** |
| **mean** | Average word count per complaint = **54.5 words** |
| **std** | Standard deviation (variation) = **~149.8 words** |
| **min** | Minimum word count = **0** (likely blank or missing narrative) |
| **25%** | 25th percentile = **0** → at least 25% of complaints are empty |
| **50%** | 50th percentile (median) = **0** → more than half are empty! |
| **75%** | 75th percentile = **50** → 75% of entries are ≤ 50 words |
| **max** | Longest complaint = **6,469 words** (very long!) |

---

### 🎯 Interpretation

- Most complaints are **very short or empty**, which reinforces the need to **filter out entries with no narrative** in the next step.
- The dataset is **heavily skewed**, with a small portion of **very long complaints**.

### Step 6: Filter and Clean Data

Based on the EDA, we will now filter the dataset to meet the requirements for the next task: focus on specific products, ensure a narrative is present, and clean the text.

#### 6.1 Identify Target Products

First, let's review all unique product names to select the relevant ones.

In [None]:
show_unique_products(df)

#### 6.2 Apply Filtering and Cleaning

We will filter for a consolidated list of product categories related to credit cards, personal loans, bank accounts, and money transfers. Then, we'll apply the text cleaning function.

In [None]:
TARGET_PRODUCTS = [
    "Credit card",
    "Credit card or prepaid card",
    "Payday loan, title loan, or personal loan",
    "Payday loan, title loan, personal loan, or advance loan",
    "Checking or savings account",
    "Money transfers",
    "Money transfer, virtual currency, or money service"
]

df_filtered = filter_and_process_complaints(df, TARGET_PRODUCTS)

Let's check a sample to see the result of the cleaning.

In [None]:
df_filtered[['Product', 'Consumer complaint narrative', 'cleaned_narrative']].sample(3, random_state=42)

### Step 7: Save the Processed Data

Finally, we save the cleaned and filtered DataFrame to a new CSV file for use in subsequent tasks.

In [None]:
FILTERED_CSV_PATH = os.path.join(BASE_DATA_DIR, 'filtered_complaints_2.csv')
save_data(df_filtered, FILTERED_CSV_PATH)