# ðŸ“Š XAE-Frame: Exploratory Data Analysis (EDA)
## Amazon Reviews 2023 - All_Beauty Category

**Author:** NazlÄ± Ã–zgÃ¼r  
**Date:** December 2024  
**Dataset:** Amazon Reviews 2023 (McAuley Lab)

---

## Objectives
1. Understand data structure and quality
2. Identify patterns for recommendation system
3. Detect potential bias sources for fairness monitoring
4. Prepare publication-quality visualizations
5. Inform feature engineering decisions

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from datetime import datetime

# Suppress warnings
warnings.filterwarnings('ignore')

# Set style for publication-quality plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10

print("Libraries imported successfully!")

---
## 1. Data Loading

In [None]:
# Load data
data_path = Path("../data/raw/amazon_reviews_All_Beauty.parquet")

if not data_path.exists():
    print(f"Data file not found: {data_path}")
    print("Please run: python scripts/download_data.py --dataset amazon_reviews --category All_Beauty")
else:
    df = pd.read_parquet(data_path)
    print(f"  Data loaded successfully!")
    print(f"  Shape: {df.shape[0]:,} rows Ã— {df.shape[1]} columns")

In [None]:
# Display first few rows
df.head()

In [None]:
# Data info
print("Dataset Information:")
print("=" * 50)
df.info()

---
## 2. Data Quality Assessment

In [None]:
# Missing values analysis
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

print("Missing Values Analysis:")
print("=" * 50)
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Visualize missing data (Publication-quality)
fig, ax = plt.subplots(figsize=(10, 6))
missing_plot = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %')
if len(missing_plot) > 0:
    missing_plot['Missing %'].plot(kind='barh', ax=ax, color='coral')
    ax.set_xlabel('Missing Data (%)', fontsize=12, fontweight='bold')
    ax.set_ylabel('Features', fontsize=12, fontweight='bold')
    ax.set_title('Missing Data Analysis - Amazon Beauty Reviews', 
                 fontsize=14, fontweight='bold', pad=20)
    ax.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.savefig('../docs/figures/missing_data.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No missing data found!")

In [None]:
# Basic statistics
print("\nBasic Statistics:")
print("=" * 50)
df.describe()

---
## 3. Rating Distribution Analysis

**Key for recommendation systems:** Understanding rating patterns helps design appropriate loss functions and identify potential biases.

In [None]:
# Rating distribution
rating_counts = df['rating'].value_counts().sort_index()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
rating_counts.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_xlabel('Rating', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Rating Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Percentage pie chart
rating_pct = (rating_counts / rating_counts.sum()) * 100
colors = sns.color_palette('Set2', len(rating_pct))
axes[1].pie(rating_pct, labels=rating_pct.index, autopct='%1.1f%%', 
            startangle=90, colors=colors)
axes[1].set_title('Rating Distribution (%)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../docs/figures/rating_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nRating Statistics:")
print(f"  Mean Rating: {df['rating'].mean():.2f}")
print(f"  Median Rating: {df['rating'].median():.2f}")
print(f"  Std Dev: {df['rating'].std():.2f}")

---
## 4. Temporal Analysis

**Critical for drift detection:** Understanding temporal patterns helps identify when to trigger model retraining.

In [None]:
# Convert timestamp to datetime if needed
if 'timestamp' in df.columns:
    df['date'] = pd.to_datetime(df['timestamp'], unit='ms', errors='coerce')
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['year_month'] = df['date'].dt.to_period('M')
    
    # Reviews over time
    reviews_over_time = df.groupby('year_month').size()
    
    fig, ax = plt.subplots(figsize=(14, 6))
    reviews_over_time.plot(ax=ax, color='darkblue', linewidth=2)
    ax.set_xlabel('Time Period', fontsize=12, fontweight='bold')
    ax.set_ylabel('Number of Reviews', fontsize=12, fontweight='bold')
    ax.set_title('Review Volume Over Time', fontsize=14, fontweight='bold', pad=20)
    ax.grid(alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('../docs/figures/temporal_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nTemporal Coverage:")
    print(f"  First Review: {df['date'].min()}")
    print(f"  Last Review: {df['date'].max()}")
    print(f"  Time Span: {(df['date'].max() - df['date'].min()).days} days")
else:
    print("No timestamp column found")

---
## 5. User and Item Analysis

**Essential for collaborative filtering:** Identifying power users and popular items.

In [None]:
# User activity analysis
user_id_col = 'user_id' if 'user_id' in df.columns else 'reviewerID'
item_id_col = 'parent_asin' if 'parent_asin' in df.columns else 'asin'

user_activity = df.groupby(user_id_col).size().reset_index(name='review_count')
item_popularity = df.groupby(item_id_col).size().reset_index(name='review_count')

print("User Activity Statistics:")
print("=" * 50)
print(f"  Total Unique Users: {df[user_id_col].nunique():,}")
print(f"  Avg Reviews per User: {user_activity['review_count'].mean():.2f}")
print(f"  Median Reviews per User: {user_activity['review_count'].median():.0f}")
print(f"  Max Reviews by Single User: {user_activity['review_count'].max():,}")

print("\nItem Popularity Statistics:")
print("=" * 50)
print(f"  Total Unique Items: {df[item_id_col].nunique():,}")
print(f"  Avg Reviews per Item: {item_popularity['review_count'].mean():.2f}")
print(f"  Median Reviews per Item: {item_popularity['review_count'].median():.0f}")
print(f"  Max Reviews for Single Item: {item_popularity['review_count'].max():,}")

In [None]:
# Visualize user and item distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# User distribution (log scale for better visibility)
axes[0].hist(user_activity['review_count'], bins=50, color='teal', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Number of Reviews per User', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency (log scale)', fontsize=12, fontweight='bold')
axes[0].set_title('User Activity Distribution', fontsize=14, fontweight='bold')
axes[0].set_yscale('log')
axes[0].grid(alpha=0.3)

# Item distribution (log scale)
axes[1].hist(item_popularity['review_count'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of Reviews per Item', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency (log scale)', fontsize=12, fontweight='bold')
axes[1].set_title('Item Popularity Distribution', fontsize=14, fontweight='bold')
axes[1].set_yscale('log')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../docs/figures/user_item_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

---
## 6. Sparsity Analysis

**Critical metric for recommendation systems:** High sparsity = cold start challenge.

In [None]:
# Calculate sparsity
n_users = df[user_id_col].nunique()
n_items = df[item_id_col].nunique()
n_ratings = len(df)

sparsity = 1 - (n_ratings / (n_users * n_items))

print("\nData Sparsity Analysis:")
print("=" * 50)
print(f"  Total Possible Interactions: {n_users * n_items:,}")
print(f"  Actual Interactions: {n_ratings:,}")
print(f"  Sparsity: {sparsity * 100:.4f}%")
print(f"  Density: {(1 - sparsity) * 100:.4f}%")
print("\n Insight: High sparsity indicates need for:")
print("   - Cross-domain transfer learning")
print("   - Hybrid recommendation approaches")
print("   - Content-based features")

---
## 7. Text Analysis (Review Text)

**For content-based features and sentiment analysis**

In [None]:
# Check if review text exists
if 'text' in df.columns:
    # Remove null values
    df_text = df[df['text'].notna()].copy()
    
    # Calculate text length
    df_text['text_length'] = df_text['text'].str.len()
    df_text['word_count'] = df_text['text'].str.split().str.len()
    
    print("\nReview Text Statistics:")
    print("=" * 50)
    print(f"  Reviews with Text: {len(df_text):,} ({len(df_text)/len(df)*100:.1f}%)")
    print(f"  Avg Characters: {df_text['text_length'].mean():.0f}")
    print(f"  Avg Words: {df_text['word_count'].mean():.0f}")
    print(f"  Median Words: {df_text['word_count'].median():.0f}")
    
    # Visualize text length distribution
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.hist(df_text['word_count'], bins=50, color='purple', edgecolor='black', alpha=0.7)
    ax.set_xlabel('Number of Words', fontsize=12, fontweight='bold')
    ax.set_ylabel('Frequency', fontsize=12, fontweight='bold')
    ax.set_title('Review Text Length Distribution', fontsize=14, fontweight='bold')
    ax.axvline(df_text['word_count'].mean(), color='red', linestyle='--', 
               linewidth=2, label=f"Mean: {df_text['word_count'].mean():.0f}")
    ax.legend()
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('../docs/figures/text_length_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print(" No review text column found")

---
## 8. Verified Purchase Analysis

**For trust and quality signals**

In [None]:
if 'verified_purchase' in df.columns:
    verified_counts = df['verified_purchase'].value_counts()
    
    print("\nVerified Purchase Analysis:")
    print("=" * 50)
    print(verified_counts)
    print(f"\nVerified Purchase Rate: {verified_counts.get(True, 0) / len(df) * 100:.1f}%")
    
    # Compare ratings: verified vs non-verified
    fig, ax = plt.subplots(figsize=(10, 6))
    df.boxplot(column='rating', by='verified_purchase', ax=ax)
    ax.set_xlabel('Verified Purchase', fontsize=12, fontweight='bold')
    ax.set_ylabel('Rating', fontsize=12, fontweight='bold')
    ax.set_title('Rating Distribution: Verified vs Non-Verified', 
                 fontsize=14, fontweight='bold')
    plt.suptitle('')  # Remove default title
    ax.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('../docs/figures/verified_purchase_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print(" No verified_purchase column found")

---
## 9. Data Quality Summary & Recommendations

In [None]:
print("\n" + "=" * 70)
print("DATA QUALITY SUMMARY")
print("=" * 70)

print("\n Strengths:")
print(f"  â€¢ Dataset Size: {len(df):,} reviews (sufficient for deep learning)")
print(f"  â€¢ User Coverage: {df[user_id_col].nunique():,} unique users")
print(f"  â€¢ Item Coverage: {df[item_id_col].nunique():,} unique products")
print(f"  â€¢ Temporal Range: {(df['date'].max() - df['date'].min()).days if 'date' in df.columns else 'N/A'} days")

print("\n Challenges:")
print(f"  â€¢ Sparsity: {sparsity * 100:.2f}% (requires transfer learning)")
print(f"  â€¢ Rating Bias: {(df['rating'] >= 4).sum() / len(df) * 100:.1f}% positive ratings")
if 'text' in df.columns:
    print(f"  â€¢ Text Coverage: {df['text'].notna().sum() / len(df) * 100:.1f}% have review text")

print("\n Recommendations for XAE-Frame:")
print("  1. Cross-domain transfer: Leverage this dataset to improve cold-start in other domains")
print("  2. Hybrid approach: Combine collaborative + content-based (text) features")
print("  3. Temporal modeling: Use timestamps for drift detection triggers")
print("  4. Fairness monitoring: Check for bias across user segments (if demographic data available)")
print("  5. Text mining: Extract product attributes from reviews for explainability")

print("\n" + "=" * 70)

---
## 10. Save Processed Summary Statistics

In [None]:
# Create summary statistics dictionary
summary_stats = {
    'dataset_size': len(df),
    'n_users': n_users,
    'n_items': n_items,
    'sparsity': sparsity,
    'avg_rating': df['rating'].mean(),
    'rating_std': df['rating'].std(),
    'date_range_days': (df['date'].max() - df['date'].min()).days if 'date' in df.columns else None,
    'avg_reviews_per_user': user_activity['review_count'].mean(),
    'avg_reviews_per_item': item_popularity['review_count'].mean(),
}

# Save to JSON
import json
with open('../data/processed/eda_summary.json', 'w') as f:
    json.dump(summary_stats, f, indent=2, default=str)

print("\n Summary statistics saved to: data/processed/eda_summary.json")

---
## Next Steps

Following this exploratory analysis, the subsequent phases of the XAE-Frame development include:

1. **Feature Engineering** - Creating explainable, domain-transferable features from raw data
2. **Baseline Model Training** - Implementing LightGBM/XGBoost recommendation models
3. **Explainability Integration** - SHAP value computation and multi-level explanation generation
4. **Adaptive Learning** - Drift detection and automated retraining mechanisms
5. **Fairness Monitoring** - Bias detection across user segments and demographic groups
6. **Cross-Domain Transfer** - Extending the framework to finance and insurance domains

---

**Note:** All visualizations are generated at 300 DPI resolution and saved in `docs/figures/` for use in academic publications and professional presentations.