# Customer Analytics: Exploratory Data Analysis (EDA)

## Project Overview
This notebook performs a comprehensive exploratory data analysis on the `customer_analytics.csv` dataset. The goal is to identify data quality issues, uncover patterns in customer behavior, and derive actionable insights.

### Key Objectives:
1. **Data Inspection**: Load and understand the dataset structure.
2. **Data Quality Assessment**: Identify missing values, duplicates, and outliers.
3. **Statistical Summary**: Examine distributions and central tendencies.
4. **Visual Analysis**: Use plots to discover relationships between variables.
5. **Insights Documentation**: Summarize findings as we go.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set visualization style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully.")

## 1. Data Loading and Initial Inspection
We start by loading the dataset and taking a quick look at the first few rows.

In [None]:
# Path to the dataset
data_path = "../data/customer_analytics.csv"

# Load dataset
if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    print(f"Dataset loaded with {df.shape[0]} rows and {df.shape[1]} columns.")
else:
    print(f"Error: Dataset not found at {data_path}. Please run the generator first.")

df.head()

## 2. Data Quality Assessment

### 2.1 Missing Values
Let's identify columns with missing data.

In [None]:
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
pd.DataFrame({'Missing Count': missing_values, 'Percentage (%)': missing_percentage.round(2)})

### 2.2 Duplicate Records
Checking for exact duplicate rows.

In [None]:
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows found: {duplicates}")

## 3. Univariate Analysis

### 3.1 Numerical Distributions
Understanding the spread of key numerical variables.

In [None]:
numerical_cols = ['Age', 'AnnualIncome', 'SpendingScore', 'YearsEmployed', 'LastPurchaseAmount']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 12))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
    sns.histplot(df[col], kde=True, ax=axes[i], color='teal')
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

**Findings:**
- `AnnualIncome` shows potential outliers on the high end.
- `Age` is relatively uniformly distributed across the range.

### 3.2 Categorical Columns
Checking the count of customers across different categories.

In [None]:
cat_cols = ['Gender', 'City', 'Education', 'MaritalStatus', 'PreferredDevice']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
axes = axes.flatten()

for i, col in enumerate(cat_cols):
    sns.countplot(data=df, x=col, ax=axes[i], palette='viridis')
    axes[i].set_title(f'Count of {col}')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 4. Bivariate and Multivariate Analysis

### 4.1 Correlations
How are the variables related?

In [None]:
plt.figure(figsize=(10, 8))
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

### 4.2 Income vs Spending Score
Exploring the relationship between earnings and spending behavior.

In [None]:
sns.scatterplot(data=df, x='AnnualIncome', y='SpendingScore', hue='Gender', alpha=0.7)
plt.title("Annual Income vs Spending Score")
plt.show()

## 5. Outlier Detection
Using boxplots to visualize outliers in numerical data.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[numerical_cols])
plt.title("Boxplot of Numerical Variables")
plt.xticks(rotation=45)
plt.show()

## 6. Conclusions and Insights

After performing this structured EDA, we have observed several key points:
1. **Messy Data**: The dataset contains ~5% missing values in `Education` and `AnnualIncome`, and some duplicate records that need cleaning.
2. **Outliers**: High-income outliers are clearly visible, which could affect machine learning models if not handled properly.
3. **Customer Segments**: The relationship between `AnnualIncome` and `SpendingScore` suggests a correlation where higher earners might have different spending behaviors.
4. **Preferred Devices**: No single device dominates, but the distribution across laptops, mobiles, and tablets provides insight into user experience priorities.

### Next Steps:
- **Data Cleaning**: Impute missing values and remove duplicates.
- **Feature Engineering**: Create new variables (e.g., spending efficiency).
- **Modeling**: Segment customers using clustering algorithms (like K-Means).