# Customer Shopping Behavior â€“ Data Quality Analysis & Cleaning

This notebook demonstrates a **complete, reproducible data quality and cleaning workflow in Python**, designed for portfolio and GitHub sharing.

## 1. Setup Environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')

## 2. Import Dataset

In [None]:
df = pd.read_csv('Customer_Shopping_Behavior.csv')
df.head()

## 3. Initial Data Investigation

In [None]:
df.info()
df.describe(include='all')

## 4. Data Quality Analysis
### 4.1 Missing Values

In [None]:
missing = df.isna().mean()*100
missing.sort_values(ascending=False)

### 4.2 Duplicate Detection

In [None]:
df.duplicated().sum()

**Duplicate Keys Explanation:**

Duplicate keys occur when a supposed unique identifier (e.g., CustomerID + Date) appears more than once.


In [None]:
df.duplicated(subset=['CustomerID'], keep=False).sum() if 'CustomerID' in df.columns else 'No key column detected'

### 4.3 Inconsistency Detection

In [None]:
for col in df.select_dtypes(include='object').columns:
    print(col)
    print(df[col].value_counts().head())

**Consistency Between Columns:**
Example: Age should align with Age_Group; Income should not be negative.


### 4.4 Outlier Detection

In [None]:
num_cols = df.select_dtypes(include=np.number).columns
for col in num_cols:
    sns.boxplot(x=df[col])
    plt.title(col)
    plt.show()

## 5. Data Cleaning
### 5.1 Fix Data Types

In [None]:
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

### 5.2 Handle Missing Values

In [None]:

for col in num_cols:
    if df[col].isna().mean() < 0.2:
        if df[col].skew() < 1:
            df[col].fillna(df[col].mean(), inplace=True)
        else:
            df[col].fillna(df[col].median(), inplace=True)
    else:
        df[col].fillna(0, inplace=True)


**Forward Fill / Backward Fill Explanation:**

Used for time-series or ordered data. Not appropriate for independent customer attributes.

### 5.3 Remove Duplicates

In [None]:
df = df.drop_duplicates()

### 5.4 Standardize Text Data

In [None]:

for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].str.strip().str.lower()


### 5.5 Treat Outliers

In [None]:

for col in num_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    df = df[(df[col] >= q1 - 1.5*iqr) & (df[col] <= q3 + 1.5*iqr)]


## 6. Clean, Analysis-Ready Dataset

In [None]:
df.to_csv('clean_customer_shopping_behavior.csv', index=False)

## 7. Trend Analysis

In [None]:

if 'Purchase_Amount' in df.columns:
    df.groupby('Gender')['Purchase_Amount'].mean().plot(kind='bar', title='Average Purchase by Gender')
    plt.show()


## Conclusion
This notebook demonstrates an end-to-end data quality, cleaning, and analysis workflow suitable for production analytics and machine learning pipelines.