# Customer Behavior Analysis – Live Project Report

**Scenario:** Small online handmade store; scattered, noisy data.  
**Goal:** Extract insights and predict churn using public + internal data.

---
## 1. Setup and Data Extraction

In [None]:
import sys
from pathlib import Path
ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
sys.path.insert(0, str(ROOT))

from scripts.extract_synthetic import run as gen_synthetic
from config import DATA_RAW, DATA_PROCESSED

gen_synthetic(DATA_RAW)
print("Raw data:", list(DATA_RAW.glob("*.csv")))

## 2. Transform and Single Source of Truth

In [None]:
from scripts.transform import run_all

combined = run_all()
print(combined.shape)
combined.head()

## 3. Exploratory Data Analysis

In [None]:
from scripts.eda import run_eda

r = run_eda()
print("Rows:", r["rows"])
print("Plots:", r.get("plot_paths", []))
print("Top keywords:", list(r.get("top_keywords", {}).keys())[:8])
print("Practical insights:", r.get("practical_insights", {}))

## 4. Modeling: Churn and Segmentation

- **Churn:** Logistic Regression and Naive Bayes with K-Fold CV; Precision, Recall, F1. Interpretability via coefficients (Logistic) and feature importance.
- **Segmentation:** K-Means (2–3 clusters) on order_count and total_amount.
- **Sentiment:** VADER in transform; optional pre-trained Hugging Face sentiment if installed.

In [None]:
from scripts.models import run_all_models

summary = run_all_models()
print("Churn F1 (Logistic):", summary.get("churn", {}).get("logistic", {}).get("f1"))
print("Churn F1 (Naive Bayes):", summary.get("churn", {}).get("naive_bayes", {}).get("f1"))
print("Segmentation:", summary.get("segmentation", {}))
if summary.get("churn", {}).get("logistic", {}).get("feature_importance"):
    print("Feature importance (churn):", summary["churn"]["logistic"]["feature_importance"])

## 5. Key Insights and Recommendations

- **Sentiment:** Use VADER scores from feedback to prioritize negative reviews.
- **Churn:** Logistic Regression and Naive Bayes give interpretable signals; more orders and higher total amount tend to reduce churn.
- **Segmentation:** K-Means on order_count/total_amount identifies low/medium/high value segments for targeted campaigns.
- **Limitations:** Synthetic data; real deployment needs more data and optional pre-trained sentiment (e.g. Hugging Face).

## 6. Key visualizations (2–3 main insights)

Below: customer sentiment distribution, segment distribution, and one relationship plot (scatter or trend).

In [None]:
from IPython.display import Image, display
from pathlib import Path
root = Path("..") if Path.cwd().name == "notebooks" else Path(".")
plots_dir = root / "data" / "processed" / "plots"
key_plots = ["sentiment_distribution.png", "segment_distribution.png", "scatter_wordcount_sentiment.png"]
for name in key_plots:
    p = plots_dir / name
    if p.exists():
        display(Image(filename=str(p.resolve()), width=500))
        print(p.name)

## 7. Practical insights (answers to business questions)

- **What topics were discussed most?** → Top keywords from EDA.
- **What sentiment towards similar products?** → Sentiment distribution and mean compound score.
- **Any pattern in time of interactions?** → Peak month and trend line.
- **Which data source gave better insights?** → Counts by source_type (internal vs public).

In [None]:
# Print practical insights from EDA (run EDA cell first)
insights = r.get("practical_insights", {})
for k, v in insights.items():
    print(f"{k}: {v}")

## 8. Project story (challenges, approach, findings, recommendations)

**Challenges:** Noisy and scattered data (missing values, mixed formats); limited internal data; need for interpretable models.

**Approach:** Modular ETL (separate scripts per source); iterative cleaning and NLTK for text; single source of truth; robust models (Logistic, Naive Bayes, K-Means) with K-Fold CV and Precision/Recall/F1.

**Findings:** Sentiment from VADER helps prioritize negative feedback; churn is driven by order count and total amount; segmentation yields 2–3 actionable customer groups.

**Recommendations:** Target high-value segments for retention; use sentiment to avoid pushing dissatisfied customers; collect more structured data and consider pre-trained Hugging Face sentiment for richer text analysis.

**Limitations and next steps:** Data is synthetic; real deployment needs live APIs and more history. Improve by gathering better data and optional weather/events features.

In [None]:
# Display one key plot path
from config import DATA_PROCESSED
plots_dir = DATA_PROCESSED / "plots"
if plots_dir.exists():
    for p in plots_dir.glob("*.png"):
        print(p)