In [None]:

# ## Question 1
# **What is the difference between K-Means and Hierarchical Clustering? Provide a use case for each.**
#
# **Answer:**
# - **K-Means:** A partitioning method that assigns data to k clusters by minimizing within-cluster variance. It requires specifying k beforehand, is efficient on large datasets, and assumes roughly spherical clusters of similar size.
#   - *Use case:* Customer segmentation with a known/suspected number of groups and many data points (e.g., segmenting customers by purchase frequency and average order value).
# - **Hierarchical Clustering:** Builds a dendrogram (agglomerative or divisive). Agglomerative starts with each point as its own cluster and merges iteratively. No need to pre-specify the number of clusters (you can cut the dendrogram), but it is more computationally expensive.
#   - *Use case:* Exploratory analysis of gene expression data where hierarchy and relationships between clusters are important.

# %% [markdown]
# ## Question 2
# **Explain the purpose of the Silhouette Score in evaluating clustering algorithms.**
#
# **Answer:**
# - The Silhouette Score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1.
#   - Values near +1 indicate the sample is well matched to its cluster and poorly matched to neighboring clusters.
#   - Values near 0 indicate overlapping clusters.
#   - Negative values indicate potential misclassification.
# - It is useful to compare clustering quality across different algorithms or different numbers of clusters.

# %% [markdown]
# ## Question 3
# **What are the core parameters of DBSCAN, and how do they influence the clustering process?**
#
# **Answer:**
# - **eps (epsilon):** The neighborhood radius. Points within this distance are considered neighbors.
# - **min_samples:** Minimum number of points (including the point itself) required to form a dense region (core point).
# - Influence:
#   - Increasing **eps** yields larger neighborhoods → fewer clusters, less noise.
#   - Increasing **min_samples** makes it harder to form clusters → more points labeled as noise.
# - DBSCAN can find arbitrarily shaped clusters and handles noise, but struggles with varying densities.

# %% [markdown]
# ## Question 4
# **Why is feature scaling important when applying clustering algorithms like K-Means and DBSCAN?**
#
# **Answer:**
# - Many clustering algorithms use distance metrics (e.g., Euclidean). Features with larger scales dominate distance computations and bias cluster assignments.
# - Scaling (StandardScaler, MinMaxScaler) ensures each feature contributes proportionally.

# %% [markdown]
# ## Question 5
# **What is the Elbow Method in K-Means clustering and how does it help determine the optimal number of clusters?**
#
# **Answer:**
# - The Elbow Method computes the within-cluster sum of squares (inertia) for a range of k values and plots inertia vs. k.
# - Initially, adding clusters reduces inertia significantly; after a point the marginal gain drops — the "elbow" suggests a reasonable k.

# %% [markdown]
# ---
# ## Dataset instructions
# Use `make_blobs`, `make_moons`, and `sklearn.datasets.load_wine()` as specified in subsequent questions.

# %% [markdown]
# ## Question 6
# **Generate synthetic data using `make_blobs(n_samples=300, centers=4)`, apply KMeans clustering, and visualize the results with cluster centers.**

# %%
# Question 6 - Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# generate data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# apply KMeans
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

# plot
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, s=30, cmap='tab10', alpha=0.7)
plt.scatter(centers[:,0], centers[:,1], c='black', s=200, marker='X', label='Centers')
plt.title('Q6: KMeans on make_blobs (4 centers)')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

print('Cluster centers:\n', centers)

# %% [markdown]
# ## Question 7
# **Load the Wine dataset, apply `StandardScaler`, and then train a DBSCAN model. Print the number of clusters found (excluding noise).**

# %%
# Question 7 - Code
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

wine = load_wine(as_frame=True)
X_wine = wine.data

# scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_wine)

# DBSCAN
db = DBSCAN(eps=0.9, min_samples=5)  # eps chosen reasonably for scaled data
labels_db = db.fit_predict(X_scaled)

# number of clusters (excluding noise label -1)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
print(f'Number of clusters found (excluding noise): {n_clusters}')

# print counts per label
unique, counts = np.unique(labels_db, return_counts=True)
print('Label counts:')
for lab, cnt in zip(unique, counts):
    print(f'  Label {lab}: {cnt}')

# %% [markdown]
# ## Question 8
# **Generate moon-shaped synthetic data using `make_moons(n_samples=200, noise=0.1)`, apply DBSCAN, and highlight the outliers in the plot.**

# %%
# Question 8 - Code
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=200, noise=0.1, random_state=42)
# DBSCAN with euclidean distance: tune eps
from sklearn.neighbors import NearestNeighbors

# quick heuristic: compute k-distance plot to pick eps (not displayed here, but we pick a reasonable value)
db_moons = DBSCAN(eps=0.2, min_samples=5)
labels_moons = db_moons.fit_predict(X_moons)

# mask outliers
outliers_mask = labels_moons == -1

plt.figure(figsize=(8,6))
plt.scatter(X_moons[~outliers_mask,0], X_moons[~outliers_mask,1], c=labels_moons[~outliers_mask], s=40, cmap='tab10', alpha=0.7)
plt.scatter(X_moons[outliers_mask,0], X_moons[outliers_mask,1], c='red', s=60, marker='x', label='Outliers')
plt.title('Q8: DBSCAN on make_moons — Outliers highlighted')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

print('Unique labels (including -1 for noise):', set(labels_moons))
print('Number of outliers detected:', outliers_mask.sum())

# %% [markdown]
# ## Question 9
# **Load the Wine dataset, reduce it to 2D using PCA, then apply Agglomerative Clustering and visualize the result in 2D with a scatter plot.**

# %%
# Question 9 - Code
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

# reduce to 2D
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

# Agglomerative clustering
agg = AgglomerativeClustering(n_clusters=3)  # choose 3 for demonstration
agg_labels = agg.fit_predict(X_pca)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=agg_labels, s=40, cmap='tab10', alpha=0.8)
plt.title('Q9: Agglomerative Clustering on Wine (PCA -> 2D)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.show()

# %% [markdown]
# ## Question 10
# **You are working as a data analyst at an e-commerce company. The marketing team wants to segment customers based on their purchasing behavior to run targeted promotions. The dataset contains customer demographics and their product purchase history across categories.**
#
# **Describe your real-world data science workflow using clustering:**
# - Which clustering algorithm(s) would you use and why?
# - How would you preprocess the data (missing values, scaling)?
# - How would you determine the number of clusters?
# - How would the marketing team benefit from your clustering analysis?
#
# *(Include Python code and output in the code box below.)*

# %%
# Question 10 - Answer + Example Code
# --------------------------------------------------
# 1) Algorithm choice (short):
# - Use KMeans for a baseline if data roughly convex and clusters expected; it's fast and interpretable.
# - Use DBSCAN if expecting noise or arbitrary-shaped clusters and want outlier detection.
# - Use Gaussian Mixture Models (GMM) if soft cluster memberships are desired.
# - Use hierarchical clustering for exploratory hierarchy.
#
# 2) Preprocessing:
# - Handle missing values: impute numeric features (median) and categorical features (mode) or use model-based imputation.
# - Encode categorical variables: OneHotEncoder for nominal, OrdinalEncoder if ordinal.
# - Scale numeric features: StandardScaler or MinMaxScaler depending on algorithm.
# - Feature engineering: create RFM features (Recency, Frequency, Monetary), CLTV estimates, or normalized category spend ratios.
#
# 3) Determine number of clusters:
# - Use Elbow Method, Silhouette Score, and domain knowledge. Also consider stability testing (bootstrap) and business constraints.
#
# 4) Business benefit:
# - Personalized promotions, targeted email campaigns, product recommendations, tailored discounts, and lifecycle marketing.
#
# Example pipeline (synthetic demonstration):

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

# Synthetic demo dataset
np.random.seed(42)
N = 300
customer_demo = pd.DataFrame({
    'age': np.random.randint(18,70,size=N),
    'gender': np.random.choice(['M','F'], size=N, p=[0.48,0.52]),
    'annual_income': np.random.normal(50000,15000,size=N).astype(int),
})

# Generate RFM-like features
purchases = np.random.poisson(lam=5, size=N)
avg_order_value = np.abs(np.random.normal(100,50,size=N))
recency_days = np.random.randint(1,365,size=N)

customer_demo['purchases'] = purchases
customer_demo['avg_order_value'] = avg_order_value
customer_demo['recency_days'] = recency_days

# pipeline: encode gender, scale numerical
num_features = ['age','annual_income','purchases','avg_order_value','recency_days']
cat_features = ['gender']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(drop='if_binary'), cat_features)
    ]
)

pipe = Pipeline([
    ('prep', preprocessor),
    ('kmeans', KMeans(n_clusters=4, random_state=42))
])

pipe.fit(customer_demo)
labels_demo = pipe.named_steps['kmeans'].labels_
customer_demo['segment'] = labels_demo

# Brief visualisation of segments over two principal components
from sklearn.decomposition import PCA
X_prep = pipe.named_steps['prep'].transform(customer_demo.drop(columns=['segment']))
X_2d = PCA(n_components=2, random_state=42).fit_transform(X_prep)

plt.figure(figsize=(8,6))
plt.scatter(X_2d[:,0], X_2d[:,1], c=labels_demo, s=40, cmap='tab10', alpha=0.8)
plt.title('Q10: Example Customer Segments (synthetic)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.show()

# Show top-level segment summary
summary = customer_demo.groupby('segment').agg({
    'purchases':'mean',
    'avg_order_value':'mean',
    'recency_days':'mean',
    'annual_income':'mean',
    'age':'mean',
    'gender':lambda x: x.value_counts().index[0]
}).round(2)

print('\nSegment summary:')
print(summary)

# --------------------------------------------------
# End of assignment notebook
