# Class 3: Exploring Data Distributions and Feature Selection

**Week 7: Unsupervised Learning and Advanced Data Analysis**

**Objective**: Learn to analyze data distributions and select meaningful features for better modeling.

**Agenda**:
- Explore data using histograms, box plots, and correlation analysis.
- Understand feature selection techniques: removing low-variance features and correlation-based selection.
- Demo: Analyze the mall customer dataset.
- Exercise: Visualize distributions and select features.

Let’s dive into understanding and refining our data!

## 1. Exploring Data Distributions

**Why Explore Data?**
- Understand the shape, spread, and patterns in your data.
- Identify outliers, skewness, or relationships between features.
- Inform preprocessing steps (e.g., scaling for PCA or clustering).

**Tools**:
- **Histograms**: Show the distribution of a single feature (e.g., how spending varies).
- **Box Plots**: Highlight median, quartiles, and outliers.
- **Correlation Analysis**: Measure relationships between features (e.g., does age correlate with income?).

**Key Concepts**:
- **Outliers**: Extreme values that may distort models.
- **Correlation**: Pearson correlation ranges from -1 (negative) to 1 (positive). High correlation may indicate redundancy.
- **Multicollinearity**: When features are highly correlated, they may add little unique information.

## 2. Feature Selection Techniques

**Why Select Features?**
- Reduce noise and improve model performance.
- Avoid redundancy (e.g., highly correlated features).
- Make models faster and easier to interpret.

**Methods Covered**:
- **Low-Variance Features**: Remove features with little variation (they don’t help distinguish data points).
- **Correlation-Based Selection**: Drop one of two highly correlated features to reduce redundancy.

**Applications**:
- Prepare data for clustering (Class 1) or PCA (Class 2).
- Simplify the mall customer dataset for our mini-project.

Let’s apply these ideas to real data!

## 3. Demo: Analyzing the Mall Customer Dataset

We’ll explore the mall customer dataset, visualize distributions, and select features to prepare for clustering.

**Dataset**: Contains customer data with features like age, annual income, and spending score.

**Setup**: Ensure libraries are installed:
```bash
pip install numpy pandas scikit-learn matplotlib seaborn
```

**Note**: Download `Mall_Customers.csv` from [Kaggle](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python) or your course platform and place it in your working directory.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import VarianceThreshold

# Load the dataset
data = pd.read_csv('Mall_Customers.csv')

# Drop non-numeric column (if any) and rename for clarity
data = data.drop(columns=['CustomerID'], errors='ignore')
data = data.rename(columns={'Annual Income (k$)': 'Income', 'Spending Score (1-100)': 'Spending'})

# Display first few rows
print(data.head())

In [None]:
# Histograms
plt.figure(figsize=(12, 4))
for i, col in enumerate(['Age', 'Income', 'Spending'], 1):
    plt.subplot(1, 3, i)
    sns.histplot(data[col], bins=20, kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Box plots
plt.figure(figsize=(12, 4))
for i, col in enumerate(['Age', 'Income', 'Spending'], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=data[col])
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Correlation analysis
corr = data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Feature selection: Remove low-variance features
selector = VarianceThreshold(threshold=0.01)  # Arbitrary threshold
X = data[['Age', 'Income', 'Spending']]  # Numeric features
selector.fit(X)
selected_features = X.columns[selector.get_support()]
print('Features after variance threshold:', selected_features.tolist())

# Correlation-based selection (manual example)
high_corr = 0.7
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column] > high_corr)]
print('Features to drop due to high correlation:', to_drop)

**Discussion**:
- **Histograms**: What do the distributions tell us? (e.g., Is spending skewed?)
- **Box Plots**: Are there outliers in income or spending?
- **Correlation**: Are any features strongly correlated? How might this affect clustering?
- **Feature Selection**: Did we drop any features? Why or why not?

## 4. Exercise: Explore and Select Features

Your turn! Analyze the mall customer dataset and select features.

**Task**:
- Create histograms and box plots for Age, Income, and Spending.
- Generate a correlation heatmap.
- Apply low-variance feature selection and check for highly correlated features.
- Interpret what you find.

**Instructions**:
1. Use the code below to load the data.
2. Create visualizations (histograms, box plots, heatmap).
3. Perform feature selection.
4. Answer: Which features would you keep for clustering? Why?

In [None]:
# Load data (same as demo)
data_ex = pd.read_csv('Mall_Customers.csv')
data_ex = data_ex.drop(columns=['CustomerID'], errors='ignore')
data_ex = data_ex.rename(columns={'Annual Income (k$)': 'Income', 'Spending Score (1-100)': 'Spending'})

# Your code: Histograms
plt.figure(figsize=(12, 4))
for i, col in enumerate(['Age', 'Income', 'Spending'], 1):
    plt.subplot(1, 3, i)
    sns.histplot(data_ex[col], bins=20, kde=True)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Your code: Box plots
plt.figure(figsize=(12, 4))
for i, col in enumerate(['Age', 'Income', 'Spending'], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=data_ex[col])
    plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Your code: Correlation heatmap
corr_ex = data_ex.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_ex, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Your Correlation Heatmap')
plt.show()

In [None]:
# Your code: Feature selection
# Low-variance
selector_ex = VarianceThreshold(threshold=0.01)
X_ex = data_ex[['Age', 'Income', 'Spending']]
selector_ex.fit(X_ex)
selected_features_ex = X_ex.columns[selector_ex.get_support()]
print('Selected features (variance):', selected_features_ex.tolist())

# Correlation-based
corr_matrix_ex = X_ex.corr().abs()
upper_ex = corr_matrix_ex.where(np.triu(np.ones(corr_matrix_ex.shape), k=1).astype(bool))
to_drop_ex = [column for column in upper_ex.columns if any(upper_ex[column] > 0.7)]
print('Features to drop (correlation):', to_drop_ex)

**Your Interpretation**:
- What do the histograms and box plots reveal?
- Are there strong correlations? Should we drop any features?
- Which features would you keep for clustering? Why?

## 5. Wrap-Up

**Key Takeaways**:
- Histograms and box plots reveal distributions and outliers.
- Correlation analysis identifies relationships and redundancy.
- Feature selection (low-variance, correlation-based) improves data quality.

**Discussion Questions**:
- What surprised you about the data’s distributions?
- How might outliers affect k-means or PCA?
- Which features seem most important for customer segmentation?

**Homework**:
- Apply feature selection to the mall customer dataset.
- Prepare a clean dataset (selected features) for clustering in Class 4.
- Think about how these features might form clusters.

Great job exploring data! Next, we’ll tie it all together with clustering and cross-validation.