<a href="https://colab.research.google.com/github/maahi0401/About/blob/main/Interview_Prep_4/interview_prep_simulation_scenarios_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 4: Feature Engineering & Dimensionality Reduction

## üéØ Interview Simulation Scenarios
**Context:** This section transitions from theory to application, simulating the pressure and problem-solving flow of a technical data science interview.

### üõ†Ô∏è Strategic Approach for Scenarios
When tackling these problems, keep the **"STAR"** method in mind for your explanations:
* **Situation:** Briefly describe the data problem.
* **Task:** Identify which feature engineering/reduction technique is needed.
* **Action:** Write the code and explain your parameter choices (e.g., why `StandardScaler`?).
* **Result:** Interpret the output (e.g., "This reduced our feature space by 60% while retaining 95% variance").

### üü¢ Warm-up: The Scaling Decision (5-10 min)
**Scenario:** You are given a dataset with two features: `Age` (0‚Äì100) and `Annual Income` ($0‚Äì$1,000,000). You plan to use K-Means clustering.
* **Problem:** What happens if you don't scale? Which scaler would you choose?
* **Goal:** Demonstrate an understanding of distance-based algorithms.

### üü° Intermediate: The Redundancy Filter (10-15 min)
**Scenario:** You have a dataset with 50 features. A quick correlation matrix shows that 10 features have a correlation > 0.95 with each other.
* **Problem:** Implement a function to automatically identify and drop one feature from each highly correlated pair.
* **Goal:** Show efficiency in automated feature selection.

### üü† Advanced: PCA for High-Dimensional Noise (15-20 min)
**Scenario:** You are working with genomic data (thousands of features, few samples). The model is overfitting severely.
* **Problem:** Use PCA to reduce the feature space. How do you determine if you've removed "noise" or "signal"?
* **Goal:** Defend the use of a Scree Plot and cumulative variance thresholds.

### üî¥ Challenge: The Production Pipeline (20-30 min)
**Scenario:** You've built a t-SNE visualization that shows perfect class separation. Your manager asks you to put this t-SNE transformation into the real-time production pipeline for new incoming data.
* **Problem:** Explain why this is or isn't possible. Propose an alternative architecture (e.g., using PCA or a small Autoencoder).
* **Goal:** Demonstrate architectural knowledge and awareness of t-SNE's limitations.


## SetupRun this cell first to import all necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import mutual_info_classif, RFE, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris, load_wine, fetch_california_housing

import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("‚úì Setup complete!")

‚úì Setup complete!


In [None]:
# Buggy code from junior data scientistfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAiris = load_iris()X = iris.datay = iris.target# They did this:scaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Then later, on test data:X_test = X[:10]  # Pretend this is test dataX_test_scaled = scaler.fit_transform(X_test)  # ‚ùå MISTAKE HEREpca = PCA(n_components=2)X_pca = pca.fit_transform(X_test_scaled)print("Transformed test data shape:", X_pca.shape)

# Problem 1.1: Detect the Scaling Mistake üü¢

## üéØ Interview Prep: Debugging Scenario
**Context:** Identifying common pitfalls in the data preprocessing pipeline that lead to data leakage or model degradation.

### üìö The Buggy Code
Imagine a junior developer presents the following snippet for scaling a training and test set:

**Your Task:**

Identify the mistake in the code above2. Explain WHY it's wrong3. Fix it

**Write your answer below:**

In [None]:
# Your answer:

# 1. What's the mistake?

# ANSWER:

# 2. Why is it wrong?

# ANSWER:

# 3. Fixed code:

# Solution: Problem 1.1 - The Scaling Mistake üü¢

## 1. Identify the Mistake
The mistake in the code is the use of **`.fit_transform(X_test)`**.

## 2. Explain WHY it is wrong
In a technical interview, you want to frame this through the lens of **Data Leakage**:

* **Learning Unseen Statistics:** When you call `fit_transform()` on the test set, the scaler calculates a *new* mean ($\mu$) and standard deviation ($\sigma$) based solely on the test data.
* **The Golden Rule:** The test set must simulate "future," unseen data. In a real-world production environment, you wouldn't have the entire "future" dataset to calculate a mean from.
* **Inconsistent Transformation:** If the training set and test set have slightly different distributions, the same raw value (e.g., a "Height" of 180cm) would be mapped to different scaled values (e.g., 0.5 vs 0.7). This confuses the model because the numerical inputs no longer represent the same physical reality it learned during training.


## 3. The Fix
You should **`fit`** your scaler on the training data only. This "locks in" the parameters. You then apply those parameters to any other data (test set or new production samples) using **`transform`**.

```python
# The Correct Implementation
scaler = StandardScaler()

# 1. Fit to training data AND transform it
X_train_scaled = scaler.fit_transform(X_train)

# 2. ONLY transform the test data (using parameters from X_train)
X_test_scaled = scaler.transform(X_test)
```

# Problem 1.2: The Dummy Variable Trap üü¢

## üéØ Interview Prep: Encoding Logic
**Context:** Understanding why "less is more" when converting categorical strings into numerical features for linear models.


### üìö The Core Question
> **Scenario:** You are encoding a categorical variable with **5 categories** (e.g., "Monday" through "Friday"). How many dummy variables should you create?

**The Short Answer:** You should create **4** dummy variables.


### üîç Explaining the "Dummy Variable Trap"
In an interview, you should explain the **why** behind the $N-1$ rule:

1.  **Perfect Multicollinearity:** If you include all $5$ columns, one column can be predicted perfectly by the others. For example, if it is not Monday, Tuesday, Wednesday, or Thursday, it *must* be Friday.
2.  **Mathematical Conflict:** In linear regression, this perfect relationship makes the design matrix "singular" (non-invertible). This prevents the algorithm from finding a unique solution for the coefficients.
3.  **The Reference Category:** The category you drop becomes the "reference" or "baseline." The coefficients of the remaining 4 variables then represent the difference from that baseline.


### üõ†Ô∏è Implementation Fix
When using `pandas` or `scikit-learn`, you must explicitly trigger the drop:

```python
# Using Pandas
df_encoded = pd.get_dummies(df, columns=['Weekday'], drop_first=True)

# Using Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=False)

In [None]:
# Dataset
data = pd.DataFrame({
    'City': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix', 'NYC', 'LA', 'Chicago'],
    'Sales': [100, 150, 120, 130, 110, 95, 155, 125]
})

print(f"Unique cities: {data['City'].nunique()}")

# Correct Encoding to avoid the trap
encoded_data = pd.get_dummies(data, columns=['City'], drop_first=True)
print("\nEncoded Data (with drop_first=True):")
print(encoded_data.head())

# Solution: Problem 1.2 - The Dummy Variable Trap üü¢

## 1. Implementation: The Wrong vs. Correct Way

Using the dataset provided, here is how both approaches look in practice:

```python
# 1. The WRONG way (including all columns)
wrong_way = pd.get_dummies(data, columns=['City'], drop_first=False)

# 2. The CORRECT way (avoiding the trap)
correct_way = pd.get_dummies(data, columns=['City'], drop_first=True)

print("Wrong Way columns:", wrong_way.columns.tolist())
print("Correct Way columns:", correct_way.columns.tolist())

In [None]:
# Includes all 5 city columns
wrong_way = pd.get_dummies(data, columns=['City'], drop_first=False)

print("Columns created (Wrong):", [col for col in wrong_way.columns if 'City' in col])
# Result: ['City_Chicago', 'City_Houston', 'City_LA', 'City_NYC', 'City_Phoenix']

# üü¢ Problem 1.3: Detecting and Handling Outliers

## üéØ Interview Prep: Scaler Selection
**Context:** Choosing the right scaling strategy when your data is "dirty" or contains extreme values that might skew a standard distribution.

### üìö The Scenario
> **Interview Question:** "If your dataset has significant outliers, would you still use `StandardScaler`? What are the alternatives?"

#### 1. The Problem with `StandardScaler`
`StandardScaler` uses the **Mean** and **Standard Deviation**. Both of these metrics are highly sensitive to outliers. A single extreme value can "pull" the mean away from the center of the data and inflate the standard deviation, causing the majority of your data to be squeezed into a very small range after scaling.

#### 2. The Alternative: `RobustScaler`
`RobustScaler` is specifically designed for this. Instead of the mean, it uses the **Median**. Instead of standard deviation, it uses the **Interquartile Range (IQR)**.
* **Median:** The middle value (unaffected by how high the highest value is).
* **IQR:** The range between the 25th and 75th percentiles (captures the "core" of your data).

### üõ†Ô∏è Code Comparison

```python
from sklearn.preprocessing import RobustScaler, StandardScaler

# Simulated data with a massive outlier
X = np.array([10, 12, 11, 13, 12, 11, 1000]).reshape(-1, 1)

# StandardScaler result
ss = StandardScaler().fit_transform(X)
# The majority of values will be ~ -0.3, while 1000 becomes ~ 2.6

# RobustScaler result
rs = RobustScaler().fit_transform(X)
# The majority of values stay around 0, correctly identifying 1000 as a massive outlier (e.g., 494.5)

# Problem 1.3: Quick Imputation Decision üü¢

## üéØ Interview Prep: Handling Missingness
**Context:** Identifying the best recovery strategy for missing data based on the variable type and distribution.


### üìö The Decision Matrix

> **Interview Question:** *"How do you decide which imputation method to use for a specific feature?"*

In an interview, don't just say "fill with the mean." Use the following logic to demonstrate a nuanced understanding:

| Feature Type | Distribution | Best Imputation Method | Why? |
| :--- | :--- | :--- | :--- |
| **Numeric** | Normal (Symmetric) | **Mean** | The average is the best central representation of Gaussian data. |
| **Numeric** | Skewed / Outliers | **Median** | The median is robust and isn't pulled by extreme values. |
| **Categorical** | Any | **Mode** (Most Frequent) | You cannot average "New York" and "LA"; you pick the most common class. |
| **Any** | Pattern-based | **KNN / MICE** | Uses other features to "predict" the missing value (Advanced). |

### üõ†Ô∏è Strategic Decision Flow

1.  **Is it Categorical?** ‚Üí Use **Mode** or create a new category called **"Missing"**.
2.  **Is it Numeric?** ‚Üí Check for skewness or outliers.
    * No outliers? ‚Üí **Mean**.
    * Significant outliers? ‚Üí **Median**.
3.  **Is the missingness "Not at Random" (MNAR)?**
    * If the fact that it's missing is a signal (e.g., people with high debt don't report it), add a **Binary Indicator Column** (`is_missing`) to tell the model that the data was absent.


### üöÄ Interview Soundbite
> "My choice depends on the data distribution. For normally distributed numeric data, I use the **mean**. However, if the data is skewed or has outliers, I prefer the **median** because it's more robust. For categorical data, I default to the **mode** or create a dedicated 'Missing' category to preserve the signal that the information was unavailable."


In [None]:
# IMPLEMENTATION STRATEGY
from sklearn.impute import SimpleImputer

# Dataset 1 (Normal) -> Mean
imputer_mean = SimpleImputer(strategy='mean')
data1_imputed = imputer_mean.fit_transform(data1.reshape(-1, 1))

# Dataset 2 (Skewed) -> Median
imputer_median = SimpleImputer(strategy='median')
data2_imputed = imputer_median.fit_transform(data2.reshape(-1, 1))

# Solution: Problem 1.3 - Imputation Decision Matrix üü¢

## üéØ Interview Prep: Distribution-Based Imputation
**Context:** Demonstrating that your preprocessing steps are grounded in the statistical properties of the feature, rather than a "one-size-fits-all" approach.


### üìö The Decision Matrix

| Dataset | Best Method | Reasoning |
| :--- | :--- | :--- |
| **Dataset 1** | **Mean** | This is a **Normal (Gaussian) distribution**. In symmetric data without heavy tails, the mean is the most statistically efficient estimator of the central tendency. |
| **Dataset 2** | **Median** | This is **Heavily Right-Skewed (Exponential)**. The mean would be "pulled" to the right by the long tail, leading to biased imputation. The median is robust to skewness and outliers. |
| **Dataset 3** | **Median (or Mean)** | This is a **Uniform distribution**. While both are mathematically similar here, **Median** is often the safer "production" choice as it remains robust if future incoming data contains unexpected extreme values. |


### üîç Deep Dive: Why the choice matters
In an interview, use these specific justifications to show depth:

* **For Dataset 2 (Skewed):** Explain that using the mean would result in **imputing values that are too high** relative to the majority of the data. This creates a "bump" in the distribution that doesn't actually exist in the population.
* **The Outlier Factor:** If Dataset 1 suddenly had an extreme outlier (e.g., a value of 500), the **Mean** would shift significantly, while the **Median** would remain stable. This is why many practitioners default to Median for numeric data unless they are certain the data is strictly Gaussian.


### üöÄ Interview Soundbite
> "For Dataset 1, I‚Äôd use the **mean** because the data is symmetric and normally distributed. However, for Dataset 2, the exponential nature creates a heavy right skew; in this case, the mean is a poor representation of the 'typical' value, so I‚Äôd use the **median** to avoid introducing bias from the long tail."

In [None]:
# Your answer:

# Dataset 1:

# Dataset 2:

# Dataset 3:

## Part 2: Intermediate Problems üü°

### Problem 2.1: Mutual Information vs. Correlation

**Scenario:** You have a dataset where a feature has a perfect **quadratic relationship** with the target ().

* **Question:** If you use **Pearson Correlation** for feature selection, what will happen? What should you use instead?

**Interview Focus:** This tests if you understand the mathematical limitations of linear metrics versus information-theoretic ones.

### üõ†Ô∏è Implementation & Demonstration

Let's simulate this scenario to see how the two metrics perform:

```python
# Generate non-linear data
x = np.linspace(-10, 10, 100)
y = x**2 + np.random.normal(0, 5, 100) # Quadratic with noise

# Calculate Correlation
correlation = np.corrcoef(x, y)[0, 1]

# Calculate Mutual Information
# (Note: MI requires a 2D array for X)
mi_score = mutual_info_classif(x.reshape(-1, 1), (y > y.mean()).astype(int))[0]

print(f"Pearson Correlation: {correlation:.4f}")
print(f"Mutual Information Score: {mi_score:.4f}")

### üìö The Solution

1. **The Result:** Pearson Correlation will be close to **0**. Because the relationship is a "U-shape" (symmetric around the y-axis), the linear "best fit" line is horizontal, suggesting no relationship at all.
2. **The Problem:** Correlation only measures **linear** dependencies. It is "blind" to non-linear patterns.
3. **The Fix:** Use **Mutual Information (MI)**. MI measures how much information the presence of one variable provides about the other. It captures any kind of statistical dependency (linear or non-linear).

### üöÄ Interview Soundbite

> "Pearson correlation is a great first pass, but it only captures linear relationships. In cases with complex patterns‚Äîlike quadratic or periodic relationships‚Äîit can return a score of zero even if the feature is highly predictive. For a more robust feature selection, I prefer **Mutual Information**, as it uses entropy to capture any form of dependency between the feature and the target."

In a real-world scenario with 50 features, jumping straight to a complex model can lead to overfitting and high computational costs. An interviewer wants to see a **tiered approach**‚Äîmoving from fast, broad filters to precise, model-based wrappers.

### üõ†Ô∏è The "Filter-to-Wrapper" Strategy

1. **Filter Stage (Fast):** Remove constant features (variance = 0) and highly correlated features to reduce redundancy.
2. **Statistical Stage (Medium):** Use **Mutual Information** or **ANOVA F-test** to rank features based on their individual relationship with the target.
3. **Wrapper Stage (Slow but Precise):** Use **Recursive Feature Elimination (RFE)** with a model like Random Forest to capture feature interactions.


### üíª Implementation: A Multi-Stage Pipeline

```python
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# 1. Initialize the components
# Use Mutual Information to narrow down 50 features to 25 (Filter)
filter_selector = SelectKBest(score_func=mutual_info_classif, k=25)

# 2. Use RFE with a Random Forest to find the final 'Top 10' (Wrapper)
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
wrapper_selector = RFE(estimator=estimator, n_features_to_select=10, step=1)

# 3. Combine into a pipeline
selection_pipeline = Pipeline([
    ('filter', filter_selector),
    ('wrapper', wrapper_selector)
])

# selection_pipeline.fit(X_train, y_train)


### üîç Interview Explanation: The Trade-offs

| Method | Why use it? | The Trade-off |
| --- | --- | --- |
| **Filter (MI)** | Extremely fast; handles non-linear relationships. | Ignores interactions between features (evaluates features in isolation). |
| **Wrapper (RFE)** | Captures how features work *together* to help a model. | Computationally expensive; prone to overfitting if the dataset is small. |


### üöÄ Interview Soundbite

> "To build a robust feature selection pipeline, I start with a **Filter method** like Mutual Information to quickly discard noise and non-informative features. Then, I apply a **Wrapper method** like RFE with a Random Forest. This two-step approach is more efficient than running RFE on all 50 features at once, as it balances computational speed with the ability to detect complex feature interactions."


In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# --- STAGE 1: Filter Method ---
# Select top 25 features based on Mutual Information
filter_selector = SelectKBest(score_func=mutual_info_classif, k=25)
X_filter = filter_selector.fit_transform(X, y)

# --- STAGE 2: Wrapper Method ---
# Use RFE to get down to the final 15 features
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=rfc, n_features_to_select=15)
X_final = rfe.fit_transform(X_filter, y)

print(f"Final feature shape: {X_final.shape}")

### üíª 3-Step Selection Pipeline

```python
from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Step 1: Filter low-variance features
selector_var = VarianceThreshold(threshold=0.01)
X_var = selector_var.fit_transform(X)

# Step 2: Select top 20 via Mutual Information
selector_k = SelectKBest(score_func=mutual_info_classif, k=20)
X_k = selector_k.fit_transform(X_var, y)

# Step 3: RFE with Logistic Regression to get final 10
estimator = LogisticRegression(solver='liblinear')
selector_rfe = RFE(estimator=estimator, n_features_to_select=10)
X_final = selector_rfe.fit_transform(X_k, y)

# Performance Comparison
model = LogisticRegression(solver='liblinear')
score_all = cross_val_score(model, X, y, cv=5).mean()
score_sel = cross_val_score(model, X_final, y, cv=5).mean()

print(f"Step 1 (Variance): {X_var.shape[1]} features remaining")
print(f"Step 2 (KBest):    {X_k.shape[1]} features remaining")
print(f"Step 3 (RFE):      {X_final.shape[1]} features remaining")
print("-" * 30)
print(f"Accuracy (All 50):      {score_all:.4f}")
print(f"Accuracy (Selected 10): {score_sel:.4f}")
```

### üìä Visualization & Analysis

#### Why this sequence works for an Interview Answer:

1. **Variance Threshold (The Janitor):** We remove features that are essentially constant. If a feature doesn't change, it contains no information to help a model distinguish between classes.
2. **SelectKBest (The Filter):** Using **Mutual Information** is crucial here because it captures both linear and non-linear relationships. It's much faster than RFE, allowing us to cut the feature space in half almost instantly.
3. **RFE (The Specialist):** Recursive Feature Elimination considers feature **interactions**. By the time we reach this step, we are only asking the model to evaluate the "best of the best," making the computation much lighter.

### üöÄ Key Findings

* **Dimensionality Reduction:** We reduced the feature space by **80%** (50 down to 10).
* **Performance:** You will often find that the "Selected 10" performs nearly as well as (or sometimes better than) the "All 50." This happens because we've removed the **noise** and **redundant** features that lead to overfitting.
* **The "Curse":** By keeping only the most informative features, we improve the "signal-to-noise" ratio, which is vital when working with smaller datasets.


In [None]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold, SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# =================================================================
# STEP 1: Filter Method - Variance Threshold
# Goal: Remove "boring" features that don't change enough to be useful.
# =================================================================

# 1. Initialize the selector (threshold=0.01 means features must vary
#    by more than 1% to be kept)
selector_var = VarianceThreshold(threshold=0.01)

# 2. Fit and transform the dataset
# X_var = ???


# =================================================================
# STEP 2: Filter Method - Statistical Ranking (SelectKBest)
# Goal: Use Mutual Information to find the top 20 features based
#       on their relationship with the target 'y'.
# =================================================================

# 1. Initialize SelectKBest using mutual_info_classif
# selector_k = SelectKBest(score_func=???, k=20)

# 2. Fit and transform the data from Step 1
# X_k = ???


# =================================================================
# STEP 3: Wrapper Method - Recursive Feature Elimination (RFE)
# Goal: Use a model to find the final 10 features that work best
#       together by iteratively removing the least important ones.
# =================================================================

# 1. Choose a base model for RFE to use for feature ranking
estimator = LogisticRegression(solver='liblinear', random_state=42)

# 2. Initialize RFE to select the top 10 features
# selector_rfe = RFE(estimator=???, n_features_to_select=10)

# 3. Fit and transform the data from Step 2
# X_final = ???


# =================================================================
# PERFORMANCE EVALUATION
# Goal: Compare the "All Features" model vs. the "Selected Features"
#       model using 5-fold cross-validation.
# =================================================================

# model = LogisticRegression(solver='liblinear', random_state=42)

# 1. Calculate accuracy for the original X (50 features)
# score_all = cross_val_score(model, X, y, cv=5).mean()

# 2. Calculate accuracy for X_final (10 features)
# score_selected = cross_val_score(model, X_final, y, cv=5).mean()

# print(f"Accuracy with all features: {score_all:.4f}")
# print(f"Accuracy with selected features: {score_selected:.4f}")

### Problem 2.2: PCA Scree Plot Analysis

**Scenario**
You are presenting PCA results to stakeholders. They ask:
‚ÄúHow many components should we use?‚Äù

**Interview Skill Assessed**
Interpreting PCA results and clearly communicating decisions to a non-technical audience.


In [None]:
# Load wine dataset (13 features)
wine = load_wine()
X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
y_wine = wine.target

# Standardize features
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)

# Fit PCA with all components
pca = PCA()
pca.fit(X_wine_scaled)

# Explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Plot scree plot and cumulative variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scree plot
axes[0].bar(
    range(1, len(explained_variance) + 1),
    explained_variance,
    edgecolor="black"
)
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot")

# Cumulative explained variance
axes[1].plot(
    range(1, l


**Your Tasks**

1. Using the elbow method, how many components would you choose?
2. How many components are needed to explain 90% of the variance? How many for 95%?
3. Write a 2‚Äì3 sentence explanation of your recommendation for a non-technical manager.
4. What is the trade-off between using fewer versus more components?



In [None]:
# Your answers

# 1. Elbow method choice
# ANSWER:

# 2. Components needed for variance thresholds
# 90% variance:
# 95% variance:

# 3. Explanation for a non-technical manager
# ANSWER:

# 4. Trade-offs between fewer vs. more components
# ANSWER:


### Solution

**1. Elbow method**
From the scree plot, the elbow appears around principal components 2‚Äì3.

**2. Components needed for variance thresholds**

* For 90% variance: approximately 5 components
* For 95% variance: approximately 6 components

**3. Explanation for a non-technical manager**
I recommend using 5‚Äì6 principal components. This captures 90‚Äì95% of the information in the data while reducing the number of features from 13 to about 5‚Äì6. The result is a simpler, faster model with very little loss of useful information.

**4. Trade-offs between fewer vs. more components**

*Fewer components (2‚Äì3)*

* Easier to visualize
* Faster computation
* May discard important information

*More components (10+)*

* Preserves more information
* Reduces the benefit of dimensionality reduction
* Harder to interpret

**Interview tip**
Always tie technical decisions back to business value, such as speed, simplicity, and interpretability.


## Part 3: Advanced Problems

### Problem 3.1: Complete Preprocessing Pipeline

**Scenario**
This is a realistic ‚Äúmessy data‚Äù interview question. You are given a dataset with:

* Mixed numeric and categorical features
* Missing values
* Features on different scales
* High dimensionality

**Task**
Design and implement a complete preprocessing pipeline that prepares the data for modeling.

**Why this matters**
This is a very common take-home and on-site interview problem. The goal is to demonstrate both technical correctness and good data engineering judgment.


In [None]:
# Create a realistic messy dataset
np.random.seed(42)
n_samples = 200

messy_data = pd.DataFrame({
    "age": np.random.randint(18, 80, n_samples),
    "income": np.random.exponential(50_000, n_samples),
    "credit_score": np.random.normal(700, 50, n_samples),
    "years_employed": np.random.randint(0, 40, n_samples),
    "education": np.random.choice(["HS", "BS", "MS", "PhD"], n_samples),
    "city": np.random.choice(
        ["NYC", "LA", "Chicago", "Houston", "Phoenix"],
        n_samples
    ),
    "loan_approved": np.random.choice([0, 1], n_samples)
})

# Introduce missing values
messy_data.loc[
    np.random.choice(n_samples, 20, replace=False),
    "income"
] = np.nan

messy_data.loc[
    np.random.choice(n_samples, 15, replace=False),
    "credit_score"
] = np.nan

messy_data.loc[
    np.random.choice(n_samples, 10, replace=False),
    "education"
] = np.nan

# Introduce outliers
messy_data.loc[
    np.random.choice(n_samples, 5, replace=False),
    "income"
] = np.random.uniform(500_000, 1_000_000, 5)

# Inspect dataset
print("Dataset Info:")
print(messy_data.info())

print("\nMissing Values:")
print(messy_data.isnull().sum())

print("\nFirst few rows:")
print(messy_data.head())


**Your Task**

Create a preprocessing function that:

1. Handles missing values appropriately for each feature
2. Detects and treats outliers in numeric features
3. Encodes categorical variables correctly
4. Scales numeric features
5. Returns a clean dataset ready for modeling

**Requirements**

* Must work on both training and test data
* Document your design decisions
* Explain any trade-offs you make



In [None]:
def preprocess_data(df, is_train=True, scaler=None, encoders=None):
    """
    Complete preprocessing pipeline for messy data.

    Parameters
    ----------
    df : pandas.DataFrame
        Input data
    is_train : bool
        If True, fit scalers/encoders.
        If False, use the provided ones.
    scaler : fitted scaler or None
    encoders : dict of fitted encoders or None

    Returns
    -------
    df_processed : pandas.DataFrame
        Cleaned and transformed data
    scaler : fitted scaler
    encoders : dict of fitted encoders
    """

    # YOUR CODE HERE
    pass


# Test your function

X = messy_data.drop("loan_approved", axis=1)
y = messy_data["loan_approved"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)

# Process training data
X_train_processed, scaler, encoders = preprocess_data(
    X_train,
    is_train=True
)

# Process test data (using training parameters)
X_test_processed, _, _ = preprocess_data(
    X_test,
    is_train=False,
    scaler=scaler,
    encoders=encoders
)

print("Processed training shape:", X_train_processed.shape)
print("Processed test shape:", X_test_processed.shape)


### Hints

**Suggested approach**

1. Separate numeric and categorical columns.

2. For numeric features:

   * Inspect the distribution before choosing an imputation strategy
   * Detect outliers using the IQR method
   * Apply scaling only after missing values are handled

3. For categorical features:

   * Impute missing values using the mode
   * One-hot encode categories, using `drop_first=True`

4. Combine the processed numeric and categorical features into a single dataset.

**Key insight**
When `is_train=True`, always save the fitted transformers (imputation values, outlier bounds, scalers, encoders). Reuse them on test data to prevent data leakage.

### Solution Overview (Conceptual)

The preprocessing function follows a structured, production-ready workflow:

* A copy of the input data is created to avoid modifying the original dataset.
* Numeric and categorical columns are identified automatically.

**Numeric features**

* Missing values are imputed using the median, which is robust to skewed distributions and outliers.
* Outliers are capped using the interquartile range (IQR) method rather than removed.
* All imputation values and bounds are saved during training and reused during testing.

**Categorical features**

* Missing values are imputed using the most frequent category (mode).
* Categorical variables are converted to numeric form using one-hot encoding.

**Scaling**

* Numeric features are standardized using a scaler fitted on the training data only.
* The same scaler is reused for test data to ensure consistency.

The final output is a clean, fully numeric dataset with no missing values and consistent preprocessing between training and testing.


### Interview Talking Points

* Median imputation is chosen because features like income are often right-skewed.
* IQR-based capping limits extreme values while preserving overall information.
* Persisting fitted transformers avoids data leakage between train and test sets.
* The approach mirrors real-world, production-ready preprocessing pipelines.



### Problem 3.2: When PCA Goes Wrong

**Scenario**
A colleague applied PCA and obtained very poor results. Your task is to diagnose what went wrong.

**Interview Skill Assessed**
Critical thinking and knowing when *not* to apply a particular technique.

In [None]:
# Your colleague's code

from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# Create dataset with clear class separation
X, y = make_classification(
    n_samples=300,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    flip_y=0.05,
    class_sep=3.0,
    random_state=42
)

# Apply PCA
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X)

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original data
axes[0].scatter(
    X[y == 0, 0],
    X[y == 0, 1],
    label="Class 0",
    alpha=0.6
)
axes[0].scatter(
    X[y == 1, 0],
    X[y == 1, 1],
    label="Class 1",
    alpha=0.6
)
axes[0].set_title("Original Data (2D)")
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# After PCA
axes[1].scatter(
    X_pca[y == 0],
    np.zeros(sum(y == 0)),
    label="Class 0",
    alpha=0.6
)
axes[1].scatter(
    X_pca[y == 1],
    np.zeros(sum(y == 1)),
    label="Class 1",
    alpha=0.6
)
axes[1].set_title("After PCA (1D)")
axes[1].set_xlabel("PC1")
axes[1].set_yticks([])
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Train models
model_original = SVC(kernel="linear", random_state=42)
model_pca = SVC(kernel="linear", random_state=42)

model_original.fit(X, y)
model_pca.fit(X_pca.reshape(-1, 1), y)

print(
    f"Accuracy with original features: "
    f"{model_original.score(X, y):.3f}"
)
print(
    f"Accuracy with PCA: "
    f"{model_pca.score(X_pca.reshape(-1, 1), y):.3f}"
)


**Your Tasks**

1. **Identify the problem**
   Why did PCA hurt model performance in this case?

2. **Recommend an alternative**
   What dimensionality reduction technique would you use instead, and why?

3. **When is PCA appropriate?**
   Provide clear guidelines for when PCA should (and should not) be used.

**Concept Check**
This question tests whether you understand the fundamental difference between PCA and Linear Discriminant Analysis (LDA).


In [None]:
# Your answers

# 1. Why PCA failed
# ANSWER:

# 2. Better dimensionality reduction technique
# CODE:

# 3. When to use PCA
# ANSWER:


### Solution

**1. Why PCA failed**
PCA identifies directions of maximum variance, but those directions do not necessarily separate classes.
In this example, the first principal component captures overall variance but projects both classes onto overlapping regions, which harms classification performance.

**2. Better technique: Linear Discriminant Analysis (LDA)**
LDA is a supervised dimensionality reduction method. Unlike PCA, it explicitly uses class labels and finds projections that maximize separation between classes while minimizing variance within each class.
As a result, LDA preserves class-discriminative information that PCA may discard.

**3. When to use PCA**

PCA is appropriate when:

* Performing unsupervised learning or clustering
* Visualizing high-dimensional data
* Reducing multicollinearity
* Removing noise

PCA should be avoided when:

* The goal is maximizing class separation (use LDA instead)
* Features are on different scales and have not been standardized

**Interview-ready takeaway**
‚ÄúPCA is an unsupervised technique that focuses on variance, not class separation. When labels are available and the goal is classification, LDA is often a better choice because it directly maximizes class separability.‚Äù


## Part 4: Challenge Problems

These are timed interview pressure tests. Set a timer and work under interview conditions.

### Challenge 4.1: The Feature Engineering Sprint (25 minutes)

**Scenario**
You are in a live coding interview. The interviewer gives you a dataset and says:
‚ÄúWe've tried models on this data with about 65% accuracy. Can you engineer features to improve it? You have 25 minutes.‚Äù

**Rules**

* Time limit: 25 minutes
* Must improve baseline accuracy
* Explain your reasoning as you code
* Be prepared to defend your choices


In [None]:
import time

start_time = time.time()

# Dataset: Employee attrition prediction
np.random.seed(42)
n = 500

employee_data = pd.DataFrame({
    "age": np.random.randint(22, 65, n),
    "years_at_company": np.random.randint(0, 40, n),
    "monthly_income": np.random.normal(6000, 2000, n),
    "distance_from_home": np.random.randint(1, 50, n),
    "num_companies_worked": np.random.randint(0, 10, n),
    "years_since_last_promotion": np.random.randint(0, 15, n),
    "department": np.random.choice(["Sales", "IT", "HR", "Marketing"], n),
    "education_level": np.random.randint(1, 6, n),      # 1‚Äì5 scale
    "job_satisfaction": np.random.randint(1, 5, n),     # 1‚Äì4 scale
    "work_life_balance": np.random.randint(1, 5, n),    # 1‚Äì4 scale
})

# Target (somewhat correlated with features)
employee_data["attrition"] = (
    (employee_data["job_satisfaction"] < 2).astype(int) * 0.4
    + (employee_data["years_since_last_promotion"] > 5).astype(int) * 0.3
    + (employee_data["monthly_income"] < 4000).astype(int) * 0.2
    + np.random.random(n) * 0.1
)
employee_data["attrition"] = (employee_data["attrition"] > 0.5).astype(int)

print("Dataset shape:", employee_data.shape)
print("\nAttrition distribution:")
print(employee_data["attrition"].value_counts(normalize=True))

# Baseline model
X_baseline = employee_data.drop("attrition", axis=1)
X_baseline = pd.get_dummies(X_baseline, drop_first=True)
y = employee_data["attrition"]

X_train, X_test, y_train, y_test = train_test_split(
    X_baseline,
    y,
    test_size=0.3,
    random_state=42
)

baseline_model = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_model.fit(X_train, y_train)
baseline_acc = baseline_model.score(X_test, y_test)

print(f"\nBASELINE ACCURACY: {baseline_acc:.3f}")
print("\nTimer started! You have 25 minutes.")
print("=" * 60)


**Your Task**

Engineer features to beat the baseline model. Consider using:

* Ratios and interaction features
* Domain knowledge related to employee attrition
* Polynomial or nonlinear transformations
* Aggregations or derived features
* Any other features you believe may improve predictive performance

**Instructions**

Document your feature engineering steps and reasoning below. Explain *why* each feature might help improve model performance.


In [None]:
# Feature engineering (fill in below)

# Some ideas to get started:
# - age / years_at_company  -> average tenure
# - monthly_income / age    -> income growth proxy
# - interaction: job_satisfaction * work_life_balance
# - flag for recent hires
# - etc.

# Your code:
# -------------------------------------------------

# Final model with engineered features
# -------------------------------------------------

# Check time
elapsed = time.time() - start_time
print(f"\nTime elapsed: {elapsed/60:.1f} minutes")
print(f"Baseline accuracy: {baseline_acc:.3f}")
print("Your accuracy: ???")
print("Improvement: ???")
``


### Hints

**Good features for attrition**

1. **Career progression rate**
   `years_at_company / (years_since_last_promotion + 1)`

2. **Income relative to age (income growth proxy)**
   `monthly_income / (age - 21 + 1)`  (assuming work starts around age 22)

3. **Commute burden**
   `distance_from_home * (1 / work_life_balance)`

4. **Job hopper flag**
   `num_companies_worked > threshold`

5. **Stagnation indicator**
   `years_since_last_promotion > years_at_company * 0.3`

6. **Interaction feature**
   `job_satisfaction * work_life_balance`

### Sample Solution (Conceptual)

A strong feature engineering approach here uses domain knowledge to capture:

* **Career trajectory**

  * Promotion rate as a measure of advancement
  * Stagnation flags to represent lack of progression

* **Compensation relative to experience**

  * Income per ‚Äúworking year‚Äù to normalize income by age/experience
  * Low income flags relative to the dataset distribution

* **Lifestyle and wellbeing**

  * Commute burden as a stress proxy
  * Interaction between job satisfaction and work-life balance to capture overall wellbeing

* **Risk flags**

  * Recent hire, job hopper, long commute, low satisfaction indicators

**Interview explanation (example)**
‚ÄúI engineered features based on known attrition drivers: career stagnation, compensation relative to experience, and overall wellbeing. Promotion rate and stagnation capture progression, income-per-year captures compensation growth, and satisfaction‚Äìbalance interaction captures employee experience. These features translate raw fields into more predictive signals while staying interpretable.‚Äù


### Challenge 4.2: Dimensionality Reduction Decision (20 minutes)

**Scenario**
This is a live technical interview question:

‚ÄúWe have a 100-feature dataset for image classification. Recommend a dimensionality reduction approach and justify your choice.‚Äù

**Interview Conditions**

* Explain your reasoning clearly
* Compare multiple approaches
* Consider computational cost
* Think about interpretability


In [None]:
# Simulated high-dimensional image data

from sklearn.datasets import load_digits

digits = load_digits()      # 8x8 pixel images = 64 features
X_digits = digits.data
y_digits = digits.target

print(
    f"Dataset: {X_digits.shape[0]} samples, "
    f"{X_digits.shape[1]} features, "
    f"{len(np.unique(y_digits))} classes"
)

print("\nTask: Reduce from 64D to 2D for visualization and classifier input")

print("\nYou must:")
print("1. Try at least 3 different dimensionality reduction techniques")
print("2. Compare them visually and quantitatively")
print("3. Make a recommendation with justification")

print("\nTimer started! 20 minutes.")
print("=" * 60)


In [None]:
# Your code goes here.
# Implement at least 3 dimensionality reduction techniques and compare them.

# Techniques to try:
# 1) PCA
# 2) LDA
# 3) t-SNE
# Bonus: Kernel PCA, Isomap, UMAP (if available), etc.

# For each method:
# - Create a 2D projection
# - Visualize the 2D projection (scatter plot colored by class)
# - Train a classifier on the reduced features
# - Record accuracy and computation time

# Suggested output to produce:
# - One plot per method
# - A short printed summary:
#   method name, fit/transform time, classifier accuracy


### Sample Solution (Readable Summary)

**Setup**

* Split the digits dataset into training and test sets.
* Standardize features using `StandardScaler`.
* Compare three dimensionality reduction techniques:

  * PCA (2 components)
  * LDA (2 components)
  * t-SNE (2 components, training only)


### Methods Compared

**1) PCA (Principal Component Analysis)**

* Unsupervised: finds directions of maximum variance.
* Can transform both train and test data.
* Provides ‚Äúvariance explained,‚Äù which is useful for justification.

**What to record**

* Runtime for fit/transform
* KNN test accuracy using the 2D PCA representation
* Total variance explained by the two components


**2) LDA (Linear Discriminant Analysis)**

* Supervised: uses labels and maximizes class separation.
* Can transform both train and test data.
* Often strong for classification when labels are available.

**What to record**

* Runtime for fit/transform
* KNN test accuracy using the 2D LDA representation

**3) t-SNE**

* Nonlinear visualization method (primarily for plotting).
* Does not provide a reliable `transform()` for unseen test data in standard use.
* Often slower and more sensitive to hyperparameters.

**What to record**

* Runtime for fitting on training data
* Visualization quality (cluster separation)
* Test accuracy is typically **not applicable** in a clean train/test pipeline

### Recommendation (Interview-Ready)

‚ÄúI recommend **LDA** for this classification task because it uses the labels to explicitly maximize class separability, which typically improves downstream classifier accuracy. It is also efficient and, unlike t-SNE, it can transform new (test) data consistently.

If we needed an unsupervised method or cared about preserving variance without using labels, **PCA** would be my second choice.

I would use **t-SNE** primarily for visualization and diagnostic insight, not as a preprocessing step for a classifier, because it does not naturally generalize to unseen data and can be computationally expensive.‚Äù


## Congratulations!

You've completed the Module 4 simulation scenarios.

### What You've Practiced

* **Warm-up:** Fundamentals (scaling, encoding, imputation)
* **Intermediate:** Multi-step pipelines and interpretation
* **Advanced:** Production-ready code and debugging
* **Challenge:** Interview pressure tests and time-constrained problem solving

### Interview Preparation Checklist

* Can you explain when to use each scaling method?
* Can you avoid the dummy variable trap?
* Can you build a complete preprocessing pipeline?
* Can you debug feature engineering mistakes?
* Can you choose between PCA, LDA, and t-SNE appropriately?
* Can you work under time pressure while staying organized?
* Can you explain trade-offs to non-technical stakeholders?

### Next Steps

1. Review solutions for any problems you struggled with.
2. Time yourself on the challenge problems again.
3. Practice explaining your decisions out loud.
4. Create your own variations of these problems.
5. Review the concepts in `Interview_Prep_4_Terms_and_Concepts.pdf`.

### Common Interview Mistakes to Avoid

* Forgetting to scale before PCA
* Using `fit_transform()` on test data
* Creating the dummy variable trap
* Not documenting preprocessing steps
* Using t-SNE for preprocessing
* Choosing mean instead of median for skewed data
* Ignoring computational cost

### Remember

> The interviewer cares more about your thought process than perfect code. Explain your reasoning as you work.

Good luck with your interviews.
