# ML, PyTorch & LLM Interview Preparation Guide

A comprehensive guide covering Machine Learning fundamentals, PyTorch, and Large Language Model concepts for technical interviews.

---

##  Study Guide

This guide is organized in optimal learning order. Each section builds on previous concepts.

### Learning Path Overview

| Part | Topic | Category |
|------|-------|----------|
| 1 | Machine Learning Fundamentals | Foundation |
| 2 | Deep Learning Fundamentals | Foundation |
| 3 | PyTorch Fundamentals | Foundation |
| 4 | Convolutional Neural Networks (CNNs) | Core Architecture |
| 5 | LLM & Transformer Concepts | Core Architecture |
| 6 | Generative Models | Specialized |
| 7 | Graph Neural Networks (GNNs) | Specialized |
| 8 | Natural Language Processing Basics | Specialized |
| 9 | Reinforcement Learning Fundamentals | Specialized |
| 10 | Model Deployment & MLOps | Production |
| 11 | Distributed Training | Production |
| 12 | Debugging ML Models | Production |
| 13 | ML System Design | Production |
| 14 | Robotics & Embodied AI Fundamentals | Domain-Specific |
| 15 | Data Augmentation | Techniques |
| 16 | Ethics & Fairness in ML | Techniques |
| 17 | Common Interview Questions | Review |
| 18 | Code Snippets to Know | Review |
| 19 | Quick Reference - Key Formulas | Reference |
| 20 | Additional Interview Tips | Review |

### Quick Navigation by Role

- **ML Engineer (General):** Parts 1-3  4-5  10-13
- **NLP/LLM Engineer:** Parts 1-3  5  8  10-11
- **Computer Vision Engineer:** Parts 1-3  4  6  15  10
- **Robotics/Embodied AI:** Parts 1-3  9  14  4
- **Research Scientist:** Parts 1-3  5  6  9  7

---

## Part 1: Machine Learning Fundamentals

### 1.1 Types of Machine Learning

**Supervised Learning**
- Learn from labeled data (input → output pairs)
- Examples: Classification, Regression
- Algorithms: Linear Regression, Logistic Regression, SVM, Decision Trees, Neural Networks

**Unsupervised Learning**
- Learn patterns from unlabeled data
- Examples: Clustering, Dimensionality Reduction, Anomaly Detection
- Algorithms: K-Means, PCA, Autoencoders

**Reinforcement Learning**
- Learn through interaction with environment (actions → rewards)
- Examples: Game playing, Robotics
- Algorithms: Q-Learning, Policy Gradient, PPO

---

### 1.2 Bias-Variance Tradeoff

**Q: Explain bias and variance**

- **Bias:** Error from oversimplified assumptions. High bias = underfitting
- **Variance:** Error from sensitivity to training data fluctuations. High variance = overfitting

| | Low Variance | High Variance |
|---|---|---|
| **Low Bias** | Ideal (good fit) | Overfitting |
| **High Bias** | Underfitting | Worst case |

**Total Error = Bias² + Variance + Irreducible Error**

**Q: How to reduce overfitting?**
- More training data
- Regularization (L1, L2, Dropout)
- Simpler model
- Early stopping
- Cross-validation
- Data augmentation

**Q: How to reduce underfitting?**
- More complex model
- More features
- Less regularization
- Train longer

---

### 1.3 Train/Validation/Test Split

**Q: Why split data?**

- **Training set (60-80%):** Learn model parameters
- **Validation set (10-20%):** Tune hyperparameters, model selection
- **Test set (10-20%):** Final unbiased evaluation

**Q: What is cross-validation?**

K-Fold CV: Split data into K folds, train on K-1, validate on 1, rotate K times.
- More reliable estimate of model performance
- Uses all data for both training and validation
- Common: 5-fold or 10-fold

**Code - Cross-Validation:**
```python
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# Simple train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify for classification
)

# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

# Manual K-Fold loop
for fold, (train_idx, val_idx) in enumerate(kfold.split(X)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
```

---

### 1.4 Evaluation Metrics

**Classification Metrics:**

| Metric | Formula | Use When |
|--------|---------|----------|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | Cost of FP is high |
| Recall | TP/(TP+FN) | Cost of FN is high |
| F1 Score | 2×(P×R)/(P+R) | Imbalanced classes |
| AUC-ROC | Area under ROC curve | Ranking quality |

**Confusion Matrix:**
```
              Predicted
              Pos    Neg
Actual Pos    TP     FN
       Neg    FP     TN
```

**Regression Metrics:**

| Metric | Formula | Notes |
|--------|---------|-------|
| MSE | (1/n)Σ(y-ŷ)² | Penalizes large errors |
| RMSE | √MSE | Same units as target |
| MAE | (1/n)Σ\|y-ŷ\| | Robust to outliers |
| R² | 1 - SS_res/SS_tot | % variance explained |

**Code - Computing Metrics:**
```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Classification
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Regression
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
```

---

### 1.5 Feature Engineering

**Q: What is feature scaling and why is it important?**

- **Standardization (Z-score):** x' = (x - μ) / σ → mean=0, std=1
- **Normalization (Min-Max):** x' = (x - min) / (max - min) → range [0,1]

Important for:
- Gradient-based optimization (faster convergence)
- Distance-based algorithms (KNN, SVM, K-Means)
- Regularization (fair penalty across features)

**Code - Feature Scaling:**
```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization (Z-score normalization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only!
X_test_scaled = scaler.transform(X_test)        # transform test

# Min-Max Normalization
minmax = MinMaxScaler()
X_normalized = minmax.fit_transform(X_train)

# Manual implementation
mean, std = X_train.mean(axis=0), X_train.std(axis=0)
X_standardized = (X_train - mean) / std
```

**Q: How to handle missing values?**
- Remove rows/columns (if few missing)
- Imputation: mean, median, mode
- Model-based imputation
- Create "is_missing" indicator feature

**Q: How to handle categorical variables?**
- One-hot encoding (nominal categories)
- Label encoding (ordinal categories)
- Target encoding (high cardinality)
- Embeddings (for neural networks)

**Code - Handling Categorical Variables:**
```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# One-Hot Encoding (pandas)
df_encoded = pd.get_dummies(df, columns=['category_col'], drop_first=True)

# One-Hot Encoding (sklearn)
onehot = OneHotEncoder(sparse=False, drop='first')
encoded = onehot.fit_transform(df[['category_col']])

# Label Encoding (for ordinal categories)
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category_col'])

# PyTorch Embedding (for neural networks)
num_categories = df['category_col'].nunique()
embedding = nn.Embedding(num_categories, embedding_dim=8)
embedded = embedding(torch.LongTensor(category_indices))
```

---

### 1.6 Regularization

**Q: Explain L1 vs L2 regularization**

| | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | λΣ\|w\| | λΣw² |
| Effect | Sparse weights (feature selection) | Small weights |
| Solution | Not differentiable at 0 | Closed-form solution |

**Elastic Net:** Combines L1 + L2

**Code - Regularization:**
```python
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# L2 Regularization (Ridge)
ridge = Ridge(alpha=1.0)  # alpha = regularization strength
ridge.fit(X_train, y_train)

# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Non-zero coefficients: {(lasso.coef_ != 0).sum()}")  # sparse!

# Elastic Net (L1 + L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio: mix of L1/L2
elastic.fit(X_train, y_train)

# PyTorch weight decay (L2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# PyTorch Dropout
dropout = nn.Dropout(p=0.5)  # 50% dropout rate
```

**Q: What is dropout?**

Randomly set neurons to 0 during training with probability p. Forces network to not rely on any single neuron. Ensemble effect.

---

### 1.7 Classical ML Algorithms

**Linear Regression**
- Predicts continuous output: ŷ = wᵀx + b
- Loss: MSE
- Closed-form solution or gradient descent

**Logistic Regression**
- Binary classification: P(y=1) = σ(wᵀx + b)
- Loss: Binary Cross-Entropy
- Despite name, it's classification!

**Decision Trees**
- Split data based on feature thresholds
- Greedy algorithm (information gain, Gini impurity)
- Prone to overfitting → use Random Forest

**Random Forest**
- Ensemble of decision trees
- Bagging + random feature subsets
- Reduces variance, more robust

**Gradient Boosting (XGBoost, LightGBM)**
- Sequential ensemble: each tree corrects previous errors
- Often best for tabular data
- Hyperparameters: learning rate, max depth, n_estimators

**SVM (Support Vector Machine)**
- Find hyperplane maximizing margin
- Kernel trick for non-linear boundaries
- Works well in high dimensions

**K-Nearest Neighbors (KNN)**
- Classify based on k nearest training points
- No training phase (lazy learning)
- Sensitive to feature scaling and curse of dimensionality

**K-Means Clustering**
- Partition data into k clusters
- Iterative: assign points → update centroids
- Choose k with elbow method or silhouette score

**PCA (Principal Component Analysis)**
- Dimensionality reduction
- Find directions of maximum variance
- Linear transformation, preserves global structure

**Code - Classical ML Algorithms:**
```python
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
probs = log_reg.predict_proba(X_test)

# Decision Tree
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)

# Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10)
rf.fit(X_train, y_train)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb.fit(X_train, y_train)

# SVM
svm = SVC(kernel='rbf', C=1.0)
svm.fit(X_train, y_train)

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X)

# PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_}")
```

---

### 1.8 Ensemble Methods

**Q: Explain bagging vs boosting**

| | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Sampling | Bootstrap (with replacement) | Weighted samples |
| Goal | Reduce variance | Reduce bias |
| Example | Random Forest | XGBoost, AdaBoost |

**Q: What is stacking?**

Train multiple models, use their predictions as features for a meta-model.

---

### 1.9 Hyperparameter Tuning

**Methods:**
- Grid Search: Try all combinations (exhaustive)
- Random Search: Sample randomly (often better)
- Bayesian Optimization: Model the objective function
- Early Stopping: Stop when validation loss increases

**Common hyperparameters:**
- Learning rate
- Regularization strength
- Number of layers/neurons
- Batch size
- Number of trees/depth (for tree models)

---

### 1.10 Probability & Statistics Fundamentals

**Q: What is Bayes' Theorem and why is it important?**

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

- **P(A|B):** Posterior - probability of A given B
- **P(B|A):** Likelihood - probability of B given A
- **P(A):** Prior - initial belief about A
- **P(B):** Evidence - normalizing constant

**Applications:** Naive Bayes classifier, Bayesian inference, spam filtering, medical diagnosis

**Q: Explain common probability distributions**

| Distribution | Type | Use Case | Parameters |
|--------------|------|----------|------------|
| Bernoulli | Discrete | Single binary outcome | p (success prob) |
| Binomial | Discrete | n binary trials | n, p |
| Poisson | Discrete | Count of rare events | λ (rate) |
| Uniform | Continuous | Equal probability | a, b (bounds) |
| Gaussian/Normal | Continuous | Natural phenomena | μ, σ |
| Exponential | Continuous | Time between events | λ (rate) |

**Key Formulas:**

- **Gaussian PDF:** $p(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
- **Binomial:** $P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$
- **Poisson:** $P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$

**Q: Define key statistical concepts**

| Concept | Formula | Meaning |
|---------|---------|---------|
| Expected Value | $\mathbb{E}[X] = \sum x_i P(x_i)$ | Average outcome |
| Variance | $\text{Var}(X) = \mathbb{E}[(X-\mu)^2]$ | Spread of data |
| Std Deviation | $\sigma = \sqrt{\text{Var}(X)}$ | Spread in original units |
| Covariance | $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ | Joint variability |
| Correlation | $\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$ | Normalized covariance [-1,1] |

**Q: What is the Central Limit Theorem?**

The sampling distribution of the mean approaches a normal distribution as sample size increases, regardless of the population distribution.

- Enables hypothesis testing
- Justifies using normal distribution in many ML algorithms
- Rule of thumb: n ≥ 30 for CLT to apply

**Q: Explain Maximum Likelihood Estimation (MLE)**

Find parameters θ that maximize the probability of observing the data:

$$\hat{\theta}_{MLE} = \arg\max_\theta P(D|\theta) = \arg\max_\theta \prod_i P(x_i|\theta)$$

In practice, maximize log-likelihood (easier to compute):
$$\hat{\theta}_{MLE} = \arg\max_\theta \sum_i \log P(x_i|\theta)$$

**Q: MLE vs MAP (Maximum A Posteriori)?**

| | MLE | MAP |
|---|---|---|
| Formula | $\arg\max P(D\|\theta)$ | $\arg\max P(\theta\|D)$ |
| Prior | No prior | Includes prior P(θ) |
| Regularization | None | Prior acts as regularizer |
| Overfitting | More prone | Less prone |

MAP with Gaussian prior = L2 regularization

**Q: What is hypothesis testing?**

1. **Null Hypothesis (H₀):** Default assumption (no effect)
2. **Alternative Hypothesis (H₁):** What we want to prove
3. **p-value:** Probability of observing data if H₀ is true
4. **Significance level (α):** Threshold (typically 0.05)
5. **Decision:** Reject H₀ if p-value < α

**Type I Error (False Positive):** Reject H₀ when it's true
**Type II Error (False Negative):** Fail to reject H₀ when it's false

**Q: What is the difference between population and sample statistics?**

| Population | Sample |
|------------|--------|
| μ (mean) | x̄ (sample mean) |
| σ² (variance) | s² (sample variance, divide by n-1) |
| N (size) | n (sample size) |

Bessel's correction (n-1) gives unbiased estimate of population variance.

**Code - Probability & Statistics:**
```python
import numpy as np
from scipy import stats

# Sampling from distributions
normal_samples = np.random.normal(loc=0, scale=1, size=1000)  # μ=0, σ=1
uniform_samples = np.random.uniform(low=0, high=1, size=1000)
binomial_samples = np.random.binomial(n=10, p=0.5, size=1000)
poisson_samples = np.random.poisson(lam=5, size=1000)

# Descriptive statistics
mean = np.mean(data)
variance = np.var(data, ddof=1)  # ddof=1 for sample variance
std = np.std(data, ddof=1)
median = np.median(data)

# Correlation and covariance
correlation = np.corrcoef(x, y)[0, 1]
covariance = np.cov(x, y)[0, 1]

# Hypothesis testing - t-test
t_stat, p_value = stats.ttest_ind(group1, group2)  # independent samples
t_stat, p_value = stats.ttest_rel(before, after)   # paired samples

# Chi-square test (categorical data)
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Normality test
stat, p_value = stats.shapiro(data)  # Shapiro-Wilk test

# Confidence interval for mean
confidence = 0.95
sem = stats.sem(data)  # standard error of mean
ci = stats.t.interval(confidence, len(data)-1, loc=np.mean(data), scale=sem)

# MLE example: fit normal distribution
mu_mle, std_mle = stats.norm.fit(data)

# Bayesian inference with conjugate prior (Beta-Binomial)
# Prior: Beta(α, β), Likelihood: Binomial
# Posterior: Beta(α + successes, β + failures)
prior_alpha, prior_beta = 1, 1  # uniform prior
successes, failures = 7, 3
posterior_alpha = prior_alpha + successes
posterior_beta = prior_beta + failures
posterior_mean = posterior_alpha / (posterior_alpha + posterior_beta)
```

**Common Interview Questions - Probability:**

1. **What's the probability of getting at least one 6 in 4 dice rolls?**
   - P(at least one 6) = 1 - P(no 6s) = 1 - (5/6)⁴ ≈ 0.518

2. **Two coins: one fair, one double-headed. You pick one randomly and flip heads. What's P(fair coin)?**
   - Using Bayes: P(fair|H) = P(H|fair)P(fair) / P(H)
   - = (0.5 × 0.5) / (0.5 × 0.5 + 1 × 0.5) = 0.25 / 0.75 = 1/3

3. **What's the expected number of coin flips to get heads?**
   - Geometric distribution: E[X] = 1/p = 1/0.5 = 2 flips

4. **Why do we use log-likelihood instead of likelihood?**
   - Products become sums (numerically stable)
   - Avoids underflow with many samples
   - Easier to differentiate

---

### 1.11 Common Interview Questions - ML Basics

1. **What's the difference between parametric and non-parametric models?**
   - Parametric: Fixed number of parameters (Linear Regression, Neural Networks)
   - Non-parametric: Parameters grow with data (KNN, Decision Trees)

2. **What is the curse of dimensionality?**
   - As dimensions increase, data becomes sparse
   - Distance metrics become less meaningful
   - Need exponentially more data

3. **How do you handle imbalanced datasets?**
   - Resampling (oversample minority, undersample majority)
   - SMOTE (synthetic samples)
   - Class weights in loss function
   - Use appropriate metrics (F1, AUC, not accuracy)

4. **What is data leakage?**
   - Information from test set leaks into training
   - Examples: scaling before split, using future data
   - Results in overly optimistic performance

5. **Generative vs Discriminative models?**
   - Generative: Model P(x,y), can generate data (Naive Bayes, GANs)
   - Discriminative: Model P(y|x) directly (Logistic Regression, SVM)

---

## Part 2: Deep Learning Fundamentals

### 2.1 Neural Network Basics

**Q: What is a neural network?**

Composition of linear transformations and non-linear activations:
```
Input → Linear → Activation → Linear → Activation → ... → Output
```

**Q: Why do we need activation functions?**

Without them, any depth of linear layers collapses to one linear transformation. Activations introduce non-linearity.

**Q: What is backpropagation?**

Algorithm to compute gradients efficiently using chain rule:
1. Forward pass: compute outputs
2. Backward pass: compute gradients from output to input
3. Update weights using gradients

**Code - Simple Neural Network:**
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# Manual backprop example (for understanding)
x = torch.tensor([1.0, 2.0], requires_grad=True)
w = torch.tensor([0.5, 0.5], requires_grad=True)
y = (x * w).sum()  # forward pass
y.backward()       # backward pass
print(f"Gradients: x.grad={x.grad}, w.grad={w.grad}")
```

---

### 2.2 Optimization

**Q: Explain gradient descent variants**

| Variant | Description |
|---------|-------------|
| Batch GD | Use all data per update (slow) |
| Stochastic GD | Use 1 sample per update (noisy) |
| Mini-batch GD | Use batch of samples (best of both) |

**Q: What is momentum?**

Accumulate velocity from past gradients to smooth updates and escape local minima.

**Q: Explain Adam optimizer**

Combines momentum (first moment) + adaptive learning rates (second moment). Default choice for most tasks.

**Code - Optimizers:**
```python
import torch.optim as optim

# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (default choice)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (Adam with decoupled weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
# or cosine annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# In training loop:
for epoch in range(epochs):
    train_one_epoch()
    scheduler.step()  # update learning rate
```

---

### 2.3 Weight Initialization

**Q: Why does initialization matter?**

- Too small: vanishing gradients
- Too large: exploding gradients
- Same values: symmetry (neurons learn same thing)

**Common methods:**
- Xavier/Glorot: For tanh/sigmoid
- He/Kaiming: For ReLU
- Normal/Uniform with appropriate scale

**Code - Weight Initialization:**
```python
import torch.nn.init as init

def init_weights(m):
    if isinstance(m, nn.Linear):
        # Xavier for tanh/sigmoid
        init.xavier_uniform_(m.weight)
        # He/Kaiming for ReLU
        # init.kaiming_uniform_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            init.zeros_(m.bias)
    elif isinstance(m, nn.Conv2d):
        init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

model.apply(init_weights)  # apply to all layers
```

---

### 2.4 Batch Normalization vs Layer Normalization

| | Batch Norm | Layer Norm |
|---|---|---|
| Normalizes over | Batch dimension | Feature dimension |
| Depends on batch size | Yes | No |
| Best for | CNNs | Transformers, RNNs |
| Train/eval difference | Yes | No |

**Code - Normalization:**
```python
# Batch Normalization (for CNNs)
bn = nn.BatchNorm1d(num_features=64)
bn = nn.BatchNorm2d(num_features=64)  # for conv layers

# Layer Normalization (for Transformers)
ln = nn.LayerNorm(normalized_shape=512)

# In a model
class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_ch)
    
    def forward(self, x):
        return F.relu(self.bn(self.conv(x)))

# Remember: model.train() vs model.eval() affects BatchNorm!
model.train()  # use batch statistics
model.eval()   # use running statistics
```

---

### 2.5 Recurrent Neural Networks (RNN, LSTM, GRU)

**Q: What is an RNN and when do you use it?**

RNNs process sequential data by maintaining a hidden state that captures information from previous time steps.

**Use cases:**
- Time series forecasting
- Natural language processing
- Speech recognition
- Video analysis

**Basic RNN equation:**
$$h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)$$
$$y_t = W_{hy}h_t$$

**Q: What is the vanishing/exploding gradient problem?**

During backpropagation through time (BPTT), gradients are multiplied repeatedly:
- **Vanishing:** Gradients → 0, early time steps don't learn
- **Exploding:** Gradients → ∞, training becomes unstable

**Solutions:**
- LSTM/GRU architectures (gating mechanisms)
- Gradient clipping (for exploding)
- Skip connections
- Proper initialization

**Q: Explain LSTM (Long Short-Term Memory)**

LSTM uses gates to control information flow:

| Gate | Formula | Purpose |
|------|---------|---------|
| Forget | $f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)$ | What to forget from cell state |
| Input | $i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)$ | What new info to store |
| Output | $o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)$ | What to output |

**Cell state update:**
$$\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)$$
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$
$$h_t = o_t \odot \tanh(C_t)$$

**Key insight:** Cell state acts as a "highway" for gradients, allowing long-range dependencies.

**Q: Explain GRU (Gated Recurrent Unit)**

GRU is a simplified LSTM with 2 gates instead of 3:

| Gate | Formula | Purpose |
|------|---------|---------|
| Reset | $r_t = \sigma(W_r[h_{t-1}, x_t])$ | How much past to forget |
| Update | $z_t = \sigma(W_z[h_{t-1}, x_t])$ | Balance old vs new |

**Hidden state update:**
$$\tilde{h}_t = \tanh(W[r_t \odot h_{t-1}, x_t])$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

**Q: LSTM vs GRU - when to use which?**

| Aspect | LSTM | GRU |
|--------|------|-----|
| Parameters | More (3 gates) | Fewer (2 gates) |
| Training | Slower | Faster |
| Long sequences | Better | Good |
| Small datasets | May overfit | Better |
| Default choice | Complex tasks | Start here |

**Q: What is bidirectional RNN?**

Processes sequence in both directions (forward and backward), capturing context from both past and future:
- Output combines both hidden states
- Doubles the parameters
- Cannot be used for real-time prediction (needs future)

**Code - RNN/LSTM/GRU in PyTorch:**
```python
import torch
import torch.nn as nn

# Basic RNN
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
# input: (batch, seq_len, input_size)
# output: (batch, seq_len, hidden_size), h_n: (num_layers, batch, hidden_size)
output, h_n = rnn(x)

# LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
# Returns output, (h_n, c_n) - note the cell state!
output, (h_n, c_n) = lstm(x)

# GRU
gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
output, h_n = gru(x)

# Bidirectional LSTM
bilstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, 
                 batch_first=True, bidirectional=True)
# output shape: (batch, seq_len, 2*hidden_size)
output, (h_n, c_n) = bilstm(x)


# Complete LSTM model for sequence classification
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, 
                 n_layers=2, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, n_layers, 
                           batch_first=True, dropout=dropout, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)  # *2 for bidirectional
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.dropout(self.embedding(x))  # (batch, seq_len, embed_dim)
        output, (h_n, c_n) = self.lstm(embedded)
        
        # Concatenate final forward and backward hidden states
        # h_n shape: (n_layers*2, batch, hidden_dim) for bidirectional
        hidden = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        
        return self.fc(self.dropout(hidden))


# LSTM for sequence-to-sequence (e.g., time series)
class LSTMSeq2Seq(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, n_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        output, (h_n, c_n) = self.lstm(x)
        # output: (batch, seq_len, hidden_dim)
        predictions = self.fc(output)  # (batch, seq_len, output_dim)
        return predictions


# Gradient clipping during training
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for batch in dataloader:
    optimizer.zero_grad()
    output = model(batch.text)
    loss = criterion(output, batch.label)
    loss.backward()
    
    # Clip gradients to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()


# Packing sequences for variable length inputs
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

def forward_with_packing(self, x, lengths):
    # x: padded sequences (batch, max_seq_len, input_dim)
    # lengths: actual lengths of each sequence
    
    # Pack to ignore padding in LSTM computation
    packed = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
    output, (h_n, c_n) = self.lstm(packed)
    
    # Unpack back to padded
    output, _ = pad_packed_sequence(output, batch_first=True)
    return output, h_n
```

**Common Interview Questions - RNN:**

1. **Why can't standard RNNs capture long-range dependencies?**
   - Vanishing gradients: gradients shrink exponentially over time steps
   - Information from early steps gets "washed out"

2. **How does LSTM solve the vanishing gradient problem?**
   - Cell state provides a direct path for gradients (additive, not multiplicative)
   - Gates learn what to remember/forget
   - Gradient can flow unchanged through cell state

3. **When would you use RNN vs Transformer?**
   - RNN: Sequential processing required, limited compute, streaming data
   - Transformer: Parallel processing, long sequences, large datasets
   - Modern trend: Transformers dominate most NLP tasks

4. **What is teacher forcing?**
   - During training: feed ground truth as next input (not model's prediction)
   - Speeds up training but can cause exposure bias
   - Solution: scheduled sampling (gradually use model predictions)

5. **How do you handle variable-length sequences?**
   - Padding + masking
   - Pack sequences (PyTorch's pack_padded_sequence)
   - Truncate to fixed length

---

## Part 3: PyTorch Fundamentals

### 1.1 Tensors - The Foundation

**Q: What is a tensor and why is it important?**

A tensor is PyTorch's fundamental data structure - a multi-dimensional array that can hold numbers:
- 0D tensor = scalar (single number)
- 1D tensor = vector (list of numbers)
- 2D tensor = matrix (table of numbers)
- 3D+ tensor = higher dimensions

**Key advantages over NumPy arrays:**
1. GPU acceleration for massive speedups
2. Automatic differentiation for training neural networks

**Common tensor properties:**
- `.shape` or `.size()` - dimensions
- `.dtype` - data type (float32, int64, etc.)
- `.device` - CPU or GPU location
- `.requires_grad` - gradient tracking flag

**Q: Explain reshape vs view**

- `.reshape()` - More flexible, may copy data if needed
- `.view()` - Shares memory with original (faster but requires contiguous memory)
- Use `-1` to auto-infer one dimension: `tensor.reshape(-1, 4)`

---

### 1.2 Device Management

**Q: How do you handle CPU/GPU computation?**

```python
# Check availability and set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create tensor on device
x = torch.tensor([1, 2, 3], device=device)

# Move existing tensor
x = x.to(device)  # or x.cuda() / x.cpu()
```

**Critical rule:** All tensors in an operation must be on the same device!

**Q: When to use GPU vs CPU?**
- GPU: Training large models, big batches, production inference
- CPU: Small models, debugging, data preprocessing

---

### 1.3 Automatic Differentiation (Autograd)

**Q: Explain requires_grad and the computation graph**

`requires_grad=True` tells PyTorch to track operations for gradient computation. PyTorch builds a dynamic computation graph during forward pass.

```python
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2  # y has grad_fn pointing to PowBackward
y.backward()  # Compute gradients
print(x.grad)  # tensor([4.0]) - derivative of x² is 2x
```

**Q: What happens during backward()?**

1. Traverses computation graph from output to inputs
2. Applies chain rule at each operation
3. Stores gradients in `.grad` attributes of leaf tensors
4. Graph is freed after backward (unless `retain_graph=True`)

**Q: Why do gradients accumulate?**

By default, `.backward()` adds to existing gradients. Must call `optimizer.zero_grad()` before each backward pass to clear old gradients.

---

### 1.4 nn.Module - Building Neural Networks

**Q: Explain the nn.Module pattern**

```python
class MyNetwork(nn.Module):
    def __init__(self):
        super().__init__()  # Always call this first!
        self.layer1 = nn.Linear(10, 5)
        self.layer2 = nn.Linear(5, 2)
    
    def forward(self, x):
        x = F.relu(self.layer1(x))
        return self.layer2(x)
```

**Key methods:**
- `.parameters()` - Get all learnable parameters
- `.to(device)` - Move model to CPU/GPU
- `.train()` / `.eval()` - Set training/evaluation mode
- `.state_dict()` - Get model weights for saving

**Q: nn.Parameter vs regular tensor?**

`nn.Parameter` is automatically registered, included in `.parameters()`, and saved in state_dict. Regular tensors are not tracked.

**Code - Saving and Loading Models:**
```python
# Save model weights
torch.save(model.state_dict(), 'model.pth')

# Load model weights
model = MyNetwork()
model.load_state_dict(torch.load('model.pth'))
model.eval()

# Save entire model (less flexible)
torch.save(model, 'full_model.pth')
model = torch.load('full_model.pth')

# Save checkpoint (for resuming training)
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')
```

---

### 1.5 Activation Functions

**Q: Compare ReLU, GeLU, and Softmax**

| Function | Formula | Use Case |
|----------|---------|----------|
| ReLU | max(0, x) | Hidden layers (default) |
| GeLU | x * Φ(x) | Transformers (BERT, GPT) |
| Softmax | exp(xi)/Σexp(xj) | Output layer (classification) |

**Q: Why are activation functions necessary?**

Without them, stacked linear layers collapse to a single linear transformation. Activations introduce non-linearity, enabling networks to learn complex patterns.

**Q: What is the dying ReLU problem?**

If a neuron's output is always negative, ReLU outputs zero and gradient is zero - the neuron stops learning. Solutions: LeakyReLU, GeLU, or careful initialization.

**Code - Activation Functions:**
```python
import torch.nn.functional as F

# ReLU
x = F.relu(x)
# or as module: nn.ReLU()

# LeakyReLU (fixes dying ReLU)
x = F.leaky_relu(x, negative_slope=0.01)

# GeLU (used in transformers)
x = F.gelu(x)

# Softmax (for classification output)
probs = F.softmax(logits, dim=-1)

# Log-Softmax (numerically stable for NLL loss)
log_probs = F.log_softmax(logits, dim=-1)

# Sigmoid (binary classification)
prob = torch.sigmoid(x)

# Tanh
x = torch.tanh(x)
```

---

### 1.6 Loss Functions

**Q: When to use MSE vs Cross-Entropy?**

- **MSE (Mean Squared Error):** Regression tasks (predicting continuous values)
- **Cross-Entropy:** Classification tasks (predicting categories)

**Q: Why does CrossEntropyLoss expect logits, not probabilities?**

For numerical stability. Computing softmax then log can cause issues. CrossEntropyLoss combines them efficiently using the log-sum-exp trick.

**Code - Loss Functions:**
```python
# Classification - Cross Entropy (expects logits, not probabilities!)
criterion = nn.CrossEntropyLoss()
loss = criterion(logits, labels)  # logits: (batch, classes), labels: (batch,)

# Binary Classification
criterion = nn.BCEWithLogitsLoss()  # includes sigmoid
loss = criterion(logits, labels.float())

# Regression - MSE
criterion = nn.MSELoss()
loss = criterion(predictions, targets)

# Regression - MAE (L1)
criterion = nn.L1Loss()

# With class weights (for imbalanced data)
weights = torch.tensor([1.0, 2.0, 3.0])  # weight per class
criterion = nn.CrossEntropyLoss(weight=weights)

# Label smoothing
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
```

---

### 1.7 The Training Loop

**Q: Walk through a complete training step**

```python
for epoch in range(num_epochs):
    for batch_x, batch_y in dataloader:
        # 1. Zero gradients
        optimizer.zero_grad()
        
        # 2. Forward pass
        predictions = model(batch_x)
        loss = criterion(predictions, batch_y)
        
        # 3. Backward pass
        loss.backward()
        
        # 4. Update parameters
        optimizer.step()
```

**Q: Why is the order important?**

- `zero_grad()` before `backward()` - prevents gradient accumulation
- `backward()` before `step()` - gradients must exist before updating
- `step()` uses gradients to update parameters

---

### 1.8 Optimizers

**Q: Compare SGD and Adam**

| Aspect | SGD | Adam |
|--------|-----|------|
| Learning rate | Same for all params | Adaptive per param |
| Momentum | Optional | Built-in |
| Typical LR | 0.01-0.1 | 0.001 |
| Use case | Simple problems | Default choice |

**Q: What is AdamW?**

Adam with decoupled weight decay. Standard L2 regularization in Adam doesn't work well because it's scaled by adaptive learning rate. AdamW fixes this.

---

## Part 4: Convolutional Neural Networks (CNNs)

### 10.1 Convolution Operations

**Q: What is a convolution and why use it for images?**

Convolution slides a kernel (filter) over input, computing element-wise multiplication and sum at each position.

**Key properties:**
- **Parameter sharing:** Same kernel applied everywhere (fewer parameters)
- **Translation equivariance:** Detects features regardless of position
- **Local connectivity:** Each output depends on local region only

**Convolution formula:**
$$(I * K)(i,j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)$$

**Output size formula:**
$$\text{out} = \left\lfloor\frac{\text{in} + 2p - k}{s}\right\rfloor + 1$$

Where: p = padding, k = kernel size, s = stride

**Q: Explain padding and stride**

| Concept | Description | Effect |
|---------|-------------|--------|
| Padding | Add zeros around input | Preserves spatial size |
| Stride | Step size of kernel | Reduces spatial size |
| Valid padding | No padding | Output smaller than input |
| Same padding | Pad to keep size | Output = Input (stride=1) |

**Q: What is pooling?**

Downsampling operation that reduces spatial dimensions:
- **Max Pooling:** Takes maximum value in window (most common)
- **Average Pooling:** Takes average value in window
- **Global Average Pooling:** Average entire feature map to single value

**Benefits:** Reduces computation, provides translation invariance, prevents overfitting

**Q: What is the receptive field?**

The region of input that affects a particular output neuron. Grows with:
- More layers (depth)
- Larger kernels
- Larger strides
- Dilated convolutions

**Code - CNN Components:**
```python
import torch
import torch.nn as nn

# Basic convolution
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, 
                 stride=1, padding=1)  # same padding

# Pooling
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)  # halves spatial dims
avgpool = nn.AdaptiveAvgPool2d((1, 1))  # global average pooling

# Transposed convolution (upsampling)
deconv = nn.ConvTranspose2d(64, 32, kernel_size=2, stride=2)

# Dilated convolution (larger receptive field)
dilated_conv = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)
```

---

### 10.2 Classic CNN Architectures

**Q: Describe key CNN architectures**

| Architecture | Year | Key Innovation | Depth |
|--------------|------|----------------|-------|
| LeNet | 1998 | First successful CNN | 5 |
| AlexNet | 2012 | ReLU, Dropout, GPU | 8 |
| VGG | 2014 | Small 3x3 kernels, depth | 16-19 |
| GoogLeNet | 2014 | Inception modules | 22 |
| ResNet | 2015 | Skip connections | 50-152 |
| DenseNet | 2017 | Dense connections | 121-264 |
| EfficientNet | 2019 | Compound scaling | Variable |

**Q: Explain ResNet and skip connections**

Skip connections add input directly to output: $y = F(x) + x$

**Why it works:**
- Easier to learn residual F(x) = H(x) - x than full mapping H(x)
- Gradients flow directly through skip connections
- Enables training very deep networks (100+ layers)
- Identity mapping is easy to learn (just set F(x) = 0)

**Q: What is the Inception module?**

Parallel convolutions with different kernel sizes (1x1, 3x3, 5x5) + pooling, concatenated together. Captures multi-scale features efficiently.

**Code - CNN Architectures:**
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Simple CNN for image classification
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(128 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 32x32 -> 16x16
        x = self.pool(F.relu(self.conv2(x)))  # 16x16 -> 8x8
        x = self.pool(F.relu(self.conv3(x)))  # 8x8 -> 4x4
        x = x.view(x.size(0), -1)  # flatten
        x = self.dropout(F.relu(self.fc1(x)))
        return self.fc2(x)


# ResNet Block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Skip connection with projection if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # skip connection
        return F.relu(out)


# Using pretrained models
from torchvision import models

resnet = models.resnet50(pretrained=True)
# Replace final layer for transfer learning
resnet.fc = nn.Linear(resnet.fc.in_features, num_classes)

# Freeze early layers
for param in resnet.parameters():
    param.requires_grad = False
for param in resnet.fc.parameters():
    param.requires_grad = True
```

---

### 10.3 Common CNN Interview Questions

1. **Why use 3x3 kernels instead of larger ones?**
   - Two 3x3 = one 5x5 receptive field, but fewer parameters
   - More non-linearities (more ReLUs)
   - VGG showed this works better

2. **What is 1x1 convolution used for?**
   - Channel-wise pooling (reduce/increase channels)
   - Add non-linearity without changing spatial size
   - Used in Inception, ResNet bottleneck

3. **How do you handle different input sizes?**
   - Global Average Pooling (removes spatial dimensions)
   - Adaptive pooling to fixed size
   - Fully convolutional networks

4. **What causes checkerboard artifacts in generated images?**
   - Transposed convolution with stride > 1
   - Solution: resize + convolution, or careful kernel size

---

## Part 5: LLM & Transformer Concepts

### 2.1 Attention Mechanism

**Q: Explain the attention formula**

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

- **Q (Query):** What am I looking for?
- **K (Key):** What do I contain?
- **V (Value):** What information do I provide?
- **√d_k scaling:** Prevents softmax saturation for large dimensions

**Q: Why scale by √d_k?**

If Q and K have variance 1, their dot product has variance d_k. Dividing by √d_k normalizes back to variance 1, keeping softmax in a good gradient region.

**Q: What is multi-head attention?**

Run multiple attention operations in parallel, each learning different patterns:
- Head 1: syntactic relationships
- Head 2: semantic relationships
- Head 3: positional patterns

Formula: MultiHead(Q,K,V) = Concat(head₁,...,headₕ)W^O

**Code - Multi-Head Attention:**
```python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        
        # Linear projections and reshape to (batch, heads, seq, d_k)
        Q = self.W_q(q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(k).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(v).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        out = torch.matmul(attn, V)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        return self.W_o(out)

# Using PyTorch's built-in
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)
output, attn_weights = mha(query, key, value, attn_mask=mask)
```

---

### 2.2 Self-Attention vs Cross-Attention

| Type | Q, K, V Source | Use Case |
|------|---------------|----------|
| Self-Attention | Same sequence | Encoder, decoder self-attention |
| Cross-Attention | Q: target, K/V: source | Encoder-decoder connection |

---

### 2.3 Transformer Architecture

**Q: Describe the transformer encoder block**

```
Input → LayerNorm → Multi-Head Self-Attention → + (residual)
                                                ↓
      → LayerNorm → Feed-Forward Network → + (residual) → Output
```

**Q: What is the FFN and why is it important?**

Position-wise Feed-Forward Network: FFN(x) = ReLU(xW₁)W₂

- Provides non-linearity (attention is linear in V)
- Typically 4x expansion (d_model → 4*d_model → d_model)
- Contains 2/3 of transformer parameters!

**Q: Pre-norm vs Post-norm?**

- **Post-norm (original):** LayerNorm(x + Sublayer(x))
- **Pre-norm (modern):** x + Sublayer(LayerNorm(x))

Pre-norm is more stable for training deep networks.

---

### 2.4 BERT vs GPT

| Aspect | BERT | GPT |
|--------|------|-----|
| Architecture | Encoder-only | Decoder-only |
| Attention | Bidirectional | Causal (left-to-right) |
| Pre-training | Masked LM + NSP | Next token prediction |
| Strength | Understanding | Generation |

**Q: Why can't BERT generate text?**

BERT sees the entire sequence during training (bidirectional). At generation time, future tokens don't exist, so BERT can't predict autoregressively.

**Q: What is causal masking?**

A lower triangular mask that prevents tokens from attending to future positions. Essential for autoregressive generation.

**Code - Causal Mask:**
```python
def create_causal_mask(seq_len):
    """Create lower triangular mask for autoregressive attention"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask  # 1s where attention is allowed, 0s where blocked

# Example: seq_len=4
# [[1, 0, 0, 0],
#  [1, 1, 0, 0],
#  [1, 1, 1, 0],
#  [1, 1, 1, 1]]

# For nn.MultiheadAttention (expects additive mask)
def create_causal_mask_additive(seq_len):
    mask = torch.triu(torch.ones(seq_len, seq_len) * float('-inf'), diagonal=1)
    return mask
```

---

### 2.5 Positional Encoding

**Q: Why do transformers need positional encoding?**

Transformers process all tokens in parallel with no inherent position awareness. Without positional info, "cat sat mat" and "mat sat cat" would be identical.

**Q: Compare positional encoding methods**

| Method | Learned? | Extrapolation | Used In |
|--------|----------|---------------|---------|
| Sinusoidal | No | Good | Original Transformer |
| Learned | Yes | Poor | BERT, GPT-2 |
| RoPE | No | Excellent | LLaMA, Mistral |

**RoPE (Rotary Position Embeddings):** Encodes position by rotating Q and K vectors. Naturally captures relative positions.

**Code - Positional Encoding:**
```python
import math

class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))  # (1, max_len, d_model)
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Learned positional embeddings (like BERT/GPT-2)
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pos_embed = nn.Embedding(max_len, d_model)
    
    def forward(self, x):
        positions = torch.arange(x.size(1), device=x.device)
        return x + self.pos_embed(positions)
```

---

### 2.6 Modern LLM Techniques

**Q: What is RMSNorm and why use it?**

Simplified LayerNorm without mean subtraction:
$$\text{RMSNorm}(x) = \gamma \cdot \frac{x}{\text{RMS}(x)}$$

~10% faster, fewer parameters, works just as well.

**Q: What is SwiGLU?**

Gated activation: SwiGLU(x) = Swish(xW₁) ⊙ xW₃

Better than ReLU/GeLU, used in LLaMA, PaLM.

**Q: Explain Grouped Query Attention (GQA)**

Multiple Q heads share K/V heads to reduce KV cache size:
- MHA: 8 Q heads, 8 KV heads
- GQA: 8 Q heads, 2 KV heads (4x smaller cache)

Used in LLaMA 2 70B, Mistral.

---

### 2.7 Efficient Attention

**Q: Why is attention O(n²)?**

Computing QK^T requires n² dot products. For 100K tokens, that's 10 billion operations per layer!

**Q: What is Flash Attention?**

IO-aware attention that never materializes the full n×n matrix:
- Tiles computation into blocks
- Computes in fast SRAM
- 2-4x faster, 5-20x less memory
- Exact computation (not approximation)

**Q: What is sliding window attention?**

Each token only attends to W previous tokens. With L layers, effective context is L×W. Mistral: 32 layers × 4096 = 131K effective context.

---

### 2.8 Mixture of Experts (MoE)

**Q: Explain MoE architecture**

Replace FFN with multiple "expert" FFNs. Router selects top-k experts per token.

- Mixtral 8x7B: 47B total params, only 13B active
- More capacity without proportional compute increase

**Q: What is load balancing loss?**

Auxiliary loss to prevent all tokens going to one expert:
$$\mathcal{L}_{aux} = \alpha \cdot n \cdot \sum_i f_i \cdot P_i$$

---

### 2.9 Fine-Tuning Techniques

**Q: What is LoRA?**

Low-Rank Adaptation: Instead of updating W directly, decompose update as W' = W + BA where B and A are low-rank matrices.

- Rank 8: 256x fewer parameters
- Can merge into weights for zero inference overhead

**Q: What is QLoRA?**

LoRA + 4-bit quantization:
- Base model quantized to 4-bit
- LoRA adapters in FP16
- Enables 7B+ models on consumer GPUs

**Code - LoRA Concept:**
```python
class LoRALinear(nn.Module):
    """Low-Rank Adaptation layer"""
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        self.original = original_layer
        self.original.weight.requires_grad = False  # freeze original
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.scale = alpha / rank
    
    def forward(self, x):
        # Original output + low-rank update
        return self.original(x) + (x @ self.lora_A @ self.lora_B) * self.scale

# Apply LoRA to a model
def apply_lora(model, rank=8):
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and 'attention' in name:
            parent = get_parent_module(model, name)
            setattr(parent, name.split('.')[-1], LoRALinear(module, rank))
```

---

### 2.10 Training LLMs

**Q: What learning rate schedule is used?**

Linear warmup + cosine decay:
- Warmup: 1-10% of training
- Peak LR: 1e-4 to 6e-4
- Decay to 10% of peak

**Q: Why gradient clipping?**

Transformers can have exploding gradients. Clip norm to 1.0 to stabilize training.

**Q: What is the KV cache?**

During generation, cache K and V from previous tokens to avoid recomputation. Reduces per-token compute from O(n²) to O(n).

**Code - Training LLM Utilities:**
```python
# Learning rate schedule with warmup
def get_lr(step, d_model, warmup_steps=4000):
    """Transformer learning rate schedule"""
    return d_model ** -0.5 * min(step ** -0.5, step * warmup_steps ** -1.5)

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Mixed precision training
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in dataloader:
    optimizer.zero_grad()
    with autocast():
        output = model(batch)
        loss = criterion(output, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Gradient accumulation (for larger effective batch size)
accumulation_steps = 4
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

---

## Part 6: Generative Models

### 11.1 Variational Autoencoders (VAE)

**Q: Explain the VAE architecture**

VAE learns a latent representation by:
1. **Encoder:** Maps input x to distribution parameters (μ, σ)
2. **Sampling:** z ~ N(μ, σ²) using reparameterization trick
3. **Decoder:** Reconstructs x from z

**Loss function:**
$$\mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q(z|x) || p(z))}_{\text{Regularization}}$$

**Q: What is the reparameterization trick?**

Instead of sampling z ~ N(μ, σ²) directly (non-differentiable), we:
1. Sample ε ~ N(0, 1)
2. Compute z = μ + σ * ε

This allows gradients to flow through μ and σ.

**Code - VAE:**
```python
class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        # Encoder
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_var = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )
    
    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_var(h)
    
    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def forward(self, x):
        mu, log_var = self.encode(x)
        z = self.reparameterize(mu, log_var)
        return self.decoder(z), mu, log_var

def vae_loss(recon_x, x, mu, log_var):
    recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')
    kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return recon_loss + kl_loss
```

---

### 11.2 Generative Adversarial Networks (GANs)

**Q: Explain the GAN framework**

Two networks competing:
- **Generator G:** Creates fake samples from noise z
- **Discriminator D:** Distinguishes real from fake

**Minimax objective:**
$$\min_G \max_D \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

**Training alternates:**
1. Train D to maximize: correctly classify real/fake
2. Train G to minimize: fool D (make D(G(z)) → 1)

**Q: What are common GAN training problems?**

| Problem | Description | Solution |
|---------|-------------|----------|
| Mode collapse | G produces limited variety | Minibatch discrimination, unrolled GAN |
| Vanishing gradients | D too strong, G gets no signal | Wasserstein loss, label smoothing |
| Training instability | Oscillating losses | Spectral normalization, progressive growing |
| Non-convergence | Never reaches equilibrium | Two-timescale update, careful hyperparams |

**Code - Simple GAN:**
```python
class Generator(nn.Module):
    def __init__(self, latent_dim, img_shape):
        super().__init__()
        self.img_shape = img_shape
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 512),
            nn.LeakyReLU(0.2),
            nn.Linear(512, 1024),
            nn.LeakyReLU(0.2),
            nn.Linear(1024, int(np.prod(img_shape))),
            nn.Tanh()
        )
    
    def forward(self, z):
        img = self.model(z)
        return img.view(img.size(0), *self.img_shape)

class Discriminator(nn.Module):
    def __init__(self, img_shape):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(int(np.prod(img_shape)), 512),
            nn.LeakyReLU(0.2),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
    
    def forward(self, img):
        return self.model(img.view(img.size(0), -1))

# Training loop
for real_imgs in dataloader:
    # Train Discriminator
    z = torch.randn(batch_size, latent_dim)
    fake_imgs = generator(z)
    
    real_loss = F.binary_cross_entropy(discriminator(real_imgs), torch.ones(batch_size, 1))
    fake_loss = F.binary_cross_entropy(discriminator(fake_imgs.detach()), torch.zeros(batch_size, 1))
    d_loss = (real_loss + fake_loss) / 2
    
    optimizer_D.zero_grad()
    d_loss.backward()
    optimizer_D.step()
    
    # Train Generator
    z = torch.randn(batch_size, latent_dim)
    fake_imgs = generator(z)
    g_loss = F.binary_cross_entropy(discriminator(fake_imgs), torch.ones(batch_size, 1))
    
    optimizer_G.zero_grad()
    g_loss.backward()
    optimizer_G.step()
```

---

### 11.3 Diffusion Models

**Q: Explain how diffusion models work**

Two processes:
1. **Forward (diffusion):** Gradually add noise to data over T steps
2. **Reverse (denoising):** Learn to remove noise step by step

**Forward process:**
$$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$$

**Training objective:** Predict the noise added at each step
$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}[||\epsilon - \epsilon_\theta(x_t, t)||^2]$$

**Sampling:** Start from noise, iteratively denoise

**Q: Why are diffusion models better than GANs?**

| Aspect | Diffusion | GAN |
|--------|-----------|-----|
| Training | Stable (simple MSE loss) | Unstable (adversarial) |
| Mode coverage | Better diversity | Mode collapse risk |
| Sample quality | State-of-the-art | Good but less consistent |
| Speed | Slow (many steps) | Fast (single forward pass) |
| Controllability | Easy (classifier guidance) | Harder |

**Code - Simple Diffusion Model:**
```python
class SimpleDiffusion:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
    
    def add_noise(self, x_0, t, noise=None):
        """Forward process: add noise to x_0"""
        if noise is None:
            noise = torch.randn_like(x_0)
        
        sqrt_alpha_cumprod = self.alpha_cumprod[t].sqrt().view(-1, 1, 1, 1)
        sqrt_one_minus = (1 - self.alpha_cumprod[t]).sqrt().view(-1, 1, 1, 1)
        
        return sqrt_alpha_cumprod * x_0 + sqrt_one_minus * noise
    
    def sample(self, model, shape):
        """Reverse process: denoise from pure noise"""
        x = torch.randn(shape)
        
        for t in reversed(range(self.num_timesteps)):
            t_batch = torch.full((shape[0],), t, dtype=torch.long)
            predicted_noise = model(x, t_batch)
            
            alpha = self.alphas[t]
            alpha_cumprod = self.alpha_cumprod[t]
            beta = self.betas[t]
            
            if t > 0:
                noise = torch.randn_like(x)
            else:
                noise = 0
            
            x = (1 / alpha.sqrt()) * (x - (beta / (1 - alpha_cumprod).sqrt()) * predicted_noise)
            x = x + beta.sqrt() * noise
        
        return x

# Training
diffusion = SimpleDiffusion()
for x_0 in dataloader:
    t = torch.randint(0, diffusion.num_timesteps, (batch_size,))
    noise = torch.randn_like(x_0)
    x_t = diffusion.add_noise(x_0, t, noise)
    
    predicted_noise = model(x_t, t)
    loss = F.mse_loss(predicted_noise, noise)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

---

## Part 7: Graph Neural Networks (GNNs)

### 12.1 Graph Basics

**Q: When do you use GNNs?**

When data has graph structure:
- Social networks (users = nodes, friendships = edges)
- Molecules (atoms = nodes, bonds = edges)
- Knowledge graphs
- Recommendation systems
- Traffic networks

**Q: What is message passing?**

Core GNN operation:
1. **Aggregate:** Collect messages from neighbors
2. **Update:** Combine with node's own features

$$h_v^{(k+1)} = \text{UPDATE}^{(k)}\left(h_v^{(k)}, \text{AGGREGATE}^{(k)}(\{h_u^{(k)} : u \in \mathcal{N}(v)\})\right)$$

### 12.2 GNN Architectures

**Q: Compare GCN, GAT, and GraphSAGE**

| Model | Aggregation | Key Feature |
|-------|-------------|-------------|
| GCN | Mean (normalized) | Spectral-based, simple |
| GAT | Attention-weighted | Learns neighbor importance |
| GraphSAGE | Sample + aggregate | Inductive, scalable |

**GCN layer:**
$$H^{(l+1)} = \sigma(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}H^{(l)}W^{(l)})$$

Where $\tilde{A} = A + I$ (add self-loops), $\tilde{D}$ is degree matrix

**Code - GNN with PyTorch Geometric:**
```python
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, SAGEConv
from torch_geometric.data import Data

# Create graph data
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]], dtype=torch.long)
x = torch.randn(3, 16)  # 3 nodes, 16 features
data = Data(x=x, edge_index=edge_index)

# GCN Model
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)
    
    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# GAT Model (with attention)
class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads=heads)
        self.conv2 = GATConv(hidden_channels * heads, out_channels, heads=1)
    
    def forward(self, x, edge_index):
        x = F.elu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# GraphSAGE (inductive learning)
class GraphSAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)
    
    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)
```

---

### 12.3 GNN Tasks and Challenges

**Q: What are the main GNN tasks?**

| Task | Level | Example |
|------|-------|---------|
| Node classification | Node | Predict user interests |
| Link prediction | Edge | Recommend friends |
| Graph classification | Graph | Predict molecule property |
| Node clustering | Node | Community detection |

**Q: What is over-smoothing in GNNs?**

With many layers, all node representations converge to similar values.
- Caused by repeated averaging over neighbors
- Solutions: Skip connections, dropout, fewer layers, jumping knowledge

**Q: Transductive vs Inductive learning?**

| | Transductive | Inductive |
|---|---|---|
| Test nodes | Seen during training | Unseen |
| Example | GCN | GraphSAGE |
| Use case | Fixed graph | Growing graph |

---

## Part 8: Natural Language Processing Basics

### 13.1 Text Preprocessing

**Q: What are the steps in NLP preprocessing?**

1. **Tokenization:** Split text into tokens (words, subwords, characters)
2. **Lowercasing:** Normalize case (optional)
3. **Stop word removal:** Remove common words (optional)
4. **Stemming/Lemmatization:** Reduce to root form
5. **Encoding:** Convert to numbers

**Q: Compare tokenization approaches**

| Method | Granularity | Pros | Cons |
|--------|-------------|------|------|
| Word | Words | Interpretable | Large vocab, OOV |
| Character | Characters | Small vocab, no OOV | Long sequences |
| Subword (BPE) | Subwords | Balance | Requires training |
| SentencePiece | Subwords | Language agnostic | Requires training |

**Code - Text Preprocessing:**
```python
import re
from collections import Counter

# Basic tokenization
def simple_tokenize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    return text.split()

# Build vocabulary
def build_vocab(texts, min_freq=1):
    counter = Counter()
    for text in texts:
        counter.update(simple_tokenize(text))
    
    vocab = {'<PAD>': 0, '<UNK>': 1, '<BOS>': 2, '<EOS>': 3}
    for word, freq in counter.items():
        if freq >= min_freq:
            vocab[word] = len(vocab)
    return vocab

# Encode text
def encode(text, vocab, max_len=None):
    tokens = simple_tokenize(text)
    ids = [vocab.get(t, vocab['<UNK>']) for t in tokens]
    if max_len:
        ids = ids[:max_len] + [vocab['<PAD>']] * max(0, max_len - len(ids))
    return ids

# Using HuggingFace tokenizers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoded = tokenizer("Hello world!", return_tensors='pt')
# Returns: input_ids, attention_mask, token_type_ids
```

---

### 13.2 Word Embeddings

**Q: What are word embeddings and why use them?**

Dense vector representations of words that capture semantic meaning.
- Similar words have similar vectors
- Enable mathematical operations (king - man + woman ≈ queen)
- Much smaller than one-hot encoding

**Q: Compare Word2Vec, GloVe, and FastText**

| Method | Training | Key Feature |
|--------|----------|-------------|
| Word2Vec (CBOW) | Predict word from context | Fast, local context |
| Word2Vec (Skip-gram) | Predict context from word | Better for rare words |
| GloVe | Matrix factorization | Global co-occurrence stats |
| FastText | Subword n-grams | Handles OOV words |

**Code - Word Embeddings:**
```python
import torch
import torch.nn as nn

# PyTorch Embedding layer
vocab_size = 10000
embed_dim = 300
embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

# Look up embeddings
word_ids = torch.LongTensor([1, 5, 3, 2])
word_vectors = embedding(word_ids)  # (4, 300)

# Load pretrained embeddings (GloVe)
def load_glove(path, vocab, embed_dim):
    embeddings = torch.zeros(len(vocab), embed_dim)
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in vocab:
                embeddings[vocab[word]] = torch.FloatTensor([float(v) for v in values[1:]])
    return embeddings

# Using gensim
from gensim.models import Word2Vec, KeyedVectors

# Train Word2Vec
sentences = [["hello", "world"], ["machine", "learning"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv["hello"]

# Load pretrained
glove = KeyedVectors.load_word2vec_format('glove.txt', binary=False)
similar = glove.most_similar("king", topn=5)
```

---

### 13.3 Text Classification

**Q: What are common approaches for text classification?**

| Approach | Description | When to Use |
|----------|-------------|-------------|
| Bag of Words + ML | TF-IDF features + classifier | Simple baseline |
| CNN | Convolutions over embeddings | Fast, good for short text |
| LSTM/GRU | Sequential processing | Captures order |
| Transformer | Self-attention | State-of-the-art |

**Code - Text Classification Models:**
```python
# CNN for Text Classification
class TextCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, kernel_sizes=[3,4,5], num_filters=100):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, num_filters, k) for k in kernel_sizes
        ])
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # x: (batch, seq_len)
        x = self.embedding(x).transpose(1, 2)  # (batch, embed_dim, seq_len)
        x = [F.relu(conv(x)) for conv in self.convs]  # list of (batch, num_filters, *)
        x = [F.max_pool1d(c, c.size(2)).squeeze(2) for c in x]  # (batch, num_filters) each
        x = torch.cat(x, dim=1)  # (batch, num_filters * len(kernel_sizes))
        x = self.dropout(x)
        return self.fc(x)

# Using HuggingFace for classification
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize
inputs = tokenizer("This is great!", return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits
```

---

## Part 9: Reinforcement Learning Fundamentals

### 2.1 What is Reinforcement Learning?

**Q: How does RL differ from supervised/unsupervised learning?**

| Paradigm | Data | Feedback | Example |
|----------|------|----------|---------|
| Supervised | Labeled pairs | Correct answer provided | Image classification |
| Unsupervised | Unlabeled | No feedback | Clustering |
| Reinforcement | Interaction | Delayed rewards | Game playing |

**The RL Loop:**
1. Agent observes state
2. Agent takes action
3. Environment transitions to new state
4. Agent receives reward
5. Repeat

---

### 2.2 Core RL Terminology

**Q: Define the key RL concepts**

- **Agent:** The learner and decision maker
- **Environment:** Everything outside the agent
- **State (s):** Current situation representation
- **Action (a):** Choice the agent can make
- **Reward (r):** Immediate feedback signal
- **Policy (π):** Strategy mapping states to actions
- **Value Function V(s):** Expected return from state s
- **Q-Function Q(s,a):** Expected return from taking action a in state s

---

### 2.3 Markov Decision Process (MDP)

**Q: What is an MDP?**

An MDP is defined by tuple (S, A, P, R, γ):
- **S:** State space
- **A:** Action space
- **P(s'|s,a):** Transition probability
- **R(s,a,s'):** Reward function
- **γ:** Discount factor (0 to 1)

**The Markov Property:** Future depends only on present, not past.
$$P(s_{t+1}|s_t, a_t, s_{t-1}, ...) = P(s_{t+1}|s_t, a_t)$$

---

### 2.4 Value Functions and Bellman Equations

**Q: Explain the Bellman equation**

**State-Value Function:**
$$V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$$

**Action-Value Function:**
$$Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$$

**Bellman Equation for V:**
$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

**Bellman Optimality Equation:**
$$V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]$$

---

### 2.5 Exploration vs Exploitation

**Q: Explain the exploration-exploitation dilemma**

- **Exploitation:** Use current best knowledge
- **Exploration:** Try new actions to discover better options

**Strategies:**
| Strategy | Description |
|----------|-------------|
| ε-greedy | Random action with prob ε |
| UCB | Optimism in face of uncertainty |
| Softmax/Boltzmann | Sample proportional to Q-values |
| Optimistic Init | Start with high values |

---

### 2.6 Model-Based vs Model-Free RL

**Q: Compare model-based and model-free approaches**

| Aspect | Model-Based | Model-Free |
|--------|-------------|------------|
| Requires | P(s'\|s,a), R | No model |
| Sample efficiency | High | Low |
| Computation | Planning expensive | Direct learning |
| Examples | Dynamic Programming | Q-Learning, Policy Gradient |

---

### 2.7 Key RL Algorithms

**Dynamic Programming (requires model):**
- **Policy Iteration:** Evaluate → Improve → Repeat
- **Value Iteration:** Combine evaluation and improvement

**Monte Carlo (episode-based):**
- Learn from complete episodes
- Unbiased but high variance
- Only for episodic tasks

**Temporal Difference (step-based):**
- Learn from every step (bootstrap)
- Lower variance, some bias
- Works for continuing tasks

**TD(0) Update:**
$$V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

---

### 2.8 Q-Learning

**Q: Explain Q-Learning**

Off-policy TD control that learns optimal Q* directly:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

**Key properties:**
- Off-policy: learns optimal policy while exploring
- Model-free: no environment dynamics needed
- Uses max over next state actions (vs SARSA which uses actual next action)

**SARSA vs Q-Learning:**
| Aspect | SARSA | Q-Learning |
|--------|-------|------------|
| Type | On-policy | Off-policy |
| Update | Uses actual A' | Uses max Q |
| Behavior | Conservative | Optimistic |

**Code - Q-Learning Implementation:**
```python
import numpy as np

def q_learning(env, episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    """Tabular Q-Learning algorithm"""
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    
    for episode in range(episodes):
        state = env.reset()
        done = False
        
        while not done:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = env.action_space.sample()  # explore
            else:
                action = np.argmax(Q[state])        # exploit
            
            next_state, reward, done, _ = env.step(action)
            
            # Q-Learning update (off-policy: uses max)
            Q[state, action] += alpha * (
                reward + gamma * np.max(Q[next_state]) - Q[state, action]
            )
            state = next_state
    
    return Q

# Epsilon decay schedule
def get_epsilon(episode, min_eps=0.01, max_eps=1.0, decay=0.995):
    return max(min_eps, max_eps * (decay ** episode))
```

---

### 2.9 Deep Q-Networks (DQN)

**Q: What innovations made DQN work?**

1. **Experience Replay:**
   - Store transitions in buffer
   - Sample random mini-batches
   - Breaks correlation, improves sample efficiency

2. **Target Network:**
   - Separate network for TD target
   - Updated periodically (not every step)
   - Stabilizes training

3. **Function Approximation:**
   - Neural network approximates Q(s,a)
   - Handles high-dimensional states (images)
   - Generalizes across similar states

**DQN Loss:**
$$L(\theta) = \mathbb{E}[(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2]$$

**Code - DQN Components:**
```python
import torch
import torch.nn as nn
from collections import deque
import random

# Q-Network
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.net(x)

# Experience Replay Buffer
class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (torch.FloatTensor(states), torch.LongTensor(actions),
                torch.FloatTensor(rewards), torch.FloatTensor(next_states),
                torch.FloatTensor(dones))

# DQN Training Step
def train_dqn(q_net, target_net, buffer, optimizer, batch_size=32, gamma=0.99):
    states, actions, rewards, next_states, dones = buffer.sample(batch_size)
    
    # Current Q values
    q_values = q_net(states).gather(1, actions.unsqueeze(1))
    
    # Target Q values (from target network)
    with torch.no_grad():
        next_q = target_net(next_states).max(1)[0]
        targets = rewards + gamma * next_q * (1 - dones)
    
    loss = nn.MSELoss()(q_values.squeeze(), targets)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

---

### 2.10 Policy Gradient Methods

**Q: Explain policy gradient vs value-based methods**

**Value-Based (Q-Learning, DQN):**
- Learn Q(s,a), derive policy from it
- Policy is implicit (argmax Q)
- Works well for discrete actions

**Policy-Based (Policy Gradient):**
- Learn policy π(a|s) directly
- Can handle continuous actions
- Can learn stochastic policies

**Policy Gradient Theorem:**
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$$

**REINFORCE Algorithm:**
$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$$

**Code - REINFORCE (Policy Gradient):**
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=-1)

def reinforce(env, policy, optimizer, episodes=1000, gamma=0.99):
    for episode in range(episodes):
        states, actions, rewards = [], [], []
        state = env.reset()
        done = False
        
        # Collect episode
        while not done:
            state_t = torch.FloatTensor(state)
            probs = policy(state_t)
            action = torch.multinomial(probs, 1).item()
            
            next_state, reward, done, _ = env.step(action)
            states.append(state_t)
            actions.append(action)
            rewards.append(reward)
            state = next_state
        
        # Compute returns (discounted cumulative rewards)
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalize
        
        # Policy gradient update
        optimizer.zero_grad()
        for state_t, action, G in zip(states, actions, returns):
            probs = policy(state_t)
            log_prob = torch.log(probs[action])
            loss = -log_prob * G  # negative for gradient ascent
            loss.backward()
        optimizer.step()
```

---

### 2.11 Actor-Critic Methods

**Q: What is Actor-Critic?**

Combines policy gradient (actor) with value function (critic):
- **Actor:** Policy network π(a|s; θ)
- **Critic:** Value network V(s; w) or Q(s,a; w)

**Advantage Actor-Critic (A2C):**
$$A(s,a) = Q(s,a) - V(s) \approx r + \gamma V(s') - V(s)$$

Uses advantage instead of return to reduce variance.

**Popular Algorithms:**
- A2C/A3C: Asynchronous advantage actor-critic
- PPO: Proximal Policy Optimization (clipped objective)
- SAC: Soft Actor-Critic (entropy regularization)

**Code - Actor-Critic:**
```python
class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Linear(state_dim, 128)
        self.actor = nn.Linear(128, action_dim)   # policy head
        self.critic = nn.Linear(128, 1)           # value head
    
    def forward(self, x):
        x = F.relu(self.shared(x))
        policy = F.softmax(self.actor(x), dim=-1)
        value = self.critic(x)
        return policy, value

def a2c_update(model, optimizer, states, actions, rewards, next_states, dones, gamma=0.99):
    states = torch.FloatTensor(states)
    next_states = torch.FloatTensor(next_states)
    actions = torch.LongTensor(actions)
    rewards = torch.FloatTensor(rewards)
    dones = torch.FloatTensor(dones)
    
    policy, values = model(states)
    _, next_values = model(next_states)
    
    # Advantage = TD error
    targets = rewards + gamma * next_values.squeeze() * (1 - dones)
    advantages = targets - values.squeeze()
    
    # Actor loss (policy gradient with advantage)
    log_probs = torch.log(policy.gather(1, actions.unsqueeze(1)))
    actor_loss = -(log_probs.squeeze() * advantages.detach()).mean()
    
    # Critic loss (value function)
    critic_loss = F.mse_loss(values.squeeze(), targets.detach())
    
    # Combined loss
    loss = actor_loss + 0.5 * critic_loss
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

---

### 2.12 Common RL Interview Questions

1. **Why use discount factor γ?**
   - Mathematical: ensures finite returns
   - Practical: prefer sooner rewards
   - Typical values: 0.9, 0.95, 0.99

2. **On-policy vs Off-policy?**
   - On-policy: learn about policy being followed (SARSA)
   - Off-policy: learn about different policy (Q-Learning)

3. **Why does DQN need experience replay?**
   - Breaks correlation between consecutive samples
   - Improves sample efficiency
   - Stabilizes training

4. **When to use policy gradient vs Q-learning?**
   - Q-learning: discrete actions, sample efficient
   - Policy gradient: continuous actions, stochastic policies

5. **What is the deadly triad?**
   - Function approximation + Bootstrapping + Off-policy
   - Can cause instability; DQN addresses with target network

---

## Part 10: Model Deployment & MLOps

### 14.1 Model Optimization

**Q: What is quantization?**

Reduce model precision to speed up inference and reduce memory:

| Type | Description | Speedup |
|------|-------------|---------|
| FP32 → FP16 | Half precision | 2x |
| FP32 → INT8 | 8-bit integers | 4x |
| FP32 → INT4 | 4-bit integers | 8x |

**Types of quantization:**
- **Post-training quantization:** Quantize after training (easy, some accuracy loss)
- **Quantization-aware training:** Simulate quantization during training (better accuracy)
- **Dynamic quantization:** Quantize weights, compute activations dynamically

**Q: What is pruning?**

Remove unnecessary weights/neurons:
- **Magnitude pruning:** Remove small weights
- **Structured pruning:** Remove entire channels/layers
- **Lottery ticket hypothesis:** Sparse subnetworks can match full network

**Code - Model Optimization:**
```python
import torch

# Dynamic Quantization (easiest)
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

# Static Quantization
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# Calibrate with representative data
for data in calibration_loader:
    model(data)
torch.quantization.convert(model, inplace=True)

# FP16 inference
model.half()
with torch.cuda.amp.autocast():
    output = model(input.half())

# Pruning
import torch.nn.utils.prune as prune

# Prune 30% of weights in a layer
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)

# Check sparsity
sparsity = 100 * float(torch.sum(model.fc1.weight == 0)) / model.fc1.weight.nelement()

# Remove pruning reparametrization (make permanent)
prune.remove(model.fc1, 'weight')

# Knowledge Distillation
def distillation_loss(student_logits, teacher_logits, labels, T=4, alpha=0.7):
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1),
        reduction='batchmean'
    ) * (T * T)
    hard_loss = F.cross_entropy(student_logits, labels)
    return alpha * soft_loss + (1 - alpha) * hard_loss
```

---

### 14.2 Model Export and Serving

**Q: What is ONNX?**

Open Neural Network Exchange - standard format for ML models.
- Export from PyTorch, TensorFlow, etc.
- Run on various runtimes (ONNX Runtime, TensorRT)
- Enables cross-framework deployment

**Q: Compare deployment options**

| Option | Use Case | Pros | Cons |
|--------|----------|------|------|
| TorchScript | PyTorch production | Native, easy | PyTorch only |
| ONNX | Cross-platform | Portable | Some ops unsupported |
| TensorRT | NVIDIA GPUs | Fastest on GPU | NVIDIA only |
| TFLite | Mobile/Edge | Small, fast | Limited ops |
| Triton | Multi-model serving | Scalable | Complex setup |

**Code - Model Export:**
```python
import torch

# TorchScript (tracing)
traced_model = torch.jit.trace(model, example_input)
traced_model.save("model_traced.pt")

# TorchScript (scripting - for control flow)
scripted_model = torch.jit.script(model)
scripted_model.save("model_scripted.pt")

# Load TorchScript model
loaded = torch.jit.load("model_traced.pt")
output = loaded(input)

# Export to ONNX
torch.onnx.export(
    model,
    example_input,
    "model.onnx",
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

# Run with ONNX Runtime
import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
outputs = session.run(None, {"input": input_numpy})
```

---

### 14.3 MLOps Best Practices

**Q: What is MLOps?**

DevOps practices applied to ML:
- Version control for code, data, and models
- Automated training pipelines
- Model monitoring and retraining
- A/B testing and gradual rollout

**Q: What should you monitor in production?**

| Category | Metrics |
|----------|---------|
| Model performance | Accuracy, latency, throughput |
| Data quality | Missing values, distribution shift |
| System health | CPU/GPU usage, memory, errors |
| Business metrics | Conversion, engagement |

**Q: What is data drift and how do you detect it?**

Data drift: Input distribution changes over time.
- **Detection:** Statistical tests (KS test, PSI), monitoring feature distributions
- **Solutions:** Retrain on new data, online learning, feature engineering

**Code - Basic Model Serving (FastAPI):**
```python
from fastapi import FastAPI
import torch

app = FastAPI()
model = torch.jit.load("model.pt")
model.eval()

@app.post("/predict")
async def predict(data: dict):
    input_tensor = torch.tensor(data["features"])
    with torch.no_grad():
        output = model(input_tensor)
    return {"prediction": output.tolist()}

# Run with: uvicorn app:app --host 0.0.0.0 --port 8000
```

---

## Part 11: Distributed Training

### 15.1 Parallelism Strategies

**Q: Compare data parallelism vs model parallelism**

| | Data Parallelism | Model Parallelism |
|---|---|---|
| Split | Data across GPUs | Model across GPUs |
| When | Model fits in GPU | Model too large |
| Communication | Gradient sync | Activation passing |
| Scaling | Easy | Complex |

**Q: What is gradient accumulation?**

Simulate larger batch sizes with limited memory:
1. Forward/backward on small batch
2. Accumulate gradients (don't zero)
3. After N steps, update weights

Effective batch size = batch_size × accumulation_steps × num_gpus

**Q: Explain ZeRO optimization**

Zero Redundancy Optimizer (DeepSpeed):
- **ZeRO-1:** Partition optimizer states
- **ZeRO-2:** + Partition gradients
- **ZeRO-3:** + Partition parameters

Reduces memory per GPU, enables training larger models.

**Code - Distributed Training:**
```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

# Initialize process group
def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

# DDP Training
def train_ddp(rank, world_size):
    setup(rank, world_size)
    
    model = MyModel().to(rank)
    model = DDP(model, device_ids=[rank])
    
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
    
    for epoch in range(epochs):
        sampler.set_epoch(epoch)  # Important for shuffling
        for batch in dataloader:
            # Training step
            pass
    
    cleanup()

# Launch with torchrun
# torchrun --nproc_per_node=4 train.py

# Gradient Accumulation
accumulation_steps = 4
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Mixed Precision Training
scaler = torch.cuda.amp.GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    
    with torch.cuda.amp.autocast():
        output = model(batch)
        loss = criterion(output, target)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

---

## Part 12: Debugging ML Models

### 16.1 Common Failure Modes

**Q: Model not learning (loss not decreasing)**

Checklist:
1. Learning rate too high/low
2. Data loading bug (check labels match inputs)
3. Loss function mismatch (e.g., softmax + NLLLoss)
4. Gradient issues (vanishing/exploding)
5. Model in eval mode during training
6. Forgetting optimizer.zero_grad()

**Q: Model overfitting**

Signs: Training loss ↓, validation loss ↑

Solutions:
- More data / data augmentation
- Regularization (dropout, weight decay)
- Simpler model
- Early stopping
- Batch normalization

**Q: Model underfitting**

Signs: Both training and validation loss high

Solutions:
- More complex model
- Train longer
- Better features
- Less regularization
- Check for data issues

### 16.2 Debugging Techniques

**Code - Debugging Tools:**
```python
import torch

# Check for NaN/Inf
def check_nan(tensor, name="tensor"):
    if torch.isnan(tensor).any():
        print(f"NaN detected in {name}")
    if torch.isinf(tensor).any():
        print(f"Inf detected in {name}")

# Gradient checking
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_norm={param.grad.norm():.4f}")
        if param.grad.norm() == 0:
            print(f"  WARNING: Zero gradient!")
        if param.grad.norm() > 100:
            print(f"  WARNING: Exploding gradient!")

# Hook to inspect activations
activations = {}
def get_activation(name):
    def hook(model, input, output):
        activations[name] = output.detach()
    return hook

model.layer1.register_forward_hook(get_activation('layer1'))

# Verify data loading
for batch_idx, (data, target) in enumerate(dataloader):
    print(f"Batch {batch_idx}: data shape={data.shape}, target shape={target.shape}")
    print(f"  Data range: [{data.min():.2f}, {data.max():.2f}]")
    print(f"  Target values: {target.unique().tolist()}")
    if batch_idx >= 2:
        break

# Overfit on single batch (sanity check)
single_batch = next(iter(dataloader))
for i in range(100):
    loss = train_step(model, single_batch)
    if i % 10 == 0:
        print(f"Step {i}: loss={loss:.4f}")
# Loss should go to ~0 if model can learn
```

---

## Part 13: ML System Design

### 17.1 System Design Framework

**Q: How do you approach ML system design questions?**

1. **Clarify requirements**
   - What's the goal? (metric to optimize)
   - Scale? (QPS, data size, latency)
   - Constraints? (budget, real-time vs batch)

2. **Data pipeline**
   - Data sources and collection
   - Feature engineering
   - Storage (data lake, feature store)

3. **Model selection**
   - Baseline → simple ML → deep learning
   - Online vs offline inference
   - Training infrastructure

4. **Serving architecture**
   - API design
   - Caching strategy
   - Load balancing

5. **Monitoring & iteration**
   - Metrics to track
   - A/B testing
   - Feedback loops

### 17.2 Common System Design Problems

**Recommendation System:**
```
User Request → Candidate Generation → Ranking → Re-ranking → Results
                    ↓                    ↓
              Collaborative         Deep Neural
              Filtering             Network
              (fast, coarse)        (slow, precise)
```

**Key components:**
- Candidate generation: ANN search, collaborative filtering
- Ranking model: Two-tower, cross-attention
- Feature store: User history, item embeddings
- Real-time signals: Recent clicks, context

**Search Ranking:**
```
Query → Query Understanding → Retrieval → Ranking → Results
              ↓                   ↓           ↓
         Spell check          BM25/ANN    Learning
         Query expansion      Inverted    to Rank
         Intent detection     Index       (LTR)
```

**Fraud Detection:**
- Real-time scoring (low latency)
- Imbalanced data handling
- Feature engineering (velocity, patterns)
- Explainability requirements
- Feedback delay (labels come late)

**Content Moderation:**
- Multi-modal (text, image, video)
- Multi-label classification
- Human-in-the-loop
- Edge cases and adversarial content
- Latency vs accuracy tradeoff

---

### 17.3 Scaling Considerations

**Q: How do you handle high QPS?**

| Strategy | Description |
|----------|-------------|
| Caching | Cache predictions for common inputs |
| Batching | Batch requests for GPU efficiency |
| Model distillation | Smaller, faster model |
| Quantization | Reduce precision |
| Horizontal scaling | More replicas |
| Async processing | Queue non-urgent requests |

**Q: How do you handle large-scale training data?**

- Distributed training (data parallelism)
- Data sampling strategies
- Online/incremental learning
- Feature hashing for high cardinality
- Approximate algorithms

---

## Part 14: Robotics & Embodied AI Fundamentals

### 8.1 Robot Perception Pipeline

**Q: Describe the perception stack for a humanoid robot**

```
Sensors → Preprocessing → Feature Extraction → Fusion → State Estimation → Planning
```

| Sensor | Data Type | Use Case |
|--------|-----------|----------|
| RGB Camera | Images | Object detection, scene understanding |
| Depth Camera | Point clouds | 3D reconstruction, obstacle detection |
| LiDAR | Sparse 3D points | Long-range mapping, localization |
| IMU | Acceleration, angular velocity | Pose estimation, balance |
| Force/Torque | Contact forces | Manipulation, grasping |
| Joint Encoders | Position, velocity | Proprioception |

**Q: What is sensor fusion?**

Combining data from multiple sensors to get more accurate state estimates than any single sensor alone. Methods:
- Early fusion: Concatenate raw sensor data
- Late fusion: Combine predictions from separate models
- Kalman Filter: Optimal fusion for linear Gaussian systems
- Extended Kalman Filter (EKF): For nonlinear systems

---

### 8.2 Coordinate Frames & Transformations

**Q: Explain common coordinate frames in robotics**

- **World frame:** Fixed global reference
- **Body frame:** Attached to robot base
- **Camera frame:** Attached to camera sensor
- **End-effector frame:** At robot gripper/hand

**Q: How do you represent 3D rotations?**

| Representation | Pros | Cons |
|----------------|------|------|
| Euler angles | Intuitive | Gimbal lock |
| Rotation matrix | No singularities | 9 params, must be orthogonal |
| Quaternion | Compact (4 params), no gimbal lock | Not intuitive |
| Axis-angle | Intuitive for small rotations | Singularity at 0° |

**Code - Transformations:**
```python
import numpy as np
from scipy.spatial.transform import Rotation

# Quaternion to rotation matrix
quat = [0, 0, 0.707, 0.707]  # [x, y, z, w]
R = Rotation.from_quat(quat).as_matrix()

# Euler angles to rotation matrix
euler = [0, 0, np.pi/2]  # roll, pitch, yaw
R = Rotation.from_euler('xyz', euler).as_matrix()

# Homogeneous transformation matrix (4x4)
def make_transform(R, t):
    """Create 4x4 transformation matrix from rotation R and translation t"""
    T = np.eye(4)
    T[:3, :3] = R
    T[:3, 3] = t
    return T

# Transform a point
def transform_point(T, point):
    p_homo = np.append(point, 1)  # homogeneous coordinates
    return (T @ p_homo)[:3]

# Inverse transformation
T_inv = np.linalg.inv(T)
# Or more efficiently:
# R_inv = R.T
# t_inv = -R.T @ t
```

---

### 8.3 Robot Kinematics

**Q: Explain forward vs inverse kinematics**

| | Forward Kinematics | Inverse Kinematics |
|---|---|---|
| Input | Joint angles | End-effector pose |
| Output | End-effector pose | Joint angles |
| Solution | Unique | Multiple or none |
| Difficulty | Easy (chain multiplication) | Hard (nonlinear) |

**Q: What is the Jacobian in robotics?**

Maps joint velocities to end-effector velocities:
$$\dot{x} = J(q) \dot{q}$$

- Used for velocity control
- Singularities occur when det(J) = 0 (robot loses DOF)
- Pseudo-inverse for redundant robots: $\dot{q} = J^+ \dot{x}$

**Code - Simple 2-Link Arm Kinematics:**
```python
import numpy as np

def forward_kinematics_2link(q1, q2, L1, L2):
    """Forward kinematics for 2-link planar arm"""
    x = L1 * np.cos(q1) + L2 * np.cos(q1 + q2)
    y = L1 * np.sin(q1) + L2 * np.sin(q1 + q2)
    return np.array([x, y])

def jacobian_2link(q1, q2, L1, L2):
    """Jacobian for 2-link planar arm"""
    J = np.array([
        [-L1*np.sin(q1) - L2*np.sin(q1+q2), -L2*np.sin(q1+q2)],
        [L1*np.cos(q1) + L2*np.cos(q1+q2), L2*np.cos(q1+q2)]
    ])
    return J

def inverse_kinematics_2link(x, y, L1, L2):
    """Inverse kinematics for 2-link planar arm (elbow-up solution)"""
    c2 = (x**2 + y**2 - L1**2 - L2**2) / (2 * L1 * L2)
    s2 = np.sqrt(1 - c2**2)  # elbow-up
    q2 = np.arctan2(s2, c2)
    q1 = np.arctan2(y, x) - np.arctan2(L2*s2, L1 + L2*c2)
    return q1, q2
```

---

### 8.4 Motion Planning

**Q: Compare motion planning algorithms**

| Algorithm | Type | Pros | Cons |
|-----------|------|------|------|
| A* | Graph search | Optimal, complete | Discretization needed |
| RRT | Sampling-based | Works in high-dim | Suboptimal paths |
| RRT* | Sampling-based | Asymptotically optimal | Slower than RRT |
| PRM | Sampling-based | Good for multi-query | Preprocessing needed |
| MPC | Optimization | Handles constraints | Computationally expensive |

**Q: What is Model Predictive Control (MPC)?**

Optimization-based control that:
1. Predicts future states over horizon H
2. Optimizes control sequence to minimize cost
3. Applies only first control
4. Repeats at next timestep (receding horizon)

$$\min_{u_0,...,u_{H-1}} \sum_{t=0}^{H} c(x_t, u_t) \quad \text{s.t.} \quad x_{t+1} = f(x_t, u_t)$$

**Code - Simple RRT:**
```python
import numpy as np

class RRT:
    def __init__(self, start, goal, bounds, obstacle_fn, step_size=0.5):
        self.start = np.array(start)
        self.goal = np.array(goal)
        self.bounds = bounds
        self.is_collision = obstacle_fn
        self.step_size = step_size
        self.nodes = [self.start]
        self.parents = {0: None}
    
    def sample_random(self):
        return np.random.uniform(self.bounds[0], self.bounds[1])
    
    def nearest_node(self, point):
        distances = [np.linalg.norm(node - point) for node in self.nodes]
        return np.argmin(distances)
    
    def steer(self, from_node, to_point):
        direction = to_point - from_node
        distance = np.linalg.norm(direction)
        if distance > self.step_size:
            direction = direction / distance * self.step_size
        return from_node + direction
    
    def plan(self, max_iter=1000, goal_bias=0.1):
        for _ in range(max_iter):
            # Sample with goal bias
            if np.random.random() < goal_bias:
                sample = self.goal
            else:
                sample = self.sample_random()
            
            # Find nearest and steer
            nearest_idx = self.nearest_node(sample)
            new_node = self.steer(self.nodes[nearest_idx], sample)
            
            # Check collision
            if not self.is_collision(new_node):
                self.nodes.append(new_node)
                self.parents[len(self.nodes)-1] = nearest_idx
                
                # Check if goal reached
                if np.linalg.norm(new_node - self.goal) < self.step_size:
                    return self.extract_path(len(self.nodes)-1)
        return None
    
    def extract_path(self, goal_idx):
        path = []
        idx = goal_idx
        while idx is not None:
            path.append(self.nodes[idx])
            idx = self.parents[idx]
        return path[::-1]
```

---

### 8.5 Control Systems

**Q: Explain PID control**

$$u(t) = K_p e(t) + K_i \int e(t)dt + K_d \frac{de(t)}{dt}$$

- **P (Proportional):** Responds to current error
- **I (Integral):** Eliminates steady-state error
- **D (Derivative):** Dampens oscillations

**Q: What is impedance control?**

Controls the dynamic relationship between force and motion:
$$F = M\ddot{x} + B\dot{x} + K(x - x_d)$$

Makes robot behave like a mass-spring-damper system. Essential for:
- Safe human-robot interaction
- Compliant manipulation
- Contact tasks

**Code - PID Controller:**
```python
class PIDController:
    def __init__(self, Kp, Ki, Kd, dt=0.01):
        self.Kp = Kp
        self.Ki = Ki
        self.Kd = Kd
        self.dt = dt
        self.integral = 0
        self.prev_error = 0
    
    def compute(self, setpoint, measurement):
        error = setpoint - measurement
        
        # Proportional
        P = self.Kp * error
        
        # Integral (with anti-windup)
        self.integral += error * self.dt
        self.integral = np.clip(self.integral, -10, 10)
        I = self.Ki * self.integral
        
        # Derivative
        derivative = (error - self.prev_error) / self.dt
        D = self.Kd * derivative
        
        self.prev_error = error
        return P + I + D
    
    def reset(self):
        self.integral = 0
        self.prev_error = 0
```

---

### 8.6 Vision-Language Models (VLMs)

**Q: What is a Vision-Language Model?**

A model that jointly understands images and text. Architecture typically:
1. Vision encoder (ViT, CLIP) → image embeddings
2. Language model (LLM) → text understanding
3. Cross-modal fusion → joint reasoning

**Key VLMs:**
| Model | Architecture | Key Innovation |
|-------|--------------|----------------|
| CLIP | Dual encoder | Contrastive image-text pretraining |
| BLIP-2 | Q-Former bridge | Efficient vision-language alignment |
| LLaVA | ViT + LLM | Simple projection, instruction tuning |
| GPT-4V | Proprietary | Multimodal GPT |

**Q: How does CLIP work?**

Contrastive learning on image-text pairs:
1. Encode images with vision transformer
2. Encode text with text transformer
3. Train to maximize similarity of matching pairs, minimize non-matching

$$\mathcal{L} = -\log \frac{\exp(sim(I_i, T_i)/\tau)}{\sum_j \exp(sim(I_i, T_j)/\tau)}$$

**Code - Using CLIP:**
```python
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Zero-shot image classification
image = load_image("robot.jpg")
texts = ["a photo of a robot", "a photo of a car", "a photo of a person"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Get similarity scores
logits_per_image = outputs.logits_per_image  # image-text similarity
probs = logits_per_image.softmax(dim=1)  # probabilities
```

---

### 8.7 Vision-Language-Action (VLA) Models

**Q: What is a VLA model?**

Extends VLMs to output robot actions. Takes:
- **Vision:** Camera images (RGB, depth)
- **Language:** Task instructions
- **Outputs:** Robot actions (joint positions, velocities, or end-effector poses)

**Key VLA Models:**
| Model | Base | Action Space | Key Feature |
|-------|------|--------------|-------------|
| RT-1 | EfficientNet + Transformer | Discrete | Large-scale robot data |
| RT-2 | PaLM-E / PaLI-X | Discrete tokens | VLM → actions |
| Octo | Transformer | Continuous | Open-source, multi-robot |
| OpenVLA | Llama 2 + SigLIP | Continuous | 7B params, fine-tunable |

**Q: How does RT-2 work?**

1. Pretrained VLM (PaLM-E) understands vision + language
2. Actions tokenized as text (e.g., "move arm [0.1, 0.2, 0.3]")
3. VLM generates action tokens autoregressively
4. Tokens decoded to continuous actions

**Q: What is action chunking?**

Predicting multiple future actions at once instead of one at a time:
- Reduces compounding errors
- Smoother trajectories
- Used in ACT, Diffusion Policy

---

### 8.8 Imitation Learning

**Q: Compare imitation learning approaches**

| Method | Description | Pros | Cons |
|--------|-------------|------|------|
| Behavior Cloning | Supervised learning on demos | Simple | Compounding errors |
| DAgger | Iterative with expert queries | Handles distribution shift | Needs expert access |
| IRL | Learn reward, then RL | Generalizes better | Computationally expensive |
| GAIL | GAN-style imitation | No reward engineering | Training instability |

**Q: What is the distribution shift problem?**

In behavior cloning, small errors accumulate because:
- Training: states from expert distribution
- Testing: states from learned policy (different distribution)
- Model never saw recovery from mistakes

Solutions: DAgger, action chunking, diffusion policies

**Code - Behavior Cloning:**
```python
class BehaviorCloning(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, obs):
        return self.net(obs)

# Training
def train_bc(model, expert_data, epochs=100):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(epochs):
        for obs, action in expert_data:
            pred_action = model(obs)
            loss = F.mse_loss(pred_action, action)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
```

---

### 8.9 Diffusion Policy

**Q: What is Diffusion Policy?**

Uses diffusion models to generate robot actions:
1. Start with noise
2. Iteratively denoise to get action sequence
3. Captures multimodal action distributions

**Advantages:**
- Handles multimodal demonstrations (multiple ways to do task)
- Generates smooth action sequences
- State-of-the-art on many manipulation tasks

**Q: How does it differ from standard BC?**

| Aspect | Behavior Cloning | Diffusion Policy |
|--------|------------------|------------------|
| Output | Single action | Action distribution |
| Multimodality | Averages modes | Captures all modes |
| Generation | Single forward pass | Iterative denoising |
| Training | MSE loss | Denoising score matching |

**Code - Simplified Diffusion Policy Concept:**
```python
class DiffusionPolicy(nn.Module):
    def __init__(self, obs_dim, action_dim, horizon=16):
        super().__init__()
        self.horizon = horizon
        self.action_dim = action_dim
        
        # Noise prediction network (simplified)
        self.noise_pred = nn.Sequential(
            nn.Linear(obs_dim + action_dim * horizon + 1, 256),  # +1 for timestep
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim * horizon)
        )
    
    def forward(self, obs, noisy_actions, timestep):
        """Predict noise given observation, noisy actions, and diffusion timestep"""
        t_embed = timestep.float().unsqueeze(-1) / 1000  # simple timestep embedding
        x = torch.cat([obs, noisy_actions.flatten(-2), t_embed], dim=-1)
        return self.noise_pred(x).view(-1, self.horizon, self.action_dim)
    
    @torch.no_grad()
    def sample(self, obs, num_steps=50):
        """Generate actions via iterative denoising"""
        device = obs.device
        actions = torch.randn(obs.shape[0], self.horizon, self.action_dim, device=device)
        
        for t in reversed(range(num_steps)):
            timestep = torch.full((obs.shape[0],), t, device=device)
            noise_pred = self(obs, actions, timestep)
            # Simplified DDPM update (actual implementation more complex)
            actions = actions - noise_pred * (1.0 / num_steps)
        
        return actions
```

---

### 8.10 Sim-to-Real Transfer

**Q: Why is sim-to-real transfer important?**

- Training in simulation is cheap, safe, and parallelizable
- Real robot data is expensive and slow to collect
- Challenge: simulation doesn't perfectly match reality (reality gap)

**Q: Techniques for sim-to-real transfer**

| Technique | Description |
|-----------|-------------|
| Domain Randomization | Randomize sim parameters (friction, mass, lighting) |
| System Identification | Measure real parameters, match in sim |
| Domain Adaptation | Learn to map sim → real features |
| Real-world fine-tuning | Pre-train in sim, fine-tune on real |

**Q: What is domain randomization?**

Randomize simulation parameters during training so policy becomes robust to variations:
- Physics: friction, mass, damping
- Visual: lighting, textures, camera pose
- Dynamics: delays, noise

**Code - Domain Randomization Example:**
```python
class DomainRandomizer:
    def __init__(self):
        self.friction_range = (0.5, 1.5)
        self.mass_scale_range = (0.8, 1.2)
        self.observation_noise = 0.01
        self.action_delay_range = (0, 3)  # frames
    
    def randomize_physics(self, env):
        friction = np.random.uniform(*self.friction_range)
        mass_scale = np.random.uniform(*self.mass_scale_range)
        env.set_friction(friction)
        env.scale_masses(mass_scale)
    
    def add_observation_noise(self, obs):
        noise = np.random.normal(0, self.observation_noise, obs.shape)
        return obs + noise
    
    def randomize_action_delay(self):
        return np.random.randint(*self.action_delay_range)
```

---

### 8.11 Recent Advanced Robotics Models

#### ACT (Action Chunking with Transformers)

**Q: What is ACT?**

A transformer-based imitation learning algorithm from Stanford that predicts sequences of actions ("action chunks") rather than single actions.

**Key Features:**
- Uses CVAE (Conditional VAE) to handle multimodal demonstrations
- Predicts chunks of 50-100 future actions at once
- Achieves 80-90% success on bimanual tasks with only 10 min of demos
- Used with ALOHA hardware for low-cost bimanual manipulation

**Architecture:**
```
Observation (images + proprioception) → Encoder → Transformer → Action Chunk (k actions)
                                          ↑
                                    CVAE latent z (captures style variation)
```

**Why action chunking helps:**
- Reduces compounding errors (fewer decision points)
- Captures temporal correlations in demonstrations
- Smoother, more natural trajectories

**Code - ACT Concept:**
```python
class ACT(nn.Module):
    def __init__(self, obs_dim, action_dim, chunk_size=50, latent_dim=32):
        super().__init__()
        self.chunk_size = chunk_size
        
        # CVAE encoder (only used during training)
        self.encoder = nn.TransformerEncoder(...)
        self.latent_proj = nn.Linear(hidden_dim, latent_dim * 2)  # mean, logvar
        
        # Decoder (transformer)
        self.decoder = nn.TransformerDecoder(...)
        self.action_head = nn.Linear(hidden_dim, action_dim)
        
        # Observation encoder
        self.obs_encoder = nn.Sequential(
            nn.Linear(obs_dim, 256),
            nn.ReLU(),
            nn.Linear(256, hidden_dim)
        )
    
    def forward(self, obs, actions=None):
        obs_embed = self.obs_encoder(obs)
        
        if self.training and actions is not None:
            # Encode actions to get latent z (CVAE)
            z_mean, z_logvar = self.latent_proj(self.encoder(actions)).chunk(2, dim=-1)
            z = z_mean + torch.randn_like(z_logvar) * torch.exp(0.5 * z_logvar)
        else:
            # At inference, sample z from prior (zero mean, unit var)
            z = torch.zeros(obs.shape[0], self.latent_dim, device=obs.device)
        
        # Decode action chunk
        action_chunk = self.decoder(z, obs_embed)
        return self.action_head(action_chunk)  # (batch, chunk_size, action_dim)
```

---

#### ALOHA / Mobile ALOHA

**Q: What is ALOHA?**

A Low-cost Open-source Hardware for bimanual teleoperation and imitation learning.

| Version | Features |
|---------|----------|
| ALOHA | Dual-arm tabletop, ~$20K hardware |
| ALOHA 2 | Improved ergonomics, better teleoperation |
| Mobile ALOHA | Adds mobile base for whole-body control |
| ALOHA Unleashed | DeepMind's diffusion policy on ALOHA 2 |

**Key capabilities:**
- Bimanual manipulation (two 6-DOF arms)
- Leader-follower teleoperation for data collection
- Tasks: tying shoelaces, folding clothes, cooking

---

#### π₀ (Pi-Zero) - Physical Intelligence

**Q: What is π₀?**

A general-purpose robot foundation model from Physical Intelligence that combines VLM pre-training with flow matching for action generation.

**Architecture:**
```
Vision (images) → PaliGemma VLM (3B) → Flow Matching → Robot Actions
Language (task)  ↗                      (iterative denoising)
```

**Key Innovations:**
- Built on PaliGemma (3B parameter VLM)
- Uses flow matching instead of diffusion for action generation
- Trained on diverse robot data across multiple embodiments
- Can fold laundry, clear tables, manipulate deformable objects

**Flow Matching vs Diffusion:**
| Aspect | Diffusion | Flow Matching |
|--------|-----------|---------------|
| Process | Add/remove noise | Learn velocity field |
| Training | Predict noise | Predict flow direction |
| Sampling | Many steps (50-1000) | Fewer steps (10-50) |
| Speed | Slower | Faster |

---

#### NVIDIA GR00T (Generalist Robot 00 Technology)

**Q: What is Project GR00T?**

NVIDIA's foundation model initiative for humanoid robots, announced at GTC 2024.

**Key Features:**
- Multimodal: processes language, vision, and outputs actions
- Learns from human demonstrations via video
- Integrated with NVIDIA Isaac for simulation and RL
- Designed for humanoid robots (partnered with Figure, 1X, etc.)

**Architecture (High-level):**
```
Multimodal Input → Foundation Model → High-level Planning
(language, video)                            ↓
                                    Low-level Control (Isaac)
                                            ↓
                                      Robot Actions
```

**Ecosystem:**
- Isaac Lab: RL training in simulation
- Isaac Sim: High-fidelity physics simulation
- Jetson Thor: Edge compute for humanoids

---

#### OpenVLA

**Q: What is OpenVLA?**

An open-source 7B parameter Vision-Language-Action model for robot control.

**Architecture:**
```
Image → DINOv2 + SigLIP → Projector → Llama 2 (7B) → Action Tokens
                                  ↑
                            Language instruction
```

**Key Details:**
- Base: Llama 2 7B language model
- Vision: Fused DINOv2 + SigLIP features
- Training: 970K trajectories from Open X-Embodiment
- Actions: Discretized into 256 bins, output as tokens

**Code - OpenVLA-style Action Tokenization:**
```python
def tokenize_actions(actions, num_bins=256, action_min=-1, action_max=1):
    """Discretize continuous actions into tokens"""
    # Normalize to [0, 1]
    normalized = (actions - action_min) / (action_max - action_min)
    # Discretize to bins
    tokens = (normalized * (num_bins - 1)).long()
    return tokens.clamp(0, num_bins - 1)

def detokenize_actions(tokens, num_bins=256, action_min=-1, action_max=1):
    """Convert tokens back to continuous actions"""
    normalized = tokens.float() / (num_bins - 1)
    actions = normalized * (action_max - action_min) + action_min
    return actions
```

---

#### Octo

**Q: What is Octo?**

An open-source generalist robot policy from UC Berkeley, trained on 800K+ trajectories.

**Key Features:**
- Transformer-based architecture (27M or 93M params)
- Trained on Open X-Embodiment dataset
- Supports multiple robots and observation types
- Designed for easy fine-tuning to new setups

**Architecture:**
```
Observations → Tokenizers → Transformer Backbone → Readout Heads → Actions
(images, lang)   (modular)      (shared)            (task-specific)
```

**Comparison with OpenVLA:**
| Aspect | Octo | OpenVLA |
|--------|------|---------|
| Size | 27M-93M | 7B |
| Base | Custom transformer | Llama 2 |
| Actions | Continuous | Discretized tokens |
| Focus | Efficiency, fine-tuning | Generalization |

---

#### FastSAC / FastTD3

**Q: What is FastSAC?**

An optimized off-policy RL algorithm that enables training humanoid locomotion in 15 minutes on a single GPU.

**Key Innovations:**
- Optimized SAC/TD3 for massively parallel simulation
- Strong domain randomization (dynamics, terrain, perturbations)
- Achieves sim-to-real transfer for humanoid walking
- Single RTX 4090 GPU, 15 minutes training

**Why it matters for humanoids:**
- Traditional RL: days of training
- PPO (on-policy): hours with parallel sim
- FastSAC (off-policy): 15 minutes with better sample efficiency

**Code - FastSAC Key Ideas:**
```python
class FastSAC:
    """Key optimizations for fast humanoid RL"""
    def __init__(self, env, num_envs=4096):
        self.num_envs = num_envs  # Massively parallel
        
        # Larger replay buffer for off-policy
        self.buffer_size = 1_000_000
        
        # Aggressive domain randomization
        self.domain_rand = {
            'friction': (0.5, 2.0),
            'mass_scale': (0.8, 1.2),
            'motor_strength': (0.8, 1.2),
            'terrain_roughness': (0, 0.1),
            'push_force': (0, 200),  # N
        }
    
    def update(self, batch_size=256):
        # Standard SAC update but with:
        # 1. Large batch sizes (256-1024)
        # 2. High UTD ratio (updates per env step)
        # 3. Parallel critic updates
        pass
```

---

#### RT-1 / RT-2 / RT-X

**Q: Compare Google's RT model family**

| Model | Architecture | Actions | Key Innovation |
|-------|--------------|---------|----------------|
| RT-1 | EfficientNet + Transformer | Discrete (256 bins) | Large-scale robot data |
| RT-2 | PaLM-E / PaLI-X VLM | Text tokens | VLM → robot actions |
| RT-X | RT-2 + cross-embodiment | Text tokens | Transfer across robots |

**RT-2 Key Insight:**
Actions can be represented as text tokens, allowing VLMs to directly output robot commands:
```
Input: "Pick up the apple" + image
Output: "1 128 64 32 255 128 64"  (tokenized action)
```

---

### 8.12 Humanoid Robot Specific Concepts

**Q: What makes humanoid control challenging?**

- High DOF (30+ joints for Optimus)
- Underactuated (can't directly control CoM)
- Balance constraints (ZMP, CoP)
- Contact dynamics (walking, manipulation)

**Q: Explain Zero Moment Point (ZMP)**

Point where total moment of inertial and gravity forces is zero. For stable walking:
- ZMP must stay within support polygon (foot contact area)
- If ZMP exits support polygon → robot falls

**Q: What is whole-body control?**

Coordinating all joints to achieve multiple objectives:
1. Primary: End-effector task (reach object)
2. Secondary: Maintain balance
3. Tertiary: Joint limit avoidance

Uses null-space projection to satisfy lower-priority tasks without affecting higher-priority ones.

---

### 8.13 Common Robotics Interview Questions

1. **How would you approach teaching a robot to pick up novel objects?**
   - Use VLA model pretrained on diverse objects (OpenVLA, RT-2)
   - Fine-tune with few demonstrations of new objects
   - Domain randomization for visual variations

2. **How do you handle real-time constraints?**
   - Model optimization (quantization, pruning)
   - Efficient architectures (MobileNet, EfficientNet)
   - Action chunking (predict multiple steps, execute while computing next)
   - Hierarchical control (high-level policy at 10Hz, low-level at 1kHz)

3. **How would you make a robot safe around humans?**
   - Impedance/compliance control
   - Force/torque sensing and limits
   - Predictive models for human motion
   - Fail-safe behaviors

4. **What's your approach to debugging a robot that fails in the real world but works in simulation?**
   - Check sensor calibration
   - Compare state distributions (sim vs real)
   - Add domain randomization
   - Collect failure cases, add to training

5. **How do you handle multimodal inputs (vision + proprioception + language)?**
   - Separate encoders for each modality
   - Cross-attention or concatenation for fusion
   - Pretrained encoders (CLIP for vision, LLM for language)

6. **Compare ACT vs Diffusion Policy for imitation learning**
   - ACT: CVAE + action chunking, faster inference, good for bimanual
   - Diffusion: Better multimodality, smoother trajectories, slower
   - Both use action chunking to reduce compounding errors

7. **Why would you choose π₀ over OpenVLA?**
   - π₀: Flow matching (faster), better for dexterous manipulation
   - OpenVLA: Open-source, easier to fine-tune, larger community
   - Both: VLM-based, language-conditioned

8. **How does FastSAC achieve 15-minute humanoid training?**
   - Off-policy (better sample efficiency than PPO)
   - Massively parallel simulation (4096+ envs)
   - Optimized implementation for GPU
   - Strong domain randomization for sim-to-real

9. **What are the tradeoffs between discrete vs continuous action spaces in VLAs?**
   - Discrete (RT-2, OpenVLA): Leverages LLM tokenization, but loses precision
   - Continuous (Octo, π₀): More precise, but needs different output heads
   - Hybrid: Coarse discrete + fine continuous refinement

10. **How would you deploy a VLA model on a real robot with latency constraints?**
    - Action chunking (compute once, execute many)
    - Model distillation to smaller network
    - Quantization (INT8)
    - Async inference pipeline

---

## Part 15: Data Augmentation

### 18.1 Image Augmentation

**Q: What augmentations are commonly used for images?**

| Augmentation | Effect | When to Use |
|--------------|--------|-------------|
| Random crop | Translation invariance | Almost always |
| Horizontal flip | Mirror invariance | When symmetric |
| Color jitter | Lighting robustness | Natural images |
| Rotation | Rotation invariance | When applicable |
| Cutout/RandomErasing | Occlusion robustness | Classification |
| MixUp | Smoother decision boundary | Classification |
| CutMix | Combines cutout + mixup | Classification |

**Code - Image Augmentation:**
```python
import torchvision.transforms as T
from torchvision.transforms import autoaugment

# Standard augmentation pipeline
train_transform = T.Compose([
    T.RandomResizedCrop(224),
    T.RandomHorizontalFlip(),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    T.RandomErasing(p=0.5)
])

# AutoAugment (learned augmentation policy)
auto_aug = T.Compose([
    T.RandomResizedCrop(224),
    autoaugment.AutoAugment(autoaugment.AutoAugmentPolicy.IMAGENET),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# MixUp implementation
def mixup(x, y, alpha=0.2):
    lam = np.random.beta(alpha, alpha)
    batch_size = x.size(0)
    index = torch.randperm(batch_size)
    
    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

# MixUp loss
def mixup_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)
```

---

### 18.2 Text Augmentation

**Q: What augmentations work for text?**

| Technique | Description |
|-----------|-------------|
| Synonym replacement | Replace words with synonyms |
| Random insertion | Insert random synonyms |
| Random swap | Swap word positions |
| Random deletion | Delete random words |
| Back-translation | Translate to another language and back |
| EDA | Easy Data Augmentation (combines above) |

**Code - Text Augmentation:**
```python
import random
import nltk
from nltk.corpus import wordnet

def synonym_replacement(words, n=1):
    new_words = words.copy()
    random_word_list = list(set([w for w in words if wordnet.synsets(w)]))
    random.shuffle(random_word_list)
    
    for random_word in random_word_list[:n]:
        synonyms = []
        for syn in wordnet.synsets(random_word):
            for lemma in syn.lemmas():
                synonyms.append(lemma.name())
        if synonyms:
            synonym = random.choice(synonyms)
            new_words = [synonym if w == random_word else w for w in new_words]
    return new_words

def random_deletion(words, p=0.1):
    if len(words) == 1:
        return words
    return [w for w in words if random.random() > p] or [random.choice(words)]

def random_swap(words, n=1):
    new_words = words.copy()
    for _ in range(n):
        if len(new_words) >= 2:
            idx1, idx2 = random.sample(range(len(new_words)), 2)
            new_words[idx1], new_words[idx2] = new_words[idx2], new_words[idx1]
    return new_words
```

---

### 18.3 Tabular Data Augmentation

**Q: How do you augment tabular data?**

| Technique | Description |
|-----------|-------------|
| SMOTE | Synthetic minority oversampling |
| Noise injection | Add Gaussian noise to features |
| Feature crossover | Combine features from different samples |
| Mixup | Interpolate between samples |

**Code - Tabular Augmentation:**
```python
from imblearn.over_sampling import SMOTE

# SMOTE for imbalanced data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Gaussian noise injection
def add_noise(X, noise_factor=0.1):
    noise = np.random.normal(0, noise_factor, X.shape)
    return X + noise

# Feature-wise mixup for tabular
def tabular_mixup(X, y, alpha=0.2):
    lam = np.random.beta(alpha, alpha, size=(len(X), 1))
    index = np.random.permutation(len(X))
    X_mixed = lam * X + (1 - lam) * X[index]
    y_mixed = lam.squeeze() * y + (1 - lam.squeeze()) * y[index]
    return X_mixed, y_mixed
```

---

## Part 16: Ethics & Fairness in ML

### 19.1 Bias in ML Systems

**Q: What types of bias exist in ML?**

| Bias Type | Description | Example |
|-----------|-------------|---------|
| Selection bias | Training data not representative | Hiring model trained on past (biased) decisions |
| Measurement bias | Features measured differently | Different quality cameras for different groups |
| Algorithmic bias | Model amplifies existing bias | Word embeddings encoding stereotypes |
| Evaluation bias | Test set not representative | Face recognition tested mainly on one demographic |

### 19.2 Fairness Metrics

**Q: What fairness metrics should you consider?**

| Metric | Definition | Use When |
|--------|------------|----------|
| Demographic parity | P(Ŷ=1\|A=0) = P(Ŷ=1\|A=1) | Equal positive rates |
| Equalized odds | TPR and FPR equal across groups | Equal error rates |
| Calibration | P(Y=1\|Ŷ=p) = p for all groups | Probability estimates |
| Individual fairness | Similar individuals treated similarly | Case-by-case fairness |

**Q: How do you mitigate bias?**

| Stage | Technique |
|-------|-----------|
| Pre-processing | Resampling, reweighting data |
| In-processing | Fairness constraints in loss |
| Post-processing | Adjust thresholds per group |

**Code - Fairness Evaluation:**
```python
from sklearn.metrics import confusion_matrix

def demographic_parity(y_pred, sensitive_attr):
    """Check if positive prediction rate is equal across groups"""
    groups = np.unique(sensitive_attr)
    rates = {}
    for g in groups:
        mask = sensitive_attr == g
        rates[g] = y_pred[mask].mean()
    return rates

def equalized_odds(y_true, y_pred, sensitive_attr):
    """Check if TPR and FPR are equal across groups"""
    groups = np.unique(sensitive_attr)
    metrics = {}
    for g in groups:
        mask = sensitive_attr == g
        tn, fp, fn, tp = confusion_matrix(y_true[mask], y_pred[mask]).ravel()
        metrics[g] = {
            'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
            'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0
        }
    return metrics
```

---

### 19.3 Responsible AI Practices

**Q: What are key responsible AI principles?**

1. **Transparency:** Explain how models make decisions
2. **Accountability:** Clear ownership of model outcomes
3. **Privacy:** Protect user data, differential privacy
4. **Safety:** Test for harmful outputs, red-teaming
5. **Human oversight:** Human-in-the-loop for high-stakes decisions

**Q: What is model interpretability?**

| Method | Type | Description |
|--------|------|-------------|
| LIME | Local | Approximate model locally with interpretable model |
| SHAP | Local/Global | Shapley values for feature importance |
| Attention weights | Model-specific | Visualize what model attends to |
| Integrated gradients | Gradient-based | Attribution to input features |
| Counterfactuals | Example-based | "What would change the prediction?" |

---

## Part 17: Common Interview Questions

### Conceptual Questions

1. **Why do we need non-linear activation functions?**
   - Without them, any depth of linear layers collapses to one linear transformation

2. **What is the vanishing gradient problem?**
   - Gradients become tiny in deep networks, preventing learning. Solutions: ReLU, residual connections, proper initialization

3. **Why do transformers use residual connections?**
   - Provide "gradient highways" for stable training of deep networks

4. **What is teacher forcing?**
   - During training, feed ground truth tokens to decoder instead of predictions. Faster training but can cause exposure bias.

5. **What is the difference between encoder and decoder?**
   - Encoder: bidirectional attention, understands input
   - Decoder: causal attention, generates output

### Practical Questions

1. **Your model isn't learning. What do you check?**
   - Learning rate (too high/low?)
   - Gradient flow (vanishing/exploding?)
   - Data pipeline (correct labels?)
   - Loss function (appropriate for task?)

2. **How do you handle variable-length sequences?**
   - Padding + attention mask
   - Pack sequences
   - Bucket by length

3. **How do you reduce memory usage during training?**
   - Gradient checkpointing
   - Mixed precision (FP16)
   - Gradient accumulation
   - LoRA/QLoRA for fine-tuning

4. **How do you speed up inference?**
   - KV caching
   - Quantization (INT8, INT4)
   - Flash Attention
   - Batching

---

## Part 18: Code Snippets to Know

### Basic Training Loop
```python
model.train()
for epoch in range(epochs):
    for x, y in dataloader:
        optimizer.zero_grad()
        pred = model(x)
        loss = criterion(pred, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
```

### Inference Mode
```python
model.eval()
with torch.no_grad():
    predictions = model(inputs)
    probabilities = F.softmax(predictions, dim=-1)
```

### DataLoader and Dataset
```python
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataset = CustomDataset(X_train, y_train)
dataloader = DataLoader(
    dataset, 
    batch_size=32, 
    shuffle=True,
    num_workers=4,
    pin_memory=True  # faster GPU transfer
)

for batch_x, batch_y in dataloader:
    batch_x = batch_x.to(device)
    batch_y = batch_y.to(device)
    # training step...
```

### Custom Module
```python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4*d_model),
            nn.GELU(),
            nn.Linear(4*d_model, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x):
        x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
        x = x + self.ffn(self.norm2(x))
        return x
```

### Attention Implementation
```python
def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V)
```

---

## Part 19: Quick Reference - Key Formulas

### Machine Learning - Loss Functions
- **MSE (Mean Squared Error):** $\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
- **MAE (Mean Absolute Error):** $\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$
- **Binary Cross-Entropy:** $\mathcal{L} = -\frac{1}{n}\sum[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$
- **Categorical Cross-Entropy:** $\mathcal{L} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$
- **Huber Loss:** $\mathcal{L} = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$
- **KL Divergence:** $D_{KL}(P||Q) = \sum P(x) \log\frac{P(x)}{Q(x)}$
- **Contrastive Loss (InfoNCE):** $\mathcal{L} = -\log\frac{\exp(sim(z_i, z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{k\neq i}\exp(sim(z_i, z_k)/\tau)}$

### Machine Learning - Activation Functions
- **Sigmoid:** $\sigma(x) = \frac{1}{1 + e^{-x}}$, derivative: $\sigma'(x) = \sigma(x)(1-\sigma(x))$
- **Tanh:** $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$, derivative: $\tanh'(x) = 1 - \tanh^2(x)$
- **ReLU:** $f(x) = \max(0, x)$, derivative: $f'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$
- **Leaky ReLU:** $f(x) = \max(\alpha x, x)$ where $\alpha \approx 0.01$
- **GeLU:** $f(x) = x \cdot \Phi(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$
- **Swish/SiLU:** $f(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$
- **Softmax:** $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

### Machine Learning - Regularization
- **L1 (Lasso):** $\mathcal{L}_{reg} = \mathcal{L} + \lambda\sum|w_i|$
- **L2 (Ridge):** $\mathcal{L}_{reg} = \mathcal{L} + \lambda\sum w_i^2$
- **Elastic Net:** $\mathcal{L}_{reg} = \mathcal{L} + \lambda_1\sum|w_i| + \lambda_2\sum w_i^2$
- **Dropout:** $\tilde{y} = r \odot y$ where $r \sim \text{Bernoulli}(p)$, scale by $1/(1-p)$ at train

### Machine Learning - Metrics
- **Accuracy:** $\frac{TP + TN}{TP + TN + FP + FN}$
- **Precision:** $\frac{TP}{TP + FP}$
- **Recall (Sensitivity):** $\frac{TP}{TP + FN}$
- **F1 Score:** $2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$
- **Specificity:** $\frac{TN}{TN + FP}$
- **IoU (Jaccard):** $\frac{|A \cap B|}{|A \cup B|}$
- **R² Score:** $1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$

### Machine Learning - Optimization
- **SGD:** $\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}$
- **SGD + Momentum:** $v_t = \gamma v_{t-1} + \eta \nabla_\theta \mathcal{L}$, $\theta_{t+1} = \theta_t - v_t$
- **Adam:** 
  - $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ (first moment)
  - $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ (second moment)
  - $\hat{m}_t = m_t/(1-\beta_1^t)$, $\hat{v}_t = v_t/(1-\beta_2^t)$ (bias correction)
  - $\theta_{t+1} = \theta_t - \eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$
- **Learning Rate Decay:** $\eta_t = \eta_0 / (1 + \text{decay} \cdot t)$
- **Cosine Annealing:** $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi))$

### Machine Learning - Statistics
- **Bayes' Theorem:** $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
- **Gaussian/Normal:** $p(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
- **Variance:** $\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
- **Covariance:** $\text{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$
- **Entropy:** $H(X) = -\sum p(x) \log p(x)$

---

### Reinforcement Learning - Core
- **Return:** $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
- **State Value:** $V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]$
- **Action Value:** $Q^\pi(s,a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]$
- **Advantage:** $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$

### Reinforcement Learning - Bellman Equations
- **Bellman Expectation (V):** $V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R + \gamma V^\pi(s')]$
- **Bellman Expectation (Q):** $Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a')$
- **Bellman Optimality (V):** $V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R + \gamma V^*(s')]$
- **Bellman Optimality (Q):** $Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

### Reinforcement Learning - Algorithms
- **Q-Learning:** $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$
- **SARSA:** $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma Q(s',a') - Q(s,a)]$
- **TD(0):** $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$
- **Policy Gradient:** $\nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$
- **REINFORCE:** $\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$
- **Actor-Critic:** $\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \cdot \delta_t$ where $\delta_t = r + \gamma V(s') - V(s)$
- **PPO Clip:** $\mathcal{L}^{CLIP} = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$
- **DQN Loss:** $\mathcal{L} = \mathbb{E}[(r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta))^2]$
- **SAC Entropy:** $J(\pi) = \sum_t \mathbb{E}[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))]$

### Reinforcement Learning - Exploration
- **ε-greedy:** $a = \begin{cases} \arg\max_a Q(s,a) & \text{prob } 1-\epsilon \\ \text{random} & \text{prob } \epsilon \end{cases}$
- **Boltzmann/Softmax:** $\pi(a|s) = \frac{e^{Q(s,a)/\tau}}{\sum_{a'} e^{Q(s,a')/\tau}}$
- **UCB:** $a = \arg\max_a [Q(s,a) + c\sqrt{\frac{\ln t}{N(s,a)}}]$

---

### Deep Learning - Neural Networks
- **Linear Layer:** $y = Wx + b$
- **Batch Norm:** $\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$, $y = \gamma\hat{x} + \beta$
- **Layer Norm:** $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$, $y = \gamma\hat{x} + \beta$ (over features)
- **RMS Norm:** $\hat{x} = \frac{x}{\text{RMS}(x)}$, $\text{RMS}(x) = \sqrt{\frac{1}{n}\sum x_i^2}$
- **Convolution:** $(f * g)(t) = \sum_{\tau} f(\tau)g(t-\tau)$
- **Conv2D output size:** $\text{out} = \lfloor\frac{\text{in} + 2p - k}{s}\rfloor + 1$

### Deep Learning - Initialization
- **Xavier/Glorot:** $W \sim \mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$
- **He/Kaiming:** $W \sim \mathcal{N}(0, \sqrt{2/n_{in}})$

### Deep Learning - Gradient
- **Chain Rule:** $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$
- **Gradient Clipping:** $g \leftarrow g \cdot \min(1, \frac{\text{max\_norm}}{||g||})$

---

### Transformers - Attention
- **Scaled Dot-Product:** $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
- **Multi-Head:** $\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,...,\text{head}_h)W^O$
- **Each head:** $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
- **Causal Mask:** $M_{ij} = \begin{cases} 0 & i \geq j \\ -\infty & i < j \end{cases}$

### Transformers - Position Encoding
- **Sinusoidal:** $PE_{(pos,2i)} = \sin(pos/10000^{2i/d})$, $PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d})$
- **RoPE:** $q_m = R_{\Theta,m}W_q x_m$ where $R_{\Theta,m}$ is rotation matrix

### Transformers - Architecture
- **FFN:** $\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$
- **SwiGLU:** $\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_3)$
- **Pre-Norm Block:** $x = x + \text{Attention}(\text{LN}(x))$, $x = x + \text{FFN}(\text{LN}(x))$
- **Post-Norm Block:** $x = \text{LN}(x + \text{Attention}(x))$, $x = \text{LN}(x + \text{FFN}(x))$

### Transformers - Complexity
- **Self-Attention:** $O(n^2 \cdot d)$ time, $O(n^2)$ memory
- **FFN:** $O(n \cdot d^2)$ time
- **Sliding Window:** $O(n \cdot w \cdot d)$ where $w$ = window size

---

### Robotics - Kinematics
- **Forward Kinematics:** $x = f(q)$ (joint angles → end-effector pose)
- **Inverse Kinematics:** $q = f^{-1}(x)$ (end-effector pose → joint angles)
- **Jacobian:** $\dot{x} = J(q)\dot{q}$
- **Jacobian Inverse:** $\dot{q} = J^{-1}(q)\dot{x}$ or $\dot{q} = J^+(q)\dot{x}$ (pseudo-inverse)
- **Singularity:** $\det(J) = 0$ (robot loses DOF)

### Robotics - Dynamics
- **Equation of Motion:** $M(q)\ddot{q} + C(q,\dot{q})\dot{q} + G(q) = \tau$
  - $M$: mass matrix, $C$: Coriolis, $G$: gravity, $\tau$: torques
- **Impedance Control:** $F = M_d\ddot{x} + B_d\dot{x} + K_d(x - x_d)$

### Robotics - Control
- **PID:** $u(t) = K_p e(t) + K_i \int_0^t e(\tau)d\tau + K_d \frac{de(t)}{dt}$
- **PD + Gravity Comp:** $\tau = K_p(q_d - q) + K_d(\dot{q}_d - \dot{q}) + G(q)$
- **Computed Torque:** $\tau = M(q)(\ddot{q}_d + K_d\dot{e} + K_p e) + C(q,\dot{q})\dot{q} + G(q)$

### Robotics - Transformations
- **Homogeneous Transform:** $T = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}$ (4×4)
- **Rotation Matrix (z-axis):** $R_z(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix}$
- **Quaternion:** $q = w + xi + yj + zk$, $||q|| = 1$ for rotation
- **Quaternion to Rotation:** $R = I + 2w[\mathbf{v}]_\times + 2[\mathbf{v}]_\times^2$ where $\mathbf{v} = (x,y,z)$

### Robotics - Planning
- **A* Heuristic:** $f(n) = g(n) + h(n)$ (cost-so-far + estimated-to-goal)
- **RRT Steering:** $q_{new} = q_{near} + \frac{q_{rand} - q_{near}}{||q_{rand} - q_{near}||} \cdot \delta$
- **MPC Cost:** $J = \sum_{t=0}^{H}[x_t^T Q x_t + u_t^T R u_t]$ subject to $x_{t+1} = f(x_t, u_t)$

### Robotics - Balance (Humanoids)
- **ZMP (Zero Moment Point):** Point where horizontal moments = 0
- **CoM Dynamics:** $\ddot{x}_{CoM} = \frac{g}{z_{CoM}}(x_{CoM} - x_{ZMP})$ (linear inverted pendulum)
- **Stability:** ZMP must stay within support polygon

---

### Diffusion Models
- **Forward Process:** $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$
- **Reverse Process:** $p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$
- **Training Loss:** $\mathcal{L} = \mathbb{E}_{t,x_0,\epsilon}[||\epsilon - \epsilon_\theta(x_t, t)||^2]$
- **Sampling:** $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t,t)) + \sigma_t z$

### Flow Matching
- **Velocity Field:** $\frac{dx_t}{dt} = v_\theta(x_t, t)$
- **Training Loss:** $\mathcal{L} = \mathbb{E}_{t,x_0,x_1}[||v_\theta(x_t, t) - (x_1 - x_0)||^2]$
- **Sampling:** $x_1 = x_0 + \int_0^1 v_\theta(x_t, t) dt$ (ODE solver)

---

### Information Theory
- **Entropy:** $H(X) = -\sum_x p(x) \log p(x)$
- **Cross-Entropy:** $H(p,q) = -\sum_x p(x) \log q(x)$
- **KL Divergence:** $D_{KL}(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p,q) - H(p)$
- **Mutual Information:** $I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y)$

### Linear Algebra Essentials
- **Matrix Inverse:** $AA^{-1} = I$
- **Pseudo-inverse:** $A^+ = (A^T A)^{-1} A^T$ (left), $A^+ = A^T(AA^T)^{-1}$ (right)
- **SVD:** $A = U\Sigma V^T$
- **Eigendecomposition:** $Av = \lambda v$
- **Determinant:** $\det(AB) = \det(A)\det(B)$
- **Trace:** $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$

---

## Part 20: Additional Interview Tips

### 20.1 Coding Interview Tips

1. **Start simple:** Implement baseline first, then optimize
2. **Think aloud:** Explain your reasoning
3. **Test your code:** Walk through with examples
4. **Handle edge cases:** Empty inputs, single elements
5. **Know complexity:** Time and space analysis

### 20.2 ML Interview Tips

1. **Know your resume:** Be ready to deep-dive on any project
2. **Understand tradeoffs:** No solution is perfect
3. **Ask clarifying questions:** Requirements, constraints, scale
4. **Start with baselines:** Simple models first
5. **Discuss failure modes:** What could go wrong?

### 20.3 Common Behavioral Questions

1. **Tell me about a challenging ML project**
   - Problem → Approach → Challenges → Results → Learnings

2. **How do you handle disagreements with teammates?**
   - Listen → Understand → Data-driven discussion → Compromise

3. **Describe a time you failed**
   - Situation → What went wrong → What you learned → How you improved

4. **How do you stay current with ML research?**
   - Papers (arXiv), conferences, blogs, implementations

### 20.4 Questions to Ask Interviewers

- What does the ML infrastructure look like?
- How do you handle model deployment and monitoring?
- What's the team structure and collaboration like?
- What are the biggest technical challenges?
- How do you balance research vs production?

---

*End of Interview Preparation Guide*
