Below is a structured set of answers and explanations to the questions posed, based on typical machine-learning fundamentals and the lecture materials you referenced. The answers are grouped according to the order/topics in your prompt.

---

## Part 1: Introduction

### 1. What is Machine Learning (ML)? When is it suitable?
- **Definition:** Machine Learning is a subfield of AI focused on algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed with if–then rules.  
- **When it’s suitable:**  
  - You have data that likely contains hidden patterns or relationships.  
  - You need to make predictions or decisions for new (unseen) data.  
  - A rule-based (hard-coded) approach is too complex or not feasible.  
  - The problem is too large or complex for explicit human-driven coding of rules (e.g., image classification, language translation, recommendation).

### 2. ML Terminology
Common terms you will see:
- **Features / Predictors / Independent variables:** The input variables \(x\).
- **Target / Dependent variable:** The output we want to predict (classification label or regression value).
- **Samples / Instances:** Individual data points in your dataset.
- **Training, validation, test sets:** Splits of the dataset used for model training, hyperparameter tuning, and final performance estimation.
- **Overfitting vs. Underfitting:** Overfitting is learning too many specifics from training data (poor generalization), while underfitting is failing to capture signal in data.

### 3. ML Types
- **Supervised learning:** Labeled data (classification, regression).
- **Unsupervised learning:** No labels provided (clustering, dimensionality reduction).
- **Semi-supervised learning:** Combination of labeled and unlabeled data.
- **Reinforcement learning:** Learning via rewards and penalties in an environment.

---

## ML Fundamentals

### 1. What are four splits of data we have seen so far?
Many courses or resources describe splitting data into:
1. **Training set** – used to fit (train) the model.
2. **Validation set** – used for model selection and hyperparameter tuning (some classes combine validation into cross-validation).
3. **Test set** – used to get an unbiased final estimate of performance after all tuning.  
4. **(Optional) Hold-out set** – an extra set or “external test set” that you keep to truly confirm your final performance, or that’s provided externally (e.g., in a competition).

In practice, you might see:
- Train/Validation/Test
- Or Train/Cross-Validation folds/Test  
- Or repeated cross-validation plus a final test set.

### 2. What are the advantages of cross-validation?
- **More efficient use of data:** Every data point gets used for both training and validation in different folds.
- **Less variance in performance estimates:** Because performance is averaged over multiple folds.
- **Particularly helpful when the dataset is not very large** – you get a more robust estimate than a single train–validation split.

### 3. Why it’s important to look at sub-scores of cross-validation?
- **Identifies instability or variance in the model’s performance:**  
  If your cross-validation scores vary a lot between folds (e.g., some folds have very high accuracy and others very low), it can signal data distribution issues or overfitting in some folds.
- **Gives you insight on how robust the model is across different subsets** of your data.

### 4. What is the fundamental trade-off in supervised machine learning?
- **Bias–Variance trade-off:**  
  - **Bias:** Error from too simplistic a model (underfitting).  
  - **Variance:** Error from a model that’s too complex and overfits the training data (overfitting).  
  We try to balance these.

### 5. What is the Golden Rule in supervised machine learning?
> **Golden Rule**: Never **peek** at your test (or hold-out) data.  
You must not use test data (or anything that simulates your final evaluation set) to make decisions about model selection, hyperparameter tuning, feature engineering, or preprocessing. The test set should be used only once at the very end to measure final performance.

### 6. Scenarios for Data Leakage
- **Leaking future/target information into features:** For example, if a data-processing step used target information or used test-set statistics.  
- **Using test data in any way to fit or choose models:** Even if indirectly (e.g., if you do fit_transform on the entire dataset including test before splitting).

---

## Pros, Cons, Parameters, and Hyperparameters of Different ML Models

Below is a concise comparison of some common models.

| **Model**         | **Key Parameters / Hyperparameters**                                                                                                                                                   | **Strengths**                                                                                                                                               | **Weaknesses**                                                                                                                                                                   |
|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Decision Trees** | - Max depth<br>- Min samples split<br>- Criterion (e.g., Gini, entropy)<br>- Max leaf nodes                                                                                           | - Easy to interpret<br>- Fast to train<br>- Can handle different feature types<br>- No scaling needed                                                         | - Easily overfits without pruning<br>- High variance<br>- Often outperformed by ensembles                                                                                         |
| **k-Nearest Neighbors (kNN)** | - \(k\) (# of neighbors)<br>- Distance metric (e.g., Euclidean)<br>- Weights (uniform vs distance)                                                                              | - Simple and intuitive<br>- Good baseline for smaller data<br>- No explicit training step                                                                     | - Can be slow at prediction time on large data<br>- Must choose \(k\) carefully<br>- Sensitive to scale and outliers                                                               |
| **SVM with RBF Kernel** | - Regularization parameter \(C\)<br>- Kernel width \(\gamma\)<br>- Possibly other kernel parameters                                                                                | - Good performance on many problems if well tuned<br>- Works well in high-dimensional spaces<br>- Good theoretical foundation                                 | - Parameter tuning can be tricky and slow (C, \(\gamma\))<br>- Not straightforward to interpret                                                                                    |
| **Linear Models (e.g., Logistic, Linear Regression, Ridge, Lasso)** | - Regularization strength (\(\alpha\) or \(C\))<br>- Solver type<br>- For logistic: class_weight can matter                                                                                     | - Fast to train, easy to interpret<br>- Works well when linear assumption is valid or approximate<br>- Scales to large data                                   | - Can underfit if relationships are highly nonlinear<br>- Feature engineering often needed to capture complexities                                                                 |
| **Random Forests** | - # of trees<br>- Max depth<br>- Min samples split<br>- Max features per split                                                                                                         | - Often strong out-of-the-box<br>- Handles different data types<br>- Reduces overfitting vs. single trees                                                    | - Can be slower with many trees<br>- Less interpretable than a single tree                                                                                                         |
| **Gradient Boosting** (e.g., XGBoost, LightGBM, CatBoost) | - Learning rate<br>- # of trees<br>- Max depth<br>- Subsampling rates<br>- Regularization parameters                                                                                                  | - State-of-the-art for many structured data problems<br>- Can handle various data types (CatBoost especially for categorical)<br>- Very flexible and powerful | - Many hyperparameters to tune<br>- Prone to overfitting if not tuned properly<br>- Can be slower to train than simpler methods                                                    |
| **Stacking**      | - Choice of base models<br>- Choice of meta-model<br>- Possibly how data is split for out-of-fold predictions                                                                           | - Can sometimes outperform single models or single ensembles<br>- Leverages complementary strengths of multiple algorithms                                    | - More complex pipeline<br>- Harder to interpret<br>- Risk of overfitting if done incorrectly                                                                                     |
| **Averaging**     | - Choice of models<br>- Possibly weighted average approach                                                                                                                              | - Simple way to ensemble different models<br>- Reduces variance                                                                                               | - Gains are not always large<br>- Doesn’t exploit differences in models as effectively as stacking                                                                                 |

---

## Preprocessing

### 1. What are various data preprocessing steps such as scaling, OHE, ordinal encoding, and handling missing values? Why and when each step is necessary?
- **Scaling:**  
  - Standard scaling (subtract mean, divide by std) or MinMax scaling is often necessary for models like kNN, SVM, neural networks, or linear models with regularization.  
  - Helps gradient-based methods converge faster and prevents features with large ranges from dominating distance-based models.
- **One-Hot Encoding (OHE):**  
  - Converts a categorical feature with N categories into N (or N-1 with a drop) binary features.  
  - Needed for linear models and many tree-based models to properly handle nominal categorical variables.
- **Ordinal Encoding:**  
  - Assigns integer codes to categories (0, 1, 2, …).  
  - Useful for tree-based models. For truly *ordered* categories, it captures ordering.  
  - For *non-ordered* categories, ordinal encoding might inject an artificial numeric ordering.
- **Handling Missing Values:**  
  - *SimpleImputer* – common methods: mean, median, constant.  
  - More sophisticated approaches: KNN imputer, MICE, or domain-specific strategies.  
  - Important to ensure you do not lose entire rows or columns if the missing data is not too large or random.

### 2. sklearn Transformers vs. Estimators
- **Transformers** (like `StandardScaler`, `OneHotEncoder`, `SimpleImputer`) have:
  - A `fit` method (learn statistics or mappings from training data).
  - A `transform` method (apply those learned transformations).  
- **Estimators** (like `LinearRegression`, `RandomForestClassifier`) have:
  - A `fit` method (learn model parameters).
  - A `predict` method (make predictions).  
Often you will see both referred to as “estimators” in sklearn’s unified API, but “transformers” specifically transform data rather than produce a final prediction.

### 3. Can you think of a better way to impute missing values compared to `SimpleImputer`?
- **KNN Imputer:** Impute a missing value by looking at the feature values of its nearest neighbors.  
- **Model-based imputation:** Train a small regression model to predict the missing feature from other features.  
- **Domain-specific strategies:** e.g., if a missing “salary” might reflect no income, you might choose 0. Or if missing data indicates a distinct category, you add a new “Missing” category.

### 4. One-Hot Encoding Arguments
- **`handle_unknown="ignore"`**:  
  - If a new category appears at inference time that the encoder never saw in training, it ignores it (instead of raising an error). The new category’s row is encoded as all zeros for that feature. 
- **`sparse=False`**:  
  - Output an array in dense format rather than a sparse matrix. Good for small to medium feature spaces.  
- **`drop="if_binary"`**:  
  - If the categorical feature has exactly 2 categories, drop one column to avoid perfect collinearity (i.e., you’ll only get 1 column for that feature, 0 or 1).

### 5. How do you deal with categorical features with only two possible categories?
- If you’re using `OneHotEncoder` with `drop="if_binary"`, you will get a single 0/1 column automatically. Alternatively, you could manually encode it as 0/1 yourself (basically the same effect).  
- If using ordinal encoding, you could encode categories as 0 and 1.

### 6. Ordinal Encoding
- **Difference vs. OHE:** OHE expands a categorical variable into multiple binary columns; ordinal encoding replaces categories with numeric codes (0, 1, 2, …).
- **What if we don’t order the categories?** For truly ordinal data (like “Poor”, “Good”, “Excellent”), you should respect the inherent ordering. For nominal categories with no natural ordering, applying ordinal encoding imposes an arbitrary numeric order (which may or may not affect some models).  
- **Does it matter if we order ascending vs. descending?** For tree-based models, typically not much. For linear or distance-based models, the magnitude of the encoded numbers can matter, so an arbitrary order can mislead the model.  
- **Unknown categories at test time?** OrdinalEncoder will throw an error by default if it encounters a category it didn’t see in training. You can handle that by specifying parameters like `handle_unknown="use_encoded_value"` in newer versions of sklearn or by filtering out unknown categories. If an unknown category does show up (e.g., “super poor”), you must decide how to encode it (perhaps as a default code like -1 or the highest code).

---

## OHE vs. Ordinal Encoding Example

### “Since `enjoy_course` feature is binary, you decide to apply `drop="if_binary"`. Your friend decides to apply ordinal encoding. Will it make any difference?”

In your example:

```python
ohe = OneHotEncoder(drop="if_binary", sparse_output=False)
ohe_encoded = ohe.fit_transform(grades_df[['enjoy_course']]).ravel()

oe = OrdinalEncoder()
oe_encoded = oe.fit_transform(grades_df[['enjoy_course']]).ravel()

data = { 
  "oe_encoded": oe_encoded, 
  "ohe_encoded": ohe_encoded
}
pd.DataFrame(data)
```

Results:
```
    oe_encoded  ohe_encoded
0         1.0         1.0
1         1.0         1.0
2         1.0         1.0
3         0.0         0.0
...
```

- They end up effectively the same in numeric values when `enjoy_course` is strictly binary. In this scenario (strict 2-category feature), both methods yield a single 0/1 column, so there’s **no real difference** in the final numeric representation.

---

## When is it OK to break the Golden Rule?

Generally, you **shouldn’t** break it. However, you might do so in:
- **Feature engineering that is domain-based** or extremely standard, e.g., standardizing text in a known general dictionary. 
- **To create baseline or synthetic examples**. 
- In real practice, if you re-fit your pipeline with knowledge from test data, you risk inflating your reported performance. This is usually a methodological mistake, but some industry scenarios may re-train on all data post-deployment.

---

## Large Categorical Columns

### What are possible ways to deal with categorical columns with a large number of categories?

- **Leave One Out (LOO) Encoding / Target Encoding:** Encode categories based on average target value.  
- **Hashing Trick:** Use a hash function to map categories to a fixed number of columns.  
- **Feature hashing** inside `CountVectorizer`-like approaches.  
- **Domain knowledge grouping:** Combine rare categories or group them if they are functionally similar.  

---

## Excluding Features

### In what scenarios you’ll not include a feature in your model even if it’s a good predictor?

- **Ethical or legal constraints:** e.g., protected attributes (race, gender) that cause fairness or legal concerns.  
- **Data collection cost or practicality:** The feature is expensive or time-consuming to obtain.  
- **Potential for leakage:** If the feature is suspiciously predictive only because it leaks future or target info.  
- **Simplicity or interpretability reasons:** If you want a simpler model and the gain from that feature is tiny.

---

## CountVectorizer on the Test Data

### What’s the problem with calling `fit_transform` on the test data in the context of `CountVectorizer`?

- **Data Leakage**: You’d be learning a new vocabulary from the test set and ignoring the vocabulary from training. The test set might have words the training set didn’t see, or it might distort frequencies.  
- **Golden Rule violation**: We’re supposed to “transform” test data using the training distribution/vocabulary, not re-fit or re-learn from the test data.

**Correct approach**: Call `fit_transform` on the training data and only `transform` on the test data.

### Do we need to scale after applying bag-of-words representation?
- Usually **no** for text classification tasks with typical bag-of-words frequencies or TF-IDF vectors.  
- If you feed BOW or TF-IDF vectors to a distance-based model like kNN, you might consider normalizing the vectors (e.g., L2 norm) rather than standard scaling.  
- For linear models or random forests, scaling the BOW features is not typically necessary.

---

## Hyperparameter Optimization

### 1. What makes hyperparameter optimization a hard problem?

- **High dimensional search space:** Many models have multiple hyperparameters (e.g., # of trees, max_depth, learning_rate).  
- **Potentially expensive to evaluate:** Each combination might require re-fitting the model, which is computationally costly.  
- **Hyperparameters can interact:** The best value for one may depend on another.

### 2. Two different tools provided by sklearn for hyperparameter optimization
1. **`GridSearchCV`** – an exhaustive search over specified parameter grids.  
2. **`RandomizedSearchCV`** – tries random combinations from a distribution over hyperparameters.  

(Others include `HalvingGridSearchCV` and `HalvingRandomSearchCV`, plus scikit-optimize or external libraries for Bayesian optimization, etc.)

### 3. What is optimization bias?
- **Definition**: The bias introduced by searching over many models/hyperparameters and picking the best one that happens to fit random noise in the data.  
- **Effect**: Overestimates your model’s true performance if you continuously tune and retest on the same validation set or small data.

---

### Method Comparisons

| **Method**           | **Strengths/Weaknesses**                                                                | **When to use?**                                                                                                                              |
|----------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| **Nested for loops** | - Basic approach to try multiple values of multiple hyperparameters.<br>- Possibly slow. | - For small parameter grids or if you’re learning the concept.                                                                                 |
| **Grid Search**      | - Exhaustive search can find best combination in a defined grid.<br>- Quickly gets large (curse of dimensionality).  | - If you have a small set of discrete hyperparameter values and enough compute.                                                                |
| **Random Search**    | - More efficient if you suspect many parameters have lesser importance.<br>- Doesn’t guarantee thorough coverage of all ranges.  | - If your hyperparameters are continuous or large-range, and you want a better chance of finding good combos within limited time.              |

---

## Evaluation Metrics

### 1. Why accuracy is not always enough?
- **Class imbalance:** Accuracy can be very high even if the model ignores the minority class.  
- **Different costs of misclassification:** E.g., in medical diagnosis, false negatives might be critical.

### 2. Why it’s useful to get prediction probabilities?
- **Allows for different thresholds:** We can trade off precision vs. recall depending on the problem.  
- **Calibration:** Helps measure how confident the model is in its predictions.  
- **Ranking scenarios:** e.g., where you want to sort by “likelihood of being positive.”

### 3. In what scenarios do you care more about precision or recall?
- **High precision**: You want fewer false positives (e.g., an alarm system that should rarely cry wolf).  
- **High recall**: You want fewer false negatives (e.g., medical tests that should flag any possible disease cases).

### 4. What’s the main difference between AP score (Average Precision) and F1 score?
- **AP Score** (from the Precision–Recall curve) is area under the precision–recall curve across different thresholds.  
- **F1 Score** is a single precision/recall harmonic mean at **one** threshold.  
- AP accounts for model performance across many decision thresholds, while F1 is at a specific threshold.

### 5. What are advantages of RMSE or MAPE over MSE?
- **RMSE**:  
  - Has the same units as the target.  
  - More interpretable than MSE’s squared units.  
  - Punishes large errors more heavily (just like MSE) but remains in the original scale.  
- **MAPE**:  
  - Interpreted as average percent error (easy to interpret especially if target scales vary widely).  
  - Downside is it can blow up if targets can be zero or close to zero.

---

### Classification Metrics Summary

| **Metric** | **How to generate/calculate?**                                                                                                                       | **When to use?**                                                                                                            |
|------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **Accuracy**   | \(\frac{\text{# correct predictions}}{\text{total # samples}}\)                                                                                   | - Balanced classes, general measure.<br>- Not suitable if large class imbalance.                                             |
| **Precision**  | \(\frac{TP}{TP + FP}\)                                                                                                                           | - When false positives are costly (e.g., spam detection).                                                                    |
| **Recall**     | \(\frac{TP}{TP + FN}\)                                                                                                                           | - When false negatives are costly (e.g., disease detection).                                                                 |
| **F1-score**   | Harmonic mean of precision and recall = \(2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\)             | - Balanced metric if you care about precision and recall equally.                                                            |
| **AP score**   | Area under the Precision–Recall curve over thresholds.                                                                                           | - If classes are highly imbalanced or if you specifically want to track performance across various operating thresholds.     |
| **AUC**        | Area under the ROC curve (plot of TPR vs. FPR).                                                                                                  | - When you want to summarize the trade-off between TPR and FPR at all thresholds.                                           |

### Regression Metrics Summary

| **Metric** | **How to generate/calculate?**                                    | **When to use?**                                                        |
|------------|-------------------------------------------------------------------|--------------------------------------------------------------------------|
| **MSE**    | Mean Squared Error = \(\frac{1}{n}\sum (y - \hat{y})^2\)          | - Default for many problems. Easy to optimize.                          |
| **RMSE**   | \(\sqrt{\mathrm{MSE}}\)                                           | - More interpretable scale than MSE; punishes large errors.             |
| **\(R^2\)**| \(1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}\)         | - Measures fraction of variance explained by the model.                 |
| **MAPE**   | Mean Absolute Percentage Error = \(\frac{100\%}{n}\sum \left|\frac{y - \hat{y}}{y}\right|\) | - If you care about percentage errors and targets are nonzero.          |

---

## Final Key Takeaways

- **Data Splits & The Golden Rule:** Always keep the test set out of all decisions. Use cross-validation on the training set for hyperparameter tuning.  
- **Preprocessing:** Carefully transform training data, then apply the *same transformation parameters* to the test set.  
- **Model Choice & Trade-Offs:** No single model is best for all tasks; choose based on data size, interpretability needs, computational constraints, and domain context.  
- **Hyperparameter Search:** Grid search vs. random search vs. advanced methods; be aware of optimization bias.  
- **Evaluation:** Match metrics to problem constraints (imbalance, cost of errors, interpretability) and interpret performance in context.

Hopefully these address each question and provide a concise summary of fundamental ML concepts, best practices, and trade-offs.

Below is a structured set of answers spanning the various topics (Ensembles, Feature Importances, Feature Engineering & Selection, Clustering, Recommender Systems, Intro to NLP, Multi-class Classification & Computer Vision, Time Series, and Survival Analysis). These answers are informed by commonly taught material in an applied machine learning course and the provided lecture resources.

---

## Part 1: Ensembles

### 1. How does a random forest model inject randomness in the model?

Random Forests (RFs) create an ensemble of decision trees, each trained on a slightly different subset of the data:
1. **Row sampling (bootstrap sampling)** – Each tree is trained on a bootstrapped sample (sampled with replacement) of the training dataset.  
2. **Column (feature) sampling** – When splitting a node, each tree considers only a random subset of all features.  

These two sources of randomness produce decorrelated trees whose predictions can be averaged.  

### 2. What’s the difference between random forests and gradient boosted trees?

| **Aspect**                   | **Random Forest**                                                                  | **Gradient Boosted Trees**                                                 |
|-----------------------------|------------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| **Main Idea**               | Parallel ensemble of decision trees, each trained independently.                   | Sequential ensemble that builds new trees to correct errors of the previous ensemble.  |
| **Training Procedure**      | All trees trained in parallel on bootstrapped samples; final prediction is majority vote (classification) or average (regression). | Trees added **one at a time**, each trying to reduce residual error (for regression) or improve classification error. |
| **Strengths**               | Tends to have good performance out of the box, less hyperparameter tuning needed, robust to outliers. | Often **best-in-class** predictive performance on tabular data if well tuned.           |
| **Weaknesses**              | May require many trees for best performance, can be slower to predict.            | More hyperparameters, can overfit if not tuned carefully.                                |
| **When to use**             | Quick, robust baseline ensemble model.                                             | If you want potentially higher accuracy (with more tuning).                              |

### 3. Why do we need averaging or stacking?

- **Averaging** (a simple ensemble) or **stacking** (a more sophisticated ensemble) can improve predictive performance by **combining multiple models**:
  - Reduces variance (especially if models are not too correlated).
  - Sometimes gains in performance if different models capture different aspects of the data.

### 4. What are the benefits of stacking over averaging?

- **Stacking** uses a “meta-model” (or “second-level model”) trained on the out-of-fold predictions of the first-level models. It can learn *how* best to combine the different models’ predictions.
- This can outperform simple averaging or voting, because the meta-model can learn to weight or combine the base models in a more sophisticated, data-driven way.

---

## Feature Importances

### 1. What are the limitations of looking at simple correlations between features and targets?

- **Linear focus**: Correlation only captures linear relationships; many real-world relationships are nonlinear.  
- **Ignore interactions**: Features may only matter in conjunction with others (nonlinear or interaction effects).  
- **Omitted variable bias**: Correlation might be driven by another unobserved feature or confounder.  
- **Direction vs. causation**: Correlation doesn’t imply causation; a high correlation doesn’t guarantee direct predictive power if there’s confounding.

### 2. How can you get feature importances for non-linear models?

- **Tree-based** (e.g., Random Forest, Gradient Boosting): Often provide built-in feature importances based on split gains or impurity reduction.  
- **Permutation Importance**: Shuffle the values of a single feature and measure how much worse the model’s performance becomes. This is model-agnostic.  
- **SHAP Values** (SHapley Additive exPlanations): Provide local or global explanation for any differentiable model.  
- **Partial Dependence Plots / ALE plots**: Show how changing a feature (while holding others constant) affects predictions.

### 3. What might you need to explain a single prediction?

- **Local explanation methods**:
  - **SHAP** or **LIME**: Provide instance-specific explanations, showing which features contributed the most to a single prediction.  
  - **Counterfactual explanations**: Show how changing certain features would alter the outcome for an individual data point.

---

## Feature Engineering and Selection

### 1. What’s the difference between feature engineering and feature selection?

- **Feature engineering**: Creating or transforming features (e.g., extracting new columns from existing data, combining features, encoding dates/times, polynomial features).
- **Feature selection**: Choosing which features to keep vs. remove (to improve generalization, reduce overfitting, or improve interpretability).

### 2. Why do we need feature selection?

- **Reduce Overfitting**: Fewer features can help avoid spurious patterns.  
- **Improve Model Interpretability**: Simpler models with fewer features are easier to explain.  
- **Decrease Training Time**: Fewer features → faster training.  
- **Eliminate Redundant or Noisy Features**: Might not improve predictive performance to keep them.

### 3. What are the three possible ways we looked at for feature selection?

1. **Filter methods**: Select features based on statistical tests or heuristics (e.g., correlation thresholds, mutual information).  
2. **Wrapper methods**: Evaluate subsets of features by training a model (e.g., RFE – Recursive Feature Elimination, or forward/backward selection).  
3. **Embedded methods**: Use models that inherently rank features or shrink coefficients (e.g., Lasso’s zero coefficients, tree-based feature importances).

---

## Part 2: Clustering

### 1. Why clustering and what is the problem of clustering?

- **Why**: Group similar data points together when we have *no labels*. Used for:  
  - Exploratory data analysis, dimensionality reduction, customer segmentation, image segmentation, etc.  
- **The problem**: We need to define a measure of similarity and the number or structure of clusters. Clustering is inherently ambiguous without labels—multiple “valid” clusterings may exist.

### 2. Compare and contrast different clustering methods

Common clustering methods:

| **Method**        | **Key Concept**                                 | **Pros**                                                       | **Cons**                                                                    |
|-------------------|------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------|
| **K-Means**       | Clusters around centroids (means).             | - Fast, works well on large data<br>- Easy to interpret        | - Assumes spherical clusters<br>- Must choose K<br>- Sensitive to outliers  |
| **Hierarchical**  | Builds a tree of clusters (agglomerative or divisive). | - No need to pre-specify # clusters (though you must cut the dendrogram somewhere)<br>- Can reveal multi-level structure | - Scalability issues for large data<br>- Choosing where to “cut” the dendrogram can be subjective |
| **DBSCAN**        | Density-based, finds core samples and expands clusters. | - Finds irregular-shaped clusters<br>- Robust to outliers<br>- Doesn’t require K   | - Needs parameters \(\epsilon\) and `min_samples` chosen carefully<br>- Doesn’t handle varying densities well |
| **Spectral**      | Use graph-based approach and eigenvectors of similarity matrix.  | - Can handle complex cluster shapes<br>- Flexible with similarity metrics | - Constructing similarity matrix can be expensive<br>- Must choose # of clusters |
| **Gaussian Mixture Models** | Probabilistic approach assuming data from mixture of Gaussians. | - Provides soft cluster assignments<br>- More flexible than K-Means for shapes.   | - Can get stuck in local minima<br>- Need to choose # components (clusters). |

### 3. What’s the difficulty in evaluation of clustering? How do we evaluate clusters?

- **No universal ground truth** (unsupervised).  
- **Internal metrics** (e.g., Silhouette score, Calinski-Harabasz, Davies-Bouldin) measure compactness, separation, etc.  
- **External metrics** (ARI, NMI) require known labels to compare.  
- **Subjectivity** – “Correctness” depends on the application and domain knowledge.

### 4. Scenario-based Advice

| **Scenario**                                    | **Which clustering method might be best?**                          | **Reasoning**                                                                                             |
|-------------------------------------------------|---------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| **Well-separated spherical clusters**           | K-Means or GMM                                                      | They do well if clusters are roughly spherical & well separated.                                          |
| **Large datasets**                              | K-Means, Mini-Batch K-Means                                        | K-Means scales well. Mini-batch variant is even more scalable.                                            |
| **Flexibility with cluster shapes**             | DBSCAN or Hierarchical or Spectral                                  | DBSCAN can capture arbitrary shapes if densities differ. Hierarchical or Spectral can also handle complex shapes. |
| **Small to medium datasets**                    | Hierarchical, DBSCAN, or K-Means depending on shape assumption.     | Less worried about scaling issues, hierarchical can provide a nice dendrogram, DBSCAN for complex shapes.  |
| **Prior knowledge on # clusters**              | K-Means, GMM, or Spectral (since you specify K)                     | These require you to specify the number of clusters.                                                       |
| **Clusters roughly of equal size**              | K-Means                                                             | Works best when clusters are of similar “size/variance.”                                                  |
| **Irregularly shaped clusters**                 | DBSCAN, Hierarchical, or Spectral                                   | K-Means is poor if shapes are elongated or irregular. DBSCAN can handle weird shapes if you tune \(\epsilon\). |
| **Clusters with different densities**           | Hierarchical, or maybe a variant of DBSCAN (e.g., HDBSCAN)          | Vanilla DBSCAN can have difficulty if densities vary widely.                                              |
| **Datasets with hierarchical relationships**    | Hierarchical (Agglomerative)                                        | Natural to represent in a dendrogram.                                                                      |
| **No prior knowledge on # clusters**            | DBSCAN or Hierarchical                                              | They don’t require specifying K.                                                                           |
| **Noise and outliers**                          | DBSCAN                                                              | It labels outliers as noise points.                                                                        |

---

### Which clustering method would you use in each scenario? How to represent data?

1. **Scenario 1: Customer segmentation in retail**  
   - **Likely method**: K-Means (common for segmentation if you assume “spherical” groupings) or GMM if you want soft cluster assignments.  
   - **Data representation**: Possibly numeric features: purchase frequency, total spend, product categories visited, demographics. Typically you standardize or scale.

2. **Scenario 2: An environmental study aiming to identify clusters of a rare plant species**  
   - **Likely method**: DBSCAN if you suspect irregular cluster shapes in geospatial data. DBSCAN also helps to find outliers (rare/unusual points).  
   - **Data representation**: Latitude/longitude coordinates, environmental variables. Possibly transform lat/long to a projected coordinate system or use geodesic distances.

3. **Scenario 3: Clustering furniture items for inventory management & recommendations**  
   - **Likely method**: K-Means or Hierarchical, depending on scale. If you have a moderate number of product types, hierarchical might help. If large scale, K-Means.  
   - **Data representation**: Possibly use numeric features: dimensions, price, style vectors, or text embeddings from product descriptions.

### How to decide the number of clusters?
- **Elbow method** (plot SSE vs. K).  
- **Silhouette scores** (pick K with best average silhouette).  
- **Domain knowledge** – e.g., business constraints.  
- **Hierarchical dendrogram** – visually inspect.

### (Reiteration) What’s the difficulty in evaluation of clustering? How do we evaluate clusters?
- **No labels** for ground truth.  
- We use **internal clustering metrics** (Silhouette, etc.) or domain-driven validation.  
- Often a trial-and-error approach plus domain expertise.

---

## Recommender Systems

### 1. What’s the utility matrix?

- A matrix \(R\) where rows = users, columns = items, and entries = user-item interaction (ratings, clicks, etc.).  
- Typically very sparse (not all users rate all items).

### 2. How do we evaluate recommender systems?

- **Hold-out** or **cross-validation** approach: split known user-item interactions into train vs. validation/test.  
- Evaluate predictions for withheld user-item pairs using metrics like:
  - **RMSE, MAE** on predicted ratings (if numerical).
  - **Precision@k, Recall@k, MAP, nDCG** if ranking top items is key.

### 3. What are the baseline models we talked about?

- **Global average**: Predict the same average rating for all user-item pairs.  
- **Per user average**: Each user’s rating for an item = that user’s average rating.  
- **Per item average**: Each item’s rating for a user = that item’s average rating.  

These are naive baselines to compare more sophisticated models against.

### 4. Compare and contrast KNN Imputer vs. content-based filtering

- **KNN Imputer** for recommendation:
  - Looks for “similar” users or items in the utility matrix and fills in missing ratings with average of neighbors.
  - Purely collaborative filtering approach if you do user–user or item–item similarity.
- **Content-based filtering**:
  - Uses item metadata (features) and/or user features.  
  - Recommends items similar to what the user liked in the past, based on item content (e.g., genre, description, embeddings).

### 5. Ethical issues associated with recommender systems

- **Filter bubbles**: Reinforces existing biases or preferences, limiting exposure to diverse content.  
- **Privacy**: User preference data can be sensitive.  
- **Fairness**: Some items or creators may be under-represented or systematically disadvantaged by the algorithm.  
- **Manipulation**: Systems can be gamed to push certain items.

---

## Introduction to NLP

### 1. Embeddings

**What are different document and word representations we talked about?**  
- **One-hot encoding** for words – too sparse and high dimensional for larger vocabularies.  
- **Bag-of-words** for documents – counts or TF-IDF.  
- **Word embeddings** (e.g., Word2Vec, GloVe) – dense, low-dimensional vectors for words.  
- **Document embeddings** (averaging word embeddings, or using specialized doc2vec).  
- **Contextual embeddings** (e.g., BERT, GPT) – produce word vectors that depend on context.

**Why do we care about creating different representations?**  
- Different tasks demand different levels of linguistic nuance.  
- Word embeddings capture semantics (words used in similar contexts have similar embeddings).  
- BERT-like models can encode context-dependent meaning, beneficial for advanced tasks (question answering, NER, etc.).

### 2. What are pre-trained models? Why are they beneficial?

- **Pre-trained models** (e.g., BERT, GPT) are trained on large corpora to learn general language representations.  
- **Benefits**:  
  - **Reduced training time**: You can fine-tune with a smaller dataset for your task.  
  - **Better performance** on tasks with limited data.  
  - **Transfer learning**: The model already knows generic language patterns.

### 3. Topic Modeling

**What is topic modeling?**  
- Unsupervised method to discover topics (latent themes) in a collection of documents (e.g., using LDA).  

**Inputs/Outputs**:
- **Input**: A set of documents, usually represented with bag-of-words or TF-IDF.  
- **Output**: A set of “topics” (clusters of words) + distribution of topics in each document.

**How is it different from clustering documents with, say, KMeans?**  
- **Topic modeling** focuses on discovering word distributions that define each topic and how each document is composed of multiple topics. Documents can be a mixture of topics.  
- **KMeans** (or any standard clustering) forces each document into exactly one cluster, no mixing.

### 4. Text Preprocessing

Common steps: tokenization, lowercasing, removing punctuation/stopwords, stemming/lemmatization, etc. This ensures consistent input format and can reduce noise in the text representation.

---

## Multiclass Classification and Computer Vision

### 1. How is the Softmax function used by logistic regression in multiclass classification?

- **Logistic regression** (or any linear model) can be extended to multiple classes by assigning a **logit** for each class.  
- **Softmax** converts these logits to a probability distribution across all classes. For a sample \(x\), if the model outputs logit \(z_c\) for class \(c\):
  \[
    P(\text{class}=c \mid x) = \frac{e^{z_c}}{\sum_{k} e^{z_k}}.
  \]

### 2. What are the methods we saw to use pre-trained image classification models for our image classification tasks?

1. **Out of the box**: Use a pre-trained model (e.g., ResNet) directly if your classes match exactly.  
2. **Using pre-trained models as feature extractors**: Remove the final classification layer, feed your images through the rest of the network to extract features, then train a simpler model (e.g., logistic regression) on those features.

### 3. How would you use a pre-trained model in each case below?

1. **Identify different cat breeds from photos (quick prototype)**  
   - **Method**: Possibly use **pre-trained model as a feature extractor** (like a ResNet pre-trained on ImageNet). Then train a classifier on top of these features for your specific cat breeds. This is typically quicker to set up.

2. **Predict the city in Canada based on photos of landmarks, with limited training data**  
   - **Method**: Again, use a **pre-trained model** and then **fine-tune** it on your small dataset or use it purely as a feature extractor. If the dataset is extremely limited, a feature-extractor approach plus a simple classifier is a good start.

3. **Develop a system to diagnose specific types of tumors from MRI scans**  
   - **Method**: If you have some medical images and a reasonable amount of labeled data, you might do **transfer learning** (fine-tuning) with a model that’s either pre-trained on large medical imaging or on ImageNet. For best performance, **fine-tuning** the last few layers (or entire network) is common in specialized tasks like tumor detection.

---

## Time Series

### 1. When is time series analysis appropriate?

- If your data has a **temporal component** and you need to respect chronological order.  
- E.g., forecasting future values of a process measured over time.

**Data splitting**: Must split by time to avoid future data leakage.  
**Cross-validation**: Use `TimeSeriesSplit` or rolling forecast origin approach.

### 2. Essential questions for EDA in time series:

- Frequency, # of time series, presence of missing timestamps, etc.

### 3. Feature engineering for time series

- **Date/time features** (e.g., day of week, month).  
- **Lag features** (e.g., yesterday’s temperature to predict today’s temperature).  
- Possibly differences or rolling windows.

### 4. Baseline model approach

- E.g., a naive forecast that tomorrow = today (for many forecasting problems).
- Compare advanced models against this baseline.

### 5. TimeSeriesSplit cross-validation

- Splits sequentially, ensuring training always precedes validation in time.

### 6. Strategies for long-term forecasting

- **Recursive approach**: Predict next step, feed predicted value back in to predict the next, etc.
- This can accumulate error but is a common approach in standard ML frameworks.

### 7. Trends

- Might add a “days since start” feature.  
- Or difference transformations to remove trends.

---

## Survival Analysis

### 1. What is right-censored data?

- **Right-censored** data arises when you do **not** observe the event (e.g., churn, failure) for some individuals by the end of the study. So you only know they survived (didn’t churn) up to some time, but not their eventual event time.

### 2. What happens if we treat right-censored data as “regular” data?

- You either:  
  1. Drop them (lose data).  
  2. Assume they “fail” or “churn” at the end date (inaccurate).  
- Both approaches bias your model. You don’t properly account for the fact that the event might happen later than your study end.

### 3. Predicting churn vs. no churn vs. predicting tenure

- **If you only do binary classification** for churn, you lose information about *when* churn occurs.  
- **Survival analysis** can predict the distribution over times to event (churn), handling censoring properly.

### 4. Kaplan–Meier (KM) vs. Cox Proportional Hazards (CPH)

- **KM model**: Non-parametric estimator of survival function. Does not incorporate features.  
- **CPH model**: A semi-parametric model that relates survival time to features (like linear regression for log hazard). Produces coefficients to interpret how each feature affects hazard.

---

## Wrap-Up

This summary answers each question from your prompt across ensembles, feature importance, feature engineering and selection, clustering, recommender systems, NLP, multiclass classification & vision, time series, and survival analysis. 

Key takeaways:
- **Ensembles** leverage multiple models to improve performance (via randomness in RF or sequential correction in boosting).  
- **Feature importances** and local interpretability methods are crucial in non-linear models.  
- **Feature selection** can reduce overfitting, improve interpretability, and speed up training.  
- **Clustering** is inherently unsupervised; method choice depends on cluster shape, scale, domain constraints.  
- **Recommender systems** revolve around user-item interactions, evaluated by typical rating/ranking metrics; baselines give important comparison points.  
- **NLP**: Representations and embeddings significantly impact downstream performance; pre-trained models accelerate development.  
- **Computer Vision**: Transfer learning is the dominant approach for new image tasks.  
- **Time Series**: Must respect time order in splits, use naive baselines, create time-based features.  
- **Survival Analysis**: Properly handles censored data, can predict both occurrence and timing of an event.

All these areas emphasize the importance of understanding data peculiarities (temporal, textual, censored) and model constraints (interpretability, overfitting, domain knowledge) to choose suitable methods.

Below is a concise answer set covering **Communication** and **Ethics** topics in Machine Learning (ML) and Data Science (DS). The content is based on common best practices and themes from the provided references (slides on AI fairness, accountability, transparency, and large language models).

---

## Communication

### 1. Why is communication important in ML and Data Science?

- **Complexity of models**: Many ML models are opaque to non-technical stakeholders. Explaining their outputs and justifying decisions is crucial.  
- **Stakeholder alignment**: Collaborations often involve domain experts, managers, end-users, and clients who need different levels of technical depth. Miscommunication can lead to misapplication or mistrust.  
- **Trust and transparency**: Especially when decisions affect people (e.g., healthcare, finance), clarity about how the model works fosters trust.  
- **Interpretation of results**: Data scientists must clearly convey model limitations, assumptions, and uncertainty to inform better decision-making.

### 2. What are different principles of good explanation?

Although various frameworks exist, some general principles include:

1. **Clarity and Conciseness**  
   - **Aim for simplicity**: Avoid jargon where possible, or define it when it’s necessary.  
   - **Focus on key points**: Highlight the most important features, assumptions, or outputs.

2. **Context and Purpose**  
   - Tailor explanations to your audience’s **background and goals**.  
   - Provide **relevant domain context** rather than purely technical details.

3. **Transparency about Limitations**  
   - Mention **data biases** or known failure modes.  
   - Indicate **uncertainty** or confidence intervals where relevant.

4. **Use Visuals Appropriately**  
   - Support textual explanations with well-chosen visuals (charts, plots) to illustrate complex relationships or model behavior.

5. **Interactivity or Engagement**  
   - Encourage questions or interactive demos if possible.  
   - Show example inputs/outputs to make the explanation tangible.

### 3. What to watch out for when producing or consuming visualizations?

- **Data Integrity**: Make sure the data is represented accurately (scales, axes, data transformations).  
- **Misleading Scales or Truncations**: For example, **y-axis not starting at zero** might exaggerate small differences.  
- **Overcomplicating Graphics**: Keep the **chart type** and design simple; avoid clutter.  
- **Context**: Check the **labels, legends, and units** are clear and correct.  
- **Cherry-Picked Data**: Ensure no selective omission of data that could distort conclusions.

---

## Ethics

### 1. Fairness, Accountability, Transparency

- **Fairness**: Ensuring that the model’s predictions or decisions do not systematically disadvantage certain groups. For instance:
  - **Equal opportunity**: People from different demographic groups with the same qualifications or traits should receive similar outcomes.  
  - **Group fairness metrics** (e.g., demographic parity, equalized odds).
- **Accountability**:  
  - ML practitioners, organizations, and stakeholders should be responsible for model decisions and impacts.  
  - Transparent processes for auditing and explaining decisions, and proper channels for recourse if harms occur.
- **Transparency**:  
  - Being open about the **data sources, model assumptions, and limitations**.  
  - Using model interpretability tools (e.g., SHAP, LIME) or publishing model cards, datasheets.

### 2. Representation Bias, Measurement Bias, Historical Bias

These biases often arise in the data and can propagate into ML models:

1. **Representation Bias**  
   - **Underrepresentation** of certain groups in the dataset (e.g., fewer examples of a particular demographic).  
   - Leads to poorer performance or systematic errors for those groups.

2. **Measurement Bias**  
   - Occurs if **data collection methods** themselves are flawed or skewed (e.g., instrument error, non-random sampling).  
   - Example: Crime data might overrepresent certain neighborhoods due to policing practices, leading to biased predictive policing models.

3. **Historical Bias**  
   - When **past inequalities** are embedded in the data. Even if a model is learned “fairly,” it can reinforce unjust patterns.  
   - Example: If historically certain groups had reduced access to loans, the model trained on past loan approvals may inadvertently perpetuate the same pattern.

**Mitigation strategies** often require:
- Collecting more representative data.  
- Adjusting or re-weighting data.  
- Revisiting the definition of the target variable.  
- Ongoing monitoring for disparate outcomes.  

---

### Key Takeaways

- **Communication** in ML/DS revolves around clarity, context, and trust. Good explanations are concise, audience-oriented, and transparent about uncertainty.
- **Visualization** must be accurate, unbiased in design, and clearly labeled.
- **Ethics** in ML focuses on fairness (avoiding harm to underrepresented groups), accountability (who is responsible), and transparency (how the model is built and how it performs).
- **Biases** (representation, measurement, historical) can creep in if data is unbalanced or historically skewed, thus careful data collection, thorough auditing, and domain-aware adjustments are crucial for equitable ML systems.

These considerations guide both how we **communicate** our work and **practice** ethical data science.