<a href="https://colab.research.google.com/github/puck-arthur/data-learning/blob/main/notebooks/week4_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 – Feature Engineering & Pipelines

## 1. Dataset Selection & Notebook Setup
- Continue with the Student Depression dataset (so you can compare against Week 3).
- Create a new Colab notebook named `week04_feature_engineering.ipynb` and load your cleaned DataFrame.

## 2. Exploratory Data Analysis Refresh
- Audit missing values and decide whether to drop, impute, or introduce a “Missing” category.
- Check for outliers on all numeric features using boxplots or IQR rules.
- Recompute your correlation heatmap to see how original and newly created features relate to the target.

## 3. Feature Engineering
- Add interaction terms (e.g. study_hours × sleep_hours, social_media_hours × study_hours).
- Create polynomial features to capture non-linear relationships (e.g. degree=2 on key numerics).
- Bin continuous variables into meaningful categories (low/medium/high).
- (Optional) Extract any text or timestamp features available (word counts, time-of-day, etc.).

## 4. Preprocessing & Pipelines
- Build a numeric pipeline that imputes missing values and scales features.
- Build an ordinal pipeline that imputes and encodes ordered categories.
- Build a nominal pipeline that imputes and one-hot encodes nominal variables.
- Build a custom feature pipeline that applies your interaction and polynomial transformations.
- Combine all pipelines into a ColumnTransformer, assigning each pipeline to its column group.
- Wrap the ColumnTransformer and your chosen estimator in a top-level Pipeline.

## 5. Cross-Validation & Evaluation
- Use cross_val_score with 5-fold CV and the “f1_macro” metric on your pipeline.
- Compare the mean and standard deviation of CV scores against your Week 3 baseline.

## 6. Hyperparameter Tuning
- Define a parameter grid that includes both model hyperparameters (e.g. tree depth, n_estimators) and feature-engineering choices (e.g. polynomial degree, whether to include interactions).
- Run GridSearchCV with 5-fold CV and “f1_macro” scoring.
- Record best_params_ and best_score_ from the search.

## 7. Final Evaluation & Diagnostics
- Fit the best estimator on the entire training set.
- Evaluate on the test set: accuracy, precision/recall/F1-macro, ROC-AUC, and confusion matrix.
- Generate and inspect a confusion matrix and ROC curve for your final model.