# scikit-learn mini-curriculum
## Goals
- Train/test split, pipelines, and transformers
- Supervised basics: linear/logistic regression, decision trees, k-NN
- Model evaluation: cross-val, metrics, confusion matrices/ROC
- Feature scaling/encoding with `ColumnTransformer`
- Hyperparameter tuning with `GridSearchCV`/`RandomizedSearchCV`
- Persistence with `joblib` and reproducibility

## Step-by-step path
1. Setup & mindset
   - Install scikit-learn + pandas, bookmark docs, set `random_state` habit
   - Know when to pick classification vs regression; define target + metric upfront
2. Data loading & splits
   - Use `train_test_split` with `stratify` for classification; keep a holdout
   - Basic EDA: dtypes, missingness, leakage checks, target distribution
3. Baselines & metrics
   - Fit `DummyClassifier`/`DummyRegressor` to set a floor; track chosen metric (`accuracy`, `F1`, `MAE`/`RMSE`)
   - Add `classification_report`, confusion matrix, ROC/PR for class imbalance; for regression log RMSE/MAE and residual plots
4. Preprocessing pipeline
   - Build a `ColumnTransformer` for numeric (imputer + scaler) and categorical (imputer + `OneHotEncoder(handle_unknown="ignore")`)
   - Wrap preprocessing + model in a `Pipeline` to keep leakage-free training
5. Core estimators
   - Classification: logistic regression, k-NN, decision tree; note linear vs non-linear trade-offs
   - Regression: linear regression, ridge/lasso (regularization), decision tree regressor
6. Cross-validation & evaluation
   - Use `cross_validate` with `scoring` dict to capture multiple metrics; prefer stratified KFold for classification
   - Inspect variance across folds to reason about bias/variance and data sufficiency
7. Hyperparameter tuning
   - Set small grids for `GridSearchCV` or wider ranges for `RandomizedSearchCV`; keep pipelines in the search object
   - Review `cv_results_` and learning/validation curves to choose simpler vs more complex models
8. Feature engineering
   - Try scaling-aware transforms (`StandardScaler`) before distance-based models; use `PolynomialFeatures` or interactions carefully
   - Handle target/feature leakage; encode dates into useful parts (year, month, hour) when applicable
9. Model diagnostics & interpretability
   - Check feature importances/permutation importances; plot partial dependence for tree-based models
   - Calibrate probabilities (`CalibratedClassifierCV`) when decision thresholds matter
10. Persistence & delivery
    - Save the final pipeline with `joblib.dump`; include preprocessing so inference matches training
    - Log versions/seeds, freeze dependencies, and document input schema + expected metrics

## Suggested exercises
- Load a tabular dataset, split data, and build a pipeline with preprocessing + a baseline model
- Compare two algorithms on accuracy/F1 (or RMSE/MAE) and discuss bias/variance from fold results
- Tune a small hyperparameter grid on one model, inspect `best_params_`, and evaluate on the untouched holdout
