
# Practical Machine Learning Tips for Beginners
**Focus:** Tools (scikit-learn, Kaggle) • Free learning resources • Ready-to-run templates  
**Last updated:** 2025-09-05 17:57  

This notebook is a quick-start companion: concise checklists, pitfalls to avoid, and **offline-runnable** code templates using scikit-learn. It also explains **how to use Kaggle** for datasets and practice.



## Who is this for?
- Students and professionals beginning ML.
- Anyone who wants a practical, **copy‑paste‑friendly** workflow.



## Quick Start: Setup (Local or Cloud)
- **Python**: 3.9+ (3.11+ recommended)
- **Install**: `pip install scikit-learn pandas numpy matplotlib jupyter`
- **Editor**: VS Code (Python extension) or JupyterLab
- **Cloud (free)**: Google Colab, Kaggle Notebooks (no install), or local Anaconda
- **Version control**: Git + GitHub (optional but highly recommended)



## Golden Rules (Pin this!)
1. **Split your data** before looking at metrics (train/validation/test).
2. Start with a **simple baseline** (linear/logistic regression).
3. Use **pipelines** for preprocessing → model (no data leakage).
4. Tune via **cross‑validation**; report mean ± std.
5. **Fix random_state** for reproducibility.
6. Pick the **right metric** (MAE/RMSE/R² vs. accuracy/F1/AUC).
7. Track experiments (even a spreadsheet works).
8. Keep a **hold‑out test set** for the final unbiased check.



## Common Pitfalls (and how to avoid them)
- **Data leakage**: Scaling/encoding fit on the full dataset → Do it **only on train** inside a `Pipeline`/`ColumnTransformer`.
- **Overfitting**: Model too complex → cross‑validate, regularize, add data, simplify features.
- **Wrong split**: Shuffling time‑series → use time‑based split.
- **Imbalanced classes** (classification) → use stratified splits, proper metrics (ROC‑AUC, PR‑AUC, F1), resampling.
- **Chasing the metric**: Tune hyperparams on test set → keep test set untouched.
- **No baseline**: Always compare to something simple.



## Tools You Should Know
- **scikit-learn**: classical ML library with clean APIs (models, metrics, pipelines).
- **pandas** & **NumPy**: data handling and array math.
- **matplotlib**: simple plotting.
- **Kaggle**: datasets, notebooks, and competitions (practice ecosystem).
- **Jupyter/Colab**: iterate quickly, share notebooks.
- **(Optional)** MLflow or Weights & Biases for experiment tracking (free tiers).



## Kaggle: How to Use (Datasets & Notebooks)
**Option A: Kaggle Notebooks (zero install)**
1. Create a Kaggle account → open **Notebooks** → **New Notebook**.
2. **Add Data** from the right sidebar (search public datasets).
3. Write code; **Run All**; save and share the notebook.

**Option B: Kaggle CLI (local)**
1. Create an API token: Kaggle → **Account** → **Create New API Token** (downloads `kaggle.json`).
2. Place it at `~/.kaggle/kaggle.json` (Linux/Mac) or `%HOMEPATH%\.kaggle\kaggle.json` (Windows).
3. Install CLI: `pip install kaggle`
4. Download data (example):
```bash
kaggle datasets download -d zillow/zecon   # example dataset
unzip zecon.zip -d data/
```

> Tip: Start with **Kaggle Learn** micro-courses and beginner competitions (e.g., Titanic).



## Free Learning Resources (Curated)
- **Kaggle Learn** – bite‑sized, hands‑on lessons.
- **scikit‑learn User Guide** – practical, concept‑first docs.
- **Google ML Crash Course** – quick theory + exercises.
- **fast.ai Practical DL** – when you move to deep learning.
- **StatQuest (YouTube)** – crystal‑clear explanations of ML math and ideas.
- **OpenML / UCI** – classic datasets for practice.


## Template 1: Regression Baseline (offline, synthetic data)

In [None]:

# Baseline regression with preprocessing and metrics (runs offline)
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Synthetic dataset (like house prices)
X, y = make_regression(n_samples=800, n_features=6, noise=25.0, random_state=42)
cols = [f"f{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=cols); df["target"] = y

num_features = cols  # all numeric in this toy example
preproc = ColumnTransformer([
    ("num", Pipeline([("impute", SimpleImputer(strategy="median")), ("scale", StandardScaler())]), num_features)
])

X_train, X_test, y_train, y_test = train_test_split(df[num_features], df["target"], test_size=0.2, random_state=0)

pipe = Pipeline([("pre", preproc), ("model", LinearRegression())])
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)

mae = mean_absolute_error(y_test, preds)
rmse = mean_squared_error(y_test, preds, squared=False)
r2 = r2_score(y_test, preds)
print(f"Linear Regression -> MAE: {mae:.2f} | RMSE: {rmse:.2f} | R²: {r2:.3f}")

# Try Ridge for regularization
ridge = Pipeline([("pre", preproc), ("model", Ridge(alpha=5.0, random_state=0))])
ridge.fit(X_train, y_train)
preds_r = ridge.predict(X_test)
print(f"Ridge(alpha=5)     -> MAE: {mean_absolute_error(y_test, preds_r):.2f} | RMSE: {mean_squared_error(y_test, preds_r, squared=False):.2f} | R²: {r2_score(y_test, preds_r):.3f}")


## Template 2: Classification Baseline (offline, synthetic data)

In [None]:

import numpy as np, pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score

X, y = make_classification(n_samples=1000, n_features=8, n_informative=5, class_sep=1.2, random_state=42)
cols = [f"f{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=cols); df["label"] = y

num = cols
preproc = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num)
])

clf = Pipeline([("pre", preproc), ("logreg", LogisticRegression(max_iter=2000, random_state=0))])
X_train, X_test, y_train, y_test = train_test_split(df[num], df["label"], test_size=0.2, random_state=0, stratify=df["label"])
clf.fit(X_train, y_train)
probs = clf.predict_proba(X_test)[:,1]
preds = (probs >= 0.5).astype(int)

acc = accuracy_score(y_test, preds)
prec, rec, f1, _ = precision_recall_fscore_support(y_test, preds, average="binary")
auc = roc_auc_score(y_test, probs)
print(f"Accuracy: {acc:.3f} | Precision: {prec:.3f} | Recall: {rec:.3f} | F1: {f1:.3f} | AUC: {auc:.3f}")


## Template 3: Cross-Validation & Grid Search (scikit-learn)

In [None]:

from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor

# Reuse regression data from Template 1 (rerun that cell first if needed)
X_train, X_test, y_train, y_test  # noqa: just to hint variable reuse

rf = Pipeline([("pre", preproc), ("rf", RandomForestRegressor(random_state=0))])
grid = {
    "rf__n_estimators": [100, 200],
    "rf__max_depth": [None, 10, 15]
}
cv = KFold(n_splits=5, shuffle=True, random_state=0)
gs = GridSearchCV(rf, grid, cv=cv, scoring="neg_mean_absolute_error")
gs.fit(X_train, y_train)
print("Best params:", gs.best_params_)
print("Best CV MAE:", -gs.best_score_)
print("Test MAE:", __import__("sklearn").metrics.mean_absolute_error(y_test, gs.predict(X_test)))


## Template 4: Learning Curve (how much data do I need?)

In [None]:

from sklearn.model_selection import learning_curve
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

lin = Pipeline([("pre", preproc), ("lin", LinearRegression())])
train_sizes, train_scores, val_scores = learning_curve(
    lin, df[num_features], df["target"], cv=5, train_sizes=np.linspace(0.1, 1.0, 5), scoring="r2"
)
plt.figure()
plt.plot(train_sizes, train_scores.mean(axis=1), marker="o", label="Train R²")
plt.plot(train_sizes, val_scores.mean(axis=1), marker="s", label="CV R²")
plt.xlabel("Training examples")
plt.ylabel("R² score")
plt.title("Learning Curve: Linear Regression")
plt.legend()
plt.show()


## Template 5: Save & Load a Model

In [None]:

import joblib
joblib.dump(pipe, "/mnt/data/baseline_regressor.joblib")
loaded = joblib.load("/mnt/data/baseline_regressor.joblib")
print("Loaded model score (R² on test):", loaded.score(X_test[num_features], y_test))



## 30‑Day Study Plan (1 hour/day)
- **Week 1 (Foundations):** Python, NumPy/Pandas, plotting, train/test split, metrics.
- **Week 2 (Classic ML):** Linear/Logistic Regression, k‑NN, Decision Trees; build 2 mini projects.
- **Week 3 (Pipelines & Tuning):** ColumnTransformer, cross‑validation, grid search; try Kaggle Titanic.
- **Week 4 (Practice & Reporting):** One full project: EDA → baseline → tuning → write a 1‑page report with results and lessons learned.



## Starter Project Ideas
1. **House prices** (regression) – engineer features like price per sqft, location score.
2. **Used car prices** (regression) – brand, year, mileage, fuel, transmission.
3. **Loan default** (classification) – credit score, income, age, history.
4. **Customer churn** (classification) – usage metrics, tenure, support tickets.
5. **Flight delays** (regression/classification) – origin/destination, time, weather proxy.
6. **Energy consumption** (regression) – weather, hour, day type (workday/holiday).
7. **Student performance** (classification) – study time, attendance, prep courses.
8. **Retail sales forecasting** (regression) – promotions, seasonality, store/category.
9. **Health risk scoring** (classification) – anonymized vitals, habits (ethics first!).
10. **Fraud detection** (classification) – highly imbalanced; focus on precision‑recall.

> Use Kaggle/OpenML to find public datasets matching these ideas.



## Responsible ML (Always)
- Respect privacy; minimize sensitive data.
- Check for **bias** and disparate impact.
- Keep an **audit trail**: data source, preprocessing, model version, metrics.
- Prefer **explainable** models when decisions affect people.



## Your Personal Checklist (fill these in)
- Goal/target variable:
- Primary metric:
- Baseline result:
- Best model (so far):
- What went wrong? (top 3 issues)
- Next experiments:
