
# ML & AI Basics — Hands‑On Colab Crash Course

Welcome! This notebook is designed to teach you **core machine learning concepts** by doing.  
You'll run cells, tweak parameters, and see how results change — all inside **Google Colab**.

**What you'll learn:**
- The ML workflow: data → split → train → evaluate → iterate
- Classic ML with **scikit-learn** (Iris dataset)
- Model evaluation (accuracy, precision/recall, confusion matrix)
- Hyperparameter tuning with cross-validation
- Intro to **neural networks** with **Keras/TensorFlow** (MNIST digits)
- (Optional) A tiny taste of **NLP** with Hugging Face Transformers

> Tip: In Colab, use `Runtime → Run all` to execute the whole notebook, or run cells one by one with **Shift+Enter**.



## 0) Setup
This notebook uses common Python libraries. The cells below will import what we need and (optionally) install extras if you're running in a fresh environment.


In [None]:

# If you're in Colab, most of these are already installed. If anything is missing, uncomment to install.
# !pip -q install -U scikit-learn pandas matplotlib
# Optional (only for the NLP section near the end):
# !pip -q install -U transformers torch --index-url https://download.pytorch.org/whl/cpu

import sys, platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score

print(f"Python: {sys.version.split()[0]}  |  Platform: {platform.platform()}" )
print("Libraries imported successfully.")



## Part A — Classic ML with scikit‑learn (Iris)
We'll walk through the **end‑to‑end ML workflow** using the classic Iris dataset.

**Concepts covered:**
1. Load and inspect data (features vs. labels)
2. Train/test split
3. Train a baseline model (Logistic Regression)
4. Evaluate (accuracy, precision/recall, confusion matrix)
5. Cross‑validation
6. Hyperparameter tuning with `GridSearchCV`
7. Save the model


In [None]:

# 1) Load & inspect the dataset
iris = datasets.load_iris(as_frame=True)
df = iris.frame.copy()
df.head()


In [None]:

# Basic info & class balance
display(df.describe(include='all'))
print("\nClass distribution (target):\n", df['target'].value_counts())


In [None]:

# 2) Quick EDA plots
# Plot class distribution
df['target'].value_counts().sort_index().plot(kind='bar')
plt.title('Class Distribution (Iris target)')
plt.xlabel('Class ID (0=setosa,1=versicolor,2=virginica)')
plt.ylabel('Count')
plt.show()


In [None]:

# Scatter matrix to visualize feature relationships
from pandas.plotting import scatter_matrix
scatter_matrix(df[iris.feature_names], figsize=(8,8))
plt.suptitle('Feature Scatter Matrix')
plt.show()


In [None]:

# 3) Split data
X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("Train shape:", X_train.shape, " Test shape:", X_test.shape)


In [None]:

# 4) Build a baseline model (with scaling in a pipeline)
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=200))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))


In [None]:

# 5) Confusion matrix plot
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(values_format='d')
plt.title('Confusion Matrix — Logistic Regression (Iris)')
plt.show()


In [None]:

# 6) Cross-validation on the training set
cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy')
print("Cross‑val scores:", cv_scores)
print("Mean ± std:", np.mean(cv_scores), "+/−", np.std(cv_scores))


In [None]:

# 7) Hyperparameter tuning with GridSearchCV
param_grid = {
    'clf__C': [0.01, 0.1, 1.0, 10.0, 100.0],
    'clf__penalty': ['l2'],
    'clf__solver': ['lbfgs', 'liblinear']
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV score:", grid.best_score_)

best_model = grid.best_estimator_
test_acc = best_model.score(X_test, y_test)
print("Test accuracy (best model):", test_acc)


In [None]:

# 8) Save the trained model (so you can re‑use it later)
import joblib, os

os.makedirs('artifacts', exist_ok=True)
model_path = 'artifacts/iris_logreg_pipeline.joblib'
joblib.dump(best_model, model_path)
print(f"Saved: {model_path}")



### ✅ Mini‑Exercises (try these right in the notebook)
1. Change `test_size` to `0.3` — what happens to accuracy?
2. Replace `LogisticRegression` with `RandomForestClassifier` — which performs better?
3. Remove `StandardScaler` in the pipeline — how does that impact results?
4. Add `scoring='f1_macro'` in cross‑validation — how do conclusions change vs. accuracy?



## Part B — Intro to Neural Networks with Keras/TensorFlow (MNIST)
We'll train a small fully‑connected neural network on **MNIST** handwritten digits.  
This introduces concepts like **epochs**, **batches**, and **layers**.


In [None]:

# If TensorFlow is missing in your environment, uncomment this in Colab:
# !pip -q install -U tensorflow

import numpy as np
import matplotlib.pyplot as plt

try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    print("TensorFlow:", tf.__version__)
except Exception as e:
    print("TensorFlow not available. If you're in Colab, run the install cell above.")
    raise e


In [None]:

# 1) Load data
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test  = x_test.astype('float32') / 255.0

# Flatten 28x28 images to vectors of length 784
x_train = x_train.reshape((-1, 28*28))
x_test  = x_test.reshape((-1, 28*28))

print("Train:", x_train.shape, y_train.shape, " Test:", x_test.shape, y_test.shape)


In [None]:

# 2) Build a simple model
model = keras.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(256, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()


In [None]:

# 3) Train
history = model.fit(
    x_train, y_train,
    validation_split=0.1,
    epochs=5,
    batch_size=128,
    verbose=1
)


In [None]:

# 4) Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")


In [None]:

# 5) Plot training curves
plt.figure()
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
plt.title('Accuracy over epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

plt.figure()
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.title('Loss over epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()


In [None]:

# 6) Predict a few samples
preds = model.predict(x_test[:10])
pred_labels = np.argmax(preds, axis=1)
print("Predicted labels:", pred_labels)
print("True labels:     ", y_test[:10])



### ✅ Mini‑Exercises
1. Increase `epochs` from 5 → 10 — does accuracy improve?
2. Try a different architecture (e.g., add `Dropout(0.2)` layers) — any change?
3. Change `batch_size` (64, 256) — what happens? Why?



## (Optional) Part C — Tiny Taste of NLP with Transformers
This section shows a pre‑trained text classifier. It uses **Hugging Face Transformers** to run inference with a small sentiment model.

> This may download a small model the first time it runs.


In [None]:

# If needed in Colab, install first:
# !pip -q install -U transformers torch --index-url https://download.pytorch.org/whl/cpu

try:
    from transformers import pipeline
    nlp = pipeline('sentiment-analysis')
    print(nlp(["I love learning ML!", "This model seems slow."]))
except Exception as e:
    print("Transformers not available. If you're in Colab, run the install cell above.")



## Wrap‑Up & Next Steps
You just went through:
- A complete **classic ML** pipeline with scikit‑learn
- A simple **neural network** with Keras/TensorFlow
- A tiny **NLP** demo with Transformers

**Where to go from here:**
- Try new datasets (e.g., `datasets.load_wine()`, `load_breast_cancer()`)
- Swap models: SVMs, Random Forests, XGBoost (need `xgboost` install)
- Build a simple **Flask** or **FastAPI** service to serve your model (then deploy to Cloud Run)
- Explore **colab notebooks** from TensorFlow tutorials and Kaggle

Happy learning! 🚀
