# Machine Learning Course - Pre-Lecture Verification Notebook

**Run this notebook AFTER completing the terminal setup instructions**


## System Information


In [15]:
import sys
import platform
import os
import warnings
warnings.filterwarnings('ignore')

print("🖥️ SYSTEM INFORMATION")
print("=" * 60)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Python Executable: {sys.executable}")

# Check available memory
try:
    import psutil
    memory = psutil.virtual_memory()
    print(f"Available Memory: {memory.available / (1024**3):.2f} GB")
    print(f"Total Memory: {memory.total / (1024**3):.2f} GB")
except ImportError:
    print("Memory check: psutil not installed")

# Check GPU availability
try:
    import torch
    if torch.cuda.is_available():
        print(f"GPU Available: ✅ {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / (1024**3):.2f} GB")
    else:
        print("GPU Available: ❌ CPU only (this is fine for the course!)")
except ImportError:
    print("GPU Check: PyTorch not installed")

print("=" * 60)

🖥️ SYSTEM INFORMATION
Python Version: 3.9.6 (default, Apr 30 2025, 02:07:17) 
[Clang 17.0.0 (clang-1700.0.13.5)]
Platform: macOS-15.5-arm64-arm-64bit
Python Executable: /Users/ming/Dropbox/learn-ml-by-building/ml_lectures_env/bin/python
Available Memory: 44.09 GB
Total Memory: 96.00 GB
GPU Available: ❌ CPU only (this is fine for the course!)


## Verify Package Imports

In [16]:
import importlib

print("🔍 VERIFYING PACKAGE IMPORTS")
print("=" * 60)

packages_to_check = {
    "torch": "PyTorch",
    "transformers": "Transformers",
    "sentence_transformers": "Sentence-Transformers",
    "datasets": "Datasets",
    "numpy": "NumPy",
    "pandas": "Pandas",
    "sklearn": "Scikit-learn",
    "matplotlib": "Matplotlib",
    "seaborn": "Seaborn",
    "tqdm": "TQDM",
    "accelerate": "Accelerate",
}

all_imported = True
for module_name, description in packages_to_check.items():
    try:
        module = importlib.import_module(module_name)
        version = getattr(module, "__version__", "unknown")
        print(f"✅ {description:25s} v{version}")
    except ImportError as e:
        print(f"❌ {description:25s} - Not found")
        all_imported = False

if all_imported:
    print("\n🎉 All packages imported successfully!")
else:
    print("\n⚠️ Some packages missing. Check terminal setup instructions.")
print("=" * 60)

🔍 VERIFYING PACKAGE IMPORTS
✅ PyTorch                   v2.7.1
✅ Transformers              v4.55.2
✅ Sentence-Transformers     v5.1.0
✅ Datasets                  v4.0.0
✅ NumPy                     v2.0.2
✅ Pandas                    v2.3.1
✅ Scikit-learn              v1.6.1
✅ Matplotlib                v3.9.4
✅ Seaborn                   v0.13.2
✅ TQDM                      v4.67.1
✅ Accelerate                v1.10.0

🎉 All packages imported successfully!


## Download and Verify Gemma-3-270m Model

Now let's download the Gemma-3-270m model. This is a small model (~550MB) that we'll use in class:

In [29]:
# Login to Hugging Face (opens a prompt in notebook)
# Make sure you've accepted the model license first:
# https://huggingface.co/google/gemma-3-270m
from huggingface_hub import login
login()  # paste your HF token here; token is stored for future use

# Optional: configure local cache (edit path if you want):
# import os
# os.environ["HF_HOME"] = "/Users/jinming/.cache/huggingface"
# os.environ["TRANSFORMERS_CACHE"] = "/Users/jinming/.cache/huggingface/transformers"

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [32]:
# Explicit local download (no symlinks) + load from local path

from pathlib import Path
from huggingface_hub import snapshot_download  # pip install huggingface_hub
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# 1) Choose a lecture-local folder
local_dir = (Path.cwd() / "models" / "gemma-3-270m").resolve()
local_dir.mkdir(parents=True, exist_ok=True)
print("Local model dir:", local_dir)

# 2) Ensure you've accepted the license and are logged in (run once in a previous cell):
# from huggingface_hub import login
# login()  # paste your HF token

# 3) Download repo contents directly into local_dir (no symlinks)
snapshot_download(
    repo_id="google/gemma-3-270m",
    local_dir=str(local_dir),
    local_dir_use_symlinks=False,  # ensures real files are written here
)
print("✅ Files downloaded to:", local_dir)

# 4) Load from local path
tokenizer = AutoTokenizer.from_pretrained(local_dir, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    local_dir,
    torch_dtype=torch.float32,      # adjust if you have GPU: torch.float16
    low_cpu_mem_usage=True,
)
model.eval()


Local model dir: /Users/ming/Dropbox/learn-ml-by-building/Lecture 1 Overview/models/gemma-3-270m


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/28.3k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

✅ Files downloaded to: /Users/ming/Dropbox/learn-ml-by-building/Lecture 1 Overview/models/gemma-3-270m


Gemma3ForCausalLM(
  (model): Gemma3TextModel(
    (embed_tokens): Gemma3TextScaledWordEmbedding(262144, 640, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x Gemma3DecoderLayer(
        (self_attn): Gemma3Attention(
          (q_proj): Linear(in_features=640, out_features=1024, bias=False)
          (k_proj): Linear(in_features=640, out_features=256, bias=False)
          (v_proj): Linear(in_features=640, out_features=256, bias=False)
          (o_proj): Linear(in_features=1024, out_features=640, bias=False)
          (q_norm): Gemma3RMSNorm((256,), eps=1e-06)
          (k_norm): Gemma3RMSNorm((256,), eps=1e-06)
        )
        (mlp): Gemma3MLP(
          (gate_proj): Linear(in_features=640, out_features=2048, bias=False)
          (up_proj): Linear(in_features=640, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=640, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma3RMSNorm((640,), eps

In [33]:
# 4) Load from local path
tokenizer = AutoTokenizer.from_pretrained(local_dir, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    local_dir,
    torch_dtype=torch.float32,      # adjust if you have GPU: torch.float16
    low_cpu_mem_usage=True,
)
model.eval()

# 5) Quick smoke test
text = "This is a laptop computer"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
print("• LM logits shape:", outputs.logits.shape)

gen_ids = model.generate(**inputs, max_length=40, do_sample=False)
print("Sample:", tokenizer.decode(gen_ids[0], skip_special_tokens=True))
print("✅ Gemma-3-270m loaded from local directory.")

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


• LM logits shape: torch.Size([1, 6, 262144])
Sample: This is a laptop computer with a 15.6 inch screen. It has a 16GB of RAM and a 1TB of storage. It has a 108
✅ Gemma-3-270m loaded from local directory.


## Verify WebShop Dataset

In [36]:
# Verify WebShop "items_ins_v2_1000.json" (attributes/instructions)
import json, os
from collections import Counter

print("📦 VERIFYING: items_ins_v2_1000.json")
print("=" * 60)

# If running the notebook from "Lecture 1 Overview/", this is correct:
data_path = "data/items_ins_v2_1000.json"
# If running from repo root, use:
# data_path = "Lecture 1 Overview/data/items_ins_v2_1000.json"

if os.path.exists(data_path):
    with open(data_path, "r") as f:
        items = json.load(f)

    print(f"✅ Dataset loaded. Product IDs: {len(items)}")

    # Expect top-level dict keyed by IDs -> dicts with attributes, optional instruction fields
    has_instruction = sum(1 for v in items.values() if "instruction" in v)
    has_attrs = sum(1 for v in items.values() if "attributes" in v and isinstance(v["attributes"], list))
    attr_lengths = [len(v.get("attributes", [])) for v in items.values()]
    avg_attrs = (sum(attr_lengths) / len(attr_lengths)) if attr_lengths else 0.0

    print("\n📊 Stats:")
    print(f"  • Entries with attributes: {has_attrs}")
    print(f"  • Avg attributes per entry: {avg_attrs:.2f}")
    print(f"  • Entries with instruction: {has_instruction}")

    # Show a sample entry
    first_key = next(iter(items))
    sample = items[first_key]
    print("\n📋 Sample Entry:")
    print(f"  • ID: {first_key}")
    print(f"  • attributes: {sample.get('attributes', [])[:8]}")
    if "instruction" in sample:
        print(f"  • instruction: {sample['instruction'][:80]}...")
        print(f"  • instruction_attributes: {sample.get('instruction_attributes', [])[:8]}")
else:
    print(f"❌ Dataset not found at {data_path}")
    print("   Tip: ensure you run this notebook from 'Lecture 1 Overview/' or fix the relative path.")

print("=" * 60)

📦 VERIFYING: items_ins_v2_1000.json
✅ Dataset loaded. Product IDs: 1000

📊 Stats:
  • Entries with attributes: 1000
  • Avg attributes per entry: 2.83
  • Entries with instruction: 415

📋 Sample Entry:
  • ID: B08GFNJN5R
  • attributes: []


In [37]:
# Verify "items_shuffle_1000.json" (richer fields, may include name/price/etc.)
import json, os
from collections import Counter

path2 = "data/items_shuffle_1000.json"  # or "Lecture 1 Overview/data/..." from repo root
if os.path.exists(path2):
    with open(path2, "r") as f:
        items2 = json.load(f)

    print(f"✅ items_shuffle_1000.json loaded: {len(items2)}")
    # Decide if it's a list or dict
    if isinstance(items2, dict):
        values = items2.values()
    else:
        values = items2

    names = [x.get("name") for x in values if isinstance(x, dict)]
    print("📋 Sample name:", next((n for n in names if n), "N/A"))
else:
    print(f"ℹ️ Not found: {path2}")

✅ items_shuffle_1000.json loaded: 1000
📋 Sample name: Vhomes Lights Reclaimed Wood Console Table The Genessis Collection Sofa-Tables


## Test ML Components


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sentence_transformers import SentenceTransformer

print("🧠 TESTING ML COMPONENTS")
print("=" * 60)

# Test scikit-learn
X = np.random.rand(100, 10)
y = np.random.randint(0, 3, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
print(f"✅ Scikit-learn KNN: {accuracy:.2f} accuracy")

# Test sentence transformers
print("\nLoading sentence transformer (first run may take a minute)...")
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
test_sentences = ["laptop computer", "cat toy", "fiction book"]
embeddings = sentence_model.encode(test_sentences)
print(f"✅ Sentence embeddings: shape {embeddings.shape}")

# Test clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(embeddings)
print(f"✅ KMeans clustering: {labels}")

# Test matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(4, 3))
ax.scatter([1, 2, 3], [1, 4, 2])
ax.set_title("Test Plot")
plt.close()
print("✅ Matplotlib: working")

print("=" * 60)

## Troubleshooting

### Common Issues:

**"Cannot download Gemma model"**
- Accept license at: https://huggingface.co/google/gemma-3-270m
- Run: `huggingface-cli login` in terminal

**"Module not found"**
- Ensure virtual environment is activated
- Re-run package installation from terminal setup

**"Dataset not found"**
- Check file exists in `Lecture 1 Overview/data/`
- Re-run download commands from terminal setup

**"Out of memory"**
- Close other applications
- The setup only needs ~2GB RAM