<a href="https://colab.research.google.com/github/lkevin2018/cog-320-lecture-examples/blob/main/nlp_resume_bert_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß™ NLP Workshop: Fine-Tuning BERT on Resumes + Pushing to Hugging Face & GitHub

In this live exercise, you'll:

1. Load a **resume dataset** from Kaggle  
2. Fine-tune a **BERT text classification model** on resume text  
3. Do a **simple bias exploration** of the dataset using a pre-trained BERT-based model  
4. Save the fine-tuned model as a **`.pt` file**  
5. Push the model to a **private Hugging Face repo**  
6. Push this notebook/code into **your own GitHub fork**


## ‚úÖ Step 0: Install required libraries

Run this cell first.

We‚Äôll use:

- `transformers` ‚Äì BERT models & training utilities  
- `datasets` ‚Äì dataset handling  
- `pandas` ‚Äì CSV loading & exploration  
- `sklearn` ‚Äì train/test split & metrics  
- `huggingface_hub` ‚Äì upload model to Hugging Face  


In [None]:
!pip install -q datasets huggingface_hub accelerate scikit-learn pandas kagglehub
!pip uninstall -y transformers
!pip uninstall -y tensorflow tensorflow-text keras
!pip install -U transformers accelerate datasets

## ‚úÖ Step 1: Imports & device check

If this cell runs, your environment is ready.


In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import torch

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    pipeline,
)

from huggingface_hub import HfApi, HfFolder

print("Torch version:", torch.__version__)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

## ‚úÖ Step 2: Load the Kaggle Resume Dataset

We'll use this dataset from Kaggle:  

**Kaggle dataset:** <https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset>


The dataset typically contains columns like:

- `ID` ‚Äì unique identifier  
- `Resume_str` ‚Äì resume text as plain string  
- `Resume_html` ‚Äì resume in HTML format  
- `Category` ‚Äì job category label (e.g., `Data Science`, `HR`, etc.)  


In [None]:
import kagglehub
import pandas as pd
import os

path = kagglehub.dataset_download("snehaanbhawal/resume-dataset")
print("üìÅ Dataset downloaded to:", path)

# Try to locate a CSV file automatically
csv_file = None
for root, dirs, files in os.walk(path):
    for f in files:
        if f.lower().endswith(".csv"):
            csv_file = os.path.join(root, f)
            break
    if csv_file:
        break

if not csv_file:
    raise FileNotFoundError(
        f"No CSV found in dataset folder: {path}. "
        "Check Kaggle dataset structure."
)

print("üìÑ Using CSV file:", csv_file)

# Load into DataFrame
df = pd.read_csv(csv_file)
print("Dataframe shape:", df.shape)
df.head()

## üîç Step 3: Explore the dataset & basic bias signals

We‚Äôll take a quick look at:

- Column names  
- Example rows  
- How many resumes per `Category`  
- Basic text length stats

This is a **very shallow** view, but even this can show **representation imbalance** (e.g., some categories heavily overrepresented).

In [None]:
print("Columns:", df.columns.tolist())

# Drop rows with missing text or labels
df = df.dropna(subset=["Resume_str", "Category"])

print("\nNumber of rows after dropping missing:", len(df))

print("\nCategory value counts:")
print(df["Category"].value_counts())

# Add a simple text length column
df["text_length"] = df["Resume_str"].str.len()
print("\nText length stats:")
print(df["text_length"].describe())

## ‚úÖ Step 4: Prepare data for BERT fine-tuning

We‚Äôll:

1. Use the `Resume_str` column as the input text  
2. Use `Category` as the label  
3. Map each unique category to an integer ID  
4. Split into **train** and **validation** sets  
5. Wrap everything into a Hugging Face `Dataset`


In [None]:
# Use these column names (change here if your CSV differs)
TEXT_COL = "Resume_str"
LABEL_COL = "Category"

# Create label mappings
label_list = sorted(df[LABEL_COL].unique())
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for label, i in label2id.items()}

print("Number of labels:", len(label_list))
print("Example label mapping (first 10):", list(label2id.items())[:10])

df["label"] = df[LABEL_COL].map(label2id)

# Train/validation split
train_df, val_df = train_test_split(
    df[[TEXT_COL, "label"]],
    test_size=0.2,
    random_state=42,
    stratify=df["label"]
)

print("Train size:", len(train_df))
print("Validation size:", len(val_df))

train_ds = Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds = Dataset.from_pandas(val_df.reset_index(drop=True))

## ‚úÖ Step 5: Tokenize text for BERT

We‚Äôll use the base BERT model:

- Model checkpoint: `bert-base-uncased`

We will:

- Tokenize the resume text  
- Truncate long resumes to a max length (e.g. 256 tokens to keep training fast in a workshop)


In [None]:
MODEL_NAME = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

max_length = 256  # keep it small for live training

def tokenize_batch(batch):
    return tokenizer(
        batch[TEXT_COL],
        padding="max_length",
        truncation=True,
        max_length=max_length,
    )

train_ds_tok = train_ds.map(tokenize_batch, batched=True)
val_ds_tok = val_ds.map(tokenize_batch, batched=True)

# Set format for PyTorch
train_ds_tok = train_ds_tok.remove_columns([TEXT_COL])
val_ds_tok = val_ds_tok.remove_columns([TEXT_COL])

train_ds_tok.set_format("torch")
val_ds_tok.set_format("torch")

train_ds_tok[0]

## ‚úÖ Step 6: Load pre-trained BERT for classification

We‚Äôll fine-tune:

- `bert-base-uncased`  
- With `num_labels = number of resume categories`

This is where **transfer learning** happens.


In [None]:
num_labels = len(label_list)

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)

model.to(device)
print("Model loaded on", device)

## ‚úÖ Step 7: Set up training

We‚Äôll use the Hugging Face `Trainer` API for convenience.

For a **live workshop**, keep training light:

- `num_train_epochs = 1`  
- Small batch size (depending on GPU memory)  
- This is for **demo**, not production!


In [None]:
batch_size = 8

training_args = TrainingArguments(
    output_dir="./resume-bert-output",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="none"
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)

    from sklearn.metrics import accuracy_score
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds_tok,
    eval_dataset=val_ds_tok,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print("Trainer ready")

## ‚ñ∂Ô∏è Step 8: Train the model (live)

Now we fine-tune BERT on the resume dataset.

> This step **may take a few minutes** depending on GPU/TPU.


In [None]:
trainer.train()

## ‚úÖ Step 9: Evaluate on validation set

We‚Äôll get:

- Accuracy (from our metric function)  
- A full classification report for a deeper look


In [None]:
eval_results = trainer.evaluate()
print("Eval results:", eval_results)

# Get predictions & classification report
preds_output = trainer.predict(val_ds_tok)
preds = preds_output.predictions.argmax(axis=-1)
true_labels = preds_output.label_ids

print("\nDetailed classification report:")
print(classification_report(true_labels, preds, target_names=label_list))

## üíæ Step 10: Save fine-tuned model and `.pt` file

We‚Äôll:

1. Save the full Hugging Face model directory (`save_pretrained`)  
2. Save a pure PyTorch `.pt` state dict file  


In [None]:
save_dir = "resume-bert-finetuned"
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

# Also save a plain .pt file with state dict
pt_path = "resume_bert_finetuned_state_dict.pt"
torch.save(model.state_dict(), pt_path)

print("Saved HF model to:", save_dir)
print("Saved PyTorch .pt file to:", pt_path)

## üéØ Step 11: Simple bias exploration with a BERT-based zero-shot model

We‚Äôll use a **BERT-based MNLI model** from Hugging Face as a **zero-shot classifier** to infer **high-level, non-sensitive features** about each resume, such as:

- `"technical"`  
- `"non-technical"`  
- `"management"`  
- `"entry-level"`  
- `"senior-level"`  

Then we‚Äôll check how these ‚Äúfeatures‚Äù are distributed across categories to see potential **representation imbalances**.

Model used (BERT-based MNLI): `ishan/bert-base-uncased-mnli`

> ‚ö†Ô∏è Note: This is a **rough heuristic**, not a fairness-certified bias audit. It‚Äôs just a teaching tool to think critically about datasets.


In [None]:
from transformers import AutoModelForSequenceClassification

feature_model_name = "ishan/bert-base-uncased-mnli"

feature_classifier = pipeline(
    "zero-shot-classification",
    model=feature_model_name,
    tokenizer=feature_model_name,
    device=0 if device == "cuda" else -1,
)

candidate_features = ["technical", "non-technical", "management", "entry-level", "senior-level"]

# Sample a subset for speed
sample_df = df.sample(n=min(200, len(df)), random_state=42).copy()

feature_labels = []

for text in sample_df[TEXT_COL].tolist():
    # Keep resumes short-ish for speed
    text_short = text[:2000]
    result = feature_classifier(
        text_short,
        candidate_features,
        multi_label=False,
    )
    feature_labels.append(result["labels"][0])

sample_df["inferred_feature"] = feature_labels
sample_df[["Category", "inferred_feature"]].head()

### üîé Aggregate ‚Äúfeature‚Äù distributions

Now we‚Äôll see:

- How often each inferred feature appears overall  
- How they are distributed by `Category`  

This can highlight where the dataset might be skewed (e.g., certain categories mostly considered ‚Äútechnical‚Äù or ‚Äúentry-level‚Äù).

In [None]:
!pip install -q bokeh

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.transform import cumsum
from bokeh.palettes import Category10, Category20
from bokeh.layouts import gridplot
import numpy as np

output_notebook()  # render bokeh plots inline in Colab

In [None]:
print("Overall inferred feature distribution:")
print(sample_df["inferred_feature"].value_counts())

print("\nFeature distribution by Category:")
cross_tab = pd.crosstab(sample_df["Category"], sample_df["inferred_feature"], normalize="index")
cross_tab

In [None]:
# ==============================
# Interactive bias visualization with Bokeh
# ==============================

# Overall counts of inferred features
overall_counts = sample_df["inferred_feature"].value_counts()

# Normalized feature distribution by Category (proportions)
cross_tab = pd.crosstab(
    sample_df["Category"],
    sample_df["inferred_feature"],
    normalize="index"
)

print("Categories:", list(cross_tab.index))
print("Inferred features:", list(cross_tab.columns))

# ---------- Overall pie chart (all resumes) ----------
overall_df = overall_counts.reset_index()
overall_df.columns = ["feature", "count"]
overall_df["proportion"] = overall_df["count"] / overall_df["count"].sum()
overall_df["angle"] = overall_df["proportion"] * 2 * np.pi

# Choose a color palette large enough
palette = Category10[10] if len(overall_df) <= 10 else Category20[20]
overall_df["color"] = [palette[i % len(palette)] for i in range(len(overall_df))]

p_overall = figure(
    height=350,
    width=400,
    title="Overall inferred feature distribution",
    tools="hover",
    tooltips="@feature: @proportion{0.0%}",
    x_range=(-1, 1),
    y_range=(-1, 1),
)

p_overall.wedge(
    x=0,
    y=0,
    radius=0.8,
    start_angle=cumsum("angle", include_zero=True),
    end_angle=cumsum("angle"),
    line_color="white",
    fill_color="color",
    legend_field="feature",
    source=overall_df,
)

p_overall.legend.location = "right"

# ---------- Pie chart per Category ----------
category_figs = []
categories = list(cross_tab.index)
features = list(cross_tab.columns)

for cat_idx, cat in enumerate(categories):
    row = cross_tab.loc[cat].reset_index()
    row.columns = ["feature", "proportion"]

    # Skip completely empty rows (just in case)
    if row["proportion"].sum() == 0:
        continue

    row["angle"] = row["proportion"] * 2 * np.pi
    row["color"] = [palette[i % len(palette)] for i in range(len(row))]

    p_cat = figure(
        height=250,
        width=250,
        title=str(cat),
        tools="hover",
        tooltips="@feature: @proportion{0.0%}",
        x_range=(-1, 1),
        y_range=(-1, 1),
    )

    p_cat.wedge(
        x=0,
        y=0,
        radius=0.8,
        start_angle=cumsum("angle", include_zero=True),
        end_angle=cumsum("angle"),
        line_color="white",
        fill_color="color",
        source=row,
    )

    category_figs.append(p_cat)

# Arrange category pies in a grid (3 per row)
grid = gridplot(category_figs, ncols=3)

# Show overall pie + grid of per-category pies
show(grid)
show(p_overall)

## ‚òÅÔ∏è Step 12: Push the `.pt` model to a **private Hugging Face repo**

### One-time setup (in browser)

1. Go to <https://huggingface.co> and create an account (if you don‚Äôt already have one).  
2. Go to **Settings ‚Üí Access Tokens** and create a token with **`write`** permissions.  
3. Keep the token ready (you‚Äôll paste it into Colab, it won‚Äôt be saved in the notebook).

### In this notebook

We will:

1. Log in programmatically by saving the token securely.  
2. Create (or reuse) a **private** model repo.  
3. Upload the `.pt` file and optionally the full HF model directory.


In [None]:
from getpass import getpass

# üîê 1. Save your Hugging Face token locally in the Colab session
if not HfFolder.get_token():
    hf_token = getpass("Enter your Hugging Face token (with write permissions): ")
    HfFolder.save_token(hf_token)
else:
    print("Hugging Face token already set for this session.")

api = HfApi()

# üîß 2. Set your repo info here
HF_USERNAME = "kevinbjoseph"   # <-- CHANGE THIS
HF_MODEL_REPO = "resume-bert-demo-12-1" # <-- CHANGE THIS if you like

repo_id = f"{HF_USERNAME}/{HF_MODEL_REPO}"

# Create repo if it doesn't exist (private=True)
api.create_repo(repo_id=repo_id, private=True, exist_ok=True)
print("Using Hugging Face repo:", repo_id)

# 3. Upload the .pt file
api.upload_file(
    path_or_fileobj="resume_bert_finetuned_state_dict.pt",
    path_in_repo="resume_bert_finetuned_state_dict.pt",
    repo_id=repo_id,
)

# 4. Optionally upload the full HF model directory
# (this lets you later call `from_pretrained(repo_id)` easily)
for root, dirs, files in os.walk(save_dir):
    for file in files:
        local_path = os.path.join(root, file)
        rel_path_in_repo = os.path.relpath(local_path, save_dir)
        repo_path = f"{rel_path_in_repo}"
        print(f"Uploading {local_path} -> {repo_path}")
        api.upload_file(
            path_or_fileobj=local_path,
            path_in_repo=repo_path,
            repo_id=repo_id,
        )

print("‚úÖ Upload complete!")

## üêô Step 13: Put this notebook into your own GitHub repo

---






## üéâ Wrap-Up

In this notebook, you:

- Loaded the **Kaggle resume dataset**  
- Fine-tuned a **BERT text classifier** on resume text  
- Saved the model as a **Hugging Face-style directory** and a **`.pt` file**  
- Used a **BERT-based MNLI model** to infer high-level, non-sensitive ‚Äúfeatures‚Äù and inspect basic dataset bias patterns  
- Uploaded your model to a **private Hugging Face model repo**  
- Learned two ways to put the notebook into **GitHub**

This is a **starting point** for thinking about:

- How dataset composition affects models  
- How to responsibly handle model + data sharing  
- How to structure a small NLP project end-to-end  
