## Phase 4: Transformer-based NLP (BERT)

This is the stage where:

Accuracy usually improves significantly

You work with pretrained language models

You connect to LLM concepts asked in interviews

You’ll learn:

Tokenizers

Transformers

Pretrained models

Fine-tuning

In [4]:
%pip install transformers
from transformers import pipeline

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
%pip install torch
import torch
print(torch.__version__)



[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting torch
  Downloading torch-2.10.0-cp312-cp312-win_amd64.whl.metadata (31 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting setuptools (from torch)
  Downloading setuptools-82.0.0-py3-none-any.whl.metadata (6.6 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Using cached markupsafe-3.0.3-cp312-cp312-win_amd64.whl.metadata (2.8 kB)
Downloading torch-2.10.0-cp312-cp312-win_amd64.whl (113.8 MB)
   ---------------------------------------- 0.0/113.8 MB ? eta -:--:--
   ---------------------------------------- 1.0/113.8 MB 5.6 MB/s eta 0:00:21
    --------------------------------------- 2.4/113.8 MB

## Phase 4 Goal

In this phase, instead of building models from scratch, we:

Text → BERT tokenizer → BERT model → sentiment prediction


BERT already:

Understands language

Knows context

Has been trained on massive datasets

We just fine-tune it on your sentiment data.

In [10]:
%pip install pandas
import pandas as pd

%pip install scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting scikit-learn
  Using cached scikit_learn-1.8.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Using cached scipy-1.17.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.8.0-cp312-cp312-win_amd64.whl (8.0 MB)
Using cached joblib-1.5.3-py3-none-any.whl (309 kB)
Using cached scipy-1.17.0-cp312-cp312-win_amd64.whl (36.3 MB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.3 scikit-learn-1.8.0 scipy-1.17.0 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Step 3: Load dataset

In [12]:
df = pd.read_csv("../data/sentimentdataset.csv")

texts = df["Text"].astype(str)
labels = df["Sentiment"]


## Step 4: Encode labels

In [13]:
encoder = LabelEncoder()
y = encoder.fit_transform(labels)


## Step 5: Train–test split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    texts, y, test_size=0.2, random_state=42
)

## Step 6: Load BERT tokenizer

In [15]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

## Step 7: Tokenize text

In [16]:
train_encodings = tokenizer(
    list(X_train),
    truncation=True,
    padding=True,
    max_length=64
)

test_encodings = tokenizer(
    list(X_test),
    truncation=True,
    padding=True,
    max_length=64
)


## Step 8: Create PyTorch dataset

In [17]:
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


In [18]:
train_dataset = SentimentDataset(train_encodings, list(y_train))
test_dataset = SentimentDataset(test_encodings, list(y_test))


## Step 9: Load BERT model

In [19]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(encoder.classes_)
)


ImportError: 
BertForSequenceClassification requires the PyTorch library but it was not found in your environment. Check out the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.


## Step 10: Training configuration

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=10
)


`logging_dir` is deprecated and will be removed in v5.2. Please set `TENSORBOARD_LOGGING_DIR` instead.


## Step 11: Trainer

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)


NameError: name 'model' is not defined

## Step 12: Train model

In [None]:
trainer.train()

NameError: name 'trainer' is not defined