# Language data for Machine Learning

### Introduction

Language data—also called *text data*—is a core part of many machine learning tasks, especially those involving Natural Language Processing (NLP). Unlike numbers in a spreadsheet, text is *unstructured*, meaning it doesn’t follow a fixed format that a machine can easily work with. Before we can use it in a model, we need to turn that text into something numerical and structured.

This usually involves steps such as:
- *Tokenisation* – breaking the text into words or smaller parts  
- *Vectorisation* – converting those words into numbers using methods like *TF-IDF*, *word embeddings*, or *one-hot encoding*  
- *Language modelling* – understanding how words relate to each other in a sentence or across documents

There are different kinds of machine learning models we can use for language data:
- Traditional models like *Naïve Bayes* or *Support Vector Machines (SVMs)* work well for simple text classification tasks  
- Deep learning models like *Recurrent Neural Networks (RNNs)* and *Transformers* (e.g. BERT or GPT) are better at capturing context and word meaning in more complex tasks

These approaches are used in real-world applications such as:
- Sentiment analysis (e.g. classifying reviews as positive or negative)  
- Machine translation (e.g. English to French)  
- Chatbots and virtual assistants  
- Text summarisation and generation

In this section, we’ll walk through how to:
- Load raw text data from files  
- Clean and preprocess it to keep the most useful content  
- Convert it into numerical features that can be fed into a machine learning model

This process is the foundation of any NLP project and plays a huge role in how well your model understands and learns from language.

### Common structure for language data
- *Folder-based structure*:  Text files (like `.txt`) may be stored in subfolders corresponding to different categories.  
- *CSV files*:  A single CSV (or TSV - where the `tab` is used as the delimiter) file may contain a text column (e.g., `review_text`) and a label column (e.g., `sentiment`).  
- *Online corpora*: Some libraries (e.g. `nltk`, datasets from HuggingFace, or `sklearn.datasets`) allow you to fetch popular text datasets automatically.

The basic steps involved regardless of the structure of your dataset is to:
1. *Collect/Load* the raw text.  
2. *Cleaning/Preprocessing* (optional) – removing special characters, lowercasing, tokenising, etc.  
3. *Vectorisation* – converting text strings into numeric representations (e.g., Bag of Words, TF–IDF, Word Embeddings).  
4. *Labels* – for classification (sentiment analysis in our case) or supervised tasks, make sure each piece of text has a corresponding label.

## The Cornell Movie Review dataset 
A ready-to-go example that uses exactly this folder structure is the <a href="https://www.cs.cornell.edu/people/pabo/movie-review-data/" target="_blank">*Movie Review Polarity dataset (v1.0)* </a> by Pang and Lee from Cornell University.

The Cornell Movie Review dataset is a collection of movie reviews compiled by researchers Bo Pang and Lillian Lee at Cornell University. It was developed to facilitate research in sentiment analysis, a field that focuses on determining the sentiment expressed in text. The dataset includes movie reviews labeled according to their overall sentiment polarity (positive or negative) and subjective rating (e.g., "two and a half stars"). Additionally, it contains sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

The primary purpose of creating this dataset was to provide a benchmark for experiments in sentiment analysis. By offering labeled data, it allows researchers to train and evaluate machine learning models aimed at classifying text based on sentiment. The dataset has been instrumental in advancing the development of algorithms that can automatically determine the sentiment expressed in written language.

Over time, the dataset has been expanded and refined. For instance, the polarity dataset version 2.0 comprises 1,000 positive and 1,000 negative processed reviews. Smaller versions such as 1.0 cover 700 instances of each polarity.

### Installing Python libraries

In [None]:
!pip install --upgrade pip

!pip install nltk matplotlib numpy pandas scikit-learn transformers torch gensim

### Downloading the data

In [None]:
import urllib.request
import tarfile
import os

# IMDb dataset URL. Uncomment to choose the larger version depending on your hardware
# url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" # Size 80.2MB

# Use an earlier version - smaller for demonstration
url = "http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz" # Size 2.2MB

# Download the dataset to the current directory
urllib.request.urlretrieve(url, "aclImdb_v1.tar.gz") 

# Unpack (extract) the dataset
with tarfile.open("aclImdb_v1.tar.gz", "r:gz") as tar:
    tar.extractall()


### Loading data from a folder structure

In many real-world projects, text data is organised into folders based on category or label. For example, if you're working with movie reviews, the data might be split into folders like this:

```
tokens/
  ├── neg/
  │    ├── cv000_tok-29416.txt
  │    ├── cv001_tok-19502.txt
  │    ...
  └── pos/
       ├── cv000_tok-29590.txt
       ├── cv001_tok-18431.txt
       ...
```

Each `.txt` file contains one review. The folder name (`neg` or `pos`) tells you the label: whether the review is *negative* or *positive*. This structure is ideal because it clearly separates the classes, making it easy to process and use in machine learning.

To load the data in Python:
- You loop through each folder (e.g. `pos/` and `neg/`)
- For each text file inside, you read its contents
- You store the text in a list called `X`  
- You assign a label (`0` for negative, `1` for positive - depending on the folder name) and store it in a list called `y`

This creates two simple lists:
- `X`: contains the raw text of each review  
- `Y`: contains the corresponding label for each review  

This format is perfect for training models because each review has a clear, matching label. Once loaded, the data can be tokenised, cleaned, and transformed for use in text classification tasks.

In [None]:
import os  # Provides functions for interacting with the file system

# Define the path to the root directory containing the review folders
root_dir = 'tokens/'  # This should point to the extracted IMDB training folder

# Define the two categories we're loading: 'neg' for negative and 'pos' for positive reviews
categories = ['neg', 'pos']  # These match the folder names inside 'train/'

# Initialise empty lists to store the text data (X) and their corresponding labels (y)
X, Y = [], []

# Loop over each category and its index (label_idx will be 0 for 'neg' and 1 for 'pos')
for label_idx, cat in enumerate(categories):
    cat_path = os.path.join(root_dir, cat)  # Create the full path to the category folder

    for filename in os.listdir(cat_path):  # Go through each file in the folder
        if filename.endswith('.txt'):  # Process only text files
            file_path = os.path.join(cat_path, filename)  # Build the full path to the file
    
            # Open and read the contents of the file
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                text_content = f.read()
    
            X.append(text_content)         # Add the review text to the X list
            Y.append(label_idx)            # Add the label (0 for neg, 1 for pos) to the y list

# Print the total number of samples loaded
print("Number of samples:", len(X))


### Loading data from a CSV file directly

Another common situation is having all your text data in one CSV, with columns for text and labels (e.g., `review_text`, `sentiment`). Assume you have something like `reviews.csv`:

| review_text                          | sentiment |
|--------------------------------------|----------|
| "I loved this product, it was great"| 1        |
| "Terrible experience, do not buy!"  | 0        |
| ...                                  | ...      |

We can read the CSV with pandas:

In [None]:
import pandas as pd  # Import the pandas library for working with data tables

# URL of the IMDb dataset in CSV format (hosted on GitHub)
url = "https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/refs/heads/master/IMDB-Dataset.csv"

# Load the dataset directly from the URL into a pandas DataFrame
data = pd.read_csv(url)

# Display a random sample of 10 rows from the dataset for inspection
print(data.sample(n=10).head(10))

# Extract the review text into a list (X will hold the raw review text)
X = data['review'].tolist()

# Extract the sentiment labels into a list (y will hold 'positive' or 'negative' values)
Y = data['sentiment'].tolist()

# Print the total number of samples loaded
print("Number of samples:", len(X))


We can also combine both methods, one to iterate through the folders and subfolders to collect the data, and the other approach, to create a `pandas` dataframe from the result:

In [None]:
import os  # For accessing the file system (folders, paths, etc.)
import pandas as pd  # For creating and handling data tables

# Define the paths to the folders containing positive and negative reviews
pos_path = root_dir + '/pos'  # Folder with positive reviews
neg_path = root_dir + '/neg'  # Folder with negative reviews

# Load positive reviews
pos_reviews = []
for file in os.listdir(pos_path):  # Loop through each file in the 'pos' folder
    with open(os.path.join(pos_path, file), 'r') as f:  # Open the file
        pos_reviews.append((f.read(), 'pos'))  # Add the review text and label to the list

# Load negative reviews
neg_reviews = []
for file in os.listdir(neg_path):  # Loop through each file in the 'neg' folder
    with open(os.path.join(neg_path, file), 'r') as f:  # Open the file 
        neg_reviews.append((f.read(), 'neg'))  # Add the review text and label to the list

# Combine both lists and create a DataFrame with two columns: 'text' and 'label'
data = pd.DataFrame(pos_reviews + neg_reviews, columns=['text', 'label'])

# Print the shape of the dataset (number of rows and columns)
print(data.shape)

# Display the first few rows of the dataset
data.head()


### Basic preprocessing

Now we have the dataset loaded, we need to potentially clean up the text to remove uninformative features. This largely depends on the task. For instance, if you wish to perform authorship attribution (identify the writer of unknown written work), punction can be a key indicator. In other instances, it is best to remomve punctuation if this is not an important feature in your machine learning model. For instance, in Named Entity Recognition (NER), we largely care about nouns describing people, places, and things, so syntax words (stop words) can be removed.

The first step is to split the text into words, and this is where tokenisation comes in. We mention words and tokens interchangeably, but they refer to the same thing in our context.

#### Tokenisation

*Tokenisation* is the process of splitting text into smaller, more manageable pieces—usually words or subwords. It’s one of the first and most important steps in any Natural Language Processing (NLP) task because models can’t work with raw text directly—they need structured input to learn from.

There are different types of tokenisation depending on the level of detail you want:
- *Word-level tokenisation* breaks a sentence into individual words  
- *Character-level tokenisation* breaks words into letters  
- *Subword tokenisation* splits words into smaller parts, which is especially helpful for handling rare or unknown words

In our case, we’re doing *word-level tokenisation*, which means each review is turned into a list of words. For example:

> “I really enjoyed this film!”  
becomes:  
`["I", "really", "enjoyed", "this", "film", "!"]`

Keeping punctuation can also be useful, especially in tasks like sentiment analysis where punctuation (like exclamation marks or question marks) might carry emotional tone.

Tokenisation helps in:
- *Text classification* (like spam detection or sentiment analysis)  
- *Language modelling* (predicting the next word in a sentence)  
- *Text cleaning* (removing or filtering specific words or patterns)

Without tokenisation, it would be difficult for any machine learning model to extract meaning from unstructured text. It’s the first step in turning raw language into something models can learn from:

In [None]:
import re

# Define a simple word tokeniser using regex
def regex_tokeniser(text):
    return re.findall(r'\b\w+\b', str(text).lower())  # Extract words and convert to lowercase (using a regular expression)

# Apply regex tokenisation using list comprehension
data['tokens'] = [regex_tokeniser(text) for text in data['text']]

# Display the first few rows
data.head()


#### Stopword removal

Stopwords are commonly used words in a language that typically do not contribute much meaning to the text and are often filtered out in Natural Language Processing (NLP) tasks. These words include articles, conjunctions, and prepositions like "the," "is," "and," "in," "on," etc.

Removing stopwords helps reduce noise in text data, leading to better performance in text classification, sentiment analysis, and machine learning models. There are several reasons for remomving stopwords:

- *Reduces dataset size*: eliminates unnecessary words, making text processing more efficient.
- *Improves model accuracy*: ensures that only meaningful words influence predictions.
- *Enhances feature extraction*: prevents models from assigning unnecessary weight to high-frequency but unimportant words.
- *Speeds up computation*: processing fewer words speeds up NLP tasks like vectorisation and training.

In [None]:
import nltk  # Natural Language Toolkit – a library for working with human language data
from nltk.corpus import stopwords  # Commonly used words like "the", "and", "is"
from nltk.stem import WordNetLemmatizer  # Tool for reducing words to their base (root) form

# Download the necessary NLTK resources if they haven’t been downloaded already
nltk.download('stopwords')  # Downloads the list of stopwords
nltk.download('wordnet')    # Downloads the WordNet lexical database for lemmatisation

# Create a set of English stopwords (like "the", "and", "in", etc.)
stop_words = set(stopwords.words('english'))

# Create a lemmatiser object
lem = WordNetLemmatizer()

# Apply text cleaning and lemmatisation to each row of the 'tokens' column
# For each list of tokens (words), this line:
# - Converts each word to lowercase
# - Removes any word that is a stopword or not purely alphabetical
# - Applies lemmatisation (e.g. "running" → "run", "cars" -> "car")
data['cleaned_tokens'] = data['tokens'].apply(
    lambda tokens: [lem.lemmatize(word.lower()) for word in tokens if word.isalpha() and word.lower() not in stop_words]
)

# Show the first few rows of the updated DataFrame
data.head(20)


If you look at the column `cleaned_tokens`, you will see there are no stop words compared to the `text` or `token` column.

### Transformation

At this stage, `X` holds the raw text data (like reviews or sentences), and `Y` contains the corresponding numerical labels (e.g. `0` or `1`). To train a machine learning model, we need to turn the text into numbers—and there are several ways to do this depending on the level of complexity and detail you need:

- *CountVectorizer* – Converts each text into a list of word counts. It’s a simple but effective method that builds a vocabulary from the dataset and counts how many times each word appears in a given document.

- *TF-IDF (TfidfVectorizer)* – Builds on simple word counts by weighing how important each word is. Words that appear in many documents are given less weight, while words that are more unique to a document are given more importance.

- *Advanced Word Embeddings* – Techniques like *Word2Vec*, *GloVe*, and *BERT* go beyond just counting. They turn words into dense numerical vectors that capture deeper meaning and context. These embeddings help enrich your data by preserving semantic relationships—for example, putting similar words closer together in the vector space.
<br>
<div style="border: 2px solid silver; border-radius: 5px; background-color: transparent;padding:10px;width:95%;margin: 10px;">
<strong>What is a vector space?</strong>

A <em>vector space</em> is a mathematical idea that helps us work with things like directions, movements, or quantities that can be added together and scaled. In simple terms, a vector space is a set of objects (called *vectors*) that you can add together (like combining two movements), and multiply by numbers (which stretches or shrinks them). These numbers can be ordinary numbers (like 2 or -1), or complex numbers, depending on the setting. For something to be a vector space, these rules must hold: 
 - There’s a *zero vector* — a sort of "do nothing" move.
- You can *undo* any vector (by going the opposite way).
- The order in which you add things doesn’t matter.
- Scaling and adding behave nicely together (e.g., doubling a sum is the same as doubling each bit first and then adding).

 Imagine walking around on a flat field:
- You take 5 steps north — that’s a vector.
- Then 3 steps east — another vector.
- You can combine these steps into one overall move — that’s vector addition.
- If someone says "do double that", you just double each part — that’s scalar multiplication.

This sort of system — where steps can be added, scaled, and still make sense — is a vector space.
</div>
These transformations allow you to turn plain text into structured numerical features that a model can learn from—whether you use basic counts or powerful language representations.

#### CountVectorizer

`CountVectorizer` is a basic and widely used tool in Natural Language Processing (NLP) that transforms text into a format that machine learning models can understand. It works by creating a *bag-of-words* representation: it counts how many times each word appears in a document, and stores those counts as numbers in a table.

Each row in the table represents a document (e.g. a review, tweet, or article), and each column represents a unique word from the entire dataset. The values in the table are just counts—how many times that word appeared in that document.

For example, suppose your dataset contains two short texts:  
  - “I love cats”  
  - “I love dogs”  
  
The resulting table would look like this: 

  | Text           | cats | dogs | love |
  |----------------|------|------|------|
  | I love cats    |  1   |  0   |  1   |
  | I love dogs    |  0   |  1   |  1   |

You can also make `CountVectorizer` smarter by:
- Automatically converting words to lowercase (`lowercase=True`)  
- Removing common words like “the” and “is” using stopword removal (`stop_words='english'`)  
- Limiting the number of features (e.g. top 5,000 most frequent words)

These options can be set using parameters in the vectoriser, or you can manually clean your text before applying it—like we did earlier with tokenisation and lemmatisation.

`CountVectorizer` is simple, fast, and often surprisingly effective for many text classification tasks. However, it does not account for the *importance* or *context* of words—just how often they appear: 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer  # Tool for converting text into numerical data

# Convert the list of cleaned tokens back into full strings
# For example, ['great', 'movie', 'plot'] becomes "great movie plot"
X = data['cleaned_tokens'].apply(lambda tokens: ' '.join(tokens))

# Create a CountVectorizer object to transform text into a "bag-of-words" format
vectoriser = CountVectorizer(
    lowercase=True,           # Convert all text to lowercase (just in case)
    stop_words='english',     # Optionally remove common stopwords (set to None if already removed earlier)
    max_features=5000         # Keep only the top 5,000 most frequent words (to limit memory and noise)
)

# Apply the vectoriser to the text data
# This converts each document into a vector showing how many times each word appears
X_count = vectoriser.fit_transform(X)

# Print the shape of the resulting matrix
# Rows = number of documents; Columns = number of features (words)
print("Shape of the count matrix:", X_count.shape)

# Print a small sample of the vocabulary (the actual words turned into features)
print("Example vocabulary snippet:", list(vectoriser.vocabulary_.keys())[:30], "...\n")


For more meaningful representations, methods like *TF-IDF* or *word embeddings* may be better suited.

#### Term Frequency–Inverse Document Frequency (TF-IDF)

TF-IDF is a technique used in Natural Language Processing (NLP) to turn text into numbers so that it can be used in machine learning models. It helps measure how important a word is—not just based on how often it appears, but also based on how *unique* it is across the entire dataset.

Unlike simple word counts, which treat every word the same, TF-IDF gives more weight to words that are specific to a document and less weight to common words like *“the”*, *“and”*, or *“is”*, which appear in almost every sentence.

This makes TF-IDF especially useful for:
- *Text classification* (e.g. sorting emails into spam or not spam)  
- *Sentiment analysis* (e.g. deciding if a review is positive or negative)  
- *Topic modelling* (e.g. grouping articles by subject)

TF-IDF has two main parts:

- *TF (Term Frequency)* – This measures how many times a word appears in a single document. The more it appears, the more important it seems—*within that document*.

- *IDF (Inverse Document Frequency)* – This looks across *all documents* and reduces the importance of words that show up everywhere. Words that appear in only a few documents are considered more meaningful, so they get a higher weight.

Applying TF and IDF creates a balanced score that highlights the words that best describe each document—making the text easier for models to understand and learn from:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer  # Tool for turning text into TF-IDF values

# Create a TF-IDF vectoriser object
vectoriser = TfidfVectorizer(
    lowercase=True,           # Convert all words to lowercase
    stop_words='english',     # Automatically remove common words like 'the', 'and', 'is'
    max_features=5000         # Use only the 5,000 most frequent words (to simplify and reduce memory usage)
)

# Transform the cleaned text data into TF-IDF feature vectors
# TF-IDF stands for Term Frequency–Inverse Document Frequency
# It measures how important a word is to a document, based on how often it appears across the dataset
X_tfidf = vectoriser.fit_transform(X)

# Get the list of words (features) used in the TF-IDF transformation
vocab = vectoriser.get_feature_names_out()

# Print the shape of the resulting TF-IDF matrix
# Rows = number of documents; Columns = number of words (features)
print("Shape of transformed X:", X_tfidf.shape)

# Show a sample of the vocabulary (the first 10 words used as features)
print("Example vocabulary snippet:", vocab[:10], "...\n")


### Advanced Word Embeddings

In Natural Language Processing (NLP), *word embeddings* are a powerful way of turning words into numbers so that machine learning models can understand them. But instead of simply assigning each word a number (like in one-hot encoding or TF-IDF), word embeddings represent words as *vectors*—lists of numbers that live in a high-dimensional space.

What makes embeddings special is that they capture the *meaning* of words based on how they’re used in context. Words that appear in similar situations will have similar vector representations, which helps models understand that “happy” and “joyful” are related, while “happy” and “sad” are not—even if those words never appear together. Let’s take a closer look at three advanced embedding techniques used in modern NLP.

#### *Word2Vec*  
Developed by Google, Word2Vec uses a simple neural network to learn how words relate to each other based on context. It comes in two flavours:
- *CBOW (Continuous Bag of Words)* – predicts a word from its surrounding context  
- *Skip-gram* – predicts surrounding words from a single input word

It produces fixed-size word vectors where similar words are *close together* in the vector space. However, it assigns the *same vector to a word regardless of context*.

<div style="border: 2px solid silver; border-radius: 5px; background-color: transparent;padding:10px;width:95%;margin: 10px;">
  <strong>What do we mean by the same vector to a word regardless of context?</strong> 
  

Word2Vec gives each word just one fixed representation, no matter where or how it’s used in a sentence. For example, the word *"bank"* might appear in different contexts:

- "She sat by the river *bank*."
- "He works at a *bank* downtown."

These are clearly different meanings of *bank* (riverbank vs financial institution), but Word2Vec will give both uses the exact same vector — because it only learns *one* embedding per word, based on all its contexts averaged together during training.

So, Word2Vec captures general word similarity, but it can’t handle words with multiple meanings depending on context — that’s a key limitation. Later models like *BERT* were developed to solve this.
</div>


#### *GloVe* (Global Vectors for word representation)  
Created by Stanford, GloVe combines the strengths of Word2Vec with a focus on *global co-occurrence*—how often words appear together across the entire dataset. GloVe is available as *pre-trained embeddings* in different sizes (e.g. 50, 100, or 200 dimensions), often trained on large datasets like Wikipedia.

Like Word2Vec, GloVe gives the same vector to a word no matter the sentence, but it does a better job at preserving global relationships between words.

#### *BERT* (Bidirectional Encoder Representations from Transformers)  
BERT is a modern and much more powerful method. It comes from Google and is based on the Transformer architecture. BERT is *context-aware*—which means it gives different vector representations to the same word depending on the sentence.

For example, the word *“bank”* will be embedded differently in “river bank” and “money at the bank,” because BERT looks at both the words *before* and *after* the target word. This makes it extremely useful for tasks like sentiment analysis, translation, and question answering.

### Why use advanced embeddings?

Compared to older techniques like *Bag of Words* or *TF-IDF*, advanced embeddings offer several key advantages:

- *Capture meaning* – Words with similar meanings end up close together in vector space.
- *Understand context* – Especially with BERT, the same word can have different meanings based on where it appears.
- *Improve model performance* – Using embeddings as input often leads to higher accuracy in NLP tasks.
- *Reduce data sparsity* – Embeddings are *dense vectors*, meaning they contain more compact and useful information.
- *Leverage pre-trained models* – Many embeddings are trained on massive datasets, saving time and improving results out of the box.
- *Enable transfer learning* – You can fine-tune pre-trained embeddings for your own specific task or dataset.

### Comparison of Word Embeddings

| Embedding | Context-Aware | Pre-trained Available | Captures Word Meaning | Handles Phrases |
|-----------|---------------|-----------------------|-----------------------|-----------------|
| Word2Vec  | No            | Yes                   | Yes                   | No              |
| GloVe     | No            | Yes                   | Yes                   | No              |
| BERT      | Yes           | Yes                   | Yes                   | Yes             |

If you're unsure where to start, *BERT* is a great default choice. It's versatile, powerful, and widely supported. However, Word2Vec and GloVe are still useful for lighter tasks, smaller datasets, or faster training when deep context isn’t required.

### Word Embeddings with Word2Vec

Word2Vec developed by Google, is a widely used algorithm that generates these vector representations by analysing word co-occurrence in large text corpora.

It ensures that words with similar meanings (e.g., "film" and "cinema") have closely related vector representations, improving the ability of models to understand language structure.  

Word2Vec works using two key architectures: *Continuous Bag of Words (CBOW)*, which predicts a target word based on surrounding words, and *Skip-Gram*, which predicts context words given a target word.

Both models train a shallow neural network to optimise word vectors (both models use a simple neural network to learn word meanings by adjusting numbers (vectors) so that related words end up closer together), allowing for better semantic understanding of language, making Word2Vec particularly useful in applications such as chatbots, recommendation systems, and machine translation.

While CBOW is efficient and works well on large datasets, Skip-Gram is better suited for capturing rare words in smaller datasets. So we will use that:

In [None]:
from gensim.models import Word2Vec

# Train Word2Vec model
# Word2Vec(..., sg=0)  # CBOW (default)
# Word2Vec(..., sg=1)  # Skip-gram
word2vec_model = Word2Vec(sentences=data['cleaned_tokens'], vector_size=100, window=5, min_count=2, workers=4, sg=1)

# Display the most similar words to "movie"
print("Most similar words to 'movie':", word2vec_model.wv.most_similar("director"))

Training Word2Vec on a specific dataset ensures that embeddings reflect domain-specific language (e.g., movie reviews), while pre-trained models offer a practical alternative for large-scale applications.

When we leverage methods like Word2Vec, we create richer, more meaningful word representations that boost the performance of machine learning and deep learning models in NLP

#### Using Hugging Face Transformers

Hugging Face Transformers is a widely used open-source library for Natural Language Processing (NLP) that provides pre-trained transformer models like BERT, GPT, T5, and more. It simplifies working with state-of-the-art deep learning models for various NLP tasks such as, text classification, Named Entity Recognition (NER), Sentiment Analysis, Information Retrieval, and text clustering and similarity analysis.

Extracting embeddings from BERT, allows you to integrate deep learning-powered language understanding into machine learning workflows. 

We import several libraries:

- `BertTokenizer`: Converts text into tokenised input BERT can understand.
- `BertModel`: Loads the pre-trained BERT model for generating embeddings.
- `torch`: PyTorch is used to handle tensor operations (tensors are just multi-dimensional grids of numbers that computers use to represent and process data like images, text, or sounds).


In [None]:
import torch

from transformers import BertTokenizer, BertModel


Next, we need to load a pre-trained BERT model and its corresponding tokeniser. Hugging Face makes this very simple:


In [None]:
# Load pre-trained BERT model and tokeniser
tokeniser = BertTokenizer.from_pretrained("bert-base-uncased")

model = BertModel.from_pretrained("bert-base-uncased")

- `bert-base-uncased`: This is a small, pre-trained BERT model that ignores capitalisation (uncased).
- `Tokenizer`: Converts raw text into token IDs that BERT understands.
- `BertModel`: Loads the full transformer model to generate embeddings.

We have several documents composed of a list of cleaned sentences (just a sample of 10) that we want to convert into word embeddings. BERT will process each sentence individually but in a batch, making computation efficient:


In [None]:
# Sample documents from your data
sample_size = 10

seed = 7 # A number we will use a lot to fix things in place so that we reproduce the same result
sampled_rows = data.sample(n=sample_size, random_state=seed)  # random_state ensures reproducibility

# Create a list of documents (each as a single string)
documents = sampled_rows['cleaned_tokens'].apply(lambda tokens: ' '.join(tokens)).tolist()

# Preview
for i, doc in enumerate(documents):
    print(f"Document {i+1}:\n{doc}\n")


First, we need to convert our text into numerical *tokens* that BERT can understand:

In [None]:
# Tokenise multiple documents at once
tokens = tokeniser(documents, return_tensors="pt", padding=True, truncation=True)

print(tokens)

- `return_tensors="pt"`: Converts tokens into PyTorch tensors (ready for BERT input).
- `padding=True`: Ensures all sentences are of equal length (important for batch processing).
- `truncation=True`: Truncates long sentences to fit within BERT’s max length.

Now it's time to generate the actual word embeddings! We pass our tokenised text into BERT and get the embeddings without updating the model’s weights. It means we feed our pre-processed text into the BERT model to get word or sentence representations (embeddings), but we don’t train or change the model itself — we just use it as it is:

In [None]:
# Get BERT embeddings
with torch.no_grad():
    outputs = model(**tokens)
    
    print(outputs)

The code above, performs the following:
- `torch.no_grad()`: Disables gradient computation to save memory and speed up processing (since we're not training).
- `model(**tokens`): Feeds the tokenised text into the BERT model and retrieves its output.

BERT has now processed the text and generated embeddings for each word! Now let’s extract the actual word embeddings from BERT’s output. These embeddings represent the meaning of words in a high-dimensional space:

In [None]:
# Extract word embeddings from last hidden state
embeddings = outputs.last_hidden_state

print("BERT Embedding Shape:", embeddings.shape)
# Expected output: (batch_size, sequence_length, hidden_size)

In the above code, `outputs.last_hidden_state` extracts the final layer embeddings from BERT. Let's inspect the resulting dimensions of our transformed data:
- *Batch size*: The number of input sentences processed together.
- *Sequence Length*: The number of tokens in each sentence (after padding/truncation).
- *Hidden size*: The dimensionality of BERT embeddings (typically 768 for BERT-base models).

For example, if you input *3 sentences*, each tokenised to *10 words*, BERT produces embeddings of shape `(3, 10, 768)`. 

We can now use these embeddings as our input to a machine learning model. Like our numeric data in the previous tutorial, we have rows of numbers (vectors) that carry some meaning or pattern that our model can learn from.

## What have we learnt?

Working with language data is a fundamental part of many machine learning and artificial intelligence tasks. Unlike numerical data, text is unstructured and must be transformed into a format that computers can understand. We’ve learnt that the first step in this process is *preprocessing*, which includes cleaning the text (e.g. lowercasing, removing punctuation), tokenising it into smaller parts like words, and reducing words to their base form through lemmatisation or stemming. These steps help simplify the data while keeping the most important information.

We also explored the *various formats* in which text data can appear. Some datasets are organised into folders, where each subfolder represents a category and contains separate `.txt` files. Others are stored in CSV files, with one column for the text and another for the labels. Alternatively, text datasets can be accessed through online libraries such as `nltk`, `sklearn.datasets`, or Hugging Face’s `datasets`, making it easier to work with large, well-known corpora.

Since machine learning models can’t understand raw text, we must convert it into numbers. We covered several *vectorisation techniques*, such as `CountVectorizer`, which simply counts word occurrences, and `TF-IDF`, which adjusts word importance based on how common or rare a word is across documents. We also introduced *advanced word embeddings* like Word2Vec, GloVe, and BERT, which go a step further by capturing word meanings and relationships in context. These representations are especially useful for modern NLP models.

Finally, we discussed how different models use these representations. Traditional machine learning models, such as Naïve Bayes or SVM, often work well with TF-IDF vectors. More advanced models—like recurrent neural networks (RNNs) and transformers (e.g. BERT)—work directly with word embeddings and are capable of understanding the structure and meaning of language more deeply.

Altogether, we’ve seen how raw text is transformed into structured, meaningful features suitable for machine learning. Each step in the process—loading the data, cleaning and tokenising it, converting it into numbers, and feeding it into a model—is crucial for building effective NLP systems.

## Recommended datasets  
If you need some well-known datasets to practice with, here are a few that are easy to access:

**20 Newsgroups**  
- *Description*: About 20,000 newsgroup posts in 20 different topics.  
- *Why it’s popular*: Classic for text classification (naive Bayes, TF–IDF, etc.).  
- *Where to get it*: Built into scikit-learn (`sklearn.datasets.fetch_20newsgroups`).

**IMDb Movie Reviews**  
- *Description*: 50,000 reviews labelled positive/negative (sentiment analysis).  
- *Why it’s popular*: Straightforward binary classification, good for NLP demos.  
- *Where to get it*: [Stanford IMDb dataset site](https://ai.stanford.edu/~amaas/data/sentiment/) or [Kaggle IMDb](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

**Reuters-21578**  
- *Description*: 10,788 news documents from Reuters, labelled with 90 categories.  
- *Why it’s popular*: Early text classification benchmark; widely cited.  
- *Where to get it*: [UCI ML Repo](https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection).

**Twitter Sentiment Analysis**  
- *Description*: Various Twitter corpora labelled for positive/negative/neutral.  
- *Why it’s popular*: Real-world social media text for sentiment tasks.  
- *Where to get it*: [Kaggle sentiment datasets](https://www.kaggle.com/datasets) (search “Twitter sentiment”).

**Enron Email Dataset**  
- *Description*: ~600K real business emails made public after the Enron scandal.  
- *Why it’s popular*: Rich set of real emails, used for spam detection or classification.  
- *Where to get it*: [Enron email dataset site](https://www.cs.cmu.edu/~enron/).

Additionally, frameworks like [Hugging Face Datasets](https://github.com/huggingface/datasets) offer a huge range of NLP corpora.