# Introduction to Natural Language Processing

## Learning Objectives

By the end of this notebook, you will be able to:

- **Understand tokenization** and apply different tokenization approaches (simple, NLTK, spaCy)
- **Implement Bag of Words (BoW)** representation from scratch and using scikit-learn
- **Recognize BoW limitations** and understand why word order matters
- **Understand word embeddings** and how Word2Vec captures semantic relationships
- **Train custom Word2Vec models** and use pre-trained embeddings
- **Compare document similarity** using cosine similarity with word embeddings

---


# Basic Tokenization

Tokenization in natural language processing (NLP) is the process of dividing text into smaller, meaningful units known as tokens. These tokens can be words, subwords, or even individual characters, depending on the task and language. Tokenization is typically one of the first and most important steps in preparing text for machine learning or other computational analysis because it transforms raw, unstructured text into a format that algorithms can more easily process.

In [None]:
# Our sample sentence
sentence = "NLP is a fascinating field of study."
print(f"Original sentence: {sentence}")

# A very simple tokenizer: convert to lowercase and split by spaces
tokens = sentence.lower().split(' ')

print(f"Tokens: {tokens}")

Original sentence: NLP is a fascinating field of study.
Tokens: ['nlp', 'is', 'a', 'fascinating', 'field', 'of', 'study.']


In [None]:

sentence = "Don't you love NLP? It's a fascinating field!"

# Tokenize using the split() method
tokens = sentence.lower().split(' ')

print(tokens)

["don't", 'you', 'love', 'nlp?', "it's", 'a', 'fascinating', 'field!']


Notice the problems. The punctuation is still attached to the words (`"nlp?"`, `"field!"`), and contractions like `"don't"` and `"it's"` are treated as single, unchangeable units. This method is fast but not very smart.

## Using NLTK

The __Natural Language Toolkit (NLTK)__ is a foundational library for NLP education and research. Its `word_tokenize` function is trained to handle many edge cases, like punctuation and contractions.


In [None]:
!pip install nltk -q

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import nltk

sentence = "Don't you love NLP? It's a fascinating field!"

# Tokenize using NLTK's word_tokenize
tokens = nltk.word_tokenize(sentence)

print(tokens)

['Do', "n't", 'you', 'love', 'NLP', '?', 'It', "'s", 'a', 'fascinating', 'field', '!']


NLTK correctly separates punctuation (`?`, `!`) from the words. It also intelligently splits the contraction `"Don't"` into its components `Do` and `n't`, which is crucial for understanding the sentence's components.

## spaCy (A Modern, Production-Ready Library)
spaCy is a modern, high-performance NLP library designed for real-world applications. Its tokenizer is fast and efficient, and it's part of a larger pipeline that creates rich document objects, where each token has useful linguistic annotations.

In [None]:
!pip install spacy -q
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

# Load the small English language model
nlp = spacy.load("en_core_web_sm")

sentence = "Don't you love NLP? It's a fascinating field!"

# Process the sentence with the spaCy pipeline
doc = nlp(sentence)

# The 'doc' object is a sequence of tokens. We can extract their text.
tokens = [token.text for token in doc]

print(tokens)

['Do', "n't", 'you', 'love', 'NLP', '?', 'It', "'s", 'a', 'fascinating', 'field', '!']


# Bag-of-Words from Stratch

Bag of Words (BoW) is a foundational model in natural language processing for representing text data. In this approach, a text (such as a sentence or document) is converted into a collection of its words, where each unique word is treated as an individual feature, and the frequency of each word is recorded.

BoW ignores the order and grammar of words, focusing solely on the presence or frequency of words within the document.

In [None]:
# Our corpus of documents
corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat."
]

In [None]:
# --- Step 1: Build the Vocabulary ---
# We'll collect all unique words from the corpus in a single set
print("--- Step 1: Building the Vocabulary ---")
all_words = set()
for sentence in corpus:
    # simple tokenization: lowercase and split by space
    tokens = sentence.lower().replace('.', '').split(' ')
    for word in tokens:
        all_words.add(word)

--- Step 1: Building the Vocabulary ---


In [None]:
all_words

{'cat', 'chased', 'dog', 'mat', 'on', 'sat', 'the'}

In [None]:
# Convert the set to a sorted list to have a consistent order
vocabulary = sorted(list(all_words))
print(f"Final Vocabulary: {vocabulary}\n")

Final Vocabulary: ['cat', 'chased', 'dog', 'mat', 'on', 'sat', 'the']



In [None]:
# --- Step 2: Create the Vectors ---
# We'll create a vector for each sentence by counting word occurrences
print("--- Step 2: Creating the Vectors ---")
final_vectors = []
for sentence in corpus:
    # Start with a vector of zeros, one position for each word in the vocabulary
    vector = [0] * len(vocabulary)

    # Tokenize the current sentence
    sentence_tokens = sentence.lower().replace('.', '').split(' ')

    # Count the words
    for word in sentence_tokens:
        # Find the index of the word in our vocabulary
        if word in vocabulary:
            index = vocabulary.index(word)
            # Increment the count at that index
            vector[index] += 1

    final_vectors.append(vector)

--- Step 2: Creating the Vectors ---


In [None]:
# Print the results beautifully
print("Sentence 1 Vector:", final_vectors[0])
print("Sentence 2 Vector:", final_vectors[1])

Sentence 1 Vector: [1, 0, 0, 1, 1, 1, 2]
Sentence 2 Vector: [1, 1, 1, 0, 0, 0, 2]


In [None]:
# For comparison with the scikit-learn output
import pandas as pd
df = pd.DataFrame(final_vectors, columns=vocabulary, index=['Sentence 1', 'Sentence 2'])
print("\n--- Readable DataFrame ---")
print(df)


--- Readable DataFrame ---
            cat  chased  dog  mat  on  sat  the
Sentence 1    1       0    0    1   1    1    2
Sentence 2    1       1    1    0   0    0    2


# Bag-of-Words with Scikit-Learn

__CountVectorizer__ from the scikit-learn library is commonly used to implement the Bag of Words (BoW) model. It converts a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a unique word (token) from the entire corpus vocabulary. The values in the matrix indicate the frequency of each word in each document.

CountVectorizer handles tokenization, lowercasing, and counting word occurrences automatically, making it a practical and efficient tool for generating BoW representations for text data. For example, for a corpus of documents, CountVectorizer creates a sparse matrix of word counts that can be used directly as input features for machine learning models

In [None]:
# Import the CountVectorizer class from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Our corpus of documents
corpus = [
    "The cat sat on the mat.",
    "The dog chased the cat."
]

# 1. Create an instance of CountVectorizer
# This object will learn the vocabulary and generate vectors
vectorizer = CountVectorizer()

# 2. Fit the vectorizer to the corpus and transform the corpus into vectors
# .fit_transform() learns the vocabulary and returns the document-term matrix (our vectors)
X = vectorizer.fit_transform(corpus)

# 3. Get the learned vocabulary
# The vocabulary is a dictionary where keys are words and values are their index positions in the vector
vocabulary = vectorizer.get_feature_names_out()
print(f"Learned Vocabulary: {vocabulary}")


Learned Vocabulary: ['cat' 'chased' 'dog' 'mat' 'on' 'sat' 'the']


In [None]:
# 4. View the vectors
# The result is a sparse matrix. We convert it to a dense array for readability.
vectors = X.toarray()
print("\nResulting Vectors (Document-Term Matrix):")
print(vectors)


Resulting Vectors (Document-Term Matrix):
[[1 0 0 1 1 1 2]
 [1 1 1 0 0 0 2]]


In [None]:
# For a more readable output, let's put it in a pandas DataFrame
df = pd.DataFrame(vectors, columns=vocabulary, index=['Sentence 1', 'Sentence 2'])
print("\n--- Readable DataFrame ---")
print(df)


--- Readable DataFrame ---
            cat  chased  dog  mat  on  sat  the
Sentence 1    1       0    0    1   1    1    2
Sentence 2    1       1    1    0   0    0    2


#  BoW Limitation (Loss of Word Order)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Two sentences with completely different meanings
sentences_with_different_meanings = [
    "The dog bit the man.",
    "The man bit the dog."
]

# Use the same CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences_with_different_meanings)

# Display the results
vocabulary = vectorizer.get_feature_names_out()
df = pd.DataFrame(X.toarray(), columns=vocabulary, index=['Sentence A', 'Sentence B'])

print("--- Vocabulary and Vectors ---")
print(df)

--- Vocabulary and Vectors ---
            bit  dog  man  the
Sentence A    1    1    1    2
Sentence B    1    1    1    2


Although the sentences mean opposite things, their Bag-of-Words representations are identical. This proves that the model has lost the crucial information contained in the word order.

# Word2Vec

Word2Vec is a neural network-based technique created by Google researchers in 2013 to learn dense vector representations of words called word embeddings. Unlike Bag of Words (BoW), which represents text by counting word frequencies and ignores word order, Word2Vec captures semantic relationships and contextual meaning between words by training on word co-occurrence in a corpus



In [None]:
!pip install gensim -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m69.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

## Training Your Own Mini Word2Vec Model

 __Gensim__ offers an efficient and straightforward API to train Word2Vec models on custom text corpora. You provide the model with preprocessed tokenized sentences, and it learns word embeddings using either the CBOW or Skip-Gram method.



In [None]:
import gensim
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt') # Download tokenizer data
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
# 1. Sample Corpus
corpus = [
    "Generative AI is a powerful technology.",
    "Large language models can create human-like text.",
    "The transformer architecture revolutionized NLP.",
    "Deep learning models require significant data.",
    "Artificial intelligence is a broad field of study.",
    "Transformers are the basis for models like GPT."
]


Here's a brief outline of how you can train Word2Vec with Gensim:

* Prepare your text data as tokenized sentences (lists of words).

* Initialize the Word2Vec model with parameters such as vector size, window size, minimum word count, and training algorithm.

* Train the model on your sentences.

* Save the trained model for later use.

In [None]:
# 2. Preprocess Data (Tokenization)
tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus]
tokenized_corpus

[['generative', 'ai', 'is', 'a', 'powerful', 'technology', '.'],
 ['large', 'language', 'models', 'can', 'create', 'human-like', 'text', '.'],
 ['the', 'transformer', 'architecture', 'revolutionized', 'nlp', '.'],
 ['deep', 'learning', 'models', 'require', 'significant', 'data', '.'],
 ['artificial',
  'intelligence',
  'is',
  'a',
  'broad',
  'field',
  'of',
  'study',
  '.'],
 ['transformers', 'are', 'the', 'basis', 'for', 'models', 'like', 'gpt', '.']]

In [None]:
# 3. Train Word2Vec Model
# vector_size: dimensionality of the word vectors
# window: max distance between the current and predicted word within a sentence
# min_count: ignores all words with total frequency lower than this
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)
model.train(tokenized_corpus, total_examples=len(tokenized_corpus), epochs=10)



(101, 460)

In [None]:
# 4. Create Document Vectors
# We'll use a simple approach: average the word vectors for each document
def get_doc_vector(doc_tokens, model):
    word_vectors = [model.wv[word] for word in doc_tokens if word in model.wv]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

doc_vectors = [get_doc_vector(doc, model) for doc in tokenized_corpus]

In [None]:
# --- Step 5: Calculate and Find Most Similar Document (scikit-learn approach) ---
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Let's find documents similar to the first one
query_vector = doc_vectors[0]

# scikit-learn's function expects 2D arrays, so we reshape the query vector
# The output is a 2D array, so we access the first (and only) row with [0]
similarities = cosine_similarity(query_vector.reshape(1, -1), doc_vectors)[0]

# Find the most similar document (excluding itself)
most_similar_idx = np.argsort(similarities)[-2] # -1 is the document itself

print(f"Original Document: '{corpus[0]}'")
print(f"Most Similar Document: '{corpus[most_similar_idx]}'")
print(f"Similarity Score: {similarities[most_similar_idx]:.4f}")

Original Document: 'Generative AI is a powerful technology.'
Most Similar Document: 'Artificial intelligence is a broad field of study.'
Similarity Score: 0.3231


## Pre-trained Word2Vec
In this lab, we'll use a pre-trained Word2Vec model to find the most similar document to a given query. Instead of training a model ourselves (which takes a lot of data and time), we'll use a model trained on a massive dataset (like all of Wikipedia).

Our strategy will be to represent each document by taking the average of the word vectors of all the words within it. This gives us a single vector that captures the document's overall meaning.

In [None]:
# --- Step 1: Setup and Load the Pre-trained Model ---
# We use gensim's downloader to fetch a pre-trained model.
# 'glove-wiki-gigaword-50' is a small model with 50-dimensional vectors.
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

The model `glove-wiki-gigaword-50` is a pretrained word embedding model based on the GloVe (Global Vectors for Word Representation) algorithm developed by Stanford NLP. It provides 50-dimensional dense vector representations for words, trained on a large combined corpus including Wikipedia (2014 dump) and the Gigaword dataset, containing billions of words.

In [None]:
model = api.load("glove-wiki-gigaword-50")



In [None]:
# --- Step 2: Define Our Documents ---
# We have three documents with different topics.
documents = [
    "The sun is the star at the center of the Solar System.", # About space
    "The ocean is a body of salt water that covers most of the Earth.", # About oceans
    "A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations." # About technology
]
doc_labels = ["Space Document", "Ocean Document", "Technology Document"]


In [None]:
# --- Step 3: Create a Function to Vectorize a Document ---
# This function converts a document into a single vector by averaging its word vectors.
def vectorize_document(doc, model):
    """Converts a document string into a single averaged vector."""
    words = doc.lower().split()

    # Get the vector for each word in the document, if the word exists in the model
    word_vectors = [model[word] for word in words if word in model]

    if not word_vectors:
        # If no words in the document are in the model's vocabulary, return a vector of zeros
        return np.zeros(model.vector_size)

    # Return the mean of the word vectors to get a single document vector
    return np.mean(word_vectors, axis=0)

In [None]:
# --- Step 4: Vectorize All Our Documents ---
doc_vectors = [vectorize_document(doc, model) for doc in documents]

In [None]:
# --- Step 5: Define a Query and Find the Most Similar Document ---
query = "The astronaut travels to the moon in a rocket."
print(f"\nQuery: '{query}'")



Query: 'The astronaut travels to the moon in a rocket.'


In [None]:
# Vectorize the query using the same function
query_vector = vectorize_document(query, model)


In [None]:
# Calculate cosine similarity between the query vector and all document vectors
# We need to reshape the vectors for the function to work correctly.
similarities = cosine_similarity(query_vector.reshape(1, -1), doc_vectors)

In [None]:
# Find the index of the most similar document
most_similar_index = np.argmax(similarities)

print(f"\nSimilarity Scores: {similarities[0]}")
print(f"The most similar document is: '{doc_labels[most_similar_index]}'")
print(f"Document Text: '{documents[most_similar_index]}'")


Similarity Scores: [0.9492896  0.90947753 0.8427572 ]
The most similar document is: 'Space Document'
Document Text: 'The sun is the star at the center of the Solar System.'


### Assignment: Word2Vec with a Larger Corpus

Your task is to repeat the Word2Vec model training process on a larger corpus of your choice.

1.  **Find a Corpus**: Find a text file (`.txt`) to use as your corpus. You can use sources like [Project Gutenberg](https://www.gutenberg.org/) to find books in plain text format. Download a book and save it in a `Resources` folder.
2.  **Load and Preprocess**: Load the text data and preprocess it similar to the examples above (e.g., tokenization, lowercasing).
3.  **Train the Model**: Train a `Word2Vec` model on your corpus. You might need to experiment with the model parameters (`vector_size`, `window`, `min_count`, etc.) to get good results.
4.  **Explore the Embeddings**:
    *   Find the most similar words for a few words in your vocabulary.
    *   Perform some word analogies (e.g., "king" - "man" + "woman" = "queen").
    *   Implement a function to find the most similar document (sentence) in your corpus for a given query sentence. Test it with a few queries.
5.  **Reflect**: Briefly describe your findings. Are the results better or worse than the small example? What did you learn?

In [None]:
import gdown
file_id = '13ETSh6QARlutXhFePwQaqVgZ4TYbRd2o'
output = 'corpus.txt'
gdown.download(f'https://drive.google.com/uc?id={file_id}', output, quiet=False)

In [None]:
# Load the corpus
with open('corpus.txt', 'r') as f:
    corpus = f.readlines()

# Preprocess the data


# Train the Word2Vec model


# Explore the embeddings
# Find the most similar words to 'king'
model.wv.most_similar('king')

In [None]:
# Create document vectors for the entire corpus by averaging word vectors

doc_vectors =

In [None]:
# Define a query sentence
query = "a rocket to the moon"

# Vectorize the query using the correct function and tokenizing it
query_tokens =
query_vector =

# scikit-learn's function expects 2D arrays, so we reshape the query vector
# The output is a 2D array, so we access the first (and only) row with [0]
similarities =

# Find the most similar document (excluding itself)
most_similar_idx =

print(f"Original Document: '{query}'")
print(f"Most Similar Document: '{corpus[most_similar_idx]}'")
print(f"Similarity Score: {similarities[most_similar_idx]:.4f}")