# NLP Tutorial: From Text to Vectors

This tutorial covers the complete flow of Natural Language Processing (NLP), broken into manageable steps:
1.  **Data Preprocessing**
2.  **Data Cleaning**
3.  **Text to Vectors** (including Word Embeddings: CBOW & Skipgram)

# 1. Data Preprocessing

## 1.1 Tokenization
Tokenization splits text into individual units like words or sentences. It is the fundamental first step in turning unstructured text into structured data.

First, we import the necessary libraries and download the required NLTK data.

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Download necessary tokenizer data
nltk.download('punkt')
nltk.download('punkt_tab')

Now, let's define a sample text to work with.

In [None]:
text = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
print(text)

### Sentence Tokenization
Split the text into individual sentences.

In [None]:
sentences = sent_tokenize(text)
print("--- Sentences ---")
for sent in sentences:
    print(sent)

### Word Tokenization
Split the text into individual words.

In [None]:
words = word_tokenize(text)
print("--- Words ---")
print(words)

## 1.2 Lowercasing
Lowercasing normalizes text to ensure that words like 'Apple' and 'apple' are treated as identical. This reduces the vocabulary size and complexity.

In [None]:
text_lower = text.lower()
print("Original :", text)
print("Lowercased:", text_lower)

## 1.3 Regular Expressions (Regex)
Regular Expressions (Regex) allow for pattern-based text searching and manipulation. They are essential for removing noise like HTML tags, URLs, or special characters.

In [None]:
import re

dirty_text = "Check out this link <a href='test'>Click</a> call 999-999-9999 or email test@example.com!!! #NLP"

### Remove HTML Tags
We use the pattern `<.*?>` to find and remove HTML tags.

In [None]:
clean_text = re.sub('<.*?>', '', dirty_text)
print("No HTML:", clean_text)

### Remove Special Characters
We keep only letters and spaces using the pattern `[^a-zA-Z\s]`.

In [None]:
clean_text = re.sub('[^a-zA-Z\s]', '', clean_text)
print("Only Letters:", clean_text)

# 2. Data Cleaning

## 2.1 Stemming
Stemming is a crude heuristic process that chops off word endings to reduce them to a base form. It is fast but often results in non-dictionary roots (e.g., 'flies' -> 'fli').

Initializing stemmers.

In [None]:
from nltk.stem import PorterStemmer, LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

Comparing Stemmer results.

In [None]:
word = "history"
print(f"Porter:    {porter.stem(word)}")
print(f"Lancaster: {lancaster.stem(word)}")

## 2.2 Lemmatization
Lemmatization uses a vocabulary and morphological analysis to return the dictionary form of a word. Unlike stemming, it produces valid words but is computationally more expensive.

Using WordNetLemmatizer.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

Comparing Lemmatization with Stemming for the word 'eating' (verb).

In [None]:
word = "eating"
print(f"Stem:  {porter.stem(word)}")
print(f"Lemma: {lemmatizer.lemmatize(word, pos='v')}")

## 2.3 Stopwords
Stopwords are high-frequency words (like 'the', 'is', 'and') that carry little semantic meaning. Removing them helps the model focus on the unique, content-rich words.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(f"Total English stopwords: {len(stop_words)}")

Applying stopword removal.

In [None]:
text_sample = "This is a sample sentence, showing off the stop words filtration."
words = word_tokenize(text_sample)

filtered_sentence = [w for w in words if not w.lower() in stop_words]

print("Original:", words)
print("Filtered:", filtered_sentence)

# 3. Text to Vectors

## 3.1 One Hot Encoding
One Hot Encoding creates a binary vector for each word, where only one bit is true. It is simple but results in high-dimensional, sparse vectors with no semantic relation.

In [None]:
import pandas as pd

docs = ["blue house", "red house"]
print("Docs:", docs)

In [None]:
one_hot = pd.get_dummies(docs[0].split() + docs[1].split())
print(one_hot)

## 3.2 Bag of Words (BoW)
Bag of Words represents a document by counting the frequency of each word it contains. It captures word presence but ignores grammar and word order.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())

In [None]:
print("BoW Matrix:\n", X.toarray())

## 3.3 TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) weighs words by their importance. It helps to deemphasize common words and highlight terms that are unique to specific documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [None]:
print("TF-IDF Matrix:\n", X_tfidf.toarray())

## 3.4 Word Embeddings (Word2Vec)
Word Embeddings are dense vector representations where words with similar meanings are located close together. They capture complex semantic relationships that sparse methods miss.

In [None]:
!pip install gensim
from gensim.models import Word2Vec

# Sample Data
sentences = [
    ['i', 'love', 'nlp'],
    ['nlp', 'is', 'awesome'],
    ['i', 'love', 'machine', 'learning'],
    ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning'],
    ['word', 'embeddings', 'are', 'dense', 'vectors']
]

### 3.4.1 CBOW (Continuous Bag of Words)
CBOW predicts the *center* word based on the surrounding *context* words. It is faster to train and has slightly better accuracy for frequent words.

In [None]:
# Train CBOW Model (sg=0)
model_cbow = Word2Vec(sentences, min_count=1, vector_size=10, window=3, sg=0)
print("CBOW Model Trained.")

vector_cbow = model_cbow.wv['nlp']
print("CBOW Vector for 'nlp':\n", vector_cbow)

### 3.4.2 Skipgram
Skipgram predicts the surrounding *context* words given a *center* word. It performs well with small datasets and handles infrequent words better than CBOW.

In [None]:
# Train Skipgram Model (sg=1)
model_skipgram = Word2Vec(sentences, min_count=1, vector_size=10, window=3, sg=1)
print("Skipgram Model Trained.")

vector_skipgram = model_skipgram.wv['nlp']
print("Skipgram Vector for 'nlp':\n", vector_skipgram)

## 3.5 Average Word2Vec
Create a document vector by averaging word vectors.

In [None]:
import numpy as np

def avg_word2vec(sentence, model):
    words = sentence.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    
    if not word_vectors:
        return np.zeros(model.vector_size)
    
    return np.mean(word_vectors, axis=0)

Testing Average Word2Vec with the CBOW model.

In [None]:
new_sentence = "i love deep learning"
doc_vector = avg_word2vec(new_sentence, model_cbow)

print(f"Document Vector (CBOW) for '{new_sentence}':\n", doc_vector)