<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Building tf-idf Document Vectors</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Feature Engineering for NLP in Python)</span></div>

## Table of Contents
1. [n-gram Modeling and Motivation](#section-1)
2. [Term Frequency-Inverse Document Frequency (TF-IDF)](#section-2)
3. [TF-IDF Implementation with Scikit-Learn](#section-3)
4. [Cosine Similarity: Theory and Math](#section-4)
5. [Cosine Similarity Implementation](#section-5)
6. [Project: Building a Plot Line Based Recommender](#section-6)
7. [Beyond n-grams: Word Embeddings](#section-7)
8. [Word and Document Similarities with spaCy](#section-8)
9. [Review and Conclusion](#section-9)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. n-gram Modeling and Motivation</span><br>

### Understanding n-gram Modeling
In n-gram modeling, the weight of a dimension is dependent on the frequency of the word corresponding to that dimension.

*   **Example**: If a document contains the word `human` in five places.
*   **Result**: The dimension corresponding to `human` has a weight of **5**.

### The Motivation for Better Vectors
While simple frequency counts are useful, they have significant drawbacks when analyzing a corpus of documents.

**The Problem:**
*   Some words occur very commonly across **all** documents.
*   Consider a corpus of documents about the universe.
    *   One document has `jupiter` and `universe` occurring 20 times each.
    *   `jupiter` rarely occurs in the other documents, but `universe` is common across the whole corpus.
    *   Simple frequency weighting treats them equally (both weight 20).
    *   **Goal**: We want to give more weight to `jupiter` on account of its **exclusivity** to that specific document.

### Applications
Advanced vectorization techniques like TF-IDF allow us to:
1.  Automatically detect stopwords.
2.  Improve Search algorithms.
3.  Build Recommender systems.
4.  Achieve better performance in predictive modeling.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Term Frequency-Inverse Document Frequency (TF-IDF)</span><br>

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

### Key Concepts
*   **Term Frequency (TF)**: Proportional to the frequency of the word in the specific document.
*   **Inverse Document Frequency (IDF)**: Inverse function of the number of documents in which the word occurs.

### Mathematical Formula
The weight $w_{i,j}$ of term $i$ in document $j$ is calculated as:

$$ w_{i,j} = tf_{i,j} \cdot \log \left( \frac{N}{df_i} \right) $$

Where:
*   $w_{i,j}$: Weight of term $i$ in document $j$.
*   $tf_{i,j}$: Term frequency of term $i$ in document $j$.
*   $N$: Total number of documents in the corpus.
*   $df_i$: Number of documents containing term $i$.

### Calculation Example
Imagine a corpus where:
*   Total documents ($N$) = 20
*   The word "library" appears in 8 documents ($df_{library} = 8$).
*   In a specific document, "library" appears 5 times ($tf = 5$).

$$ w_{library, document} = 5 \cdot \log \left( \frac{20}{8} \right) \approx 2 $$

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. TF-IDF Implementation with Scikit-Learn</span><br>

We can implement TF-IDF easily using `TfidfVectorizer` from scikit-learn.

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. It handles tokenization and weighting automatically. </div>



In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample Corpus (Created for demonstration purposes)
corpus = [
    "The sun is hot",
    "The sun is bright",
    "The moon is cold",
    "The stars are far"
]

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)

# Print the matrix as an array
print("Feature Names:", vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:\n", tfidf_matrix.toarray())



**Explanation:**
1.  We initialize the `TfidfVectorizer`.
2.  `fit_transform(corpus)` learns the vocabulary and inverse document frequencies, then returns the document-term matrix.
3.  The output is a sparse matrix where non-zero values represent the TF-IDF weight of a word in a document.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Cosine Similarity: Theory and Math</span><br>

To build recommender systems or search engines, we need a way to calculate how similar two documents (vectors) are. The **Cosine Similarity** is a standard metric used in NLP.

### The Dot Product
Consider two vectors $V = (v_1, v_2, ..., v_n)$ and $W = (w_1, w_2, ..., w_n)$.
The dot product is:
$$ V \cdot W = (v_1 \times w_1) + (v_2 \times w_2) + ... + (v_n \times w_n) $$

**Example:**
$$ A = (4, 7, 1) $$
$$ B = (5, 2, 3) $$
$$ A \cdot B = (4 \times 5) + (7 \times 2) + (1 \times 3) $$
$$ A \cdot B = 20 + 14 + 3 = 37 $$

### Magnitude of a Vector
The magnitude (length) of a vector is defined as:
$$ ||V|| = \sqrt{(v_1)^2 + (v_2)^2 + ... + (v_n)^2} $$

**Example for Vector A:**
$$ ||A|| = \sqrt{(4)^2 + (7)^2 + (1)^2} $$
$$ ||A|| = \sqrt{16 + 49 + 1} = \sqrt{66} \approx 8.12 $$

### The Cosine Score Formula
The cosine similarity is the cosine of the angle $\theta$ between two vectors.

$$ \text{sim}(A, B) = \cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||} $$

**Calculating the Score for A and B:**
$$ \cos(A, B) = \frac{37}{\sqrt{66} \times \sqrt{38}} $$
$$ \cos(A, B) \approx \frac{37}{8.12 \times 6.16} \approx 0.7388 $$

### Points to Remember
*   Mathematically, the value is between -1 and 1.
*   In NLP (where term counts are non-negative), the value is between **0 and 1**.
*   It is **robust to document length** (unlike Euclidean distance).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Cosine Similarity Implementation</span><br>

Scikit-learn provides a utility to calculate this efficiently.



In [None]:
# Import the cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Define two 3-dimensional vectors A and B
A = [4, 7, 1]
B = [5, 2, 3]

# Compute the cosine score of A and B
# Note: cosine_similarity expects 2D arrays (lists of vectors)
score = cosine_similarity([A], [B])

# Print the cosine score
print(score)



**Output Interpretation**:
The output `[[0.73881883]]` matches our manual calculation.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Project: Building a Plot Line Based Recommender</span><br>

We will build a system that recommends movies based on the similarity of their plot descriptions.

### The Data
We will use a small subset of movie data containing titles and overviews.

| Title | Overview |
| :--- | :--- |
| **Shanghai Triad** | A provincial boy related to a Shanghai crime family is recruited by his uncle into cosmopolitan Shanghai in the 1930s to be a servant to a ganglord's mistress. |
| **Cry, the Beloved Country** | A South-African preacher goes to search for his wayward son who has committed a crime in the big city. |
| **The Godfather** | Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family. |

### Step 1: Data Setup and Preprocessing
First, let's create a DataFrame to simulate the dataset.



In [None]:
import pandas as pd

# Creating a dummy dataset for the recommender
data = {
    'title': [
        'The Godfather', 
        'The Godfather: Part II', 
        'The Godfather: Part III', 
        'Shanghai Triad', 
        'Cry, the Beloved Country', 
        'Goodfellas',
        'The Lion King',
        'The Lion King 2: Simba\'s Pride'
    ],
    'overview': [
        'Spanning the years 1945 to 1955, a chronicle of the fictional Italian-American Corleone crime family.',
        'The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.',
        'In the midst of trying to legitimize his business dealings in New York City and Italy in 1979, aging Mafia Don Michael Corleone seeks to avow for his sins.',
        'A provincial boy related to a Shanghai crime family is recruited by his uncle into cosmopolitan Shanghai in the 1930s.',
        'A South-African preacher goes to search for his wayward son who has committed a crime in the big city.',
        'The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito.',
        'Lion prince Simba and his father are targeted by his bitter uncle, who wants to ascend the throne himself.',
        'Simba\'s daughter is the key to a resolution of a bitter feud between Simba\'s pride and the outcast pride led by the mate of Scar.'
    ]
}

df = pd.DataFrame(data)
print(df.head())



### Step 2: Generating TF-IDF Vectors
We convert the text overviews into numerical vectors.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
# We can use built-in stop words to remove common English words
vectorizer = TfidfVectorizer(stop_words='english')

# Generate matrix of tf-idf vectors
tfidf_matrix = vectorizer.fit_transform(df['overview'])

print("Matrix Shape:", tfidf_matrix.shape)



### Step 3: Generating Cosine Similarity Matrix
We calculate the similarity of every movie against every other movie.

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> The magnitude of a TF-IDF vector is 1. Therefore, the dot product of two TF-IDF vectors is equal to their cosine similarity. We can use <code>linear_kernel</code> instead of <code>cosine_similarity</code> to significantly improve computation time. </div>



In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Generate cosine similarity matrix
# linear_kernel calculates the dot product
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

print(cosine_sim)



### Step 4: The Recommender Function
We need a function that:
1.  Takes a movie title.
2.  Finds the index of that movie.
3.  Extracts pairwise similarity scores.
4.  Sorts scores in descending order.
5.  Returns the top similar titles (ignoring the movie itself).



In [None]:
# Construct a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar movies (ignoring index 0 which is the movie itself)
    sim_scores = sim_scores[1:4]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 3 most similar movies
    return df['title'].iloc[movie_indices]

# Test the recommender
print("Recommendations for 'The Godfather':")
print(get_recommendations('The Godfather'))

print("\nRecommendations for 'The Lion King':")
print(get_recommendations('The Lion King'))



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Beyond n-grams: Word Embeddings</span><br>

### The Problem with BoW and TF-IDF
Consider these sentences:
1.  'I am happy'
2.  'I am joyous'
3.  'I am sad'

In Bag-of-Words or TF-IDF, "happy" and "joyous" are just different strings. The model does not know they are synonyms.

### Word Embeddings
Word embeddings map words into an n-dimensional vector space.
*   Produced using deep learning on huge amounts of data.
*   Can discern how similar two words are.
*   Used to detect synonyms and antonyms.
*   **Captures complex relationships**:
    *   King - Queen $\rightarrow$ Man - Woman
    *   France - Paris $\rightarrow$ Russia - Moscow

### Implementation using spaCy
We use the `spaCy` library. (Note: You must download the model via `python -m spacy download en_core_web_lg`).



In [None]:
import spacy

# Load model and create Doc object
# Note: 'en_core_web_lg' is required for vectors. 'sm' models do not contain vectors.
try:
    nlp = spacy.load('en_core_web_lg')
except OSError:
    print("Downloading model...")
    from spacy.cli import download
    download("en_core_web_lg")
    nlp = spacy.load('en_core_web_lg')

doc = nlp('I am happy')

# Generate word vectors for each token
for token in doc:
    print(f"Token: {token.text}, Vector Size: {token.vector.shape}")
    # Printing just the first 5 dimensions for brevity
    print(token.vector[:5]) 



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Word and Document Similarities with spaCy</span><br>

### Word Similarities
We can compare individual tokens to see how semantically close they are.



In [None]:
doc = nlp("happy joyous sad")

# Iterate over tokens to compare them
for token1 in doc:
    for token2 in doc:
        print(f"{token1.text} - {token2.text}: {token1.similarity(token2):.4f}")



**Expected Behavior**:
*   `happy` and `joyous` should have a high similarity score.
*   `happy` and `sad` will have a lower score (though still related as they are both emotions).

### Document Similarities
We can also compare entire documents (sentences). spaCy averages the vectors of the words in the document.



In [None]:
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
score_1_2 = sent1.similarity(sent2)
print(f"Similarity ('I am happy', 'I am sad'): {score_1_2}")

# Compute similarity between sent1 and sent3
score_1_3 = sent1.similarity(sent3)
print(f"Similarity ('I am happy', 'I am joyous'): {score_1_3}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. Review and Conclusion</span><br>

### Summary of Feature Engineering for NLP
In this notebook, we have covered the essential techniques for converting text into numerical data that machine learning models can understand.

1.  **Basic Features**: Characters, words, mentions.
2.  **Preprocessing**: Tokenization, Lemmatization, Text cleaning.
3.  **n-gram Modeling**: Capturing context by looking at adjacent words.
4.  **TF-IDF**: Weighting words by their importance (exclusivity) to a document.
5.  **Cosine Similarity**: Measuring the angle between document vectors to find similarities.
6.  **Word Embeddings**: Using deep learning (spaCy) to capture semantic meaning and relationships.

### Next Steps
To further advance your NLP skills, consider exploring:
*   **Advanced NLP with spaCy**: Custom pipelines and entity recognition.
*   **Deep Learning**: Using libraries like TensorFlow or PyTorch for Transformers (BERT, GPT).

**Congratulations on building your own TF-IDF vectorizer and Movie Recommender System!**
