<a href="https://colab.research.google.com/github/jenelaineDC/Natural-Language-Processing/blob/main/TFIDF_Movie_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF‑IDF — Movie Recommendations

### Learning goals
- Understand TF-IDF and its components **term frequency (TF)**, **inverse document frequency (IDF)**.
- Learn about **vector similarity**, how **cosine similarity** compares document vectors.
- Brief comparison between `TfidfVectorizer()` and `CountVectorizer`.
- Build a movie recommendation using the TMDB dataset.


### TF‑IDF vs CountVectorizer

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

| Aspect | CountVectorizer | TfidfVectorizer |
|---|---:|---|
| What it outputs | Raw token counts (bag-of-words) | TF‑IDF weighted values (normalized) |
| Use-case | When absolute frequency matters, or as input to other transforms | When you want to emphasize distinctive words |
| Relationship | — | Equivalent to `CountVectorizer` **followed by** `TfidfTransformer` in scikit‑learn |


### TF-IDF

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

```
tfidf(t,d) = tf(t,d) x idf(t)
```
- `IDF` =Log[(# Number of documents) / (Number of documents containing the word)] and
- `TF` = (Number of repetitions of word in a document) / (# of words in a document)

**TfidfVectorizer() parameters**

| Parameter | What it does | Default |
|---|---|---|
| `input` | Whether input is `filename`, `file` or `content` | `'content'` |
| `lowercase` | Convert text to lowercase before tokenizing | `True` |
| `stop_words` | Remove stop words (e.g. `'english'`) | `None` |
| `token_pattern` | Regex for tokenization (word tokens) | `r'(?u)\b\w\w+\b'` |
| `ngram_range` | Range of n-grams to extract, e.g. `(1,2)` for uni+bi-grams | `(1,1)` |
| `max_df` / `min_df` | Remove too frequent / too rare tokens | `1.0` / `1` |
| `max_features` | Keep only top K features by frequency | `None` |
| `use_idf` | Enable IDF reweighting | `True` |
| `smooth_idf` | Prevent division by zero (smooth) | `True` |
| `sublinear_tf` | Apply `1 + log(tf)` scaling | `False` |
| `norm` | Vector normalization (`'l2'` or `'l1'` or `None`) | `'l2'` |



**Term Frequency Variations:**
- Binary (1 if word appear, 0 if it did not)
- Normalize the count
  ```
  tf(t,d)= count(t,d) / summation count(t',d)
  ```
- Log
  ```
	tf(t,d) = log(1 + count(t,d))
  ```

**Inverse Document Frequency Variation:**
- Smooth IDF
  ```
	idf(t) = log( N/N(t)+1 ) + 1
  ```
- IDF Max
  ```
	idf(t) = log( maxN(t')/N(t) )
  ```
- Probabilistic IDF
  ```
	idf(t) = log ( N-N(t)/N(t) )
  ```

**Normalizing the entire TF-IDF:**
- In scikit learn TfidfVectorizer(norm= "l2 or l1") where l2 is the default


### Cosine similarity
Cosine Similarity measures how similar two vectors are — based on the angle between them. It’s commonly used in NLP to compare:

**Why "cosine"?**

Because it uses the cosine of the angle between two vectors.
- If the angle is small (vectors point in a similar direction), cosine similarity is close to 1 (very similar).
- If the angle is large (vectors are very different), cosine similarity is close to 0 or negative.

**Formula:**

For two vectors A and B:

  Cosine Similarity=(𝐴⋅𝐵) / (∥𝐴∥∥𝐵∥)

    - A⋅B = dot product of A and B
    - ∥A∥ = magnitude (length) of vector A
    - ∥𝐵∥ = magnitude (length) of vector B

### Movie Recommender

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

In [2]:
# Download the TMDB dataset
!wget -q https://lazyprogrammer.me/course_files/nlp/tmdb_5000_movies.csv -O tmdb_5000_movies.csv

In [3]:
df = pd.read_csv('tmdb_5000_movies.csv')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [4]:
import ast  # safely evaluate stringified lists/dicts

# Function to extract 'name' values
def extract_names(obj):
    try:
        # Safely convert string to list of dicts
        data = ast.literal_eval(obj)
        if isinstance(data, list):
            return [d['name'] for d in data if 'name' in d]
    except:
        pass
    return []  # return empty list if parsing fails

# Apply to your dataset
df['genres'] = df['genres'].apply(extract_names)
df['keywords'] = df['keywords'].apply(extract_names)

# Example output for first row
print(df[['title', 'genres', 'keywords']].head(1))

    title                                         genres  \
0  Avatar  [Action, Adventure, Fantasy, Science Fiction]   

                                            keywords  
0  [culture clash, future, space war, space colon...  


In [5]:
# Convert genres and keywords column to string
df['genres'] = df['genres'].apply(lambda x: ' '.join(x))
df['keywords'] = df['keywords'].apply(lambda x: ' '.join(x))

# Create tags
df['tags'] = df['overview'] + ' ' + df['genres'] + ' ' + df['keywords']
df = df.dropna(subset = ['tags'])


#### TF-IDF Recommender

In [6]:
# Create TF-IDF features
tfidf = TfidfVectorizer(max_features=2000, stop_words='english', norm='l2')
X_tfidf = tfidf.fit_transform(df['tags'])

print('TF-IDF matrix shape:', X_tfidf.shape)
print('Sample features (first 20):', tfidf.get_feature_names_out()[:20])

TF-IDF matrix shape: (4800, 2000)
Sample features (first 20): ['000' '10' '11' '12' '15' '17' '18th' '1930s' '1950s' '1960s' '1970s'
 '1980s' '19th' '20' '20th' '30' '3d' 'abandoned' 'ability' 'able']


In [7]:
# Mapping title -> index
movie2idx = pd.Series(df.index, index=df['title']).drop_duplicates()
movie2idx.head(5)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Avatar,0
Pirates of the Caribbean: At World's End,1
Spectre,2
The Dark Knight Rises,3
John Carter,4


In [8]:
def recommend_tfidf(title, topn=5):
    # Get the index of the title
    idx = movie2idx.get(title)
    if idx is None:
        raise KeyError(f"Title not found: {title}")

    # Handle duplicates
    if isinstance(idx, (pd.Series, pd.Index)):
        idx = idx.iloc[0]

    # Get the TF-IDF vector for the query movie
    query_vec = X_tfidf[idx]

    # Compute cosine similarity with all other movies
    scores = cosine_similarity(query_vec, X_tfidf).flatten()

    # Exclude the movie itself and get top-N similar movies
    top_idxs = (-scores).argsort()[1: topn + 1]

    # Return the titles of the most similar movies
    return df['title'].iloc[top_idxs].reset_index(drop=True)

In [9]:
print('Recommendations for "Scream 3":')
print(recommend_tfidf('Scream 3', topn=5))

Recommendations for "Scream 3":
0       The Calling
1        Jackass 3D
2            Scream
3    Disaster Movie
4        Grindhouse
Name: title, dtype: object


In [10]:
# Show top terms (by tf-idf) for a specific movie index
def top_tfidf_terms(doc_index, top_n=10):
    vec = X_tfidf[doc_index].toarray().ravel()
    top_indices = vec.argsort()[-top_n:][::-1]
    terms = tfidf.get_feature_names_out()[top_indices]
    scores = vec[top_indices]
    return list(zip(terms, np.round(scores,4)))

# Example: top terms for 'The Dark Knight Rises'
idx = movie2idx['The Dark Knight Rises']
print('Top TF-IDF terms for', df.loc[idx,'title'])
print(top_tfidf_terms(idx, top_n=12))

Top TF-IDF terms for The Dark Knight Rises
[('batman', np.float64(0.4809)), ('attorney', np.float64(0.2797)), ('city', np.float64(0.2527)), ('terrorist', np.float64(0.2377)), ('protect', np.float64(0.2266)), ('department', np.float64(0.1437)), ('district', np.float64(0.1427)), ('imax', np.float64(0.1407)), ('knight', np.float64(0.1407)), ('vigilante', np.float64(0.1398)), ('terrorism', np.float64(0.139)), ('comics', np.float64(0.1365))]


#### CountVectorizer Recommender

In [11]:
# Apply CountVectorizer()
cv = CountVectorizer(max_features=2000, stop_words='english')
X_count = cv.fit_transform(df['tags'])

In [12]:
# Create function
def recommend_count(title, topn=5):
    idx = movie2idx.get(title)
    if idx is None:
        raise KeyError(f"Title not found: {title}")
    if isinstance(idx, pd.Series) or isinstance(idx, pd.Index):
        idx = idx.iloc[0]
    q = X_count[idx]
    scores = cosine_similarity(q, X_count).flatten()
    top_idxs = (-scores).argsort()[1: topn+1]
    return df['title'].iloc[top_idxs].reset_index(drop=True)

In [13]:
# Compare for a sample movie
movie = 'Mortal Kombat'
print('TF-IDF recommendations for', movie)
print(recommend_tfidf(movie))
print('\nCountVectorizer recommendations for', movie)
print(recommend_count(movie))

TF-IDF recommendations for Mortal Kombat
0    Mortal Kombat: Annihilation
1        Lara Croft: Tomb Raider
2             DOA: Dead or Alive
3                       Æon Flux
4         300: Rise of an Empire
Name: title, dtype: object

CountVectorizer recommendations for Mortal Kombat
0    Mortal Kombat: Annihilation
1        Lara Croft: Tomb Raider
2             DOA: Dead or Alive
3                       Æon Flux
4         300: Rise of an Empire
Name: title, dtype: object
