### Context Aware Model

In [1]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

# For text preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import datetime
import string

# For multilabel classification
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

# For neural networks



# For model evaluation
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix


#### Context Vectors (For CB Model)

1. Explicit Context - extract context features from MovieLens dataframe or imdb link 


Movies: 
Movie Ids
---------

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).


Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)

In [2]:
movies = pd.read_csv("../dataset/ml-20m/filtered_movies.csv")

In [3]:
movies.drop(columns=['Unnamed: 0'], inplace=True)

movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,4,Waiting to Exhale (1995),Comedy|Drama|Romance
3,5,Father of the Bride Part II (1995),Comedy
4,6,Heat (1995),Action|Crime|Thriller
...,...,...,...
6232,130856,Severe Clear (2010),Comedy|Documentary
6233,130958,Killer Crocodile (1989),Horror
6234,130984,Santo vs. las lobas (1976),Action|Fantasy|Horror
6235,131011,Execution Squad (1972),Crime|Drama


In [4]:
movies.head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,4,Waiting to Exhale (1995),Comedy|Drama|Romance
3,5,Father of the Bride Part II (1995),Comedy
4,6,Heat (1995),Action|Crime|Thriller
5,7,Sabrina (1995),Comedy|Romance
6,10,GoldenEye (1995),Action|Adventure|Thriller
7,11,"American President, The (1995)",Comedy|Drama|Romance
8,12,Dracula: Dead and Loving It (1995),Comedy|Horror
9,14,Nixon (1995),Drama



Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.

In [6]:
links = pd.read_csv("../dataset/ml-20m/links.csv")

In [7]:
links.head(10)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
5,6,113277,949.0
6,7,114319,11860.0
7,8,112302,45325.0
8,9,114576,9091.0
9,10,113189,710.0


In [8]:
df_movies = pd.merge(links, movies, on='movieId', how='inner')


In [9]:
df_movies

Unnamed: 0,movieId,imdbId,tmdbId,title,genres
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,113497,8844.0,Jumanji (1995),Adventure|Children|Fantasy
2,4,114885,31357.0,Waiting to Exhale (1995),Comedy|Drama|Romance
3,5,113041,11862.0,Father of the Bride Part II (1995),Comedy
4,6,113277,949.0,Heat (1995),Action|Crime|Thriller
...,...,...,...,...,...
6232,130856,494826,48376.0,Severe Clear (2010),Comedy|Documentary
6233,130958,143338,78402.0,Killer Crocodile (1989),Horror
6234,130984,208423,317168.0,Santo vs. las lobas (1976),Action|Fantasy|Horror
6235,131011,69109,79572.0,Execution Squad (1972),Crime|Drama


#### Converting to correct data types


In [10]:
dtype_dict = {col: 'str' for col in df_movies.columns}
dtype_dict['movieId'] = 'int'
df_movies = df_movies.astype(dtype_dict)

# Verify the conversion
print(df_movies.dtypes)

movieId     int64
imdbId     object
tmdbId     object
title      object
genres     object
dtype: object


### Accessing the links 

- Using web scraping to access the storyline/synopsis from each link
- Stored in separate columns - "imdb_doc", "tmdb_doc"


In [11]:
# !pip install IMDbPY

In [12]:
from imdb import IMDb

ia = IMDb()
movie = ia.get_movie('0114709')  # Use IMDb ID without "tt"
print(movie.get('plot outline'))


A little boy named Andy loves to be in his room, playing with his toys, especially his doll named "Woody". But, what do the toys do when Andy is not with them, they come to life. Woody believes that his life (as a toy) is good. However, he must worry about Andy's family moving, and what Woody does not know is about Andy's birthday party. Woody does not realize that Andy's mother gave him an action figure known as Buzz Lightyear, who does not believe that he is a toy, and quickly becomes Andy's new favorite toy. Woody, who is now consumed with jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are now lost. They must find a way to get back to Andy before he moves without them, but they will have to pass through a ruthless toy killer, Sid Phillips.


In [13]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
27273,131254,466713,4436.0
27274,131256,277703,9274.0
27275,131258,3485166,285213.0
27276,131260,249110,32099.0


Due to it taking a long time - only using tmdb for now, and then we can add this if necessary 

In [14]:
# import pandas as pd
# from imdb import IMDb
# 
# # Assuming df_movies is your DataFrame and it has a column named 'imdbId'
# # Initialize IMDb object
# ia = IMDb()
# 
# # Create a new column 'imdb_syn' initialized with None only if it does not exist
# if 'imdb_syn' not in df_movies.columns:
#     df_movies['imdb_syn'] = None
# 
# #
# # Loop through the first 10 IMDb IDs in the DataFrame using integer-based indexing
# for index in range(0, 100):
#     try:
#         imdb_id = str(df_movies.loc[index, 'imdbId']).zfill(7)  # Ensure the IMDb ID has leading zeros up to 7 digits
#         movie = ia.get_movie(imdb_id)  # Use IMDb ID without "tt"
#         plot_outline = movie.get('plot outline')
#         
#         # Assign the plot outline to the corresponding entry in 'imdb_syn' column
#         df_movies.loc[index, 'imdb_syn'] = plot_outline
#     except Exception as e:
#         print(f"An error occurred for index {index}, IMDb ID {imdb_id}: {e}")
# 
# # Now df_movies['imdb_syn'] will contain the plot outlines for the first 10 IMDb IDs.


In [15]:
# df_movies.loc[2]

In [16]:
# df_movies.to_csv("../dataset/context_imdb.csv", index=False)

In [17]:
df_movies = pd.read_csv("../dataset/context_imdb.csv")

In [18]:
# df_movies.drop(columns=['Unnamed: 0'], inplace=True)
# df_movies

In [19]:
# # df_movies_null = df_movies[df_movies['imdb_syn'].notna().index]
# non_na_indices = df_movies.index[df_movies['imdb_syn'].notna()]
# if not non_na_indices.empty:
#     df_movies_null_num = non_na_indices.max() + 1
# else:
#     df_movies_null_num = 0

For the tmdb, we use the Python package 'tmdbv3api'
https://github.com/AnthonyBloomer/tmdbv3api

In [20]:
# !pip install tmdbv3api

2cc6b369ade4867c4efa72198cd6dba9 - API KEY

a95e7426cf907141b0b558fef03000ab

In [21]:
# import time
# 
# import pandas as pd
# from concurrent.futures import ThreadPoolExecutor
# from tmdbv3api import TMDb, Movie
# 
# # Initialize TMDb and Movie objects
# tmdb = TMDb()
# movie = Movie()
# 
# # Your TMDb API key
# tmdb.api_key = 'a95e7426cf907141b0b558fef03000ab'
# 
# # Function to fetch movie overview
# def fetch_overview(tmdb_id):
#     try:
#         if tmdb_id:
#             details = movie.details(tmdb_id)
#             if details:
#                 return details.overview
#             else:
#                 print(f"Resource with ID {tmdb_id} could not be found.")
#                 return 'N/A'
#         else:
#             return 'N/A'
#     except Exception as e:
#         print(f"An error occurred: {e}")
#         return 'N/A'
#     finally:
#         # Sleep briefly to avoid overwhelming the API and hitting rate limits
#         time.sleep(0.1)  
# 
# # Function to handle each batch
# def process_batch(batch):
#     max_workers = 5
#     with ThreadPoolExecutor(max_workers=max_workers) as executor:
#         return list(executor.map(fetch_overview, batch))
# 
# # Batch size (adjust based on your requirements)
# batch_size = 50
# 
# # Initialize an empty list to hold the results
# all_results = []
# 
# # Process each batch
# for i in range(0, len(df_movies['tmdbId']), batch_size):
#     batch = df_movies['tmdbId'][i:i + batch_size]
#     batch_results = process_batch(batch)
#     all_results.extend(batch_results)
# 
# # Add the results back into the DataFrame
# df_movies['tmdb_syn'] = all_results
# 
# # Display the updated DataFrame
# print(df_movies)
# 
# 


In [22]:

df_movies.to_csv("../dataset/tmdb_syn_labelled.csv", index=False)


In [23]:
# import pandas as pd
# from concurrent.futures import ThreadPoolExecutor
# from tmdbv3api import TMDb, Movie

# # Initialize TMDb and Movie objects
# tmdb = TMDb()
# movie = Movie()

# # Your TMDb API key
# tmdb.api_key = '2cc6b369ade4867c4efa72198cd6dba9'

# # Function to fetch movie overview
# def fetch_overview(tmdb_id):
#     try:
#         if tmdb_id:
#             details = movie.details(tmdb_id)
#             if details:
#                 return details.overview
#             else:
#                 print(f"Resource with ID {tmdb_id} could not be found.")
#                 return 'N/A'
#         else:
#             return 'N/A'
#     except Exception as e:
#         print(f"An error occurred: {e}")
#         return 'N/A'


# # Function to handle each batch
# def process_batch(batch):
#     max_workers = 10
#     with ThreadPoolExecutor(max_workers=max_workers) as executor:
#         return list(executor.map(fetch_overview, batch))

# # Batch size (adjust based on your requirements)
# batch_size = 100

# # Initialize an empty list to hold the results
# all_results = []

# # Process each batch
# for i in range(0, len(df_movies['tmdbId']), batch_size):
#     batch = df_movies['tmdbId'][i:i + batch_size]
#     batch_results = process_batch(batch)
#     all_results.extend(batch_results)

# # Add the results back into the DataFrame
# df_movies['tmdb_syn'] = all_results

# # Display the updated DataFrame
# print(df_movies)




In [59]:
# df_movies.to_csv("../dataset/tmdb_syn_labelled.csv", index=False)

#### Combining the imdb and tmdb synopsis into one column - 2 Oct: Summarise once imdb is done - if used
- Combine and summarise the synopses using the gensim package
- ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.


In [5]:
df_movies= pd.read_csv("../dataset/tmdb_syn_labelled.csv")

In [24]:
# from transformers import pipeline
# 
# # Initialize the summarizer pipeline
# summarizer = pipeline("summarization")
# 
# def combine_and_summarize(imdb_syn, tmdb_syn):
#     combined_syn = imdb_syn + " " + tmdb_syn
#     # Perform summarization
#     summary = summarizer(combined_syn, max_length=150, min_length=30, do_sample=False)
#     # Extract the summarized text
#     summarized_syn = summary[0]['summary_text']
#     return summarized_syn if summarized_syn else combined_syn  # Use original if summarization fails
# 
# # Apply the function to your DataFrame
# df_movies['summarized_syn'] = df_movies.apply(lambda x: combine_and_summarize(x['imdb_syn'], x['tmdb_syn']), axis=1)


In [13]:
# !pip install gensim
# !pip3 install gensim==3.6.0


Collecting gensim==3.6.0
  Downloading gensim-3.6.0.tar.gz (23.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25ldone
[?25h  Created wheel for gensim: filename=gensim-3.6.0-cp311-cp311-macosx_11_0_arm64.whl size=23218283 sha256=701682c488fd8945db7da094f87e6e8bbc7327bcae9739eb0f604d29824e1250
  Stored in directory: /Users/jiayi/Library/Caches/pip/wheels/6b/84/4d/ea7977f42f89ebaa24f5702cc2d5013300934c7641d48c1521
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.3.0
    Uninstalling gensim-4.3.0:
      Successfully uninstalled gensim-4.3.0
Successfully installed gensim-3.6.0


In [19]:
df_movies

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",A little boy named Andy loves to be in his roo...
1,2,113497,8844.0,Jumanji (1995),Adventure|Children|Fantasy,"Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,"Jumanji, one of the most unique--and dangerous..."
2,4,114885,31357.0,Waiting to Exhale (1995),Comedy|Drama|Romance,This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",This story based on the best selling novel by ...
3,5,113041,11862.0,Father of the Bride Part II (1995),Comedy,"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,"In this sequel to ""Father of the Bride"", Georg..."
4,6,113277,949.0,Heat (1995),Action|Crime|Thriller,Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,Hunters and their prey--Neil and his professio...
...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),Comedy|Documentary,,Severe Clear is a film based on the memoirs of...,Severe Clear is a film based on the memoirs o...
6179,130958,143338,78402.0,Killer Crocodile (1989),Horror,,A group of environmentalists arrives at a fara...,A group of environmentalists arrives at a fara...
6180,130984,208423,317168.0,Santo vs. las lobas (1976),Action|Fantasy|Horror,,Also known as Santo vs. the She-Wolves,Also known as Santo vs. the She-Wolves
6181,131011,69109,79572.0,Execution Squad (1972),Crime|Drama,,Bertone is a moderately honest homicide cop. U...,Bertone is a moderately honest homicide cop.\n...


In [18]:
from gensim.summarization import summarize

def count_sentences(text):
    # Simple sentence count based on punctuation
    return len(re.findall(r'\w[.!?]', text))

# Combine and summarize the synopses
def combine_and_summarize(imdb_syn, tmdb_syn):
    imdb_syn = str(imdb_syn) if not pd.isna(imdb_syn) else ""
    tmdb_syn = str(tmdb_syn) if not pd.isna(tmdb_syn) else ""    
    combined_syn = imdb_syn + " " + tmdb_syn
    
    if count_sentences(combined_syn) > 1:
        try:    
            summarized_syn = summarize(combined_syn, ratio=0.7)  # Adjust the ratio as needed
            return summarized_syn if summarized_syn else combined_syn  # Use original if summarization fails
        except Exception:
            return combined_syn
    else:
        return combined_syn 

# Apply the function to your DataFrame
df_movies['summarized_syn'] = df_movies.apply(lambda x: combine_and_summarize(x['imdb_syn'], x['tmdb_syn']), axis=1)


Reading in the tmdb column

In [2]:
df_movies= pd.read_csv("../dataset/tmdb_syn_labelled.csv")

tmdb = pd.read_csv("../dataset/tmdb_syn_labelled.csv")


In [None]:
# tmdb['movieId'] = tmdb['movieId'].astype('int')


In [None]:
# tmdb

In [None]:
# Merge df_movies with tmdb based on 'movieId'
# This will add the 'tmdb_syn' column from tmdb to df_movies
df_movies = pd.merge(df_movies, tmdb[['movieId', 'tmdb_syn']], on='movieId', how='left')

# Display the updated DataFrame
df_movies[["movieId", "title", "summarized_syn"]].head(10)


In [None]:
df_movies

In [None]:
df_movies['tmdb_syn']

Investigating movies with NO tmdb_syn:


In [None]:

# Create a DataFrame with only the rows where 'tmdb_syn' is NaN
df_movies_missing_tmdb_syn = df_movies[pd.isna(df_movies['tmdb_syn'])]

# Display or analyze the DataFrame
print(df_movies_missing_tmdb_syn)

# If you want to know the number of such rows
print("Number of rows with missing tmdb_syn:", len(df_movies_missing_tmdb_syn))

For these 56 missing - will try to get from imdb API:

In [None]:
df_movies_missing_tmdb_syn

In [None]:
from imdb import IMDb

# Initialize IMDb object
ia = IMDb()

# Function to fetch the plot outline
def get_imdb_syn(imdbId):
    try:
        movie = ia.get_movie(str(imdbId))
        return movie.get('plot outline', 'N/A')
    except Exception as e:
        print(f"An error occurred: {e}")
        return 'N/A'

# Find rows where 'tmdb_syn' is NaN
missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()

# Initialize an empty list to keep track of updated rows
updated_rows = []

# Fetch 'imdb_syn' for these rows
for idx in missing_tmdb_syn_idx:
    imdbId = df_movies.loc[idx, 'imdbId']
    new_syn = get_imdb_syn(imdbId)
    
    if new_syn != 'N/A':
        df_movies.loc[idx, 'imdb_syn'] = new_syn
        updated_rows.append(idx)

# Show the rows that were updated
print("Updated rows:")
print(df_movies.loc[updated_rows])


In [None]:
# import pandas as pd
# from imdb import IMDb
# import time
# 
# # Initialize IMDb object
# ia = IMDb()
# 
# # Function to fetch the plot outline
# def get_imdb_syn(imdbId):
#     try:
#         movie = ia.get_movie(str(imdbId))
#         return movie.get('plot outline', 'N/A')
#     except Exception as e:
#         print(f"An error occurred while fetching plot outline for {imdbId}: {e}")
#         return 'N/A'
# 
# # Sample DataFrame (replace with your actual DataFrame)
# # df_movies = pd.DataFrame({
# #     'imdbId': ['0111161', '0133093', '0133093'],  # Example IMDb IDs
# #     'tmdb_syn': [None, 'Some Synopsis', None]
# # })
# 
# # Find rows where 'tmdb_syn' is NaN
# missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()
# 
# # Initialize an empty list to keep track of updated rows
# updated_rows = []
# 
# # Fetch 'imdb_syn' for these rows
# for idx in missing_tmdb_syn_idx:
#     imdbId = df_movies.loc[idx, 'imdbId']
#     new_syn = get_imdb_syn(imdbId)
# 
#     if new_syn != 'N/A':
#         df_movies.loc[idx, 'imdb_syn'] = new_syn
#         updated_rows.append(idx)
# 
#     # Add a delay to avoid hitting API rate limits
#     time.sleep(1)
# 
# # Show the rows that were updated
# print("Updated rows:")
# print(df_movies.loc[updated_rows])


In [55]:
df_movies

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn_x,tmdb_syn_y,tmdb_syn
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
1,2,113497,8844.0,Jumanji (1995),Adventure|Children|Fantasy,"Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...,When siblings Judy and Peter discover an encha...
2,4,114885,31357.0,Waiting to Exhale (1995),Comedy|Drama|Romance,This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...","Cheated on, mistreated and stepped on, the wom...","Cheated on, mistreated and stepped on, the wom..."
3,5,113041,11862.0,Father of the Bride Part II (1995),Comedy,"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,Just when George Banks has recovered from his ...,Just when George Banks has recovered from his ...
4,6,113277,949.0,Heat (1995),Action|Crime|Thriller,Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,Obsessive master thief Neil McCauley leads a t...,Obsessive master thief Neil McCauley leads a t...
...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),Comedy|Documentary,,Severe Clear is a film based on the memoirs of...,Severe Clear is a film based on the memoirs of...,Severe Clear is a film based on the memoirs of...
6179,130958,143338,78402.0,Killer Crocodile (1989),Horror,,A group of environmentalists arrives at a fara...,A group of environmentalists arrives at a fara...,A group of environmentalists arrives at a fara...
6180,130984,208423,317168.0,Santo vs. las lobas (1976),Action|Fantasy|Horror,,Also known as Santo vs. the She-Wolves,Also known as Santo vs. the She-Wolves,Also known as Santo vs. the She-Wolves
6181,131011,69109,79572.0,Execution Squad (1972),Crime|Drama,,Bertone is a moderately honest homicide cop. U...,Bertone is a moderately honest homicide cop. U...,Bertone is a moderately honest homicide cop. U...


Just for now, until the imdb API works -> we are removing these movieIds:

In [None]:
missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()
# missing_tmdb_syn_idx # for 

# Remove rows with indices in missing_tmdb_syn_idx from df_movies
df_movies.drop(missing_tmdb_syn_idx, inplace=True)

# Reset index if needed
df_movies.reset_index(drop=True, inplace=True)


In [20]:
df_movies.to_csv("../dataset/imdb_tmdb_syn_labelled.csv", index=False)

In [None]:
df_movies= pd.read_csv("../dataset/imdb_tmdb_syn_labelled.csv")

#### Text Pre-processing 
- Need to apply to the following colunns: summarized_syn, title, genre


1. Removal of special characters  
2. Uniform size of letters - lowercase
3. Remove punctuation and quotation marks
4. Remove possessive pronouns 
5. Lemmatisation 
6. Removal of “stop words”

1. Removal of special characters  

In [21]:
# Replace non-alphabetical characters with empty string, leaving spaces intact
df_movies['summarized_syn_cleaned'] = df_movies['summarized_syn'].str.replace(r'[^a-zA-Z\s]', '', regex=True)


2. Uniform size of letters - lowercase

In [22]:
df_movies['summarized_syn'] = df_movies['summarized_syn'].str.lower() #lowercase


3. Remove punctuation and quotation marks

In [23]:
df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(r'[^\w\s]', '', regex=True)


4. Remove possessive pronouns 

In [24]:
import pandas as pd



# Define a regular expression to match possessive pronouns, with word boundaries
possessive_pronouns = r'\b(my|your|his|her|its|our|their)\b'

# Replace possessive pronouns with empty strings
df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(possessive_pronouns, '', regex=True)

# Remove extra spaces (since the possessive pronouns might leave extra spaces when removed)
df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(r'\s+', ' ', regex=True).str.strip()

print(df_movies)


      movieId   imdbId    tmdbId                               title  \
0           1   114709     862.0                    Toy Story (1995)   
1           2   113497    8844.0                      Jumanji (1995)   
2           4   114885   31357.0            Waiting to Exhale (1995)   
3           5   113041   11862.0  Father of the Bride Part II (1995)   
4           6   113277     949.0                         Heat (1995)   
...       ...      ...       ...                                 ...   
6178   130856   494826   48376.0                 Severe Clear (2010)   
6179   130958   143338   78402.0             Killer Crocodile (1989)   
6180   130984   208423  317168.0          Santo vs. las lobas (1976)   
6181   131011    69109   79572.0              Execution Squad (1972)   
6182   131015  1430116  143928.0                     Hellgate (2011)   

                                           genres  \
0     Adventure|Animation|Children|Comedy|Fantasy   
1                      Advent

5. Lemmatisation 

In [25]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')  # for word_tokenize

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Function to map NLTK's POS tags to the first character used by WordNetLemmatizer
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

# Function to lemmatize a single word (removed the keep check)
def lemmatize_word(word):
    pos = nltk.pos_tag([word])[0][1]  # POS tagging
    wordnet_pos = pos_tagger(pos)     # Map POS tag to first character used by WordNetLemmatizer
    if wordnet_pos is None:
        return word
    else:
        return lemmatizer.lemmatize(word, wordnet_pos)

# Tokenize and then lemmatize
df_movies['summarized_syn'] = df_movies['summarized_syn'].apply(
    lambda text: ' '.join([lemmatize_word(word) for word in word_tokenize(text)])
)

# Display the updated DataFrame
print(df_movies)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jiayi/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/jiayi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


      movieId   imdbId    tmdbId                               title  \
0           1   114709     862.0                    Toy Story (1995)   
1           2   113497    8844.0                      Jumanji (1995)   
2           4   114885   31357.0            Waiting to Exhale (1995)   
3           5   113041   11862.0  Father of the Bride Part II (1995)   
4           6   113277     949.0                         Heat (1995)   
...       ...      ...       ...                                 ...   
6178   130856   494826   48376.0                 Severe Clear (2010)   
6179   130958   143338   78402.0             Killer Crocodile (1989)   
6180   130984   208423  317168.0          Santo vs. las lobas (1976)   
6181   131011    69109   79572.0              Execution Squad (1972)   
6182   131015  1430116  143928.0                     Hellgate (2011)   

                                           genres  \
0     Adventure|Animation|Children|Comedy|Fantasy   
1                      Advent

6. Removal of “stop words”

In [26]:
from nltk.tokenize import word_tokenize
from spacy.lang.en import STOP_WORDS

# Function to remove stop words from a list of words
def remove_stop_words(words_list):
    return [word for word in words_list if word.lower() not in STOP_WORDS]

# First, tokenize the sentences into words
df_movies['summarized_syn_tokens'] = df_movies['summarized_syn'].apply(
    lambda element: word_tokenize(element) if isinstance(element, str) else element
)

# Now remove stop words from the 'summarized_syn_tokens' column
df_movies['summarized_syn_cleaned'] = df_movies['summarized_syn_tokens'].apply(
    lambda element: remove_stop_words(element) if isinstance(element, list) else element
)

# Display the updated DataFrame
df_movies


Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn,summarized_syn_cleaned,summarized_syn_tokens
0,1,114709,862.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",a little boy name andy love to be in room play...,"[little, boy, andy, love, room, play, toy, esp...","[a, little, boy, name, andy, love, to, be, in,..."
1,2,113497,8844.0,Jumanji (1995),Adventure|Children|Fantasy,"Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,jumanji one of the most uniqueand dangerousboa...,"[jumanji, uniqueand, dangerousboard, game, fal...","[jumanji, one, of, the, most, uniqueand, dange..."
2,4,114885,31357.0,Waiting to Exhale (1995),Comedy|Drama|Romance,This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",this story base on the best sell novel by terr...,"[story, base, best, sell, novel, terry, mcmill...","[this, story, base, on, the, best, sell, novel..."
3,5,113041,11862.0,Father of the Bride Part II (1995),Comedy,"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,in this sequel to father of the bride george b...,"[sequel, father, bride, george, bank, accept, ...","[in, this, sequel, to, father, of, the, bride,..."
4,6,113277,949.0,Heat (1995),Action|Crime|Thriller,Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,hunter and preyneil and professional criminal ...,"[hunter, preyneil, professional, criminal, cre...","[hunter, and, preyneil, and, professional, cri..."
...,...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),Comedy|Documentary,,Severe Clear is a film based on the memoirs of...,severe clear be a film base on the memoir of f...,"[severe, clear, film, base, memoir, lieutenant...","[severe, clear, be, a, film, base, on, the, me..."
6179,130958,143338,78402.0,Killer Crocodile (1989),Horror,,A group of environmentalists arrives at a fara...,a group of environmentalist arrives at a faraw...,"[group, environmentalist, arrives, faraway, tr...","[a, group, of, environmentalist, arrives, at, ..."
6180,130984,208423,317168.0,Santo vs. las lobas (1976),Action|Fantasy|Horror,,Also known as Santo vs. the She-Wolves,also know as santo v the shewolves,"[know, santo, v, shewolves]","[also, know, as, santo, v, the, shewolves]"
6181,131011,69109,79572.0,Execution Squad (1972),Crime|Drama,,Bertone is a moderately honest homicide cop. U...,bertone be a moderately honest homicide cop be...,"[bertone, moderately, honest, homicide, cop, e...","[bertone, be, a, moderately, honest, homicide,..."


#### Encoding genre column using GloVe vectors

- Using pre-trained word vectors (wikipedia) - 200d
ref: https://github.com/stanfordnlp/GloVe 


In [27]:
df_movies['genres'] = df_movies['genres'].apply(lambda x: x.split('|'))

In [28]:
# get a distinct list of genres


# Assuming df_movies['genres'] has been split into lists of strings
unique_genres = set()

# Iterate over the 'genres' column to populate the unique_genres set
for genre_list in df_movies['genres']:
    unique_genres.update(genre_list)

# Convert the set to a list, if needed
unique_genres = list(unique_genres)


In [29]:
unique_genres = [genre.lower() for genre in unique_genres]


In [30]:
unique_genres

['imax',
 'comedy',
 '(no genres listed)',
 'crime',
 'animation',
 'drama',
 'horror',
 'mystery',
 'adventure',
 'action',
 'fantasy',
 'film-noir',
 'documentary',
 'war',
 'sci-fi',
 'children',
 'western',
 'romance',
 'thriller',
 'musical']

"film-noir" has been replaced to "noir"
- As film-noir is not detected in the glove vectors 


In [31]:
unique_genres = [genre.replace("film-noir", "noir") for genre in unique_genres]
unique_genres

['imax',
 'comedy',
 '(no genres listed)',
 'crime',
 'animation',
 'drama',
 'horror',
 'mystery',
 'adventure',
 'action',
 'fantasy',
 'noir',
 'documentary',
 'war',
 'sci-fi',
 'children',
 'western',
 'romance',
 'thriller',
 'musical']

### Getting the glove vectors for each unique genre 

In [33]:
# ref: https://keras.io/examples/nlp/pretrained_word_embeddings/

def get_glove(file_path):

    embeddings_index = {}
    with open(path_to_glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embeddings_index[word] = coefs
    return embeddings_index

path_to_glove_file = "../pretrain_model/glove.6B/glove.6B.200d.txt"
glove_vec = get_glove(path_to_glove_file)

print("Found %s word vectors." % len(glove_vec))

 # 200 zero vec is assigned when the word is not found in the GloVe index 

Found 400001 word vectors.


In [34]:
# Initialize an empty dictionary to hold the GloVe vectors for each unique genre
genre_glove_vec = {}

# Populate the genre_glove_vec dictionary
for genre in unique_genres:
    if genre in glove_vec:  # Check if the genre name is available in the GloVe vocab
        genre_glove_vec[genre] = glove_vec[genre]
    else:
        print(f"No GloVe representation found for genre: {genre}")
        genre_glove_vec[genre] = None  # You can also populate with a default vector if needed

# Now, genre_glove_vec contains the GloVe representation for each unique genre


No GloVe representation found for genre: (no genres listed)


Number of movies with  no genres: 

In [35]:
# Count the number of movies with no genres
count_no_genres = df_movies['genres'].apply(lambda x: len(x) == 0).sum()

print(f'Number of movies with no genres: {count_no_genres}')


Number of movies with no genres: 0


All movies have an assigned genre!

In [37]:
# passing the genre_glove_vec to file - which will be read in the CB model 
import pickle

# Writing to file
with open('../pretrain_model/genre_glove_vec.pkl', 'wb') as f:
    pickle.dump(genre_glove_vec, f)



In [None]:
# put movieId and genres into one dataframe too

In [38]:
df_output= df_movies[["movieId", "genres"]]
df_output.to_csv("../dataset/movie_genre.csv",index=False)

In [39]:
df_movies

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn,summarized_syn_cleaned,summarized_syn_tokens
0,1,114709,862.0,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",a little boy name andy love to be in room play...,"[little, boy, andy, love, room, play, toy, esp...","[a, little, boy, name, andy, love, to, be, in,..."
1,2,113497,8844.0,Jumanji (1995),"[Adventure, Children, Fantasy]","Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,jumanji one of the most uniqueand dangerousboa...,"[jumanji, uniqueand, dangerousboard, game, fal...","[jumanji, one, of, the, most, uniqueand, dange..."
2,4,114885,31357.0,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",this story base on the best sell novel by terr...,"[story, base, best, sell, novel, terry, mcmill...","[this, story, base, on, the, best, sell, novel..."
3,5,113041,11862.0,Father of the Bride Part II (1995),[Comedy],"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,in this sequel to father of the bride george b...,"[sequel, father, bride, george, bank, accept, ...","[in, this, sequel, to, father, of, the, bride,..."
4,6,113277,949.0,Heat (1995),"[Action, Crime, Thriller]",Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,hunter and preyneil and professional criminal ...,"[hunter, preyneil, professional, criminal, cre...","[hunter, and, preyneil, and, professional, cri..."
...,...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),"[Comedy, Documentary]",,Severe Clear is a film based on the memoirs of...,severe clear be a film base on the memoir of f...,"[severe, clear, film, base, memoir, lieutenant...","[severe, clear, be, a, film, base, on, the, me..."
6179,130958,143338,78402.0,Killer Crocodile (1989),[Horror],,A group of environmentalists arrives at a fara...,a group of environmentalist arrives at a faraw...,"[group, environmentalist, arrives, faraway, tr...","[a, group, of, environmentalist, arrives, at, ..."
6180,130984,208423,317168.0,Santo vs. las lobas (1976),"[Action, Fantasy, Horror]",,Also known as Santo vs. the She-Wolves,also know as santo v the shewolves,"[know, santo, v, shewolves]","[also, know, as, santo, v, the, shewolves]"
6181,131011,69109,79572.0,Execution Squad (1972),"[Crime, Drama]",,Bertone is a moderately honest homicide cop. U...,bertone be a moderately honest homicide cop be...,"[bertone, moderately, honest, homicide, cop, e...","[bertone, be, a, moderately, honest, homicide,..."


In [40]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer




# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Transform the 'genres' column to a binary matrix
binary_genres = mlb.fit_transform(df_movies['genres'])

# Create a new DataFrame from the binary matrix
df_genres = pd.DataFrame(binary_genres, columns=mlb.classes_)

# Concatenate the original DataFrame and the new DataFrame
df_movies = pd.concat([df_movies, df_genres], axis=1)

df_movies


Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn,summarized_syn_cleaned,summarized_syn_tokens,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,114709,862.0,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",a little boy name andy love to be in room play...,"[little, boy, andy, love, room, play, toy, esp...","[a, little, boy, name, andy, love, to, be, in,...",...,0,0,0,0,0,0,0,0,0,0
1,2,113497,8844.0,Jumanji (1995),"[Adventure, Children, Fantasy]","Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,jumanji one of the most uniqueand dangerousboa...,"[jumanji, uniqueand, dangerousboard, game, fal...","[jumanji, one, of, the, most, uniqueand, dange...",...,0,0,0,0,0,0,0,0,0,0
2,4,114885,31357.0,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",this story base on the best sell novel by terr...,"[story, base, best, sell, novel, terry, mcmill...","[this, story, base, on, the, best, sell, novel...",...,0,0,0,0,0,1,0,0,0,0
3,5,113041,11862.0,Father of the Bride Part II (1995),[Comedy],"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,in this sequel to father of the bride george b...,"[sequel, father, bride, george, bank, accept, ...","[in, this, sequel, to, father, of, the, bride,...",...,0,0,0,0,0,0,0,0,0,0
4,6,113277,949.0,Heat (1995),"[Action, Crime, Thriller]",Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,hunter and preyneil and professional criminal ...,"[hunter, preyneil, professional, criminal, cre...","[hunter, and, preyneil, and, professional, cri...",...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),"[Comedy, Documentary]",,Severe Clear is a film based on the memoirs of...,severe clear be a film base on the memoir of f...,"[severe, clear, film, base, memoir, lieutenant...","[severe, clear, be, a, film, base, on, the, me...",...,0,0,0,0,0,0,0,0,0,0
6179,130958,143338,78402.0,Killer Crocodile (1989),[Horror],,A group of environmentalists arrives at a fara...,a group of environmentalist arrives at a faraw...,"[group, environmentalist, arrives, faraway, tr...","[a, group, of, environmentalist, arrives, at, ...",...,0,1,0,0,0,0,0,0,0,0
6180,130984,208423,317168.0,Santo vs. las lobas (1976),"[Action, Fantasy, Horror]",,Also known as Santo vs. the She-Wolves,also know as santo v the shewolves,"[know, santo, v, shewolves]","[also, know, as, santo, v, the, shewolves]",...,0,1,0,0,0,0,0,0,0,0
6181,131011,69109,79572.0,Execution Squad (1972),"[Crime, Drama]",,Bertone is a moderately honest homicide cop. U...,bertone be a moderately honest homicide cop be...,"[bertone, moderately, honest, homicide, cop, e...","[bertone, be, a, moderately, honest, homicide,...",...,0,0,0,0,0,0,0,0,0,0


In [41]:
df_movies[["movieId", "genres"]].head(10)

Unnamed: 0,movieId,genres
0,1,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,"[Adventure, Children, Fantasy]"
2,4,"[Comedy, Drama, Romance]"
3,5,[Comedy]
4,6,"[Action, Crime, Thriller]"
5,7,"[Comedy, Romance]"
6,10,"[Action, Adventure, Thriller]"
7,11,"[Comedy, Drama, Romance]"
8,12,"[Comedy, Horror]"
9,14,[Drama]


In [42]:
df_genres

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6178,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
6179,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
6180,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
6181,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0


### CBOW TF-IDF Method to generate Context Vector 

1. tf-IDF

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df_movies['summarized_syn_cleaned'].apply(' '.join))


In [44]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming df_movies['summarized_syn_cleaned'] is a Series of lists of words
# We'll join the lists into single strings for each document
tfidf_matrix = vectorizer.fit_transform(df_movies['summarized_syn_cleaned'].apply(' '.join))

# Get the words corresponding to the features of the TF-IDF matrix
feature_names = vectorizer.get_feature_names_out()

# Convert the tfidf_matrix to a dense matrix, then to a DataFrame
tfidf_dataframe = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)

# Show the head of the DataFrame for a preview
tfidf_matrix


<6183x20223 sparse matrix of type '<class 'numpy.float64'>'
	with 128331 stored elements in Compressed Sparse Row format>

In [45]:
tfidf_matrix

<6183x20223 sparse matrix of type '<class 'numpy.float64'>'
	with 128331 stored elements in Compressed Sparse Row format>

2. CBOW Representation with weight with TF-IDF

Need to use 200-dimensional pre-trained word vectors
Why? Because this matches the dimensionality of the vectors that we use for the GloVe vectors
- This is what we need to compare in the WSD operation. They need to have the same dimensionality.
- **Put this in thesis.**

Using GloVe vectors in your Continuous Bag-of-Words (CBOW) model is a common practice. The idea is to replace the word vectors you generate in the CBOW model with pre-trained GloVe vectors. This way, you can leverage the semantic information captured during the GloVe pre-training process.

In [47]:

def get_glove(file_path):

    
    embeddings_index = {}
    with open(path_to_glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embeddings_index[word] = coefs
    return embeddings_index

path_to_glove_file = "../pretrain_model/glove.6B/glove.6B.200d.txt"
glove_vec = get_glove(path_to_glove_file)



In [48]:
def get_weighted_cbow_vector(tokens, tfidf_scores, vocab, model):
    weighted_vector = np.zeros((200,))  # Updated dimensionality to match the GloVe vectors
    for token, score in zip(tokens, tfidf_scores):
        if token in vocab:
            try:
                weighted_vector += model[token] * score  # Used 'model' parameter
            except KeyError:  # Token might not be in the model vocabulary
                continue
    if len(tokens) > 0:
        return weighted_vector / len(tokens)
    else:
        return np.zeros((200,))


# Convert sparse tfidf_matrix to dense form and iterate to compute weighted CBOW vectors, WHY: because not all algorithms work well with sparse form. Helps with element-wise operations. 
dense_tfidf = tfidf_matrix.todense()
vocab = set(vectorizer.get_feature_names_out())
df_movies['weighted_cbow_synopsis'] = [get_weighted_cbow_vector(tokens, dense_tfidf[i].tolist()[0], vocab, glove_vec) for i, tokens in enumerate(df_movies['summarized_syn_cleaned'])]


How is tf-IDF used here? With Filtering (vocab): You ensure that the token is not only in the GloVe vocabulary but also relevant in your specific corpus according to TF-IDF. This could be more precise for your use-case but may miss out on some broader semantic relationships captured by GloVe.
- Gives more importance to context words that are relevant within your specific corpus. 



Transforming genre matrix and weighted cbow for each movie in df_movies

These will be separate columns


In [49]:
import numpy as np

# Convert the 'context_vector_synopsis' and 'binary_genres' to NumPy arrays if they are not already
df_movies['final_context_vector'] = df_movies['weighted_cbow_synopsis'].apply(np.array)
df_genres_np = df_genres.to_numpy()


In [50]:
df_genres_np

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0]])

In [51]:
df_movies

Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn,summarized_syn_cleaned,summarized_syn_tokens,...,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,weighted_cbow_synopsis,final_context_vector
0,1,114709,862.0,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",a little boy name andy love to be in room play...,"[little, boy, andy, love, room, play, toy, esp...","[a, little, boy, name, andy, love, to, be, in,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2,113497,8844.0,Jumanji (1995),"[Adventure, Children, Fantasy]","Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,jumanji one of the most uniqueand dangerousboa...,"[jumanji, uniqueand, dangerousboard, game, fal...","[jumanji, one, of, the, most, uniqueand, dange...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,4,114885,31357.0,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",this story base on the best sell novel by terr...,"[story, base, best, sell, novel, terry, mcmill...","[this, story, base, on, the, best, sell, novel...",...,0,0,0,1,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,5,113041,11862.0,Father of the Bride Part II (1995),[Comedy],"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,in this sequel to father of the bride george b...,"[sequel, father, bride, george, bank, accept, ...","[in, this, sequel, to, father, of, the, bride,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,6,113277,949.0,Heat (1995),"[Action, Crime, Thriller]",Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,hunter and preyneil and professional criminal ...,"[hunter, preyneil, professional, criminal, cre...","[hunter, and, preyneil, and, professional, cri...",...,0,0,0,0,0,1,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376.0,Severe Clear (2010),"[Comedy, Documentary]",,Severe Clear is a film based on the memoirs of...,severe clear be a film base on the memoir of f...,"[severe, clear, film, base, memoir, lieutenant...","[severe, clear, be, a, film, base, on, the, me...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6179,130958,143338,78402.0,Killer Crocodile (1989),[Horror],,A group of environmentalists arrives at a fara...,a group of environmentalist arrives at a faraw...,"[group, environmentalist, arrives, faraway, tr...","[a, group, of, environmentalist, arrives, at, ...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6180,130984,208423,317168.0,Santo vs. las lobas (1976),"[Action, Fantasy, Horror]",,Also known as Santo vs. the She-Wolves,also know as santo v the shewolves,"[know, santo, v, shewolves]","[also, know, as, santo, v, the, shewolves]",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6181,131011,69109,79572.0,Execution Squad (1972),"[Crime, Drama]",,Bertone is a moderately honest homicide cop. U...,bertone be a moderately honest homicide cop be...,"[bertone, moderately, honest, homicide, cop, e...","[bertone, be, a, moderately, honest, homicide,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [52]:
len(binary_genres)

6183

Passing to a file ->

In [53]:
import json 

df_movies.to_json("../dataset/df_context_vec.json")

### Sparse Users Method 


Reading in the recommendations from CF model - empty ones


In [57]:
# reading in the tags (userId, movieId, tag)
tags = pd.read_csv("../dataset/tags_contentbased.csv")
# tags.drop(columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.2'], inplace=True)
# tags.to_csv("../dataset/tags_contentbased.csv",index=False)
tags

Unnamed: 0,userId,movieId,tag,timestamp,un-lemmatised,glove_vec,has_glove_vec
0,318,260,s,2015-02-20 22:42:49,,[ 0.18209 0.88297 -0.49805 0.53137 -...,True
1,318,115149,action,2015-02-21 15:58:30,action,[ 2.0240e-02 8.4992e-01 -7.8150e-01 -8.2769e-...,True
2,320,2762,twist,2006-04-25 11:33:52,twist,[-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...,True
3,320,2959,twist,2006-04-25 11:30:58,twist,[-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...,True
4,320,3996,overrate,2006-04-25 11:32:28,overrated,[ 2.8151e-01 -4.2171e-01 -3.8275e-01 1.5364e-...,True
...,...,...,...,...,...,...,...
50186,138280,116797,history,2015-01-30 23:07:25,history,[ 4.5847e-02 7.4334e-02 1.5092e-02 -2.6392e-...,True
50187,138280,116797,informatics,2015-01-30 23:07:35,informatics,[ 1.7728e-01 1.5395e-01 7.7811e-01 1.6527e-...,True
50188,138280,116797,mathematics,2015-01-30 23:07:17,mathematics,[ 1.0033e+00 3.8874e-01 6.4312e-01 -6.8630e-...,True
50189,138280,117871,image,2015-01-30 23:09:16,image,[ 1.1091e-02 4.8461e-01 1.9142e-02 8.3725e-...,True


In [58]:
# reading in the movie context vectors

df_movies = pd.read_json("../dataset/df_context_vec.json")
df_movies


Unnamed: 0,movieId,imdbId,tmdbId,title,genres,imdb_syn,tmdb_syn,summarized_syn,summarized_syn_cleaned,summarized_syn_tokens,...,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,weighted_cbow_synopsis,final_context_vector
0,1,114709,862,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",A little boy named Andy loves to be in his roo...,"Led by Woody, Andy's toys live happily in his ...",a little boy name andy love to be in room play...,"[little, boy, andy, love, room, play, toy, esp...","[a, little, boy, name, andy, love, to, be, in,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2,113497,8844,Jumanji (1995),"[Adventure, Children, Fantasy]","Jumanji, one of the most unique--and dangerous...",When siblings Judy and Peter discover an encha...,jumanji one of the most uniqueand dangerousboa...,"[jumanji, uniqueand, dangerousboard, game, fal...","[jumanji, one, of, the, most, uniqueand, dange...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,4,114885,31357,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",This story based on the best selling novel by ...,"Cheated on, mistreated and stepped on, the wom...",this story base on the best sell novel by terr...,"[story, base, best, sell, novel, terry, mcmill...","[this, story, base, on, the, best, sell, novel...",...,0,0,0,1,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,5,113041,11862,Father of the Bride Part II (1995),[Comedy],"In this sequel to ""Father of the Bride"", Georg...",Just when George Banks has recovered from his ...,in this sequel to father of the bride george b...,"[sequel, father, bride, george, bank, accept, ...","[in, this, sequel, to, father, of, the, bride,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,6,113277,949,Heat (1995),"[Action, Crime, Thriller]",Hunters and their prey--Neil and his professio...,Obsessive master thief Neil McCauley leads a t...,hunter and preyneil and professional criminal ...,"[hunter, preyneil, professional, criminal, cre...","[hunter, and, preyneil, and, professional, cri...",...,0,0,0,0,0,1,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6178,130856,494826,48376,Severe Clear (2010),"[Comedy, Documentary]",,Severe Clear is a film based on the memoirs of...,severe clear be a film base on the memoir of f...,"[severe, clear, film, base, memoir, lieutenant...","[severe, clear, be, a, film, base, on, the, me...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6179,130958,143338,78402,Killer Crocodile (1989),[Horror],,A group of environmentalists arrives at a fara...,a group of environmentalist arrives at a faraw...,"[group, environmentalist, arrives, faraway, tr...","[a, group, of, environmentalist, arrives, at, ...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6180,130984,208423,317168,Santo vs. las lobas (1976),"[Action, Fantasy, Horror]",,Also known as Santo vs. the She-Wolves,also know as santo v the shewolves,"[know, santo, v, shewolves]","[also, know, as, santo, v, the, shewolves]",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6181,131011,69109,79572,Execution Squad (1972),"[Crime, Drama]",,Bertone is a moderately honest homicide cop. U...,bertone be a moderately honest homicide cop be...,"[bertone, moderately, honest, homicide, cop, e...","[bertone, be, a, moderately, honest, homicide,...",...,0,0,0,0,0,0,0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [59]:
df_rec = pd.read_json('../dataset/df_rec.json', orient='split')

df_rec_sparse = df_rec[df_rec['recommendations'].str.len() == 0]

df_rec_sparse




Unnamed: 0,userId,recommendations,tags_movies
160,31935,[],[]
416,77374,[],[]
555,120748,[],[]
586,29096,[],[]
943,117259,[],[]
1014,98756,[],[]
1073,115495,[],[]
1095,11585,[],[]
1105,50687,[],[]
1404,54248,[],[]


Testing

In [64]:

# Extracting genre vectors
# df_movies['genre_vector'] = df_movies[all_genres].values.tolist()

['IMAX',
 'Comedy',
 '(no genres listed)',
 'Crime',
 'Animation',
 'Drama',
 'Horror',
 'Mystery',
 'Adventure',
 'Action',
 'Fantasy',
 'Film-Noir',
 'Documentary',
 'War',
 'Sci-Fi',
 'Children',
 'Western',
 'Romance',
 'Thriller',
 'Musical']

In [66]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# One-hot encode the genres for each movie
df_movies['genres'] = df_movies['genres'].apply(lambda x: x if isinstance(x, list) else [])
all_genres = set([genre for sublist in df_movies['genres'].tolist() for genre in sublist])
for genre in all_genres:
    df_movies[genre] = df_movies['genres'].apply(lambda x: 1 if genre in x else 0)

# Extracting genre vectors
df_movies['genre_vector'] = df_movies[list(all_genres)].values.tolist()

def get_similar_movies_genre_vector(user_movie_vector, threshold=0.98):
    similarities = cosine_similarity([user_movie_vector], df_movies['genre_vector'].tolist())[0]
    df_movies['cosine_similarity'] = similarities
    similar_movies = df_movies[df_movies['cosine_similarity'] >= threshold]
    return similar_movies['movieId'].tolist()

def get_movie_tags(similar_movie_ids, user):
    movie_tags_df = tags[tags['movieId'].isin(similar_movie_ids)][['movieId', 'tag']].groupby('movieId')['tag'].apply(list).reset_index()
    movie_tags_df['userId'] = user
    return movie_tags_df

# Main function using genre vectors
def context_aware_model_for_sparse_users_genre_vector():
    sparse_users = df_rec_sparse['userId'].unique()
    all_tags_for_similar_movies = []
    
    for user in sparse_users:
        # Getting movies tagged by the user
        user_movies = tags[tags['userId'] == user]['movieId'].unique()
        
        for movie in user_movies:
            if df_movies[df_movies['movieId'] == movie]['genre_vector'].shape[0] > 0:
                user_movie_vector = np.array(df_movies[df_movies['movieId'] == movie]['genre_vector'].iloc[0])
            else:
                continue

            similar_movie_ids = get_similar_movies_genre_vector(user_movie_vector)
            movie_tags_df = get_movie_tags(similar_movie_ids, user)
            
            all_tags_for_similar_movies.append(movie_tags_df)
    
    return pd.concat(all_tags_for_similar_movies)

result_df = context_aware_model_for_sparse_users_genre_vector()
print(result_df)


     movieId                                                tag  userId
0         41                         [shakespeare, shakespeare]   31935
1         73                                   [jewish, lawyer]   31935
2        527  [atmospheric, biography, classic, disturb, his...   31935
3        760                                          [germany]   31935
4       1090  [action, action, s, vietnam, action, drama, na...   31935
..       ...                                                ...     ...
795   126387                                          [latvian]   54248
796   127453  [crime, history, investigation, moscow, poland...   54248
797   128604                            [horrible, pretentious]   54248
798   128948                                            [greek]   54248
799   129303                                             [camp]   54248

[5416 rows x 3 columns]


In [67]:
result_df

Unnamed: 0,movieId,tag,userId
0,41,"[shakespeare, shakespeare]",31935
1,73,"[jewish, lawyer]",31935
2,527,"[atmospheric, biography, classic, disturb, his...",31935
3,760,[germany],31935
4,1090,"[action, action, s, vietnam, action, drama, na...",31935
...,...,...,...
795,126387,[latvian],54248
796,127453,"[crime, history, investigation, moscow, poland...",54248
797,128604,"[horrible, pretentious]",54248
798,128948,[greek],54248


In [68]:
result_df.to_csv("../dataset/df_sparse_similar_movies.csv")

In [72]:
result_df[result_df['userId'] == 54248]

Unnamed: 0,movieId,tag,userId
0,14,"[president, drama, nan, politics, president]",54248
1,26,"[shakespeare, shakespeare, shakespeare, shakes...",54248
2,31,"[inspirational, inspirational, nan, education,...",54248
3,43,"[historical, england]",54248
4,55,[musician],54248
...,...,...,...
795,126387,[latvian],54248
796,127453,"[crime, history, investigation, moscow, poland...",54248
797,128604,"[horrible, pretentious]",54248
798,128948,[greek],54248


In [73]:
tags[tags['userId'] == 54248]

Unnamed: 0,userId,movieId,tag,timestamp,un-lemmatised,glove_vec,has_glove_vec
22702,54248,119430,casino,2014-12-24 23:16:11,casino,[ 1.5207e-02 1.9634e-01 -6.1053e-01 2.6841e-...,True
22703,54248,119430,crap,2014-12-24 23:16:18,craps,[ 0.51509 0.036047 0.14851 -0.081221 -...,True
22704,54248,119430,gamble,2014-12-24 23:16:39,gamble,[-0.43183 0.22825 -0.46183 0.30202 ...,True
22705,54248,119430,gamble,2014-12-24 23:16:07,gambling,[-0.43183 0.22825 -0.46183 0.30202 ...,True
22706,54248,119430,hustle,2014-12-24 23:16:33,hustle,[ 0.45497 -0.11613 -0.62842 0.25815 ...,True
22707,54248,119430,poker,2014-12-24 23:16:21,poker,[ 0.42254 0.24923 -0.15362 -0.73099 ...,True
