## Word2vec Model Trainning
- Using the movie plots and summary dataset, train a word2vec model that will recommend movies with similar plot lines 

In [4]:
# enter path to the location of the dataset here
"""# Mike's desktop paths 
path_to_imdb_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_movie_details.json' """

# Mike's laptop paths
path_to_imdb_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_movie_details.json'

In [None]:
""" # load imdb dataset
import gzip
with gzip.open(path_to_imdb_dataset, 'rt', encoding='utf-8') as f:
    df_imdb = pd.read_csv(f, delimiter='\t') """

In [None]:
""" # obtain all unique movie titles from the IMDB dataset
unique_titles = df_imdb[df_imdb["titleType"] == 'movie']
unique_titles = unique_titles.drop_duplicates(subset="primaryTitle")
unique_titles.head(5) """

Load the reviews dataset, upon inpection we find that it has two review field - original text&summarized

In [9]:
import pandas as pd

# load reviews dataset
plot_details = pd.read_json(path_to_details_dataset, lines = True)

# inspect the reviews dataset
plot_details.head(5)

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in..."
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...
2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"[Comedy, Romance]",6.7,2002-04-11,
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...


According to the dataset description (https://www.kaggle.com/datasets/rmisra/imdb-spoiler-dataset) plot_synopsis is the movies' plot summaries with spoilers. Since we want to analyze the similarity of the plot line of the movies, we will use this variable to train our word2vec model. 

To train the model, let's first initiate a spark session and load the dataset into the spark dataframe

In [24]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

In [25]:
df = spark.read.json(path_to_details_dataset)
df = df.select('movie_id','plot_synopsis').filter("plot_synopsis != ''")
print('after cleaning, there is a total of ', df.count(), ' movie plot summaries')


df = df.withColumn('inputText', F.col('plot_synopsis'))
df.show(3)

after cleaning, there is a total of  1339  movie plot summaries
+---------+--------------------+--------------------+
| movie_id|       plot_synopsis|           inputText|
+---------+--------------------+--------------------+
|tt0105112|Jack Ryan (Ford) ...|Jack Ryan (Ford) ...|
|tt1204975|Four boys around ...|Four boys around ...|
|tt0040897|Fred Dobbs (Humph...|Fred Dobbs (Humph...|
+---------+--------------------+--------------------+
only showing top 3 rows



In [26]:
# tokenize and remove stop words in this cell
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec

# regular expression tokenizer to tokenize inputText into individual tokens (words)
regextok = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'inputText', outputCol = 'tokens')

# StopWordsRemover to remove stopwords in the list of tokens
stopwrmv = StopWordsRemover(inputCol = 'tokens', outputCol = 'tokens_sw_removed')

df = regextok.transform(df)
df = stopwrmv.transform(df)
df.show(3)

+---------+--------------------+--------------------+--------------------+--------------------+
| movie_id|       plot_synopsis|           inputText|              tokens|   tokens_sw_removed|
+---------+--------------------+--------------------+--------------------+--------------------+
|tt0105112|Jack Ryan (Ford) ...|Jack Ryan (Ford) ...|[jack, ryan, ford...|[jack, ryan, ford...|
|tt1204975|Four boys around ...|Four boys around ...|[four, boys, arou...|[four, boys, arou...|
|tt0040897|Fred Dobbs (Humph...|Fred Dobbs (Humph...|[fred, dobbs, hum...|[fred, dobbs, hum...|
+---------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [27]:
# train word2vec model
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'tokens_sw_removed', outputCol = 'wordvectors')
model = word2vec.fit(df)

                                                                                

In [36]:
# using transform to add wordvectors column to dataframe
df = model.transform(df)
chunks = df.select('movie_id', 'plot_synopsis','wordvectors').limit(30000).collect()

                                                                                

In [29]:
# create search query and transform it to word vectors
SEARCH_QUERY = "FBI"
query_df = spark.createDataFrame([(1, SEARCH_QUERY)]).toDF('index','inputText')
query_tok = regextok.transform(query_df)
query_swr = stopwrmv.transform(query_tok)
query_vec = model.transform(query_swr)
query_vec = query_vec.select('wordvectors').collect()[0][0]
query_vec

                                                                                

DenseVector([-0.1636, 0.2109, -0.0792, -0.0412, -0.0555, 0.3304, 0.0679, 0.0579, 0.1853, 0.0019, -0.1646, 0.0478, -0.1155, -0.2133, 0.0258, 0.0115, -0.1741, -0.0595, -0.0189, -0.0008, -0.0009, -0.1454, 0.0192, 0.0461, 0.0599, -0.0331, 0.1709, -0.0365, 0.016, -0.1037, -0.1511, 0.1848, 0.0719, 0.0659, -0.1696, 0.1041, -0.0165, 0.0645, -0.3049, 0.1035, -0.144, 0.0756, -0.048, 0.1401, -0.0729, -0.1139, -0.1896, -0.059, -0.0556, -0.1468, 0.079, 0.001, -0.1078, 0.0609, -0.0428, 0.0462, 0.2305, -0.0971, 0.2571, -0.2279, 0.0789, 0.0351, 0.0323, -0.0549, -0.0537, -0.073, 0.0921, -0.2874, -0.0017, 0.0091, -0.2079, -0.3728, 0.0215, -0.1557, -0.0057, -0.02, 0.1128, 0.3135, 0.0799, -0.0473, 0.0859, -0.0692, 0.0684, 0.156, 0.0597, -0.016, -0.0535, -0.1263, -0.0235, -0.1763, -0.0686, 0.1547, -0.1075, -0.0677, -0.0171, 0.0764, -0.1579, 0.1827, 0.1768, -0.128])

In [30]:
# define function to calculate cosine similarity
import numpy as np
def cossim(v1, v2): 
    '''
        cossim(v1, v2) calculates the cosine similarity between v1 and v1.
        If v1 or v2 is a zero vector, it will return 0
    '''
    if np.dot(v1, v1) == 0 or np.dot(v2, v2) == 0:
        return 0.0
    return float(np.dot(v1, v2) / np.sqrt(np.dot(v1, v1)) / (np.sqrt(np.dot(v2, v2))))

In [31]:
chunks[:3]

[Row(plot_synopsis='Jack Ryan (Ford) is on a "working vacation" in London with his family. He has retired from the CIA and is a Professor at the US Naval Academy. He is seen delivering a lecture at the Royal Naval Academy in London.Meanwhile, Ryan\'s wife Cathy and daughter Sally are sightseeing near Buckingham Palace. Sally and Cathy come upon a British Royal Guard, and Sally tries to get the guard to react by doing an improvised tap dance in front of him. She\'s impressed when the guard, trained to ignore distraction, doesn\'t react at all, and they leave.As Sally and Cathy walk away from the guard, en route to rendezvous with Ryan, they walk by a stolen cab, in which sit three Ulster Liberation Army terrorists: Kevin O\'Donnell, the driver, as well as Sean Miller (Sean Bean) and his younger brother Patrick. The three are loading bullets into their guns as they prepare to carry out a scheduled ambush on Lord William Holmes, British Secretary of State for Northern Ireland and a distan

In [38]:
data = [(i[0], i[1], float(cossim(query_vec, i[2]))) for i in chunks]
#sim_df = spark.createDataFrame(data).toDF('title', 'similarity').orderBy('similarity', ascending=False)
#sim_df.show(20, truncate = False)

In [40]:
sim_df = spark.createDataFrame(data).toDF('movie_id','text', 'similarity').orderBy('similarity', ascending=False)
sim_df.show(50, truncate = False)

23/07/27 10:37:47 WARN TaskSetManager: Stage 44 contains a task of very large size (1589 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------