## Word2vec Model Trainning
- Using the movie plots and summary dataset, train a word2vec model that will recommend movies with similar plot lines 

In [1]:
# enter path to the location of the dataset here
"""# Mike's desktop paths 
path_to_imdb_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_movie_details.json' """

# Mike's laptop paths
path_to_imdb_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_movie_details.json'

Three data sources: 
1. IMDB: used for matching movie title & ID
2. Spoiler: contains plots&movie ID, used for trainning
3. Plot: contains plots& movie name, used for trainning

Plot_synopsis is the movies' plot summaries with spoilers. Since we want to analyze the similarity of the plot line of the movies, we will use this variable to train our word2vec model. 

To train the model, let's first initiate a spark session and load the dataset into the spark dataframe

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/02 15:42:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# reading the IMDB dataset
imdb = spark.read.options(header = True, inferSchema = True, delimiter = "\t").csv(path_to_imdb_dataset)

# filter the imdb dataset so that only movies are included
imdb = imdb.filter("titleType = 'movie'")\
  .filter("primaryTitle != ''")\
    .select('tconst', 'primaryTitle', 'startYear')\
      .withColumnRenamed('startYear', 'Year')

print('there is a total of ', imdb.count(), ' movies in the imdb dataset')
imdb.show(3)

                                                                                

there is a total of  651281  movies left in the imdb dataset
+---------+--------------------+----+
|   tconst|        primaryTitle|Year|
+---------+--------------------+----+
|tt0000009|          Miss Jerry|1894|
|tt0000147|The Corbett-Fitzs...|1897|
|tt0000502|            Bohemios|1905|
+---------+--------------------+----+
only showing top 3 rows



In [4]:
details = spark.read.json(path_to_details_dataset)
details = details.select('movie_id','plot_synopsis')\
  .filter("plot_synopsis != ''")
print('there is a total of ', details.count(), ' plot summaries left in the details dataset')
details.show(3)

there is a total of  1339  plot summaries left in the details dataset
+---------+--------------------+
| movie_id|       plot_synopsis|
+---------+--------------------+
|tt0105112|Jack Ryan (Ford) ...|
|tt1204975|Four boys around ...|
|tt0040897|Fred Dobbs (Humph...|
+---------+--------------------+
only showing top 3 rows



In [5]:
# join the imdb with details by matching the unique identifier(e.g. tt0000000)
imdb_join_details = imdb.join(details, imdb.tconst == details.movie_id, 'inner')\
  .withColumnRenamed('plot_synopsis', 'Plot')\
    .withColumnRenamed('primaryTitle', 'Title')\
      .withColumnRenamed('tconst', 'id')\
        .select('id', 'Title', 'Plot')

print("The joined dataset has ", imdb_join_details.count(), " entries")
imdb_join_details.filter("tconst == 'tt0472062'").show(truncate = False)

                                                                                

The joined dataset has  1324  entries


[Stage 16:>                                                         (0 + 1) / 1]

+---------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id       |Title               |Plot                                                                                                            

                                                                                

In [6]:
# reading the plot dataset
plot = spark.read.options(header = True, inferSchema = True, quote = '"', escape = '"', multiLine = True).csv(path_to_plots_dataset)
plot = plot.select('Title', 'Release Year','Plot').withColumnRenamed('Release Year', 'Year')
print('there is a total of ', plot.count(), ' plot summaries in the plot dataset')
plot.show(3)

                                                                                

there is a total of  34886  plot summaries in the plot dataset
+--------------------+----+--------------------+
|               Title|Year|                Plot|
+--------------------+----+--------------------+
|Kansas Saloon Sma...|1901|A bartender is wo...|
|Love by the Light...|1901|The moon, painted...|
|The Martyred Pres...|1901|The film, just ov...|
+--------------------+----+--------------------+
only showing top 3 rows



In [7]:
# join the imdb with the plot dataset by matching movie titles and release year
imdb_join_plot = imdb.join(plot, [imdb.primaryTitle == plot.Title, imdb.Year == plot.Year], 'inner')\
  .withColumnRenamed('tconst', 'id')\
    .select('id', 'Title', 'Plot')

print("The joined dataset has ", imdb_join_plot.count(), " entries")
#imdb_join_plot.filter("tconst == 'tt0472062'").show(truncate = False)
imdb_join_plot.show(5, truncate=False)

                                                                                

The joined dataset has  26953  entries


[Stage 32:>                                                         (0 + 1) / 1]

+---------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

In [8]:
# drop primaryTitle and movie_id because they provide redundant information 
df = imdb_join_plot.union(imdb_join_details)

print('after merging & cleaning, there is a total of ', df.count(), ' movie plot entries left in the merged dataset')

df.show(3, truncate = False)

                                                                                

after merging & cleaning, there is a total of  28277  movie plot entries left in the merged dataset


[Stage 47:>                                                         (0 + 1) / 1]

+---------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

## Trainning Model

In [9]:
# tokenize and remove stop words in this cell
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec

# create a new field by copying Plot
df = df.withColumn('inputText', F.col('Plot')) 

# regular expression tokenizer to tokenize inputText into individual tokens (words)
regextok = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'inputText', outputCol = 'tokens')

# StopWordsRemover to remove stopwords in the list of tokens
stopwrmv = StopWordsRemover(inputCol = 'tokens', outputCol = 'tokens_sw_removed')
df = regextok.transform(df)
df = stopwrmv.transform(df)
df.show(3)

[Stage 53:>                                                         (0 + 1) / 1]

+---------+------------------+--------------------+--------------------+--------------------+--------------------+
|       id|             Title|                Plot|           inputText|              tokens|   tokens_sw_removed|
+---------+------------------+--------------------+--------------------+--------------------+--------------------+
|tt0790799|             $9.99|The film mainly f...|The film mainly f...|[the, film, mainl...|[film, mainly, fo...|
|tt2614684|               '71|Gary Hook, a new ...|Gary Hook, a new ...|[gary, hook, a, n...|[gary, hook, new,...|
|tt0032176|'Til We Meet Again|Total strangers D...|Total strangers D...|[total, strangers...|[total, strangers...|
+---------+------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



                                                                                

In [10]:
# train word2vec model, the parameters here can be changed to optimize the model
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'tokens_sw_removed', outputCol = 'wordvectors')
model = word2vec.fit(df)

23/08/02 15:45:31 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

In [11]:
# using transform to add wordvectors column to dataframe
df = model.transform(df)
chunks = df.select('id', 'Title','wordvectors', 'Plot').limit(30000).collect()

                                                                                

In [12]:
# here I am trying to use the actual plot of the movies as an input, but I haven't completed the code
""" base_movie = "Titanic"
SEARCH_QUERY = df.filter(df.Title == base_movie).select("Plot")
SEARCH_QUERY.show() """

' base_movie = "Titanic"\nSEARCH_QUERY = df.filter(df.Title == base_movie).select("Plot")\nSEARCH_QUERY.show() '

In [13]:
# create search query and transform it to word vectors
SEARCH_QUERY = "Titanic"
query_df = spark.createDataFrame([(1, SEARCH_QUERY)]).toDF('index','inputText')
query_tok = regextok.transform(query_df)
query_swr = stopwrmv.transform(query_tok)
query_vec = model.transform(query_swr)
query_vec = query_vec.select('wordvectors').collect()[0][0]
query_vec

                                                                                

DenseVector([0.0894, 0.0576, 0.2092, -0.1297, -0.1751, 0.1632, 0.0966, -0.0067, -0.0508, 0.0572, 0.1328, -0.013, 0.0614, -0.1614, 0.0955, -0.0553, -0.0983, -0.0384, 0.142, 0.0748, 0.0825, 0.0458, -0.0424, -0.0764, 0.0668, 0.0377, -0.186, 0.0442, 0.044, -0.014, 0.0129, -0.0195, -0.041, 0.049, 0.023, -0.016, 0.0783, -0.0659, -0.1108, 0.0196, 0.1961, 0.083, 0.0189, 0.0255, 0.165, -0.0683, 0.038, 0.0636, 0.0207, -0.1058, -0.1393, 0.1209, -0.0508, -0.0001, -0.0672, 0.0946, 0.0225, 0.0509, 0.0776, 0.0327, 0.0546, 0.148, 0.1342, -0.1169, 0.097, 0.1636, -0.0741, -0.0993, 0.1836, -0.1829, 0.0725, 0.0268, 0.1521, 0.0145, -0.0623, -0.0112, 0.048, -0.0428, 0.0437, -0.291, -0.1043, 0.0059, -0.0962, -0.1176, -0.1713, 0.0934, 0.0701, 0.0427, 0.0302, 0.0221, -0.1327, -0.085, 0.119, -0.1746, 0.0446, 0.1417, 0.0656, -0.0571, 0.1699, -0.069])

In [14]:
# define function to calculate cosine similarity
import numpy as np
def cossim(v1, v2): 
    '''
        cossim(v1, v2) calculates the cosine similarity between v1 and v1.
        If v1 or v2 is a zero vector, it will return 0
    '''
    if np.dot(v1, v1) == 0 or np.dot(v2, v2) == 0:
        return 0.0
    return float(np.dot(v1, v2) / np.sqrt(np.dot(v1, v1)) / (np.sqrt(np.dot(v2, v2))))

In [15]:
data = [(i[0], float(cossim(query_vec, i[2])), i[1], i[3]) for i in chunks]

In [17]:
sim_df = spark.createDataFrame(data).toDF('movie_id', 'similarity', 'Title', 'Plot').orderBy('similarity', ascending=False)
sim_df.show(10, truncate = False)

23/08/02 15:49:34 WARN TaskSetManager: Stage 84 contains a task of very large size (6703 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---------+------------------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------