## Word2vec Model Trainning
- Using the movie plots and summary dataset, train a word2vec model that will recommend movies with similar plot lines 

In [1]:
# enter path to the location of the dataset here
"""# Mike's desktop paths 
path_to_imdb_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_movie_details.json' """

# Mike's laptop paths
path_to_imdb_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_movie_details.json'

Three data sources: 
1. IMDB: used for matching movie title & ID
2. Spoiler: contains plots&movie ID, used for trainning
3. Plot: contains plots& movie name, used for trainning

Plot_synopsis is the movies' plot summaries with spoilers. Since we want to analyze the similarity of the plot line of the movies, we will use this variable to train our word2vec model. 

To train the model, let's first initiate a spark session and load the dataset into the spark dataframe

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

23/08/08 22:52:29 WARN Utils: Your hostname, Yus-MacBook-Air-2.local resolves to a loopback address: 127.0.0.1; using 192.168.181.65 instead (on interface en0)
23/08/08 22:52:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/08 22:52:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# reading the IMDB dataset
imdb = spark.read.options(header = True, inferSchema = True, delimiter = "\t").csv(path_to_imdb_dataset)

# filter the imdb dataset so that only movies are included
imdb = imdb.filter("titleType = 'movie'")\
  .select('tconst', 'primaryTitle', 'startYear')\
    .withColumnRenamed('startYear', 'Year')\
      .withColumnRenamed('primaryTitle', 'Title')\
        .dropDuplicates(['Title', 'Year'])

print('there is a total of ', imdb.count(), ' movies in the imdb dataset')
imdb.show(1)

                                                                                

there is a total of  637758  movies in the imdb dataset


[Stage 8:>                                                          (0 + 1) / 1]

+---------+--------------------+----+
|   tconst|               Title|Year|
+---------+--------------------+----+
|tt0033122|"Swing it" magistern|1940|
+---------+--------------------+----+
only showing top 1 row



                                                                                

In [6]:
# reading the details dataset, preserving only three important variables
details_summary = spark.read.json(path_to_details_dataset)
details_summary = details_summary\
  .select('movie_id','plot_summary')\
    .withColumnRenamed('plot_summary','Plot')
print('there is a total of ', details_summary.count(), ' plot summaries in the details_summary dataset')

# reading the details dataset, preserving only three important variables
details_synopsis = spark.read.json(path_to_details_dataset)
details_synopsis = details_synopsis.select('movie_id','plot_synopsis')\
  .filter("plot_synopsis != ''")\
    .withColumnRenamed('plot_synopsis', 'Plot')
print('there is a total of ', details_synopsis.count(), ' plot synopsis in the details_synopsis dataset')

details = details_summary.union(details_synopsis)
print('there is a total of ', details.count(), ' plot descriptions in the details dataset')
details.show(3)

there is a total of  1572  plot summaries in the details_summary dataset
there is a total of  1339  plot synopsis in the details_synopsis dataset
there is a total of  2911  plot descriptions in the details dataset
+---------+--------------------+
| movie_id|                Plot|
+---------+--------------------+
|tt0105112|Former CIA analys...|
|tt1204975|Billy (Michael Do...|
|tt0243655|The setting is Ca...|
+---------+--------------------+
only showing top 3 rows



In [8]:
from pyspark.sql.functions import lit

# join the imdb with details by matching the unique identifier(e.g. tt0000000)
imdb_join_details = imdb.join(details, imdb.tconst == details.movie_id, 'inner')\
  .withColumnRenamed('tconst', 'id')\
    .select('id', 'Title', 'Plot')\
      .withColumn("Source", lit("imdb_details"))

print("The joined dataset has ", imdb_join_details.count(), " entries")

# inspect the joined dataset
imdb_join_details.show(3)

                                                                                

The joined dataset has  2857  entries


[Stage 53:>                                                         (0 + 1) / 1]

+---------+--------------+--------------------+------------+
|       id|         Title|                Plot|      Source|
+---------+--------------+--------------------+------------+
|tt2294449|22 Jump Street|Following their s...|imdb_details|
|tt2294449|22 Jump Street|After making thei...|imdb_details|
|tt0120623|  A Bug's Life|On a small island...|imdb_details|
+---------+--------------+--------------------+------------+
only showing top 3 rows



                                                                                

In [9]:
from pyspark.sql.functions import length
# reading the plot dataset, preserving only three important variables
wiki_plot = spark.read.options(header = True, inferSchema = True, quote = '"', escape = '"', multiLine = True).csv(path_to_plots_dataset)
wiki_plot = wiki_plot.select('Title', 'Release Year','Plot')\
  .withColumnRenamed('Release Year', 'Year')\
    .filter(length(wiki_plot['Plot']) >= 200) # filter out the very short plot descriptions
print('there is a total of ', wiki_plot.count(), ' plot summaries in the plot dataset')
wiki_plot.show(1)

                                                                                

there is a total of  33243  plot summaries in the plot dataset
+--------------------+----+--------------------+
|               Title|Year|                Plot|
+--------------------+----+--------------------+
|Kansas Saloon Sma...|1901|A bartender is wo...|
+--------------------+----+--------------------+
only showing top 1 row



In [10]:
# join the imdb with the plot dataset by matching movie titles and release year
imdb_join_plot = imdb.join(wiki_plot, ["Title", "Year"], 'inner')\
  .withColumnRenamed('tconst', 'id')\
    .select('id', 'Title', 'Plot')\
      .withColumn("Source", lit("wiki_plot"))

print("The joined dataset has ", imdb_join_plot.count(), " entries")

# inspect the joined dataset
imdb_join_plot.show(1)

                                                                                

The joined dataset has  25361  entries


[Stage 75:>                                                         (0 + 8) / 8]

+---------+-----+--------------------+---------+
|       id|Title|                Plot|   Source|
+---------+-----+--------------------+---------+
|tt0790799|$9.99|The film mainly f...|wiki_plot|
+---------+-----+--------------------+---------+
only showing top 1 row



                                                                                

In [11]:
df = imdb_join_plot.union(imdb_join_details)

print('after merging & cleaning, there is a total of ', df.count(), ' movie plot entries left in the merged dataset')

# inspect the combined new dataset
df.show(1)

                                                                                

after merging & cleaning, there is a total of  28218  movie plot entries left in the merged dataset




+---------+-----+--------------------+---------+
|       id|Title|                Plot|   Source|
+---------+-----+--------------------+---------+
|tt0790799|$9.99|The film mainly f...|wiki_plot|
+---------+-----+--------------------+---------+
only showing top 1 row



                                                                                

## Trainning Model

In [12]:
# tokenize and remove stop words in this cell
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec

# create a new field by copying Plot
df = df.withColumn('inputText', F.col('Plot')) 

# regular expression tokenizer to tokenize inputText into individual tokens (words)
regextok = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'inputText', outputCol = 'tokens')

# StopWordsRemover to remove stopwords in the list of tokens
stopwrmv = StopWordsRemover(inputCol = 'tokens', outputCol = 'tokens_sw_removed')
df = regextok.transform(df)
df = stopwrmv.transform(df)
df.show(3)

                                                                                

+---------+------------------+--------------------+---------+--------------------+--------------------+--------------------+
|       id|             Title|                Plot|   Source|           inputText|              tokens|   tokens_sw_removed|
+---------+------------------+--------------------+---------+--------------------+--------------------+--------------------+
|tt0790799|             $9.99|The film mainly f...|wiki_plot|The film mainly f...|[the, film, mainl...|[film, mainly, fo...|
|tt2614684|               '71|Gary Hook, a new ...|wiki_plot|Gary Hook, a new ...|[gary, hook, a, n...|[gary, hook, new,...|
|tt0032176|'Til We Meet Again|Total strangers D...|wiki_plot|Total strangers D...|[total, strangers...|[total, strangers...|
+---------+------------------+--------------------+---------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [13]:
# train word2vec model, the parameters here can be changed to optimize the model
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'tokens_sw_removed', outputCol = 'wordvectors')
model = word2vec.fit(df)

23/08/08 23:01:04 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

In [14]:
# using transform to add wordvectors column to dataframe
df = model.transform(df)
chunks = df.select('id', 'Title','wordvectors', 'Plot', 'Source').limit(30000).collect()

                                                                                

### Advanced movie recommender with Word Vector Arithmetic
Here we can try to add or subtract elements from the movies that we like. For example, I love the movie Interstella and I love Zombies. So it'd be great if I can find a movie similar to the combination of Interstella and Resident Evil. To find such a movie, we will add the vector of zombie to the vector of the plot of Interstella. 

In [33]:
# writing a function to obtain the plot string from the plot dataset
def acquire_plot(base_movie: str): 
  # input: a movie name (precise) or a movie id 
  # output: the movie's plot

  if base_movie.startswith("tt"):   # search by movie name
    base_movie_row = df.filter(df.id == base_movie).collect()
  else:                             # search by movie id
    base_movie_row = df.filter(df.Title == base_movie).collect()

  if base_movie_row: 
    movie_plot = base_movie_row[0]['Plot']
    return movie_plot
  else: 
    print("Sorry, ", base_movie, " is not found in the database")

In [34]:
# create extra element and transform it to word vectors
extra_movie = "2012"
extra_element = acquire_plot(extra_movie)      # enter the element you wish to add or subtract from the movie
element_df = spark.createDataFrame([(1, extra_element)]).toDF('index','inputText')
element_tok = regextok.transform(element_df)
element_swr = stopwrmv.transform(element_tok)
element_vec = model.transform(element_swr)
element_vec = element_vec.select('wordvectors').collect()[0][0]
element_vec

                                                                                

DenseVector([-0.071, -0.0175, 0.0051, 0.0578, -0.0273, 0.0137, -0.0937, 0.0662, 0.0392, 0.0116, -0.0293, -0.0425, -0.0109, 0.0814, -0.005, -0.0889, 0.0121, 0.125, -0.0171, 0.0197, 0.0546, 0.1209, -0.005, 0.0376, -0.1512, -0.043, 0.0596, 0.131, -0.0744, -0.013, -0.042, 0.0081, -0.0021, -0.0258, 0.0225, -0.0368, 0.0038, -0.0205, -0.0177, -0.0374, 0.1426, -0.0174, -0.0428, 0.0051, -0.0179, -0.0082, -0.0739, 0.0865, 0.0669, 0.0093, -0.0011, 0.0563, 0.0242, -0.0089, -0.0232, -0.0186, 0.0417, 0.0051, -0.0069, 0.0479, 0.029, 0.005, -0.0261, -0.104, -0.0148, 0.0839, 0.0424, -0.0581, 0.0097, 0.0767, -0.0312, -0.0547, -0.0109, -0.0155, -0.016, -0.0165, -0.0101, -0.0455, 0.0216, -0.0352, -0.0055, -0.0456, 0.0119, -0.0189, -0.0476, 0.0368, -0.0039, -0.0475, -0.0112, -0.0173, -0.0705, 0.0967, 0.0369, 0.065, -0.0534, 0.0259, 0.0001, 0.0161, 0.0114, 0.012])

In [35]:
base_movie = "Interstellar"   # enter the movie that you love
# create search query and transform it to word vectors
SEARCH_QUERY = acquire_plot(base_movie)
query_df = spark.createDataFrame([(1, SEARCH_QUERY)]).toDF('index','inputText')
query_tok = regextok.transform(query_df)
query_swr = stopwrmv.transform(query_tok)
query_vec = model.transform(query_swr)
query_vec = query_vec.select('wordvectors').collect()[0][0]
query_vec = query_vec + element_vec
query_vec

                                                                                

DenseVector([-0.1284, -0.0193, -0.0366, 0.1402, -0.0891, -0.0087, -0.1537, 0.088, 0.0369, 0.0179, -0.0451, -0.0296, -0.0393, 0.1159, 0.0216, -0.1689, 0.0372, 0.1925, -0.0528, 0.0453, 0.0902, 0.2886, -0.0181, 0.0868, -0.3125, -0.0935, 0.1206, 0.2614, -0.1437, -0.0137, -0.0856, 0.0152, -0.0007, -0.064, 0.0343, -0.059, 0.0363, -0.0011, -0.0318, -0.11, 0.2378, -0.1037, -0.0946, -0.0308, -0.0561, -0.0161, -0.1514, 0.1581, 0.17, -0.0365, 0.0323, 0.0915, 0.0708, -0.0813, -0.0331, -0.0706, 0.1028, 0.0011, -0.0171, 0.089, 0.0365, 0.0041, -0.016, -0.1871, 0.0166, 0.1351, 0.0507, -0.0841, -0.0444, 0.1798, -0.0667, -0.1086, -0.0817, -0.0125, -0.0242, -0.0104, -0.0624, -0.0631, 0.0192, -0.0867, 0.011, -0.0553, 0.0036, -0.0101, -0.1428, 0.0546, -0.0603, -0.1019, -0.0114, 0.0016, -0.1258, 0.2081, 0.0646, 0.162, -0.0839, 0.055, -0.0341, 0.0047, 0.02, -0.005])

In [36]:
# define function to calculate cosine similarity
import numpy as np
def cossim(v1, v2): 
    '''
        cossim(v1, v2) calculates the cosine similarity between v1 and v1.
        If v1 or v2 is a zero vector, it will return 0
    '''
    if np.dot(v1, v1) == 0 or np.dot(v2, v2) == 0:
        return 0.0
    return float(np.dot(v1, v2) / np.sqrt(np.dot(v1, v1)) / (np.sqrt(np.dot(v2, v2))))

data = [(i[0], float(cossim(query_vec, i[2])), i[1], i[4], i[3]) for i in chunks]

In [37]:
sim_df = spark.createDataFrame(data)\
  .toDF('movie_id', 'similarity', 'Title', 'Source','Plot')\
    .orderBy('similarity', ascending=False)

sim_df.filter((sim_df.Title != base_movie) & (sim_df.Title != extra_movie))\
  .show(10, truncate = False)

23/08/08 23:23:16 WARN TaskSetManager: Stage 217 contains a task of very large size (7098 KiB). The maximum recommended task size is 1000 KiB.
[Stage 217:>                                                        (0 + 8) / 8]

+---------+------------------+--------------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

In [38]:
len("Bekar hai sab kich | url=http://www.imdb.com/title/tt0245977/%7Cdate=September 2015}} ")

86