## Word2vec Model Trainning
- Using the movie plots and summary dataset, train a word2vec model that will recommend movies with similar plot lines 

In [1]:
# enter path to the location of the dataset here
"""# Mike's desktop paths 
path_to_imdb_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = 'C:/Users/123/OneDrive/Academic/5430/data/IMDB_movie_details.json' """

# Mike's laptop paths
path_to_imdb_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/title.basics.tsv.gz'
path_to_reviews_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_reviews.json'
path_to_plots_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/wiki_movie_plots_deduped.csv'
path_to_details_dataset = '/Users/yupan/Library/CloudStorage/OneDrive-Personal/Academic/5430/data/IMDB_movie_details.json'

Three data sources: 
1. IMDB: used for matching movie title & ID
2. Spoiler: contains plots&movie ID, used for trainning
3. Plot: contains plots& movie name, used for trainning

Plot_synopsis is the movies' plot summaries with spoilers. Since we want to analyze the similarity of the plot line of the movies, we will use this variable to train our word2vec model. 

To train the model, let's first initiate a spark session and load the dataset into the spark dataframe

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

23/08/03 19:00:45 WARN Utils: Your hostname, Yus-MacBook-Air-2.local resolves to a loopback address: 127.0.0.1; using 172.19.249.213 instead (on interface en0)
23/08/03 19:00:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/03 19:00:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# reading the IMDB dataset
imdb = spark.read.options(header = True, inferSchema = True, delimiter = "\t").csv(path_to_imdb_dataset)

# filter the imdb dataset so that only movies are included
imdb = imdb.filter("titleType = 'movie'")\
  .filter("primaryTitle != ''")\
    .select('tconst', 'primaryTitle', 'startYear')\
      .withColumnRenamed('startYear', 'Year')

print('there is a total of ', imdb.count(), ' movies in the imdb dataset')
imdb.show(3)

                                                                                

there is a total of  651281  movies in the imdb dataset
+---------+--------------------+----+
|   tconst|        primaryTitle|Year|
+---------+--------------------+----+
|tt0000009|          Miss Jerry|1894|
|tt0000147|The Corbett-Fitzs...|1897|
|tt0000502|            Bohemios|1905|
+---------+--------------------+----+
only showing top 3 rows



In [4]:
details = spark.read.json(path_to_details_dataset)
details = details.select('movie_id','plot_synopsis')\
  .filter("plot_synopsis != ''")
print('there is a total of ', details.count(), ' plot summaries left in the details dataset')
details.show(3)

there is a total of  1339  plot summaries left in the details dataset
+---------+--------------------+
| movie_id|       plot_synopsis|
+---------+--------------------+
|tt0105112|Jack Ryan (Ford) ...|
|tt1204975|Four boys around ...|
|tt0040897|Fred Dobbs (Humph...|
+---------+--------------------+
only showing top 3 rows



In [5]:
# join the imdb with details by matching the unique identifier(e.g. tt0000000)
imdb_join_details = imdb.join(details, imdb.tconst == details.movie_id, 'inner')\
  .withColumnRenamed('plot_synopsis', 'Plot')\
    .withColumnRenamed('primaryTitle', 'Title')\
      .withColumnRenamed('tconst', 'id')\
        .select('id', 'Title', 'Plot')

print("The joined dataset has ", imdb_join_details.count(), " entries")

# inspect one entry
imdb_join_details.filter("tconst == 'tt0472062'").show(truncate = False)

                                                                                

The joined dataset has  1324  entries


[Stage 16:>                                                         (0 + 1) / 1]

+---------+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id       |Title               |Plot                                                                                                            

                                                                                

In [6]:
# reading the plot dataset
plot = spark.read.options(header = True, inferSchema = True, quote = '"', escape = '"', multiLine = True).csv(path_to_plots_dataset)
plot = plot.select('Title', 'Release Year','Plot').withColumnRenamed('Release Year', 'Year')
print('there is a total of ', plot.count(), ' plot summaries in the plot dataset')
plot.show(3)

                                                                                

there is a total of  34886  plot summaries in the plot dataset
+--------------------+----+--------------------+
|               Title|Year|                Plot|
+--------------------+----+--------------------+
|Kansas Saloon Sma...|1901|A bartender is wo...|
|Love by the Light...|1901|The moon, painted...|
|The Martyred Pres...|1901|The film, just ov...|
+--------------------+----+--------------------+
only showing top 3 rows



In [7]:
# join the imdb with the plot dataset by matching movie titles and release year
imdb_join_plot = imdb.join(plot, [imdb.primaryTitle == plot.Title, imdb.Year == plot.Year], 'inner')\
  .withColumnRenamed('tconst', 'id')\
    .select('id', 'Title', 'Plot')

print("The joined dataset has ", imdb_join_plot.count(), " entries")

# inspect the joined dataset
imdb_join_plot.show(5, truncate=False)

                                                                                

The joined dataset has  26953  entries


[Stage 32:>                                                         (0 + 1) / 1]

+---------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

In [8]:
# drop primaryTitle and movie_id because they provide redundant information 
df = imdb_join_plot.union(imdb_join_details)

print('after merging & cleaning, there is a total of ', df.count(), ' movie plot entries left in the merged dataset')

# inspect the combined new dataset
df.show(3, truncate = False)

                                                                                

after merging & cleaning, there is a total of  28277  movie plot entries left in the merged dataset


[Stage 47:>                                                         (0 + 1) / 1]

+---------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

## Trainning Model

In [9]:
# tokenize and remove stop words in this cell
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, Word2Vec

# create a new field by copying Plot
df = df.withColumn('inputText', F.col('Plot')) 

# regular expression tokenizer to tokenize inputText into individual tokens (words)
regextok = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'inputText', outputCol = 'tokens')

# StopWordsRemover to remove stopwords in the list of tokens
stopwrmv = StopWordsRemover(inputCol = 'tokens', outputCol = 'tokens_sw_removed')
df = regextok.transform(df)
df = stopwrmv.transform(df)
df.show(3)

[Stage 53:>                                                         (0 + 1) / 1]

+---------+------------------+--------------------+--------------------+--------------------+--------------------+
|       id|             Title|                Plot|           inputText|              tokens|   tokens_sw_removed|
+---------+------------------+--------------------+--------------------+--------------------+--------------------+
|tt0790799|             $9.99|The film mainly f...|The film mainly f...|[the, film, mainl...|[film, mainly, fo...|
|tt2614684|               '71|Gary Hook, a new ...|Gary Hook, a new ...|[gary, hook, a, n...|[gary, hook, new,...|
|tt0032176|'Til We Meet Again|Total strangers D...|Total strangers D...|[total, strangers...|[total, strangers...|
+---------+------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



                                                                                

In [10]:
# train word2vec model, the parameters here can be changed to optimize the model
word2vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'tokens_sw_removed', outputCol = 'wordvectors')
model = word2vec.fit(df)

23/08/03 19:04:10 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

In [11]:
# using transform to add wordvectors column to dataframe
df = model.transform(df)
chunks = df.select('id', 'Title','wordvectors', 'Plot').limit(30000).collect()

                                                                                

In [26]:
# experiemnt new query with different Movie's plot, worked very well! 

# Interstellar_plot = """In the mid-21st century, crop blights and dust storms threaten humanity's survival. Joseph Cooper, a widowed engineer and former NASA pilot, runs a farm with his father-in-law Donald, son Tom, and daughter Murphy. Living in a post-truth society, Cooper is reprimanded for telling Murphy that the Apollo missions were not fake; he encourages her to carefully observe and record what she sees. They discover that dust patterns, which Murphy first attributes to a ghost, result from gravity variations, and translate into geographic coordinates. These lead them to a secret NASA facility headed by Cooper's former supervisor, Professor John Brand, who explains that 48 years earlier a wormhole appeared near Saturn, opening a path to a distant galaxy with twelve potentially habitable planets located near a black hole named Gargantua. Volunteers had previously traveled through the wormhole to evaluate the planets, with Miller, Edmunds, and Mann reporting back desirable results. Brand explains he has conceived two plans to ensure humanity’s survival - Plan A involves developing a gravitational propulsion theory, allowing a mass exodus from Earth, while Plan B is a conventional launch of the Endurance spacecraft with 5,000 frozen embryos to colonize a habitable planet. Cooper is recruited to pilot the Endurance. When Murphy refuses to see him off, he leaves her his wristwatch to compare their relative time when he returns.
# The crew consists of Cooper, the robots TARS and CASE, and the scientists Dr. Amelia Brand, (Professor Brand's daughter), Romilly, and Doyle. After traversing the wormhole, Cooper, Doyle, and Brand use a lander to investigate Miller's planet, where time is severely dilated. After landing in knee-high water and finding only wreckage from Miller's expedition, a gigantic tidal wave kills Doyle and waterlogs the lander's engines. By the time the engines restart, 23 years have elapsed in terms of Earth time.
# Having enough fuel for only one of the other two planets, Cooper rules they go to Mann's, as he is still broadcasting. En route, they receive messages from Earth. Murphy Cooper is now a scientist working on Plan A. On his deathbed, Professor Brand reveals to her that Plan B was his only real plan, knowing that Plan A required observations of gravitational singularities from within a black hole.
# On Mann's planet, the Endurance crew revive Mann from cryostasis. He assures them colonization is possible, despite an extreme environment. On an excursion, Mann attempts to kill Cooper and reveals that he falsified the data in the hope of being rescued. He steals Cooper's lander and heads for the Endurance. While a booby trap set by Mann kills Romilly, Brand rescues Cooper with the other lander and they race to the Endurance. Mann is killed in a failed manual docking operation, severely damaging the Endurance. Through a difficult docking maneuver, Cooper regains control.
# With insufficient fuel, they resort to a slingshot around Gargantua which costs them another 51 years. In the process, Cooper and TARS must jettison their landers to allow Brand and CASE to reach Edmunds' planet. Slipping past the event horizon of Gargantua, they eject from their craft and find themselves in a tesseract, possibly constructed by humans of the far future. Across time, Cooper can see through the bookcases of Murphy's old room on Earth and weakly interact with its gravity. Realizing that he is now Murphy's ""ghost"", he manipulates the second hand of the wristwatch he gave her before he left, transmitting via Morse code the quantum data that TARS collected from inside the event horizon.
# The tesseract, its purpose completed, collapses and ejects Cooper and TARS. Cooper wakes on a huge station, orbiting Saturn. He reunites with Murphy, now an old woman nearing death. Using the quantum data, she was able to develop the gravitational propulsion theory, enabling humanity’s exodus and transformation into an advanced spacefaring civilization. She reminds Cooper that Amelia Brand is out there alone. Cooper and TARS take a spacecraft to rejoin Brand and CASE, who are setting up a human colony on Edmunds' habitable planet."""
# 
# Titanic_plot = "In 1996, treasure hunter Brock Lovett and his team aboard the research vessel Akademik Mstislav Keldysh search the wreck of RMS Titanic for a necklace with a rare diamond, the Heart of the Ocean. They recover a safe containing a drawing of a young woman wearing only the necklace dated April 14, 1912, the day the ship struck the iceberg.[Note 1] Rose Dawson Calvert, the woman in the drawing, is brought aboard Keldysh and tells Lovett of her experiences aboard Titanic.\nIn 1912 Southampton, 17-year-old first-class passenger Rose DeWitt Bukater, her fiancé Cal Hockley, and her mother Ruth board the luxurious Titanic. Ruth emphasizes that Rose's marriage will resolve their family's financial problems and retain their high-class persona. Distraught over the engagement, Rose considers suicide by jumping from the stern; Jack Dawson, a penniless artist, intervenes and discourages her. Discovered with Jack, Rose tells a concerned Cal that she was peering over the edge and Jack saved her from falling. When Cal becomes indifferent, she suggests to him that Jack deserves a reward. He invites Jack to dine with them in first class the following night. Jack and Rose develop a tentative friendship, despite Cal and Ruth being wary of him. Following dinner, Rose secretly joins Jack at a party in third class.\nAware of Cal and Ruth's disapproval, Rose rebuffs Jack's advances, but realizes she prefers him over Cal. After rendezvousing on the bow at sunset, Rose takes Jack to her state room; at her request, Jack sketches Rose posing nude wearing Cal's engagement present, the Heart of the Ocean necklace. They evade Cal's bodyguard, Mr. Lovejoy, and have sex in an automobile inside the cargo hold. On the forward deck, they witness a collision with an iceberg and overhear the officers and designer discussing its seriousness.\nCal discovers Jack's sketch of Rose and an insulting note from her in his safe along with the necklace. When Jack and Rose attempt to inform Cal of the collision, Lovejoy slips the necklace into Jack's pocket and he and Cal accuse him of theft. Jack is arrested, taken to the master-at-arms' office, and handcuffed to a pipe. Cal puts the necklace in his own coat pocket.\nWith the ship sinking, Rose flees Cal and her mother, who has boarded a lifeboat, and frees Jack. On the boat deck, Cal and Jack encourage her to board a lifeboat; Cal claims he can get himself and Jack off safely. After Rose boards one, Cal tells Jack the arrangement is only for himself. As her boat lowers, Rose decides that she cannot leave Jack and jumps back on board. Cal takes his bodyguard's pistol and chases Rose and Jack into the flooding first-class dining saloon. After using up his ammunition, Cal realizes he gave his coat and consequently the necklace to Rose. He later boards a collapsible lifeboat by carrying a lost child.\nAfter braving several obstacles, Jack and Rose return to the boat deck. The lifeboats have departed and passengers are falling to their deaths as the stern rises out of the water. The ship breaks in half, lifting the stern into the air. Jack and Rose ride it into the ocean and he helps her onto a wooden panel buoyant enough for only one person. He assures her that she will die an old woman, warm in her bed. Jack dies of hypothermia[8] but Rose is saved.\nWith Rose hiding from Cal en route, the RMS Carpathia takes the survivors to New York City where Rose gives her name as Rose Dawson. Rose says she later read that Cal committed suicide after losing all his money in the Wall Street Crash of 1929.\nBack in the present, Lovett decides to abandon his search after hearing Rose's story. Alone on the stern of Keldysh, Rose takes out the Heart of the Ocean — in her possession all along — and drops it into the sea over the wreck site. While she is seemingly asleep or has died in her bed,[9] photos on her dresser depict a life of freedom and adventure inspired by the life she wanted to live with Jack. A young Rose reunites with Jack at the Titanic's Grand Staircase, applauded by those who died.                |In 1996, treasure hunter Brock Lovett and his team aboard the research vessel Akademik Mstislav Keldysh search the wreck of RMS Titanic for a necklace with a rare diamond, the Heart of the Ocean. They recover a safe containing a drawing of a young woman wearing only the necklace dated April 14, 1912, the day the ship struck the iceberg.[Note 1] Rose Dawson Calvert, the woman in the drawing, is brought aboard Keldysh and tells Lovett of her experiences aboard Titanic.\nIn 1912 Southampton, 17-year-old first-class passenger Rose DeWitt Bukater, her fiancé Cal Hockley, and her mother Ruth board the luxurious Titanic. Ruth emphasizes that Rose's marriage will resolve their family's financial problems and retain their high-class persona. Distraught over the engagement, Rose considers suicide by jumping from the stern; Jack Dawson, a penniless artist, intervenes and discourages her. Discovered with Jack, Rose tells a concerned Cal that she was peering over the edge and Jack saved her from falling. When Cal becomes indifferent, she suggests to him that Jack deserves a reward. He invites Jack to dine with them in first class the following night. Jack and Rose develop a tentative friendship, despite Cal and Ruth being wary of him. Following dinner, Rose secretly joins Jack at a party in third class.\nAware of Cal and Ruth's disapproval, Rose rebuffs Jack's advances, but realizes she prefers him over Cal. After rendezvousing on the bow at sunset, Rose takes Jack to her state room; at her request, Jack sketches Rose posing nude wearing Cal's engagement present, the Heart of the Ocean necklace. They evade Cal's bodyguard, Mr. Lovejoy, and have sex in an automobile inside the cargo hold. On the forward deck, they witness a collision with an iceberg and overhear the officers and designer discussing its seriousness.\nCal discovers Jack's sketch of Rose and an insulting note from her in his safe along with the necklace. When Jack and Rose attempt to inform Cal of the collision, Lovejoy slips the necklace into Jack's pocket and he and Cal accuse him of theft. Jack is arrested, taken to the master-at-arms' office, and handcuffed to a pipe. Cal puts the necklace in his own coat pocket.\nWith the ship sinking, Rose flees Cal and her mother, who has boarded a lifeboat, and frees Jack. On the boat deck, Cal and Jack encourage her to board a lifeboat; Cal claims he can get himself and Jack off safely. After Rose boards one, Cal tells Jack the arrangement is only for himself. As her boat lowers, Rose decides that she cannot leave Jack and jumps back on board. Cal takes his bodyguard's pistol and chases Rose and Jack into the flooding first-class dining saloon. After using up his ammunition, Cal realizes he gave his coat and consequently the necklace to Rose. He later boards a collapsible lifeboat by carrying a lost child.\nAfter braving several obstacles, Jack and Rose return to the boat deck. The lifeboats have departed and passengers are falling to their deaths as the stern rises out of the water. The ship breaks in half, lifting the stern into the air. Jack and Rose ride it into the ocean and he helps her onto a wooden panel buoyant enough for only one person. He assures her that she will die an old woman, warm in her bed. Jack dies of hypothermia[8] but Rose is saved.\nWith Rose hiding from Cal en route, the RMS Carpathia takes the survivors to New York City where Rose gives her name as Rose Dawson. Rose says she later read that Cal committed suicide after losing all his money in the Wall Street Crash of 1929.\nBack in the present, Lovett decides to abandon his search after hearing Rose's story. Alone on the stern of Keldysh, Rose takes out the Heart of the Ocean — in her possession all along — and drops it into the sea over the wreck site. While she is seemingly asleep or has died in her bed,[9] photos on her dresser depict a life of freedom and adventure inspired by the life she wanted to live with Jack. A young Rose reunites with Jack at the Titanic's Grand Staircase, applauded by those who died."
# 
# Furious7_plot = """In the opening scene set in London, England, Deckard Shaw (Jason Statham) stands by his brother Owen's (Luke Evans) bedside as he lays in a coma, badly scarred and crippled after being ejected from the plane (in the last Fast and Furious film). Shaw promises to his brother that he will settle his score. He leaves the room and out into the rest of the hospital, with bodies everywhere. The building continues to burn and crumble around him.Meanwhile, Dom (Vin DIesel) drives Letty (Michelle Rodriguez) to a racetrack in the California desert where hundreds of people from their neighborhood gather for Race Wars, something that Dom and Letty invented when they were younger. Letty goes up for the race and flies past her opponent as his car breaks down on the track. All the patrons cheer her on after she crosses the finish line, followed by Iggy Azalea showing up out of nowhere to congratulate Letty. The excitement from the others is too overwhelming for Letty, and she takes the car and drives away. Dom later finds her that night at the cemetery, staring at her own tombstone. Dom takes a sledgehammer to smash it, but Letty stops him because she thinks the person she used to be is no longer who she is to Dom, and she doesn't want to hurt him for that. She bids Dom goodbye.Brian (Paul Walker) is adjusting to life as a minivan-driving dad, as his son Jack is now old enough for school. Even his wife Mia (Jordana Brewster), Dom's sister, acknowledges that he has had trouble settling down this way.That evening, Hobbs (Dwayne Johnson) continues to do some overnight work while Elena (Elsa Pataky) is getting ready to go out. Hobbs hands her a letter of recommendation that she asked him for. He wishes her luck in her pursuits. Hobbs then sees Shaw in his office hacking into his computer. Hobbs attempts to arrest him as he is gathering the information on the crew that took down his brother. Shaw battles Hobbs through the whole floor, smashing each other through glass walls and coffee tables, while Hobbs manages to get a good chokeslam down on Shaw. Elena returns for back-up as Shaw gets a grenade out. He tosses it towards the detectives, forcing Hobbs to run to Elena as the grenade explodes, sending them both out the window where they land Hobbs-first onto a van. Elena is unharmed. In the office, it is shown that Shaw was looking for Han (Sung Kang).Dom visits Mia at her home as Brian is getting Jack ready for school. Outside their house is a large package. Mia tells Dom that she is having another child, but she hasn't told Brian for fear of how he'd react to more changes. Dom gets a call from Shaw, listening to his message after killing Han (from the end of the previous film). Dom realizes there's trouble, and he grabs Mia as the package explodes massively, destroying the whole house. Jack is safe in the minivan, though his parents rush to him in a panic.Dom follows Elena to Hobbs' hospital room. He tells Dom who Shaw is, including his history in the Special Forces, where he was turned into a human killing machine. Hobbs asks Dom to promise him he will take Shaw down for good. Dom agrees. Meanwhile, Brian sends Mia and their son to hide out in the Dominican Republic until this thing with Shaw is taken care of. He promises to Mia that he will return as long as they are safe.Dom flies to Tokyo to bring Han back for a proper burial. He meets with Sean Boswell (Lucas Black), Han's friend (from the third film Fast and Furious 3: Toyko Drift). After the two have their race, Sean gives him the only things they could find from Han's car: a picture of Gisele and a cross necklace.Back in California, the gang gathers for Han's funeral, joined by Tej (Ludacris) and Roman (Tyrese Gibson). Dom spots a car suspiciously driving near the funeral. He follows after it, learning it is Shaw. They crash into each other in a tunnel, and they briefly fight until a team of agents come in, giving Shaw a chance to escape. Their leader, a shadowy agent known only as "Mr. Nobody" (Kurt Russell), brings Dom with him and gathers Brian, Tej, Roman, and finally Letty to work together on a mission. A hacker known only as Ramsey has been captured by a terrorist leader named Mose Jakande (Djimon Hounsou), because he is pursuing something Ramsey helped develop called God's Eye, which is a surveillance system that can spot anybody from anywhere in the world. Using this, Dom can locate Shaw. The guys come up with a plan to infiltrate the bus that is carrying Ramsey, while Dom asks Tej to help him put armor on one car.The plan involves the five dropping from a jet in their cars and carefully land close to their target. Roman gets cold feet, prompting Tej to pull the chute out on him and sucking him out of the plane. The other four land close to the bus and break in. Jakande's men shoot at the team. Brian hops on the bus and fights off the guards. He finds Ramsey... a young woman (Nathalie Emmanuel) in her cell and has her jump off the bus and onto the hood of Dom's car for safety. He then dukes it out with a minion named Kiet (Tony Jaa), who is a tough fighter. In the madness, the driver of the bus is accidentally shot. Kiet locks Brian on the bus as he gets out. The bus slides toward the edge of a cliff. Brian climbs out and manages to run off the bus and onto Letty's car as she arrives in time to grab him. Meanwhile, Dom and Ramsey encounter Shaw, leading them both through the woods. Roman appears and knocks Shaw off the road. However, Jakande and his team find Dom. Before they can get him, he drives his car off a cliff, yet he and Ramsey miraculously survive.The team revives Ramsey, who tells them that she gave God's Eye to a friend of hers in Abu Dhabi. They all travel there and meet this person, Safar (Ali Fazal), who says he sold God's Eye to a prince. The team goes undercover to a party that the prince is throwing. Dom and Brian find that God's Eye is in a car. Letty ends up fighting three guards and the prince's chief bodyguard Kara (Ronda Rousey). Kara alerts the guards that there are intruders, keeping Tej and Ramsey out of their systems. Dom and Brian drive the car out of there before the gates shut the place down. Shaw comes out and tries to shoot at Dom, until he drives out of the building and through the next one. Dom discovers that the breaks are out, forcing him and Brian to jump to the next building. The two jump out of the car and pull God's Eye out before the car slides out and crashes to the ground below.With God's Eye, the team learns that Shaw is hiding out in an abandoned factory outside the city. After tracking him there, they see that he has a lot of back-up from Jakande. The team evades gunfire, when Mr. Nobody gets shot in the chaos. Dom carries him out of there, and Jakande gets his hands on God's Eye. Mr. Nobody calls for medical assistance, and tells Dom that he will be leaving him from there.The team knows that it's time to end it with Shaw once and for all. They decide to take the fight back to the streets of their hometown in Los Angeles. Brian calls Mia to tell her he loves her in case he doesn't make it back. Mia tells him that they're having a little girl. Brian then promises to come back to her.Tej and Roman take Ramsey with them while they try and hack God's Eye to prevent Jakande from finding Ramsey. Dom finds Shaw and lures him to a parking lot where they have their final showdown. They have a street fight, dueling with wrenches and pipes. Meanwhile, Jakande, in an armored Stealth Black Hawk helicopter, sends a flying armored drone aircraft to find Ramsey. They end up shooting down an electrical tower, which catches Hobbs' attention after watching it on the news. He announces "Daddy's gotta go to work" and breaks off the cast on his arm, and then gears up.Ramsey gets switched under the bridge and goes with Lettty while Brian tries to find a new spot to hack God's Eye. He encounters Kiet again and kills him when he hooks him up to a weight and pushes him down an elevator shaft. The drone chases after Letty and Ramsey, nearly getting them until Hobbs rides in on an ambulance and destroys it by crashing into it with the ambulance off the freeway ramp. Hobbs exits the totaled ambulance, shaken but still alive.They succeed in the hack, and Jakande is furious. He and his men locate Dom and Shaw still fighting on the roof of the parking garage. They shoot a missile at the lot, causing the ground to break beneath Shaw's feet. Dom stomps on the concrete and drops Shaw through the lot. Jakande continues shooting at Dom as he drives away, but Hobbs shoots back. Dom grabs a bag of grenades and drives close enough to stick them onto Jakande's chopper. Hobbs shoots at the bag, destroying the chopper and Jakande. Dom crashes his car and is pulled out by Brian and Letty. Letty begs him to stay alive and says she remembers that they got married in the Dominican Republic. That's where he gave her the cross necklace that Han had. Dom awakens and kisses Letty.Shaw is locked up for good in a maximum security black site prison. He threatens to break out, though Hobbs doubts that it will ever happen.The team watches Brian and Mia play with Jack on the beach. They realize this is where he belongs, and they look at them lovingly. Dom gets up to leave. Ramsey asks if he's gonna say goodbye. Dom says, "It's never goodbye." He drives away, only to get caught up with Brian on the road. They look at each other with a smile.We hear Dom's voice say that they both lived life at a quarter mile, and that's why they're brothers. This is cut between scenes of Brian through the whole series and everything he and Dom have been through. Dom says Brian will always be his brother. The two continue driving until they finally part ways at a fork in the road, with Brian driving into the sunset.The film closes with the text, "For Paul"."""

### Advanced movie recommender with Word Vector Arithmetic
Here we can try to add or subtract elements from the movies that we like. For example, I love the movie Interstella and I love Zombies. So it'd be great if I can find a movie similar to the combination of Interstella and Resident Evil. To find such a movie, we will add the vector of zombie to the vector of the plot of Interstella. 

In [64]:
# here I am trying to use the actual plot of the movies as an input, but I haven't completed the code
def acquire_plot(base_movie: str): 
  if base_movie.startswith("tt"):   # search by movie name
    base_movie_row = df.filter(df.id == base_movie).collect()
  else:                             # search by movie id
    base_movie_row = df.filter(df.Title == base_movie).collect()

  if base_movie_row: 
    movie_plot = base_movie_row[0]['Plot']
    return movie_plot
  else: 
    print("Sorry, ", base_movie, " is not found in the database")
    
base_movie = acquire_plot("Interstellar")  # enter the movie that you love
print(base_movie)



In the mid-21st century, crop blights and dust storms threaten humanity's survival. Joseph Cooper, a widowed engineer and former NASA pilot, runs a farm with his father-in-law Donald, son Tom, and daughter Murphy. Living in a post-truth society, Cooper is reprimanded for telling Murphy that the Apollo missions were not fake; he encourages her to carefully observe and record what she sees. They discover that dust patterns, which Murphy first attributes to a ghost, result from gravity variations, and translate into geographic coordinates. These lead them to a secret NASA facility headed by Cooper's former supervisor, Professor John Brand, who explains that 48 years earlier a wormhole appeared near Saturn, opening a path to a distant galaxy with twelve potentially habitable planets located near a black hole named Gargantua. Volunteers had previously traveled through the wormhole to evaluate the planets, with Miller, Edmunds, and Mann reporting back desirable results. Brand explains he has

                                                                                

In [65]:
# experiment with word vector arithmetic
# create extra element and transform it to word vectors
extra_element = acquire_plot("Resident Evil: The Final Chapter")      # enter the element you wish to add or subtract from the movie
element_df = spark.createDataFrame([(1, extra_element)]).toDF('index','inputText')
element_tok = regextok.transform(element_df)
element_swr = stopwrmv.transform(element_tok)
element_vec = model.transform(element_swr)
element_vec = element_vec.select('wordvectors').collect()[0][0]
element_vec

                                                                                

DenseVector([-0.0125, -0.0248, 0.0072, -0.0221, -0.0018, 0.0967, -0.0404, -0.0131, -0.0187, 0.099, -0.0923, -0.0037, -0.0613, 0.001, 0.0088, 0.0426, -0.041, 0.0001, 0.0982, 0.0589, -0.0453, -0.0067, -0.1337, -0.037, 0.0664, 0.0126, 0.0177, -0.0226, 0.0136, 0.0644, 0.0489, 0.033, -0.0452, 0.0013, 0.0462, -0.0588, -0.0765, 0.0191, -0.0316, -0.1311, -0.0479, -0.0721, -0.0276, 0.1405, -0.0281, 0.0368, 0.0094, 0.0447, 0.0052, -0.0775, -0.0824, -0.0035, 0.0831, -0.1041, -0.0293, 0.0908, 0.0432, 0.0709, 0.0716, -0.0561, -0.0338, -0.0216, -0.0584, -0.0482, -0.0322, 0.0135, -0.0259, -0.0031, 0.0775, -0.0397, -0.11, -0.0452, -0.0637, 0.1082, 0.0217, -0.0621, -0.014, 0.0363, -0.0811, -0.0045, 0.0175, -0.0077, -0.0657, 0.0496, -0.0774, -0.0288, 0.1014, 0.0797, 0.0886, -0.1616, -0.0412, -0.0028, -0.0451, 0.0919, 0.0192, 0.0206, -0.0686, -0.0443, 0.008, 0.0089])

In [66]:
# create search query and transform it to word vectors
SEARCH_QUERY = base_movie
query_df = spark.createDataFrame([(1, SEARCH_QUERY)]).toDF('index','inputText')
query_tok = regextok.transform(query_df)
query_swr = stopwrmv.transform(query_tok)
query_vec = model.transform(query_swr)
query_vec = query_vec.select('wordvectors').collect()[0][0]
query_vec = query_vec - element_vec
query_vec

DenseVector([-0.0177, 0.0224, 0.0069, 0.0062, -0.0417, -0.0544, 0.0451, 0.0324, 0.0063, 0.0258, 0.0305, 0.0022, 0.0104, -0.0374, -0.0075, -0.0197, 0.0356, -0.0242, 0.05, -0.0501, 0.0289, 0.0025, 0.0472, 0.0154, 0.0324, -0.0109, 0.0435, -0.0312, -0.0316, -0.0171, 0.001, -0.0066, 0.0013, 0.0045, -0.0327, 0.0408, -0.0015, 0.0441, -0.0012, 0.0704, 0.0068, 0.0518, -0.0382, 0.026, 0.0826, 0.0313, 0.0132, 0.0043, 0.0037, 0.0112, -0.0234, -0.0006, -0.002, 0.0371, 0.0408, -0.0193, 0.0034, -0.0084, 0.0021, 0.0563, -0.0063, 0.0375, -0.0099, 0.0137, -0.0014, 0.0029, 0.0175, -0.0116, -0.0332, 0.0316, 0.0937, -0.0414, 0.113, -0.0371, 0.0421, -0.0274, -0.0052, -0.0264, 0.0471, 0.0043, 0.0001, -0.0093, 0.0382, 0.0004, -0.0078, -0.0047, -0.0368, -0.0793, -0.0172, 0.0215, -0.0201, 0.0144, 0.0514, -0.073, 0.0234, 0.0128, 0.0119, -0.0023, -0.0162, -0.0311])

In [67]:
# define function to calculate cosine similarity
import numpy as np
def cossim(v1, v2): 
    '''
        cossim(v1, v2) calculates the cosine similarity between v1 and v1.
        If v1 or v2 is a zero vector, it will return 0
    '''
    if np.dot(v1, v1) == 0 or np.dot(v2, v2) == 0:
        return 0.0
    return float(np.dot(v1, v2) / np.sqrt(np.dot(v1, v1)) / (np.sqrt(np.dot(v2, v2))))

data = [(i[0], float(cossim(query_vec, i[2])), i[1], i[3]) for i in chunks]

In [68]:
sim_df = spark.createDataFrame(data).toDF('movie_id', 'similarity', 'Title', 'Plot').orderBy('similarity', ascending=False)
sim_df.show(10, truncate = False)

23/08/03 20:35:54 WARN TaskSetManager: Stage 200 contains a task of very large size (6703 KiB). The maximum recommended task size is 1000 KiB.
[Stage 200:>                                                        (0 + 8) / 8]

+---------+-------------------+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|movie_id |similarity         |Title                       |Plot                                                                                                                                                                            

                                                                                