##CUNY - Data 612

#Final Project
##Violeta Stoyanova
##Peter Kowalchuk

#Introduction

“For those of us fortunate enough to be healthy and working from home during the outbreak of COVID-19, the possibilities for entertainment can feel both extremely confined (specifically, to our homes) and overwhelmingly endless (where in our Netflix queues to begin?) (Cohen, 2020).” Now more than ever in our socially-distanced life we depend on recommendations for movies to fill our quarantine days with entertainment. Some rely on movie experts; others look at ratings to pick flicks and cure some of the overwhelming boredom. In this regard, for our final project we will attempt to build a recommender system with the use of ‘implicit’ movie data for the model to recommend new movies to users. 

#Objective

The goal of this project is to build various models in the attempt to recommend new movies to users based on ‘implicit’ data. We will be using a dataset from kaggle.com that contains descriptions of 34,886 movies from around the world. This dataset contains a summary of the plot of each movie. The goal is to use this plot to generate recommendations based on the works present in the plot. A first implementation of the recommender will provide a recommendation based on a word or list of words provided. One can imagine how a more complex implementation of this approach could allow the user write a plot of a movie she/he would like to watch. The system then would provide recommendations based on this plot by using the list of words contained in it as input to the recommender system presented in this project.

#Methodology

For the purposes of this project we will utilize the speed and scalability of Apache Spark via Databricks. The first step will be to tokenize the data, so we are able extract some weight with the use of the ‘genre’ variable. Via TF-IDF we will first eliminate any stop words and count word frequency and the build a utility matrix. With this approach words will be equivalent to users in a conventional recommender, with the word weight provided by TF-IDF representing the ratings for each movie. Afterwards, using the techniques we have learned and applied in class, such as Item-based collaborative filtering and User-based collaborative filtering (IBCF and UBCF, respectively) and ALS (Alternating Least Squares) we will build recommender systems in order for movies to be recommended to users. 
We believe that proper evaluation is important to detect how accurate our algorithms are at recommending movies thus we will root mean square error metric (RMSE).

#Data

The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:
  - Release Year - Year in which the movie was released
  - Title - Movie title
  - Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
  - Director - Director(s)
  - Plot - Main actor and actresses
  - Genre - Movie Genre(s)
  - Wiki Page - URL of the Wikipedia page from which the plot description was scraped
  - Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

In [3]:
# File location and type
file_location = "/FileStore/tables/wiki_movie_plots_deduped.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7
Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Smashers,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Light_of_the_Moon,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Presidents,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination."
In the second shot,which runs just over eight seconds long,"an assassin kneels feet of Lady Justice.""",,,,,
1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_the_Grizzly_King","""Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading """"His Photographer"""" and """"His Press Agent"""" respectively"
1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Beanstalk_(1902_film),"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."
1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderland_(1903_film),"""Alice follows a large white rabbit down a """"Rabbit-hole"""". She finds a tiny door. When she finds a bottle labeled """"Drink me"""""
She enters a kitchen,"in which there is a cook and a woman holding a baby. She persuades the woman to give her the child and takes the infant outside after the cook starts throwing things around. The baby then turns into a pig and squirms out of her grip. """"The Duchess's Cheshire Cat"""" appears and disappears a couple of times to Alice and directs her to the Mad Hatter's """"Mad Tea-Party."""" After a while",she leaves.,,,,,
"The Queen invites Alice to join the """"ROYAL PROCESSION"""": a parade of marching playing cards and others headed by the White Rabbit. When Alice """"unintentionally offends the Queen""""","the latter summons the """"Executioner"""". Alice """"boxes the ears""""","then flees when all the playing cards come for her. Then she wakes up and realizes it was all a dream.""",,,,,


We load the data and observe how many movie plot we have

In [5]:
df.count()

We clean the dataframe to only include the columns we need and delete all row without movie plots

In [7]:
df=df.drop("_c0","_c2","_c3","_c4","_c6")

df=df.selectExpr("_c1 as Title","_c5 as Genre","_c7 as Plot")
df = df.filter(df.Title != "Title")
df = df.dropna()
display(df)

Title,Genre,Plot
Kansas Saloon Smashers,unknown,"A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.[1]"
Love by the Light of the Moon,unknown,"The moon, painted with a smiling face hangs over a park at night. A young couple walking past a fence learn on a railing and look up. The moon smiles. They embrace, and the moon's smile gets bigger. They then sit down on a bench by a tree. The moon's view is blocked, causing him to frown. In the last scene, the man fans the woman with his hat because the moon has left the sky and is perched over her shoulder to see everything better."
The Martyred Presidents,unknown,"The film, just over a minute long, is composed of two shots. In the first, a girl sits at the base of an altar or tomb, her face hidden from the camera. At the center of the altar, a viewing portal displays the portraits of three U.S. Presidents—Abraham Lincoln, James A. Garfield, and William McKinley—each victims of assassination."
"Terrible Teddy, the Grizzly King",unknown,"""Lasting just 61 seconds and consisting of two shots, the first shot is set in a wood during winter. The actor representing then vice-president Theodore Roosevelt enthusiastically hurries down a hillside towards a tree in the foreground. He falls once, but rights himself and cocks his rifle. Two other men, bearing signs reading """"His Photographer"""" and """"His Press Agent"""" respectively"
Jack and the Beanstalk,unknown,"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."
Alice in Wonderland,unknown,"""Alice follows a large white rabbit down a """"Rabbit-hole"""". She finds a tiny door. When she finds a bottle labeled """"Drink me"""""
The Great Train Robbery,western,"The film opens with two bandits breaking into a railroad telegraph office, where they force the operator at gunpoint to have a train stopped and to transmit orders for the engineer to fill the locomotive's tender at the station's water tank. They then knock the operator out and tie him up. As the train stops it is boarded by the bandits‍—‌now four. Two bandits enter an express car, kill a messenger and open a box of valuables with dynamite; the others kill the fireman and force the engineer to halt the train and disconnect the locomotive. The bandits then force the passengers off the train and rifle them for their belongings. One passenger tries to escape but is instantly shot down. Carrying their loot, the bandits escape in the locomotive, later stopping in a valley where their horses had been left."
The Suburbanite,comedy,"The film is about a family who move to the suburbs, hoping for a quiet life. Things start to go wrong, and the wife gets violent and starts throwing crockery, leading to her arrest."
The Little Train Robbery,unknown,"""The opening scene shows the interior of the robbers' den. The walls are decorated with the portraits of notorious criminals and pictures illustrating the exploits of famous bandits. Some of the gang are lounging about, while others are reading novels and illustrated papers. Although of youthful appearance, each is dressed like a typical Western desperado. The """"Bandit Queen"
where the remainder of the gang have been waiting for them. Believing they have successfully eluded their pursuers,abandoning everything,and are obliged to take chances on foot. The police now get in sight of the fleeing robbers and a lively chase follows through tall weeds


Now we have a reduced dataframe with only movies and plot

In [9]:
df.count()

#TF-IDF

In [11]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover, CountVectorizer

In [12]:
#tokenize
tokenizer = Tokenizer(inputCol="Plot", outputCol="words")
wordsData = tokenizer.transform(df)

#remove stop words
remover = StopWordsRemover(inputCol="words",outputCol="noStopWords")
wordsData = remover.transform(wordsData)
wordsData = wordsData.drop("words")
wordsData=wordsData.selectExpr("Title as Title","Genre as Genre","noStopWords as words")
display(wordsData)

Title,Genre,words
Kansas Saloon Smashers,unknown,"List(bartender, working, saloon,, serving, drinks, customers., fills, stereotypically, irish, man's, bucket, beer,, carrie, nation, followers, burst, inside., assault, irish, man,, pulling, hat, eyes, dumping, beer, head., group, begin, wrecking, bar,, smashing, fixtures,, mirrors,, breaking, cash, register., bartender, sprays, seltzer, water, nation's, face, group, policemen, appear, order, everybody, leave.[1])"
Love by the Light of the Moon,unknown,"List(moon,, painted, smiling, face, hangs, park, night., young, couple, walking, past, fence, learn, railing, look, up., moon, smiles., embrace,, moon's, smile, gets, bigger., sit, bench, tree., moon's, view, blocked,, causing, frown., last, scene,, man, fans, woman, hat, moon, left, sky, perched, shoulder, see, everything, better.)"
The Martyred Presidents,unknown,"List(film,, minute, long,, composed, two, shots., first,, girl, sits, base, altar, tomb,, face, hidden, camera., center, altar,, viewing, portal, displays, portraits, three, u.s., presidents—abraham, lincoln,, james, a., garfield,, william, mckinley—each, victims, assassination.)"
"Terrible Teddy, the Grizzly King",unknown,"List(""lasting, 61, seconds, consisting, two, shots,, first, shot, set, wood, winter., actor, representing, vice-president, theodore, roosevelt, enthusiastically, hurries, hillside, towards, tree, foreground., falls, once,, rights, cocks, rifle., two, men,, bearing, signs, reading, """"his, photographer"""", """"his, press, agent"""", respectively)"
Jack and the Beanstalk,unknown,"List(earliest, known, adaptation, classic, fairytale,, films, shows, jack, trading, cow, beans,, mother, forcing, drop, front, yard,, beig, forced, upstairs., sleeps,, jack, visited, fairy, shows, glimpses, await, ascends, bean, stalk., version,, jack, son, deposed, king., jack, wakes, up,, finds, beanstalk, grown, climbs, top, enters, giant's, home., giant, finds, jack,, narrowly, escapes., giant, chases, jack, bean, stalk,, jack, able, cut, giant, get, safety., falls, killed, jack, celebrates., fairy, reveals, jack, may, return, home, prince.)"
Alice in Wonderland,unknown,"List(""alice, follows, large, white, rabbit, """"rabbit-hole""""., finds, tiny, door., finds, bottle, labeled, """"drink, me"""")"
The Great Train Robbery,western,"List(film, opens, two, bandits, breaking, railroad, telegraph, office,, force, operator, gunpoint, train, stopped, transmit, orders, engineer, fill, locomotive's, tender, station's, water, tank., knock, operator, tie, up., train, stops, boarded, bandits‍—‌now, four., two, bandits, enter, express, car,, kill, messenger, open, box, valuables, dynamite;, others, kill, fireman, force, engineer, halt, train, disconnect, locomotive., bandits, force, passengers, train, rifle, belongings., one, passenger, tries, escape, instantly, shot, down., carrying, loot,, bandits, escape, locomotive,, later, stopping, valley, horses, left.)"
The Suburbanite,comedy,"List(film, family, move, suburbs,, hoping, quiet, life., things, start, go, wrong,, wife, gets, violent, starts, throwing, crockery,, leading, arrest.)"
The Little Train Robbery,unknown,"List(""the, opening, scene, shows, interior, robbers', den., walls, decorated, portraits, notorious, criminals, pictures, illustrating, exploits, famous, bandits., gang, lounging, about,, others, reading, novels, illustrated, papers., although, youthful, appearance,, dressed, like, typical, western, desperado., """"bandit, queen)"
where the remainder of the gang have been waiting for them. Believing they have successfully eluded their pursuers,abandoning everything,"List(, obliged, take, chances, foot., police, get, sight, fleeing, robbers, lively, chase, follows, tall, weeds)"


In [13]:
#TF
#hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
#featurizedData = hashingTF.transform(wordsData)
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures", minDF=10.0)
model = cv.fit(wordsData)
vocabulary=model.vocabulary #we will use this to map back our words when using the recommender
featurizedData = model.transform(wordsData)

#IDF
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)

#TF-IDF
rescaledData = idfModel.transform(featurizedData)

display(rescaledData.select("Title", "words","features"))

Title,words,features
Kansas Saloon Smashers,"List(bartender, working, saloon,, serving, drinks, customers., fills, stereotypically, irish, man's, bucket, beer,, carrie, nation, followers, burst, inside., assault, irish, man,, pulling, hat, eyes, dumping, beer, head., group, begin, wrecking, bar,, smashing, fixtures,, mirrors,, breaking, cash, register., bartender, sprays, seltzer, water, nation's, face, group, policemen, appear, order, everybody, leave.[1])","List(0, 20749, List(80, 117, 121, 293, 355, 369, 763, 1090, 1163, 1277, 1325, 1407, 1918, 2075, 2135, 2856, 3241, 3376, 3490, 3854, 4343, 4647, 4884, 5092, 6735, 7041, 7345, 7922, 8225, 10926, 11738, 12676, 13125, 13421, 14515, 14968, 16205), List(7.666965822220131, 3.9812690352710836, 3.9943713421682725, 4.600636453296795, 4.775870808207695, 4.729504887649769, 5.3939419623272835, 5.614729799949034, 5.675354421765468, 5.7310433276260575, 5.771498282318371, 5.828117176317879, 6.132112824261183, 12.531288433771412, 6.221519411976768, 6.5020329949499365, 6.690085226452876, 6.66735697537532, 6.701646048853952, 6.8252600048211285, 6.9364856399313535, 7.11294207727291, 7.078455901201741, 7.243970339679314, 7.467113890993524, 7.788697515120986, 7.545075432463236, 7.6902574423077334, 15.31897156728196, 8.02049912917831, 8.02049912917831, 8.111470907384037, 8.16026107155347, 8.21155436594102, 8.322780001051244, 8.322780001051244, 8.5169360154922))"
Love by the Light of the Moon,"List(moon,, painted, smiling, face, hangs, park, night., young, couple, walking, past, fence, learn, railing, look, up., moon, smiles., embrace,, moon's, smile, gets, bigger., sit, bench, tree., moon's, view, blocked,, causing, frown., last, scene,, man, fans, woman, hat, moon, left, sky, perched, shoulder, see, everything, better.)","List(0, 20749, List(6, 8, 17, 46, 107, 119, 159, 174, 355, 380, 414, 452, 473, 524, 667, 933, 1079, 1527, 1955, 2625, 2640, 3739, 4055, 4106, 5092, 5765, 7524, 8071, 8137, 8369, 8408, 9134, 9244, 10018, 11112, 17202, 18182), List(2.7893905123237235, 2.9216937094112065, 3.145965816625726, 3.4837008894787047, 3.9615564940071257, 3.9982578608575534, 4.174987604385731, 4.279728733216694, 4.775870808207695, 4.775870808207695, 4.82271371696777, 4.886897061401198, 4.917668720067952, 5.033500535593074, 5.205350792519733, 5.448883080358584, 5.696407830963301, 5.888135186044132, 12.531288433771412, 6.419794896712965, 6.419794896712965, 6.773966710433578, 6.893313468066144, 6.981606075211823, 7.243970339679314, 7.2641730469968335, 7.518407185381074, 7.6902574423077334, 7.6006452836180465, 7.722006140622314, 7.629632820491299, 7.7547959634453045, 7.860156479103131, 7.860156479103131, 8.064950891749143, 17.516196144618178, 8.591043987645923))"
The Martyred Presidents,"List(film,, minute, long,, composed, two, shots., first,, girl, sits, base, altar, tomb,, face, hidden, camera., center, altar,, viewing, portal, displays, portraits, three, u.s., presidents—abraham, lincoln,, james, a., garfield,, william, mckinley—each, victims, assassination.)","List(0, 20749, List(3, 42, 47, 355, 438, 478, 641, 855, 888, 1380, 1530, 2163, 2257, 3083, 3086, 5311, 5596, 6005, 6600, 6812, 7406, 8655, 11771, 13966, 14126), List(2.62430570754077, 3.45388900505641, 3.522623695427876, 4.775870808207695, 5.020428454025721, 5.011807710981814, 5.15457846714631, 5.531975838920122, 5.438965643701239, 5.857675978559423, 5.898497973079678, 6.243338459371408, 6.311806258648868, 6.581282366604078, 6.64513383859061, 7.130641654372311, 7.185701431555338, 7.2641730469968335, 7.467113890993524, 7.418323726824092, 7.6006452836180465, 7.6902574423077334, 8.064950891749143, 8.265621587211296, 8.265621587211296))"
"Terrible Teddy, the Grizzly King","List(""lasting, 61, seconds, consisting, two, shots,, first, shot, set, wood, winter., actor, representing, vice-president, theodore, roosevelt, enthusiastically, hurries, hillside, towards, tree, foreground., falls, once,, rights, cocks, rifle., two, men,, bearing, signs, reading, """"his, photographer"""", """"his, press, agent"""", respectively)","List(0, 20749, List(3, 36, 38, 81, 327, 333, 1047, 1187, 1700, 1717, 1905, 2902, 2922, 3396, 5468, 6345, 7004, 8122, 9860, 10273, 11322, 12452, 13329, 14818, 16482, 18585, 20430), List(5.24861141508154, 3.278975449485062, 3.352558288155292, 3.7559838279447675, 4.679020982217777, 4.703944390670234, 5.610815900627897, 5.667055618950774, 6.0933983120804935, 6.044005556750917, 6.145358051011204, 6.571025866436888, 6.511602445966087, 6.66735697537532, 7.167009298543186, 7.4424212784031525, 7.4424212784031525, 7.6006452836180465, 7.823788834932256, 8.265621587211296, 7.977939514759514, 8.21155436594102, 8.21155436594102, 8.322780001051244, 8.447943144005249, 8.67108669531946, 8.758098072309089))"
Jack and the Beanstalk,"List(earliest, known, adaptation, classic, fairytale,, films, shows, jack, trading, cow, beans,, mother, forcing, drop, front, yard,, beig, forced, upstairs., sleeps,, jack, visited, fairy, shows, glimpses, await, ascends, bean, stalk., version,, jack, son, deposed, king., jack, wakes, up,, finds, beanstalk, grown, climbs, top, enters, giant's, home., giant, finds, jack,, narrowly, escapes., giant, chases, jack, bean, stalk,, jack, able, cut, giant, get, safety., falls, killed, jack, celebrates., fairy, reveals, jack, may, return, home, prince.)","List(0, 20749, List(9, 18, 20, 24, 30, 36, 60, 122, 193, 195, 198, 209, 257, 259, 263, 348, 411, 446, 630, 915, 1119, 1140, 1194, 1261, 1455, 1583, 2218, 2388, 2416, 2805, 3482, 3785, 4095, 4849, 5309, 5780, 5836, 5997, 6355, 7278, 7422, 7746, 11678, 12235, 15552, 16919, 17559, 19887), List(2.9214283518403246, 6.146564421739759, 3.17460176352739, 3.2699119433317154, 3.2446693261441073, 3.278975449485062, 3.6048064778113105, 4.00294171016998, 4.297428310316095, 4.265384224960293, 4.278697273610032, 37.823104579382914, 9.042719974491549, 4.5135065437402035, 4.52399156771183, 4.734371077300942, 4.820939093609401, 4.88500491324916, 5.144726170703298, 5.432408243155079, 5.642564598942478, 5.6345324272452135, 5.683722671435985, 5.748821573647342, 17.872488941524903, 5.924884728252874, 6.273191422521089, 6.296180940745788, 6.351972300374204, 6.4738621179832405, 6.725176546264146, 6.8252600048211285, 14.191100669122081, 7.078455901201741, 7.130641654372311, 7.327351948618365, 7.371803711189199, 7.394793229413898, 7.327351948618365, 7.492431698977813, 7.492431698977813, 7.7547959634453045, 16.42310873188204, 8.064950891749143, 8.383404622867678, 8.67108669531946, 8.591043987645923, 8.758098072309089))"
Alice in Wonderland,"List(""alice, follows, large, white, rabbit, """"rabbit-hole""""., finds, tiny, door., finds, bottle, labeled, """"drink, me"""")","List(0, 20749, List(18, 220, 330, 394, 2729, 3245, 3596, 3690, 15316), List(6.146564421739759, 4.364771882381275, 4.6836470506065595, 4.973908438390828, 6.446463143795126, 6.725176546264146, 7.11294207727291, 6.865533903959069, 8.383404622867678))"
The Great Train Robbery,"List(film, opens, two, bandits, breaking, railroad, telegraph, office,, force, operator, gunpoint, train, stopped, transmit, orders, engineer, fill, locomotive's, tender, station's, water, tank., knock, operator, tie, up., train, stops, boarded, bandits‍—‌now, four., two, bandits, enter, express, car,, kill, messenger, open, box, valuables, dynamite;, others, kill, fireman, force, engineer, halt, train, disconnect, locomotive., bandits, force, passengers, train, rifle, belongings., one, passenger, tries, escape, instantly, shot, down., carrying, loot,, bandits, escape, locomotive,, later, stopping, valley, horses, left.)","List(0, 20749, List(1, 3, 5, 41, 54, 68, 167, 217, 236, 327, 376, 390, 648, 662, 667, 756, 763, 829, 1062, 1114, 1277, 1342, 1454, 1475, 1698, 1897, 2236, 2331, 2496, 2653, 3598, 3701, 3758, 4025, 4061, 4134, 4688, 5003, 5103, 5858, 6803, 7468, 7840, 10402, 10675, 11846, 12749, 14739, 15884, 16342, 18776), List(2.3838479058623605, 5.24861141508154, 2.7528651099791954, 3.3951001492564363, 3.503922598990978, 7.48163647098833, 8.408442361417098, 4.348058401407534, 18.38551243045687, 4.679020982217777, 4.774177327701362, 14.378895724700984, 5.159541256488439, 5.17205706442027, 5.205350792519733, 5.272670956619181, 5.3939419623272835, 5.350858376190971, 5.57249703632576, 5.610815900627897, 5.7310433276260575, 5.935637520029135, 5.908969272946974, 11.828492660095636, 6.0143297886048, 6.18618004553146, 6.250718566669031, 6.3438089897350425, 6.368501602325414, 6.464645462878316, 6.879327226091404, 6.865533903959069, 6.85192825190329, 6.921886840510201, 28.382201338244162, 13.932677206162069, 6.997110261747788, 7.078455901201741, 7.130641654372311, 7.2641730469968335, 7.394793229413898, 7.518407185381074, 7.6006452836180465, 7.937117520239259, 7.937117520239259, 8.111470907384037, 8.111470907384037, 8.322780001051244, 8.591043987645923, 8.447943144005249, 8.67108669531946))"
The Suburbanite,"List(film, family, move, suburbs,, hoping, quiet, life., things, start, go, wrong,, wife, gets, violent, starts, throwing, crockery,, leading, arrest.)","List(0, 20749, List(5, 11, 16, 17, 44, 70, 132, 229, 279, 342, 367, 908, 1123, 1840, 2484, 4474, 10661), List(2.7528651099791954, 3.0579585889313896, 3.1027421915583626, 3.145965816625726, 3.410990541591621, 3.737212462356666, 4.048567870996755, 4.447909261254391, 4.571201952721744, 4.708687482566247, 4.727888072422864, 5.432408243155079, 5.650661809175097, 6.068397009875076, 6.351972300374204, 6.966338603081034, 7.937117520239259))"
The Little Train Robbery,"List(""the, opening, scene, shows, interior, robbers', den., walls, decorated, portraits, notorious, criminals, pictures, illustrating, exploits, famous, bandits., gang, lounging, about,, others, reading, novels, illustrated, papers., although, youthful, appearance,, dressed, like, typical, western, desperado., """"bandit, queen)","List(0, 20749, List(100, 148, 192, 257, 270, 413, 530, 662, 737, 966, 1229, 1497, 1521, 1717, 2249, 2756, 2857, 6280, 6742, 7220, 7514, 8729, 9588, 11739, 12066, 14126, 14548, 18975), List(3.888467917279288, 4.281794849654165, 4.2313809490589005, 4.521359987245774, 4.5359201385771035, 4.833428105180176, 5.0247668556243195, 5.17205706442027, 5.500001534287607, 5.479239542839178, 5.776095991567001, 5.867726314412924, 5.919551382277511, 6.044005556750917, 6.273191422521089, 6.483164510645554, 6.5020329949499365, 7.327351948618365, 7.418323726824092, 7.467113890993524, 7.518407185381074, 7.6902574423077334, 7.788697515120986, 8.02049912917831, 8.064950891749143, 8.265621587211296, 8.322780001051244, 8.67108669531946))"
where the remainder of the gang have been waiting for them. Believing they have successfully eluded their pursuers,"List(, obliged, take, chances, foot., police, get, sight, fleeing, robbers, lively, chase, follows, tall, weeds)","List(0, 20749, List(0, 9, 13, 28, 220, 988, 1646, 2897, 2953, 5020, 5152, 9212, 11424), List(0.6622776660192281, 2.9214283518403246, 3.167111091798233, 3.2107922126947006, 4.364771882381275, 5.517638675773714, 5.996938045892931, 6.492554250995393, 6.63420476805842, 7.185701431555338, 7.0955503345610405, 7.7547959634453045, 8.02049912917831))"


#pySpark dataframe

Now that we have weights for all the words in all the movies, we need to prepare a dataframe for the pySpark ALS recommender. The dataframe for ALS has three columns: Users, Item and Rating. In our case Words represent Users, MOvies are the Items and the TF_IDF Weight are the Ratings. So we use the output from TF-IDF to build the dataframe with the three required columns.

To do this we use Koalas, which was introduced into Databricks in 2018. Koalas is a spark implementation of Pandas. We use this becouse of the ease of use of Pandas.

In [15]:
import numpy as np
import pandas as pd
import databricks.koalas as ks

We start by converting the output from TF-IDF into a Koalas dataframe

In [17]:
KrescaledData=ks.DataFrame(rescaledData)
display(KrescaledData.head(10))

Unnamed: 0,Title,Genre,words,rawFeatures,features
0,Kansas Saloon Smashers,unknown,"[bartender, working, saloon,, serving, drinks,...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,Love by the Light of the Moon,unknown,"[moon,, painted, smiling, face, hangs, park, n...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7893905123237..."
2,The Martyred Presidents,unknown,"[film,, minute, long,, composed, two, shots., ...","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 2.62430570754077, 0.0, 0.0, 0...."
3,"Terrible Teddy, the Grizzly King",unknown,"[""lasting, 61, seconds, consisting, two, shots...","(0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 5.24861141508154, 0.0, 0.0, 0...."
4,Jack and the Beanstalk,unknown,"[earliest, known, adaptation, classic, fairyta...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,Alice in Wonderland,unknown,"[""alice, follows, large, white, rabbit, """"rabb...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,The Great Train Robbery,western,"[film, opens, two, bandits, breaking, railroad...","(0.0, 1.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","(0.0, 2.3838479058623605, 0.0, 5.2486114150815..."
7,The Suburbanite,comedy,"[film, family, move, suburbs,, hoping, quiet, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 2.7528651099791954, ..."
8,The Little Train Robbery,unknown,"[""the, opening, scene, shows, interior, robber...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,where the remainder of the gang have been wai...,abandoning everything,"[, obliged, take, chances, foot., police, get,...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.6622776660192281, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Same as we will use vocabulary to map back words, we create a list to map back movies

In [19]:
movieList = KrescaledData['Title'].tolist()

We add the first movie to get the pySPark dataframe we need for ALS started.

In [21]:
thisMovie=KrescaledData['features'][0].toArray()

#get words for this movie
words=np.nonzero(thisMovie)

#get weights for the words in this movie
weights=thisMovie[thisMovie!=0]

#create an initial Pandas dataframe with the words and their weights for the first movie
matrixData = pd.DataFrame()
matrixData['word']= words[0]
matrixData['rating']=weights
matrixData['movie']=0

#convert Pandas dataframe to Koalas to make use of spark going forward
matrix=ks.DataFrame(matrixData)
#matrix

Now that we have a dataframe with one movie, we can add all the other movies in our dataset by following the same steps.

In [23]:
import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

In [24]:
from multiprocessing.pool import ThreadPool

In [25]:
# allow up to 5 concurrent threads
pool = ThreadPool(5)

# hyperparameters to test out (n_trees)
parameters = [1,7]

# define a function to train a RF model and return metrics 
def mllib_random_forest(cluster,movies):
  thisMovie=KrescaledData['features'][cluster].toArray()
  #get words for this movie
  words=np.nonzero(thisMovie)
  #get weights for the words in this movie
  weights=thisMovie[thisMovie!=0]
  #create an initial Pandas dataframe with the words and their weights for the first movie
  matrixData = pd.DataFrame()
  matrixData['word']= words[0]
  matrixData['rating']=weights
  matrixData['movie']=0
  #convert Pandas dataframe to Koalas to make use of spark going forward
  matrix=ks.DataFrame(matrixData)  
  for i in range(cluster,cluster+5):
    thisMovie=movies[i].toArray()
    #get words for this movie
    words=np.nonzero(thisMovie)
    #get weights for the words in this movie
    weights=thisMovie[thisMovie!=0]
    matrixData = pd.DataFrame()
    matrixData['word']= words[0]
    matrixData['rating']=weights
    matrixData['movie']=i
    matrix=matrix.append(ks.DataFrame(matrixData))
  return [matrix]
  
# run the tasks 
pool.map(lambda cluster: mllib_random_forest(cluster,KrescaledData['features']), parameters)
 

In [26]:
for i in range(1,500):
  thisMovie=KrescaledData['features'][i].toArray()
  #get words for this movie
  words=np.nonzero(thisMovie)
  #get weights for the words in this movie
  weights=thisMovie[thisMovie!=0]
  matrixData = pd.DataFrame()
  matrixData['word']= words[0]
  matrixData['rating']=weights
  matrixData['movie']=i
  matrix=matrix.append(ks.DataFrame(matrixData))
#matrix

Convert results from Koalas to databricks

In [28]:
df_matrix=matrix.to_spark()
df_matrix.count()

Split data into training and test

In [30]:
(training,test)=df_matrix.randomSplit([0.8,0.2])

#print('Training: {0}, test: {1}\n'.format(
#  training.count(), test.count())
#)
#training.show(3)
#test.show(3)

#Model

Train ALS models with different K

In [32]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

In [33]:
als = ALS()

als.setUserCol('word')\
   .setItemCol('movie')

param_grid=ParamGridBuilder()\
            .addGrid(als.rank,[8])\
            .addGrid(als.maxIter,[10])\
            .addGrid(als.regParam,[.1])\
            .build()

evaluator=RegressionEvaluator(metricName="rmse",labelCol="rating",predictionCol="prediction")

tvs=TrainValidationSplit(estimator=als,estimatorParamMaps=param_grid,evaluator=evaluator)

In [34]:
import mlflow
import mlflow.sklearn

In [35]:
ALSmodel=tvs.fit(training)

In [36]:
best_model=ALSmodel.bestModel

predictions=best_model.transform(test) 

predicted_ratings_df = predictions.filter(predictions.prediction != float('nan'))

rmse=evaluator.evaluate(predicted_ratings_df)

print("RMSE = " + str(rmse))
print("**Best Model**")
print(" Rank:"),best_model.rank
print(" MaxIter:"),best_model._java_obj.parent().getMaxIter()
print(" RegParam:"),best_model._java_obj.parent().getRegParam()

In [37]:
display(predicted_ratings_df)

word,rating,movie,prediction
18296,8.591043987645923,1,8.526204
6,2.7893905123237235,1,2.768338
5167,7.243970339679314,1,7.15207
9299,7.860156479103131,1,7.800833
8,2.9216937094112065,1,2.8996422
4087,6.981606075211823,1,6.9289126
1078,5.696407830963301,1,5.6534147
119,3.9982578608575534,1,3.9680815
662,5.17205706442027,6,0.47961414
7066,7.442421278403152,3,7.3923683


#Recommendations

We start by computing the 10 recommended movies for each word.

In [39]:
user_rec=best_model.recommendForAllUsers(10)

Recommendations are made for earch word. So for instance if we thing of a given word in the vocabulary, we would like to know what movies best fit such word. Lets say we want recommendations for word 7678 in the vocabulary.

In [41]:
vocabulary[1959]

In [42]:
user = user_rec.filter(user_rec.word==1959)
rec=user.select("recommendations.movie","recommendations.rating")
rec_items=rec.select("movie").toPandas().iloc[0,0]
rec_ratings=rec.select("rating").toPandas().iloc[0,0]
rec_matrix=pd.DataFrame(rec_items,columns=["movie"])
rec_matrix["ratings"]=rec_ratings
ratings_matrix_df=sqlContext.createDataFrame(rec_matrix)
display(ratings_matrix_df)

movie,ratings
1,12.43671
8,5.5398307
2,5.1620774
6,4.6677046
0,3.9043462
3,2.351719
7,2.1631382
9,-4.5959554
5,-5.635323
4,-6.1553035


We use the list of movie titles to find which movie the selected word recommends

In [44]:
movieList[ratings_matrix_df.select('movie').collect()[0].movie]