# Amazon Review Analysis

Data source: https://nijianmo.github.io/amazon/index.html

Useful resources: 
- PySpark cheat sheet: http://web.utk.edu/~wfeng1/doc/cheatSheet_pyspark.pdf
- MLlib document: https://spark.apache.org/docs/latest/ml-guide.html
- SparkR document: https://spark.apache.org/docs/latest/sparkr.html
- SparkR tutorial: https://rpubs.com/wendyu/sparkr

## Before We Start...

Basic concepts of Spark: 
- RDD (Resilient Distributed Datasets): fundamental data structure for distributing data among cluster nodes. Immutable.
- Transformation: operations on RDD that returns an RDD, such as map, filter, reduce, and reduceByKey.
- Action: operations on RDD that returns a non-RDD value, such as collect.

We will be mainly using Spark Dataframe APIs instead of RDD APIs, to simplify development.
- Spark Dataframes are very similar to tables in relational databases. They have schema. Most of the operations on them are similar to querying a relational database as well. You can consider Spark Dataframe as a wrap on top of RDD.

## Loading Data

In [4]:
# Reading data from Delta Lake

amazon_review_raw = spark.sql("SELECT * FROM default.video_games_5")

In [5]:
display(amazon_review_raw)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,label
0,5.0,0,True,"10 17, 2015",A1HP7NVNPFMA4N,0700026657,Ambrosia075,"This game is a bit hard to get the hang of, but when you do it's great.",but when you do it's great.,1445040000,0
1,4.0,0,False,"07 27, 2015",A1JGAP0185YJI6,0700026657,travis,"I played it a while but it was alright. The steam was a bit of trouble. The more they move these game to steam the more of a hard time I have activating and playing a game. But in spite of that it was fun, I liked it. Now I am looking forward to anno 2205 I really want to play my way to the moon.","But in spite of that it was fun, I liked it",1437955200,0
2,3.0,0,True,"02 23, 2015",A1YJWEXHQBWK2B,0700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,0
3,2.0,0,True,"02 20, 2015",A2204E1TH211HT,0700026657,Grandma KR,"found the game a bit too complicated, not what I expected after having played 1602, 1503, and 1701",Two Stars,1424390400,0
4,5.0,0,True,"12 25, 2014",A2RF5B5H74JLPE,0700026657,jon,"great game, I love it and have played it since its arrived",love this game,1419465600,0
5,4.0,0,True,"11 13, 2014",A11V6ZJ2FVQY1D,0700026657,IBRAHIM ALBADI,i liked a lot some time that i haven't play a wonderfull game very simply and funny game verry good game.,Anno 2070,1415836800,0
6,1.0,0,False,"08 2, 2014",A1KXJ1ELZIU05C,0700026657,Creation27,"I'm an avid gamer, but Anno 2070 is an INSULT to gaming. It is so buggy and half-finished that the first campaign doesn't even work properly and the DRM is INCREDIBLY frustrating to deal with. Once you manage to work your way past the massive amounts of bugs and get through the DRM, HOURS later you finally figure out that the game has no real tutorial, so you stuck just clicking around randomly. Sad, sad, sad, example of a game that could have been great but FTW.",Avoid This Game - Filled with Bugs,1406937600,0
7,5.0,0,True,"03 3, 2014",A1WK5I4874S3O2,0700026657,WhiteSkull,"I bought this game thinking it would be pretty cool and that i might play it for a week or two and be done. Boy was I wrong! From the moment I finally got the gamed Fired up (the other commentors on this are right, it takes forever and u are forced to create an account) I watched as it booted up I could tell right off the bat that ALOT of thought went into making this game. If you have ever played Sim city, then this game is a must try as you will easily navigate thru it and its multi layers. I have been playing htis now for a month straight, and I am STILL discovering layers of complexity in the game. There are a few things in the game that could used tweaked, but all in all this is a 5 star game.",A very good game balance of skill with depth of choices,1393804800,0
8,5.0,0,True,"02 21, 2014",AV969NA4CBP10,0700026657,Travis B. Moore,I have played the old anno 1701 AND 1503. this game looks great but is more complex than the previous versions of the game. I found a lot of things lacking such as the sources of power and an inability to store energy with batteries or regenertive fuel cells as buildings in the game need power. Trade is about the same. My main beef with this it requires an internet connection. Other than that it has wonderful artistry and graphics. It is the same as anno 1701 but set in a future world where global warmming as flood the land and resource scarcity has sent human kind to look to the deep ocean for valuable minerals. I recoment the deep ocean expansion or complete if you get this. I found the ai instructor a little corny but other than that the game has some real polish. I wrote my 2 cents worth on suggestions on anno 2070 wiki and you can read 3 pages on that for game ideas I had.,Anno 2070 more like anno 1701,1392940800,0
9,4.0,0,True,"06 27, 2013",A1EO9BFUHTGWKZ,0700026657,johnnyz3,"I liked it and had fun with it, played for a while and got my money's worth. You can certainly go further than I did but I got frustrated with the fact that here we are in this new start and still taking from the earth rather than living with it. Better than simcity in that respect and maybe the best we could hope for.",Pretty fun,1372291200,0


## Cleaning Data

In [7]:
# Drop duplicates

print("Before duplication removal: ", amazon_review_raw.count())
amazon_review_distinct = amazon_review_raw.dropDuplicates(['reviewerID', 'asin'])
print("After duplication removal: ", amazon_review_distinct.count())

In [8]:
# Convert Unix timestamp to readable date

from pyspark.sql.functions import from_unixtime, to_date

amazon_review_with_date = amazon_review_distinct.withColumn("reviewTime", to_date(from_unixtime(amazon_review_distinct.unixReviewTime))) \
                                                .drop("unixReviewTime")

# Fill in the empty vote column with 0, and convert it to numeric type

from pyspark.sql.types import *

amazon_review_fill_vote = amazon_review_with_date.withColumn("vote", amazon_review_with_date.vote.cast(IntegerType())) \
                                                 .fillna(0, subset=["vote"]) \

In [9]:
display(amazon_review_fill_vote)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label
479357,1.0,0,True,2014-02-07,A0380485C177Q6QQNJIX,B003N5ZYOG,Franklin Tineo,bad very bad the conect drums is broken,bad very bad the conect drums is broken,0
78637,4.0,0,True,2015-01-10,A102MU6ZC9H1N6,B000EGELP0,Teresa Halbert,My daughter still pulls this out from time to time just to see if she can still keep up with the puzzles. In great shape and purchase and price was good.,Brain Age: Train Your Brain in Minutes a Day,0
186337,4.0,0,True,2013-03-26,A102MU6ZC9H1N6,B001TOQ8JS,Teresa Halbert,"My son is Beatles crazy and when he saw this game, he was so excited to get it ordered and play it with his friends. The game was in good shape and the price was great.",Xbox 360 The Beatles: rock Band,0
274182,5.0,0,True,2012-10-11,A1033LWHDCAJNQ,B0083CJ2X8,S. Ali Limonadi,"Everything from all the previous Kingdom Hearts games, especially 1 & 2, has been leading up to this. I'm NOT going to spoil anything, but just in case you keep making excuses to not get this game: You find out who the man in the Brown Robe is. You find out every single thing about Organization 13. You find out what Riku is. (a blank blank). You find out who Xemnes is. (you only think you know, but this game will let you know the real deal) 2 of the worlds are waking worlds, not sleeping. And there's a 8th for pure cinematic. To unlock secret ending: Beginners mode is 10 to 13 tropies. Normal is 7 tropies......easy without even trying. Proud mode is just 3, I believe. I play with the 3D ON & all the way up 24/7 on every game, including, this......and I must say that games like this & Paper Mario & Kid Icarus & Luigi's Mansion were made for 3D. there was a few places were the frame rate dropped for 1 to 3 seconds, but if you did a Wireless Update and have either 4.2.0-9U or 4.4.0-10U then you won't have any problem.......and the latest Wireless Update Firmware is currently 4.4.0-10U. But the PS3 & 360 only do up to 24 fps with the 3D ON 24/7......so I have no complaints about the 3DS......especially now that I have Wireless Firmware Update 4.4.0-10U. The 3D Augmented Realty requires 0 cards, but you can use the cards to unlock rare Partners. Flowmotion is one of the easiest things to do in this game. You can Link & Dual Link with Spirits......I ended up just doing Dual Link for crazy new abilities during mostly Boss Battles. (because you have to build up your Link Gauge). There's also Reality Shift, which is more easier to do by just using your fingers on the Touch Screen, but sometimes I would used the Stylus after entering Reality Shift for certain things, like if it's a Sword for cutting the White Chains & Links on th touch screen. FYI According to the guy that makes every Kingdom Hearts game......""Reality Shift gives you a glimpse of a new gameplay that is exclusive to Kingdom Hearts 3 Only""..........at first this didn't make sense to me because each World has a completely different gameplay functionality for Reality Shift..........but then I thought about what they all have in common, which is Dual Screen interaction......so basically Kingdom Hearts 3 is either coming out on the Wii U or 3DS or both. Anyways everything has been leading up to Kingdom Hearts 3D, so I don't know what they are going to do in Kingdom Hearts 3, because everything has already been done in Kingdom Hearts 3D. But they do have a Tutorial for the previous 6 Kingdom Hearts games to read inside Kingdom Hearts 3D, so it's okay to start off with this......but I would still suggest playing Kingdom Hearts 1 & 2 as well as 3D.......and I hope 3 will be good. The Mark of Mastery comes with 5 3D AR Cards, like the Limited Edition $40 game at WalMart.......but you also get 12 artwork stuff & a great Kingdom Hearts 10th Anniversary 3DS case, which goes great with such 3DS colors as Red & Purple & maybe also White......but I have Blue and it doesn't look that good in bright areas, buti still love wearing it on my 3DS 24/7 anyways. So besides Heartless & Nobodies, there is also just a walking Heart, as well as ISO's & Spirits & Nightmares & Dream Eaters & Reapers. Basically, they are clearly making a sequel for the World Ends With You, for the 3DS. Like Kingdom Hearts 1 & 2......3D is also a main game. Quite frankly, I now don't know how I will play this game without Flowmotion & Reality Shift & Dual Links in the upcoming Kingdom Hearts 3........so I guess you could say that this game has ruin me, LOL. This is a must have game, no matter what kind of gamer you are. And everything gets revealed in this game, which makes me question the upcoming 3, but 3D does leave some things for you to want to see in the last game. And they now sell this game by itself for $30 at places like GameStop and probably also on amazon, and so fort. And you are really going to change your mind about hating 3D after you play this entire game with the 3D ON & all the way UP. This game is simply so beatiful in every way, I don't know how else to describe it with the 3D all the way up, without you seeing it for yourself. Even though both HD & Monchromatic 3D are just passing fads that are going to die forever finally. But both IMAX & Stereoscopic 3D are here forever since they first arrived, and the 3DS is Stereoscopic 3D, and like all Stereoscopic Movies, it's 60 fps per video image. So enjoy the great Stereoscopic 3D 60 fps per eye in-sinc experience. Oh yeah, and the 6 to 7 speaker sound the 3DS does thru the Earphone Jack is just Wonderful in Kingdom Hearts 3D, but I have heard better Surround Sound in a few other games, i.e. Zelda Ocarina of Time 3D (Best Surround Sound EVER) & Thor 3DS (Second best Surround Sound) & Dillon's Rolling Western (best Surround Sound for a Digital Only game) Also Kingdom Hearts 3D is getting a Digital release in the eShop to download onto your 3DS instead, if you can wait up to around 3 months or 6. FYI I use ""PLANTRONICS GAMECOM 777"" for my Surround Sound gaming and I just leave the separate Microphone jack plug unplugged, because I am only using the 7.1 Surround Sound Earphone functionality......plus the 3DS already has a Built-in surround sound microphone to pick up sound in beatiful 6 or 7 speaker sound, like in 3D Video Recording. If you want other Steroscopic 3D Surround Sound game suggestions: Kid Icarus: Uprising King of Pirates Rodea the Sky Soldier Dillon's Rolling Western Rhythm Thief and the Emperor's Treasure Nano Assault EX Mario Kart 7 Ketzal's Corridors Mutant Mudds (3DS, it has 40 extra levels the PC will never have) Pushmo Cave Story 3D (the Surround Sound is limited in a side-scroller, so you can just use the Speakerphones, which in the 3DS does Virtual Surround Sound, because its just mere Speakerphones). Sonic & All Stars Racing Transformed (3DS) Luigi's Mansion: Dark Moon Lego City Undercover (3DS) Tales of the Abyss 3DS Lego Batman 2 (3DS) (but same old Lego gameplay) Super Smash Bros. 3DS Resident Evil: Revelations Metal Gear Solid: Snake Eater Star Fox 64 3D Crushmo Shinobi 3DS (it's in Stereo) Epic Mickey: Power of illusion (side-scroller, so just use the Speakerphones, I guess) SpeedX 3D (it's just Mono) Virtual Racing Sakura Samurai: Art of the Sword Samurai Sword Destiny Heavy Fire: Special Operations 3D The Denpa Men: They came by wave Pokemon Mystery Dungeon 3DS Code of Princess Yo-Kai Watch",The best or second best Kingdom Hearts EVER!,0
225932,4.0,0,True,2015-04-05,A103KKI1Y4TFNQ,B004EW948E,Ktown,okay,Four Stars,0
401674,3.0,0,True,2016-06-15,A1058LY5U7O6B3,B00YC7ECXS,johnathan,Ummmmm. I guess its soccer.,Three Stars,0
17521,5.0,0,False,2002-07-16,A1085T5S1400VA,B00004UE0V,khiul,"I was on someone's site and saw a reference to this game. I looked into it and it seemed cool. Went out and bought it the next day. It is awesome!! Dark and gothic with very richly detailed worlds and levels and tonnes of interactive characters. I remember seeing this game for sale where I work but never really thought about buying it (never had the hardware to run it anyway). I've only just started playing but this game is worth it!! Top notch graphics and gameplay, I'd take Alice over Lara Croft any day. ;)","twisted and evil, I love it!",0
314824,2.0,39,False,2014-03-11,A109IF09XK3YNQ,B00DC9T2J6,Jeffy,"(-) No single player campaign. I've purchased and enjoyed other on-line only multiplayer games before, such as Left 4 Dead 2 and Team Fortress 2, but those were sold at low price points to make up for the fact that they offer limited experiences (in fact TF2 is free now). $60 for an on-line only shooter is asking a bit much in my opinion (-) The graphics aren't good. In fact it looks like a last gen game that happens to have impressive art direction. Many of the textures look outright poor and muddy, and the lack of any kind of interactive environmental physics feels dated. The game grinds down to single digit framerates when several Titans get into close combat (i7-2600 3.4Ghz quad, 16GB DDR3-1333, two SATA IIIs mirrored, GTX 660 2GB OC, 75Mbps Internet connection) (-) The maps are small and confined, particularly when there are giant robots stomping around with scores of bots (computer-controller characters) scurrying about underfoot. I find myself often spawning right next to enemies, or having only several seconds to get my bearings before being fired upon from above. And for a game featuring jet packs and parkour, the ceiling/height limit of every map feels really low (+/-) Team sizes are very small: 6 players per. To offset the lack of human players, matches include a large number of computer controlled AI opponents along with the featured giant robots. This design decision helps in some ways: the AI bots serve as fodder for players, making it easier to call in the titular Titans, but also alleviating the frustration new players often feel when playing a FPS for the first time. Unfortunately the AI is rudimentary and often makes poor decisions. This includes unmanned Titans, who have the tendency to wander away from their teammates when left to their own devices. It feels underwhelming and lacking challenge when the opposing team is comprised primarily of mindless automatons (+/-) Despite the small, cramped maps and limited team sizes the general gunning and gameplay is sound. The controls are rock solid - you'll fluidly run up and over obstacles, scale walls, swoop down into your Titan or go sailing out of a window while blasting away at your enemies. It's fun figuring out ways to flank the enemy position or work with teammates to take down Titans. Unfortunately the gameplay looks and feels extremely similar to Call of Duty - right down to the cadence of combat as you maneuver and take aim, or the tell tale visual queues as your bullets find their target. This may actually be a positive to many players, as the fanbase for CoD is huge and enthusiastic - but for me I was hoping for gameplay that felt new and fresh at its core (+) No issues with the nuts and bolts, like hit registration and client stability","A competent, if confined and limited, last gen shooter",1
19008,3.0,6,False,2000-12-21,A10FBJXMQPI0LL,B00004YKHW,M. Thakery,"Okay, the whacks first then the good stuff. First whack: The intro-opening movie thing just really needs to go away. It was aweful!! It had very little to do with the story and even less to do with making me want to play the game! From crappy leaves to odd duck looking things to that terrible screeching pseudo-song as the movie goes on (and on and on and on. . .), the intro just needs to be trashed and redone completely. Second whack: The camera angles are really lousy. Sure, you control the camera, but that's just it; you spend most of your time trying to manipulate the camera to look at the monsters and generally end up getting whacked, yourself. There are areas in the game that are so difficult to navigate woth the odd angles that you invariably end up quitting and coming back to it. Half the time the camera's not even looking in the direction you're running. VERY annoying! Third whack: The mapping is nice and all, but it is ENTIRELY too easy to fall off a ledge. Sure, it's possible to fall off a cliff in real life, but in real life I don't control my actions with a 3/4 inch button. . . Sneeze and you're history. Last whack: This is actually a two parter- 1) the words on the screen are only acurate about 60% of the time with the words being spoken. Did we get 2 different translations?!? If so, let's pick 1 and stick to it. . . 2) The attributes given to various armaments in the menu rarely matches what you actually get. It will say you're getting one thing, then you get something totally different. Again, annoying. Praise: The gameplay is quick- you change weapons with almost every encounter. There are lots of things to buy from the creepy elephant-man shop owner and most of it is upgradeable. The bosses are not ridiculously powerful and the monsters are actually manageable. The game actually allows you to win on ocassion. The graphics are pretty good and the colors are lively. The male character looks realistic and the female looks almost so, except for the fact that her wrists are like a foot long. All in all, I would tell you to rent it first. Otherwise I would tell you to buy Legend of Lagaia. . . Except LOL can't recognize the PS2 mem cards. . . Bummer.",Great game. . . If it were for or PS1,1
396656,5.0,0,True,2017-03-13,A10HJXBYIFBY1J,B00W8FYEU2,Heyitsbud777,good game it is fun,Five Stars,0


As comparison, for pandas dataframe you will use .apply() to apply a function to a column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

For example: amz_review['Date'] = amz_review['Time'].apply(to_date)

In [11]:
# Tokenization

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="reviewWord", pattern="\\W")

amazon_review_tokenized = regexTokenizer.transform(amazon_review_fill_vote.fillna("", subset=["reviewText"]))

In [12]:
# Remove stop words

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="reviewWord", outputCol="reviewWordFiltered")
amazon_review_stop_word_removed = remover.transform(amazon_review_tokenized)

In [13]:
# Stemming

from nltk.stem.porter import PorterStemmer
from pyspark.sql.functions import udf

def stemming(col):
  p_stemmer = PorterStemmer()
  return [p_stemmer.stem(w) for w in col]

stemming_udf = udf(stemming, ArrayType(StringType()))
amazon_review_stemmed = amazon_review_stop_word_removed.withColumn("reviewWordCleaned", stemming_udf(amazon_review_stop_word_removed.reviewWordFiltered))

In [14]:
# Dropping temporary columns, and cache results (note that cache is also a lazy operation)

amazon_review_cleaned = amazon_review_stemmed.drop("reviewWord").drop("reviewWordFiltered").cache()

display(amazon_review_cleaned)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned
479357,1.0,0,True,2014-02-07,A0380485C177Q6QQNJIX,B003N5ZYOG,Franklin Tineo,bad very bad the conect drums is broken,bad very bad the conect drums is broken,0,"List(bad, bad, conect, drum, broken)"
78637,4.0,0,True,2015-01-10,A102MU6ZC9H1N6,B000EGELP0,Teresa Halbert,My daughter still pulls this out from time to time just to see if she can still keep up with the puzzles. In great shape and purchase and price was good.,Brain Age: Train Your Brain in Minutes a Day,0,"List(daughter, still, pull, time, time, see, still, keep, puzzl, great, shape, purchas, price, good)"
186337,4.0,0,True,2013-03-26,A102MU6ZC9H1N6,B001TOQ8JS,Teresa Halbert,"My son is Beatles crazy and when he saw this game, he was so excited to get it ordered and play it with his friends. The game was in good shape and the price was great.",Xbox 360 The Beatles: rock Band,0,"List(son, beatl, crazi, saw, game, excit, get, order, play, friend, game, good, shape, price, great)"
274182,5.0,0,True,2012-10-11,A1033LWHDCAJNQ,B0083CJ2X8,S. Ali Limonadi,"Everything from all the previous Kingdom Hearts games, especially 1 & 2, has been leading up to this. I'm NOT going to spoil anything, but just in case you keep making excuses to not get this game: You find out who the man in the Brown Robe is. You find out every single thing about Organization 13. You find out what Riku is. (a blank blank). You find out who Xemnes is. (you only think you know, but this game will let you know the real deal) 2 of the worlds are waking worlds, not sleeping. And there's a 8th for pure cinematic. To unlock secret ending: Beginners mode is 10 to 13 tropies. Normal is 7 tropies......easy without even trying. Proud mode is just 3, I believe. I play with the 3D ON & all the way up 24/7 on every game, including, this......and I must say that games like this & Paper Mario & Kid Icarus & Luigi's Mansion were made for 3D. there was a few places were the frame rate dropped for 1 to 3 seconds, but if you did a Wireless Update and have either 4.2.0-9U or 4.4.0-10U then you won't have any problem.......and the latest Wireless Update Firmware is currently 4.4.0-10U. But the PS3 & 360 only do up to 24 fps with the 3D ON 24/7......so I have no complaints about the 3DS......especially now that I have Wireless Firmware Update 4.4.0-10U. The 3D Augmented Realty requires 0 cards, but you can use the cards to unlock rare Partners. Flowmotion is one of the easiest things to do in this game. You can Link & Dual Link with Spirits......I ended up just doing Dual Link for crazy new abilities during mostly Boss Battles. (because you have to build up your Link Gauge). There's also Reality Shift, which is more easier to do by just using your fingers on the Touch Screen, but sometimes I would used the Stylus after entering Reality Shift for certain things, like if it's a Sword for cutting the White Chains & Links on th touch screen. FYI According to the guy that makes every Kingdom Hearts game......""Reality Shift gives you a glimpse of a new gameplay that is exclusive to Kingdom Hearts 3 Only""..........at first this didn't make sense to me because each World has a completely different gameplay functionality for Reality Shift..........but then I thought about what they all have in common, which is Dual Screen interaction......so basically Kingdom Hearts 3 is either coming out on the Wii U or 3DS or both. Anyways everything has been leading up to Kingdom Hearts 3D, so I don't know what they are going to do in Kingdom Hearts 3, because everything has already been done in Kingdom Hearts 3D. But they do have a Tutorial for the previous 6 Kingdom Hearts games to read inside Kingdom Hearts 3D, so it's okay to start off with this......but I would still suggest playing Kingdom Hearts 1 & 2 as well as 3D.......and I hope 3 will be good. The Mark of Mastery comes with 5 3D AR Cards, like the Limited Edition $40 game at WalMart.......but you also get 12 artwork stuff & a great Kingdom Hearts 10th Anniversary 3DS case, which goes great with such 3DS colors as Red & Purple & maybe also White......but I have Blue and it doesn't look that good in bright areas, buti still love wearing it on my 3DS 24/7 anyways. So besides Heartless & Nobodies, there is also just a walking Heart, as well as ISO's & Spirits & Nightmares & Dream Eaters & Reapers. Basically, they are clearly making a sequel for the World Ends With You, for the 3DS. Like Kingdom Hearts 1 & 2......3D is also a main game. Quite frankly, I now don't know how I will play this game without Flowmotion & Reality Shift & Dual Links in the upcoming Kingdom Hearts 3........so I guess you could say that this game has ruin me, LOL. This is a must have game, no matter what kind of gamer you are. And everything gets revealed in this game, which makes me question the upcoming 3, but 3D does leave some things for you to want to see in the last game. And they now sell this game by itself for $30 at places like GameStop and probably also on amazon, and so fort. And you are really going to change your mind about hating 3D after you play this entire game with the 3D ON & all the way UP. This game is simply so beatiful in every way, I don't know how else to describe it with the 3D all the way up, without you seeing it for yourself. Even though both HD & Monchromatic 3D are just passing fads that are going to die forever finally. But both IMAX & Stereoscopic 3D are here forever since they first arrived, and the 3DS is Stereoscopic 3D, and like all Stereoscopic Movies, it's 60 fps per video image. So enjoy the great Stereoscopic 3D 60 fps per eye in-sinc experience. Oh yeah, and the 6 to 7 speaker sound the 3DS does thru the Earphone Jack is just Wonderful in Kingdom Hearts 3D, but I have heard better Surround Sound in a few other games, i.e. Zelda Ocarina of Time 3D (Best Surround Sound EVER) & Thor 3DS (Second best Surround Sound) & Dillon's Rolling Western (best Surround Sound for a Digital Only game) Also Kingdom Hearts 3D is getting a Digital release in the eShop to download onto your 3DS instead, if you can wait up to around 3 months or 6. FYI I use ""PLANTRONICS GAMECOM 777"" for my Surround Sound gaming and I just leave the separate Microphone jack plug unplugged, because I am only using the 7.1 Surround Sound Earphone functionality......plus the 3DS already has a Built-in surround sound microphone to pick up sound in beatiful 6 or 7 speaker sound, like in 3D Video Recording. If you want other Steroscopic 3D Surround Sound game suggestions: Kid Icarus: Uprising King of Pirates Rodea the Sky Soldier Dillon's Rolling Western Rhythm Thief and the Emperor's Treasure Nano Assault EX Mario Kart 7 Ketzal's Corridors Mutant Mudds (3DS, it has 40 extra levels the PC will never have) Pushmo Cave Story 3D (the Surround Sound is limited in a side-scroller, so you can just use the Speakerphones, which in the 3DS does Virtual Surround Sound, because its just mere Speakerphones). Sonic & All Stars Racing Transformed (3DS) Luigi's Mansion: Dark Moon Lego City Undercover (3DS) Tales of the Abyss 3DS Lego Batman 2 (3DS) (but same old Lego gameplay) Super Smash Bros. 3DS Resident Evil: Revelations Metal Gear Solid: Snake Eater Star Fox 64 3D Crushmo Shinobi 3DS (it's in Stereo) Epic Mickey: Power of illusion (side-scroller, so just use the Speakerphones, I guess) SpeedX 3D (it's just Mono) Virtual Racing Sakura Samurai: Art of the Sword Samurai Sword Destiny Heavy Fire: Special Operations 3D The Denpa Men: They came by wave Pokemon Mystery Dungeon 3DS Code of Princess Yo-Kai Watch",The best or second best Kingdom Hearts EVER!,0,"List(everyth, previou, kingdom, heart, game, especi, 1, 2, lead, m, go, spoil, anyth, case, keep, make, excus, get, game, find, man, brown, robe, find, everi, singl, thing, organ, 13, find, riku, blank, blank, find, xemn, think, know, game, let, know, real, deal, 2, world, wake, world, sleep, 8th, pure, cinemat, unlock, secret, end, beginn, mode, 10, 13, tropi, normal, 7, tropi, easi, without, even, tri, proud, mode, 3, believ, play, 3d, way, 24, 7, everi, game, includ, must, say, game, like, paper, mario, kid, icaru, luigi, mansion, made, 3d, place, frame, rate, drop, 1, 3, second, wireless, updat, either, 4, 2, 0, 9u, 4, 4, 0, 10u, won, problem, latest, wireless, updat, firmwar, current, 4, 4, 0, 10u, ps3, 360, 24, fp, 3d, 24, 7, complaint, 3d, especi, wireless, firmwar, updat, 4, 4, 0, 10u, 3d, augment, realti, requir, 0, card, use, card, unlock, rare, partner, flowmot, one, easiest, thing, game, link, dual, link, spirit, end, dual, link, crazi, new, abil, mostli, boss, battl, build, link, gaug, also, realiti, shift, easier, use, finger, touch, screen, sometim, use, stylu, enter, realiti, shift, certain, thing, like, sword, cut, white, chain, link, th, touch, screen, fyi, accord, guy, make, everi, kingdom, heart, game, realiti, shift, give, glimps, new, gameplay, exclus, kingdom, heart, 3, first, didn, make, sens, world, complet, differ, gameplay, function, realiti, shift, thought, common, dual, screen, interact, basic, kingdom, heart, 3, either, come, wii, u, 3d, anyway, everyth, lead, kingdom, heart, 3d, know, go, kingdom, heart, 3, everyth, alreadi, done, kingdom, heart, 3d, tutori, previou, 6, kingdom, heart, game, read, insid, kingdom, heart, 3d, okay, start, still, suggest, play, kingdom, heart, 1, 2, well, 3d, hope, 3, good, mark, masteri, come, 5, 3d, ar, card, like, limit, edit, 40, game, walmart, also, get, 12, artwork, stuff, great, kingdom, heart, 10th, anniversari, 3d, case, goe, great, 3d, color, red, purpl, mayb, also, white, blue, doesn, look, good, bright, area, buti, still, love, wear, 3d, 24, 7, anyway, besid, heartless, nobodi, also, walk, heart, well, iso, spirit, nightmar, dream, eater, reaper, basic, clearli, make, sequel, world, end, 3d, like, kingdom, heart, 1, 2, 3d, also, main, game, quit, frankli, know, play, game, without, flowmot, realiti, shift, dual, link, upcom, kingdom, heart, 3, guess, say, game, ruin, lol, must, game, matter, kind, gamer, everyth, get, reveal, game, make, question, upcom, 3, 3d, leav, thing, want, see, last, game, sell, game, 30, place, like, gamestop, probabl, also, amazon, fort, realli, go, chang, mind, hate, 3d, play, entir, game, 3d, way, game, simpli, beati, everi, way, know, els, describ, 3d, way, without, see, even, though, hd, monchromat, 3d, pass, fad, go, die, forev, final, imax, stereoscop, 3d, forev, sinc, first, arriv, 3d, stereoscop, 3d, like, stereoscop, movi, 60, fp, per, video, imag, enjoy, great, stereoscop, 3d, 60, fp, per, eye, sinc, experi, oh, yeah, 6, 7, speaker, sound, 3d, thru, earphon, jack, wonder, kingdom, heart, 3d, heard, better, surround, sound, game, e, zelda, ocarina, time, 3d, best, surround, sound, ever, thor, 3d, second, best, surround, sound, dillon, roll, western, best, surround, sound, digit, game, also, kingdom, heart, 3d, get, digit, releas, eshop, download, onto, 3d, instead, wait, around, 3, month, 6, fyi, use, plantron, gamecom, 777, surround, sound, game, leav, separ, microphon, jack, plug, unplug, use, 7, 1, surround, sound, earphon, function, plu, 3d, alreadi, built, surround, sound, microphon, pick, sound, beati, 6, 7, speaker, sound, like, 3d, video, record, want, steroscop, 3d, surround, sound, game, suggest, kid, icaru, upris, king, pirat, rodea, sky, soldier, dillon, roll, western, rhythm, thief, emperor, treasur, nano, assault, ex, mario, kart, 7, ketzal, corridor, mutant, mudd, 3d, 40, extra, level, pc, never, pushmo, cave, stori, 3d, surround, sound, limit, side, scroller, use, speakerphon, 3d, virtual, surround, sound, mere, speakerphon, sonic, star, race, transform, 3d, luigi, mansion, dark, moon, lego, citi, undercov, 3d, tale, abyss, 3d, lego, batman, 2, 3d, old, lego, gameplay, super, smash, bro, 3d, resid, evil, revel, metal, gear, solid, snake, eater, star, fox, 64, 3d, crushmo, shinobi, 3d, stereo, epic, mickey, power, illus, side, scroller, use, speakerphon, guess, speedx, 3d, mono, virtual, race, sakura, samurai, art, sword, samurai, sword, destini, heavi, fire, special, oper, 3d, denpa, men, came, wave, pokemon, mysteri, dungeon, 3d, code, princess, yo, kai, watch)"
225932,4.0,0,True,2015-04-05,A103KKI1Y4TFNQ,B004EW948E,Ktown,okay,Four Stars,0,List(okay)
401674,3.0,0,True,2016-06-15,A1058LY5U7O6B3,B00YC7ECXS,johnathan,Ummmmm. I guess its soccer.,Three Stars,0,"List(ummmmm, guess, soccer)"
17521,5.0,0,False,2002-07-16,A1085T5S1400VA,B00004UE0V,khiul,"I was on someone's site and saw a reference to this game. I looked into it and it seemed cool. Went out and bought it the next day. It is awesome!! Dark and gothic with very richly detailed worlds and levels and tonnes of interactive characters. I remember seeing this game for sale where I work but never really thought about buying it (never had the hardware to run it anyway). I've only just started playing but this game is worth it!! Top notch graphics and gameplay, I'd take Alice over Lara Croft any day. ;)","twisted and evil, I love it!",0,"List(someon, site, saw, refer, game, look, seem, cool, went, bought, next, day, awesom, dark, gothic, richli, detail, world, level, tonn, interact, charact, rememb, see, game, sale, work, never, realli, thought, buy, never, hardwar, run, anyway, ve, start, play, game, worth, top, notch, graphic, gameplay, d, take, alic, lara, croft, day)"
314824,2.0,39,False,2014-03-11,A109IF09XK3YNQ,B00DC9T2J6,Jeffy,"(-) No single player campaign. I've purchased and enjoyed other on-line only multiplayer games before, such as Left 4 Dead 2 and Team Fortress 2, but those were sold at low price points to make up for the fact that they offer limited experiences (in fact TF2 is free now). $60 for an on-line only shooter is asking a bit much in my opinion (-) The graphics aren't good. In fact it looks like a last gen game that happens to have impressive art direction. Many of the textures look outright poor and muddy, and the lack of any kind of interactive environmental physics feels dated. The game grinds down to single digit framerates when several Titans get into close combat (i7-2600 3.4Ghz quad, 16GB DDR3-1333, two SATA IIIs mirrored, GTX 660 2GB OC, 75Mbps Internet connection) (-) The maps are small and confined, particularly when there are giant robots stomping around with scores of bots (computer-controller characters) scurrying about underfoot. I find myself often spawning right next to enemies, or having only several seconds to get my bearings before being fired upon from above. And for a game featuring jet packs and parkour, the ceiling/height limit of every map feels really low (+/-) Team sizes are very small: 6 players per. To offset the lack of human players, matches include a large number of computer controlled AI opponents along with the featured giant robots. This design decision helps in some ways: the AI bots serve as fodder for players, making it easier to call in the titular Titans, but also alleviating the frustration new players often feel when playing a FPS for the first time. Unfortunately the AI is rudimentary and often makes poor decisions. This includes unmanned Titans, who have the tendency to wander away from their teammates when left to their own devices. It feels underwhelming and lacking challenge when the opposing team is comprised primarily of mindless automatons (+/-) Despite the small, cramped maps and limited team sizes the general gunning and gameplay is sound. The controls are rock solid - you'll fluidly run up and over obstacles, scale walls, swoop down into your Titan or go sailing out of a window while blasting away at your enemies. It's fun figuring out ways to flank the enemy position or work with teammates to take down Titans. Unfortunately the gameplay looks and feels extremely similar to Call of Duty - right down to the cadence of combat as you maneuver and take aim, or the tell tale visual queues as your bullets find their target. This may actually be a positive to many players, as the fanbase for CoD is huge and enthusiastic - but for me I was hoping for gameplay that felt new and fresh at its core (+) No issues with the nuts and bolts, like hit registration and client stability","A competent, if confined and limited, last gen shooter",1,"List(singl, player, campaign, ve, purchas, enjoy, line, multiplay, game, left, 4, dead, 2, team, fortress, 2, sold, low, price, point, make, fact, offer, limit, experi, fact, tf2, free, 60, line, shooter, ask, bit, much, opinion, graphic, aren, good, fact, look, like, last, gen, game, happen, impress, art, direct, mani, textur, look, outright, poor, muddi, lack, kind, interact, environment, physic, feel, date, game, grind, singl, digit, framer, sever, titan, get, close, combat, i7, 2600, 3, 4ghz, quad, 16gb, ddr3, 1333, two, sata, iii, mirror, gtx, 660, 2gb, oc, 75mbp, internet, connect, map, small, confin, particularli, giant, robot, stomp, around, score, bot, comput, control, charact, scurri, underfoot, find, often, spawn, right, next, enemi, sever, second, get, bear, fire, upon, game, featur, jet, pack, parkour, ceil, height, limit, everi, map, feel, realli, low, team, size, small, 6, player, per, offset, lack, human, player, match, includ, larg, number, comput, control, ai, oppon, along, featur, giant, robot, design, decis, help, way, ai, bot, serv, fodder, player, make, easier, call, titular, titan, also, allevi, frustrat, new, player, often, feel, play, fp, first, time, unfortun, ai, rudimentari, often, make, poor, decis, includ, unman, titan, tendenc, wander, away, teammat, left, devic, feel, underwhelm, lack, challeng, oppos, team, compris, primarili, mindless, automaton, despit, small, cramp, map, limit, team, size, gener, gun, gameplay, sound, control, rock, solid, ll, fluidli, run, obstacl, scale, wall, swoop, titan, go, sail, window, blast, away, enemi, fun, figur, way, flank, enemi, posit, work, teammat, take, titan, unfortun, gameplay, look, feel, extrem, similar, call, duti, right, cadenc, combat, maneuv, take, aim, tell, tale, visual, queue, bullet, find, target, may, actual, posit, mani, player, fanbas, cod, huge, enthusiast, hope, gameplay, felt, new, fresh, core, issu, nut, bolt, like, hit, registr, client, stabil)"
19008,3.0,6,False,2000-12-21,A10FBJXMQPI0LL,B00004YKHW,M. Thakery,"Okay, the whacks first then the good stuff. First whack: The intro-opening movie thing just really needs to go away. It was aweful!! It had very little to do with the story and even less to do with making me want to play the game! From crappy leaves to odd duck looking things to that terrible screeching pseudo-song as the movie goes on (and on and on and on. . .), the intro just needs to be trashed and redone completely. Second whack: The camera angles are really lousy. Sure, you control the camera, but that's just it; you spend most of your time trying to manipulate the camera to look at the monsters and generally end up getting whacked, yourself. There are areas in the game that are so difficult to navigate woth the odd angles that you invariably end up quitting and coming back to it. Half the time the camera's not even looking in the direction you're running. VERY annoying! Third whack: The mapping is nice and all, but it is ENTIRELY too easy to fall off a ledge. Sure, it's possible to fall off a cliff in real life, but in real life I don't control my actions with a 3/4 inch button. . . Sneeze and you're history. Last whack: This is actually a two parter- 1) the words on the screen are only acurate about 60% of the time with the words being spoken. Did we get 2 different translations?!? If so, let's pick 1 and stick to it. . . 2) The attributes given to various armaments in the menu rarely matches what you actually get. It will say you're getting one thing, then you get something totally different. Again, annoying. Praise: The gameplay is quick- you change weapons with almost every encounter. There are lots of things to buy from the creepy elephant-man shop owner and most of it is upgradeable. The bosses are not ridiculously powerful and the monsters are actually manageable. The game actually allows you to win on ocassion. The graphics are pretty good and the colors are lively. The male character looks realistic and the female looks almost so, except for the fact that her wrists are like a foot long. All in all, I would tell you to rent it first. Otherwise I would tell you to buy Legend of Lagaia. . . Except LOL can't recognize the PS2 mem cards. . . Bummer.",Great game. . . If it were for or PS1,1,"List(okay, whack, first, good, stuff, first, whack, intro, open, movi, thing, realli, need, go, away, awe, littl, stori, even, less, make, want, play, game, crappi, leav, odd, duck, look, thing, terribl, screech, pseudo, song, movi, goe, intro, need, trash, redon, complet, second, whack, camera, angl, realli, lousi, sure, control, camera, spend, time, tri, manipul, camera, look, monster, gener, end, get, whack, area, game, difficult, navig, woth, odd, angl, invari, end, quit, come, back, half, time, camera, even, look, direct, re, run, annoy, third, whack, map, nice, entir, easi, fall, ledg, sure, possibl, fall, cliff, real, life, real, life, control, action, 3, 4, inch, button, sneez, re, histori, last, whack, actual, two, parter, 1, word, screen, acur, 60, time, word, spoken, get, 2, differ, translat, let, pick, 1, stick, 2, attribut, given, variou, armament, menu, rare, match, actual, get, say, re, get, one, thing, get, someth, total, differ, annoy, prais, gameplay, quick, chang, weapon, almost, everi, encount, lot, thing, buy, creepi, eleph, man, shop, owner, upgrad, boss, ridicul, power, monster, actual, manag, game, actual, allow, win, ocass, graphic, pretti, good, color, live, male, charact, look, realist, femal, look, almost, except, fact, wrist, like, foot, long, tell, rent, first, otherwis, tell, buy, legend, lagaia, except, lol, recogn, ps2, mem, card, bummer)"
396656,5.0,0,True,2017-03-13,A10HJXBYIFBY1J,B00W8FYEU2,Heyitsbud777,good game it is fun,Five Stars,0,"List(good, game, fun)"


## Exploratory Analysis

In [16]:
# Let's use Spark SQL for some simple exploratory analysis. Firstly, we need to create a temporary view based on the dataframe.

amazon_review_cleaned.createOrReplaceTempView("amazon_book_reviews")

In [17]:
# Distribution of the star ratings of book reviews

star_rating = spark.sql('''
  SELECT 
    overall AS star_rating, 
    COUNT(*) AS count 
  FROM
    amazon_book_reviews
  GROUP BY
    overall
  ORDER BY
    overall
''')

display(star_rating)

star_rating,count
1.0,28449
2.0,22327
3.0,45614
4.0,86789
5.0,280104


In [18]:
# Number of reviews over time

review_over_time = spark.sql('''
  SELECT 
    reviewTime AS date, 
    COUNT(*) AS count 
  FROM
    amazon_book_reviews
  WHERE
    reviewTime >= '2015-01-01'
  GROUP BY
    reviewTime
  ORDER BY
    reviewTime
''')

display(review_over_time)

date,count
2015-01-01,432
2015-01-02,382
2015-01-03,501
2015-01-04,389
2015-01-05,392
2015-01-06,375
2015-01-07,436
2015-01-08,303
2015-01-09,397
2015-01-10,248


## Review Score Prediction

As comparison, without Spark we commonly use sklearn in Python for machine learning (read more: https://scikit-learn.org/stable/user_guide.html); or NLTK for natural language processing (read more: https://www.nltk.org/)

In [21]:
# Extract verified 5-star and 1-star reviews for prediction

prediction_df = amazon_review_cleaned.where( ((amazon_review_cleaned.overall == 1) | (amazon_review_cleaned.overall == 5)) \
                                             & amazon_review_cleaned.verified == True )

# This is equivalent to the following Spark SQL command:

prediction_df = spark.sql("SELECT * FROM amazon_book_reviews WHERE (overall = 1 OR overall = 5) AND verified = TRUE")

display(prediction_df)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned
479357,1.0,0,True,2014-02-07,A0380485C177Q6QQNJIX,B003N5ZYOG,Franklin Tineo,bad very bad the conect drums is broken,bad very bad the conect drums is broken,0,"List(bad, bad, conect, drum, broken)"
274182,5.0,0,True,2012-10-11,A1033LWHDCAJNQ,B0083CJ2X8,S. Ali Limonadi,"Everything from all the previous Kingdom Hearts games, especially 1 & 2, has been leading up to this. I'm NOT going to spoil anything, but just in case you keep making excuses to not get this game: You find out who the man in the Brown Robe is. You find out every single thing about Organization 13. You find out what Riku is. (a blank blank). You find out who Xemnes is. (you only think you know, but this game will let you know the real deal) 2 of the worlds are waking worlds, not sleeping. And there's a 8th for pure cinematic. To unlock secret ending: Beginners mode is 10 to 13 tropies. Normal is 7 tropies......easy without even trying. Proud mode is just 3, I believe. I play with the 3D ON & all the way up 24/7 on every game, including, this......and I must say that games like this & Paper Mario & Kid Icarus & Luigi's Mansion were made for 3D. there was a few places were the frame rate dropped for 1 to 3 seconds, but if you did a Wireless Update and have either 4.2.0-9U or 4.4.0-10U then you won't have any problem.......and the latest Wireless Update Firmware is currently 4.4.0-10U. But the PS3 & 360 only do up to 24 fps with the 3D ON 24/7......so I have no complaints about the 3DS......especially now that I have Wireless Firmware Update 4.4.0-10U. The 3D Augmented Realty requires 0 cards, but you can use the cards to unlock rare Partners. Flowmotion is one of the easiest things to do in this game. You can Link & Dual Link with Spirits......I ended up just doing Dual Link for crazy new abilities during mostly Boss Battles. (because you have to build up your Link Gauge). There's also Reality Shift, which is more easier to do by just using your fingers on the Touch Screen, but sometimes I would used the Stylus after entering Reality Shift for certain things, like if it's a Sword for cutting the White Chains & Links on th touch screen. FYI According to the guy that makes every Kingdom Hearts game......""Reality Shift gives you a glimpse of a new gameplay that is exclusive to Kingdom Hearts 3 Only""..........at first this didn't make sense to me because each World has a completely different gameplay functionality for Reality Shift..........but then I thought about what they all have in common, which is Dual Screen interaction......so basically Kingdom Hearts 3 is either coming out on the Wii U or 3DS or both. Anyways everything has been leading up to Kingdom Hearts 3D, so I don't know what they are going to do in Kingdom Hearts 3, because everything has already been done in Kingdom Hearts 3D. But they do have a Tutorial for the previous 6 Kingdom Hearts games to read inside Kingdom Hearts 3D, so it's okay to start off with this......but I would still suggest playing Kingdom Hearts 1 & 2 as well as 3D.......and I hope 3 will be good. The Mark of Mastery comes with 5 3D AR Cards, like the Limited Edition $40 game at WalMart.......but you also get 12 artwork stuff & a great Kingdom Hearts 10th Anniversary 3DS case, which goes great with such 3DS colors as Red & Purple & maybe also White......but I have Blue and it doesn't look that good in bright areas, buti still love wearing it on my 3DS 24/7 anyways. So besides Heartless & Nobodies, there is also just a walking Heart, as well as ISO's & Spirits & Nightmares & Dream Eaters & Reapers. Basically, they are clearly making a sequel for the World Ends With You, for the 3DS. Like Kingdom Hearts 1 & 2......3D is also a main game. Quite frankly, I now don't know how I will play this game without Flowmotion & Reality Shift & Dual Links in the upcoming Kingdom Hearts 3........so I guess you could say that this game has ruin me, LOL. This is a must have game, no matter what kind of gamer you are. And everything gets revealed in this game, which makes me question the upcoming 3, but 3D does leave some things for you to want to see in the last game. And they now sell this game by itself for $30 at places like GameStop and probably also on amazon, and so fort. And you are really going to change your mind about hating 3D after you play this entire game with the 3D ON & all the way UP. This game is simply so beatiful in every way, I don't know how else to describe it with the 3D all the way up, without you seeing it for yourself. Even though both HD & Monchromatic 3D are just passing fads that are going to die forever finally. But both IMAX & Stereoscopic 3D are here forever since they first arrived, and the 3DS is Stereoscopic 3D, and like all Stereoscopic Movies, it's 60 fps per video image. So enjoy the great Stereoscopic 3D 60 fps per eye in-sinc experience. Oh yeah, and the 6 to 7 speaker sound the 3DS does thru the Earphone Jack is just Wonderful in Kingdom Hearts 3D, but I have heard better Surround Sound in a few other games, i.e. Zelda Ocarina of Time 3D (Best Surround Sound EVER) & Thor 3DS (Second best Surround Sound) & Dillon's Rolling Western (best Surround Sound for a Digital Only game) Also Kingdom Hearts 3D is getting a Digital release in the eShop to download onto your 3DS instead, if you can wait up to around 3 months or 6. FYI I use ""PLANTRONICS GAMECOM 777"" for my Surround Sound gaming and I just leave the separate Microphone jack plug unplugged, because I am only using the 7.1 Surround Sound Earphone functionality......plus the 3DS already has a Built-in surround sound microphone to pick up sound in beatiful 6 or 7 speaker sound, like in 3D Video Recording. If you want other Steroscopic 3D Surround Sound game suggestions: Kid Icarus: Uprising King of Pirates Rodea the Sky Soldier Dillon's Rolling Western Rhythm Thief and the Emperor's Treasure Nano Assault EX Mario Kart 7 Ketzal's Corridors Mutant Mudds (3DS, it has 40 extra levels the PC will never have) Pushmo Cave Story 3D (the Surround Sound is limited in a side-scroller, so you can just use the Speakerphones, which in the 3DS does Virtual Surround Sound, because its just mere Speakerphones). Sonic & All Stars Racing Transformed (3DS) Luigi's Mansion: Dark Moon Lego City Undercover (3DS) Tales of the Abyss 3DS Lego Batman 2 (3DS) (but same old Lego gameplay) Super Smash Bros. 3DS Resident Evil: Revelations Metal Gear Solid: Snake Eater Star Fox 64 3D Crushmo Shinobi 3DS (it's in Stereo) Epic Mickey: Power of illusion (side-scroller, so just use the Speakerphones, I guess) SpeedX 3D (it's just Mono) Virtual Racing Sakura Samurai: Art of the Sword Samurai Sword Destiny Heavy Fire: Special Operations 3D The Denpa Men: They came by wave Pokemon Mystery Dungeon 3DS Code of Princess Yo-Kai Watch",The best or second best Kingdom Hearts EVER!,0,"List(everyth, previou, kingdom, heart, game, especi, 1, 2, lead, m, go, spoil, anyth, case, keep, make, excus, get, game, find, man, brown, robe, find, everi, singl, thing, organ, 13, find, riku, blank, blank, find, xemn, think, know, game, let, know, real, deal, 2, world, wake, world, sleep, 8th, pure, cinemat, unlock, secret, end, beginn, mode, 10, 13, tropi, normal, 7, tropi, easi, without, even, tri, proud, mode, 3, believ, play, 3d, way, 24, 7, everi, game, includ, must, say, game, like, paper, mario, kid, icaru, luigi, mansion, made, 3d, place, frame, rate, drop, 1, 3, second, wireless, updat, either, 4, 2, 0, 9u, 4, 4, 0, 10u, won, problem, latest, wireless, updat, firmwar, current, 4, 4, 0, 10u, ps3, 360, 24, fp, 3d, 24, 7, complaint, 3d, especi, wireless, firmwar, updat, 4, 4, 0, 10u, 3d, augment, realti, requir, 0, card, use, card, unlock, rare, partner, flowmot, one, easiest, thing, game, link, dual, link, spirit, end, dual, link, crazi, new, abil, mostli, boss, battl, build, link, gaug, also, realiti, shift, easier, use, finger, touch, screen, sometim, use, stylu, enter, realiti, shift, certain, thing, like, sword, cut, white, chain, link, th, touch, screen, fyi, accord, guy, make, everi, kingdom, heart, game, realiti, shift, give, glimps, new, gameplay, exclus, kingdom, heart, 3, first, didn, make, sens, world, complet, differ, gameplay, function, realiti, shift, thought, common, dual, screen, interact, basic, kingdom, heart, 3, either, come, wii, u, 3d, anyway, everyth, lead, kingdom, heart, 3d, know, go, kingdom, heart, 3, everyth, alreadi, done, kingdom, heart, 3d, tutori, previou, 6, kingdom, heart, game, read, insid, kingdom, heart, 3d, okay, start, still, suggest, play, kingdom, heart, 1, 2, well, 3d, hope, 3, good, mark, masteri, come, 5, 3d, ar, card, like, limit, edit, 40, game, walmart, also, get, 12, artwork, stuff, great, kingdom, heart, 10th, anniversari, 3d, case, goe, great, 3d, color, red, purpl, mayb, also, white, blue, doesn, look, good, bright, area, buti, still, love, wear, 3d, 24, 7, anyway, besid, heartless, nobodi, also, walk, heart, well, iso, spirit, nightmar, dream, eater, reaper, basic, clearli, make, sequel, world, end, 3d, like, kingdom, heart, 1, 2, 3d, also, main, game, quit, frankli, know, play, game, without, flowmot, realiti, shift, dual, link, upcom, kingdom, heart, 3, guess, say, game, ruin, lol, must, game, matter, kind, gamer, everyth, get, reveal, game, make, question, upcom, 3, 3d, leav, thing, want, see, last, game, sell, game, 30, place, like, gamestop, probabl, also, amazon, fort, realli, go, chang, mind, hate, 3d, play, entir, game, 3d, way, game, simpli, beati, everi, way, know, els, describ, 3d, way, without, see, even, though, hd, monchromat, 3d, pass, fad, go, die, forev, final, imax, stereoscop, 3d, forev, sinc, first, arriv, 3d, stereoscop, 3d, like, stereoscop, movi, 60, fp, per, video, imag, enjoy, great, stereoscop, 3d, 60, fp, per, eye, sinc, experi, oh, yeah, 6, 7, speaker, sound, 3d, thru, earphon, jack, wonder, kingdom, heart, 3d, heard, better, surround, sound, game, e, zelda, ocarina, time, 3d, best, surround, sound, ever, thor, 3d, second, best, surround, sound, dillon, roll, western, best, surround, sound, digit, game, also, kingdom, heart, 3d, get, digit, releas, eshop, download, onto, 3d, instead, wait, around, 3, month, 6, fyi, use, plantron, gamecom, 777, surround, sound, game, leav, separ, microphon, jack, plug, unplug, use, 7, 1, surround, sound, earphon, function, plu, 3d, alreadi, built, surround, sound, microphon, pick, sound, beati, 6, 7, speaker, sound, like, 3d, video, record, want, steroscop, 3d, surround, sound, game, suggest, kid, icaru, upris, king, pirat, rodea, sky, soldier, dillon, roll, western, rhythm, thief, emperor, treasur, nano, assault, ex, mario, kart, 7, ketzal, corridor, mutant, mudd, 3d, 40, extra, level, pc, never, pushmo, cave, stori, 3d, surround, sound, limit, side, scroller, use, speakerphon, 3d, virtual, surround, sound, mere, speakerphon, sonic, star, race, transform, 3d, luigi, mansion, dark, moon, lego, citi, undercov, 3d, tale, abyss, 3d, lego, batman, 2, 3d, old, lego, gameplay, super, smash, bro, 3d, resid, evil, revel, metal, gear, solid, snake, eater, star, fox, 64, 3d, crushmo, shinobi, 3d, stereo, epic, mickey, power, illus, side, scroller, use, speakerphon, guess, speedx, 3d, mono, virtual, race, sakura, samurai, art, sword, samurai, sword, destini, heavi, fire, special, oper, 3d, denpa, men, came, wave, pokemon, mysteri, dungeon, 3d, code, princess, yo, kai, watch)"
396656,5.0,0,True,2017-03-13,A10HJXBYIFBY1J,B00W8FYEU2,Heyitsbud777,good game it is fun,Five Stars,0,"List(good, game, fun)"
84232,5.0,0,True,2010-12-27,A10K4042PZXP0X,B000FQBF1M,Kenny Parker,"Similar in scope to Call of Duty. Some of the scenarios require problem solving skills to defeat the foes. For the most part, this is a rock em sock em kick ass game that will get your heart beat up and challenge your trigger finger. I like it enough to pre-order KILLZONE 3.",Killzone GREAT,0,"List(similar, scope, call, duti, scenario, requir, problem, solv, skill, defeat, foe, part, rock, em, sock, em, kick, ass, game, get, heart, beat, challeng, trigger, finger, like, enough, pre, order, killzon, 3)"
191872,5.0,0,True,2014-11-10,A10QJJEI0XKZFD,B002912TAM,Beats,Good game. Hard as I last played it for the wii (2009). I really enjoyed it.,love it.,0,"List(good, game, hard, last, play, wii, 2009, realli, enjoy)"
293378,5.0,0,True,2014-11-18,A10WPJ75QGIIRS,B00BI83EVU,TexasDude,"Excellent fun. I hope they make even better sequels. The online multiplayer has some issues, but that did not detract from the single player experience. The possibility of a random person jumping in your game and coming after you added a bit of spice.",What's not to love about a detailed open world?,0,"List(excel, fun, hope, make, even, better, sequel, onlin, multiplay, issu, detract, singl, player, experi, possibl, random, person, jump, game, come, ad, bit, spice)"
341112,5.0,0,True,2015-12-10,A10XOZQKSOHRUL,B00HKCIT0O,charlie b,good times,Five Stars,0,"List(good, time)"
248057,5.0,0,True,2015-04-06,A10YGP23KD5P67,B0053B66KE,Jinyi Li,Good,Five Stars,0,List(good)
184203,5.0,0,True,2014-12-15,A110WKRYO065GH,B001SGZL2W,Etelvino Centeno,Excellent,Five Stars,0,List(excel)
296830,5.0,0,True,2015-03-27,A116358ORJDPOV,B00BU3ZLJQ,B.Gunn,My son loved it.,Five Stars,0,"List(son, love)"


In [22]:
# Take a stratified sample

print("Number of rows before sampling: ", prediction_df.count())
prediction_df_sampled = prediction_df.sampleBy("overall", fractions = {1:0.001, 5:0.001}, seed = 16).cache()
print("Number of rows after sampling: ", prediction_df_sampled.count())

### TF-IDF with Hashing Trick + Random Forest

In [24]:
# Copy prediction data

prediction_tfidf_hash = prediction_df_sampled.select('*')

In [25]:
# Extract bigram

from pyspark.ml.feature import NGram
from pyspark.sql.functions import array_union

ngram = NGram(n = 2, inputCol="reviewWordCleaned", outputCol="reviewBigrams")
prediction_tfidf_hash = ngram.transform(prediction_tfidf_hash)

prediction_tfidf_hash = prediction_tfidf_hash.withColumn("reviewNgrams", \
                                                         array_union(prediction_tfidf_hash.reviewWordCleaned, \
                                                                     prediction_tfidf_hash.reviewBigrams))

In [26]:
# Getting tf-idf values for 1-2grams

from pyspark.ml.feature import HashingTF, IDF

hashtf = HashingTF(numFeatures=2**12, inputCol="reviewNgrams", outputCol='TF')
tf = hashtf.transform(prediction_tfidf_hash)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf)
prediction_tfidf_hash = idfModel.transform(tf)

In [27]:
display(prediction_tfidf_hash)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned,reviewBigrams,reviewNgrams,TF,TF-IDF
321107,5.0,0,True,2014-07-04,ARW4UBHTV3OCS,B00E4MQODC,Anon,"It work as well as my G700, but with battery life that makes it actually practical to use as a wireless mouse. Great shape, textures, button layout, and tracking. The only things the G700 has on it are wheel functions: I miss the express scroll function a lot, and not having side scrolling on the wheel is slightly missed.",Fantastic wireless gaming mouse.,0,"List(work, well, g700, batteri, life, make, actual, practic, use, wireless, mous, great, shape, textur, button, layout, track, thing, g700, wheel, function, miss, express, scroll, function, lot, side, scroll, wheel, slightli, miss)","List(work well, well g700, g700 batteri, batteri life, life make, make actual, actual practic, practic use, use wireless, wireless mous, mous great, great shape, shape textur, textur button, button layout, layout track, track thing, thing g700, g700 wheel, wheel function, function miss, miss express, express scroll, scroll function, function lot, lot side, side scroll, scroll wheel, wheel slightli, slightli miss)","List(work, well, g700, batteri, life, make, actual, practic, use, wireless, mous, great, shape, textur, button, layout, track, thing, wheel, function, miss, express, scroll, lot, side, slightli, work well, well g700, g700 batteri, batteri life, life make, make actual, actual practic, practic use, use wireless, wireless mous, mous great, great shape, shape textur, textur button, button layout, layout track, track thing, thing g700, g700 wheel, wheel function, function miss, miss express, express scroll, scroll function, function lot, lot side, side scroll, scroll wheel, wheel slightli, slightli miss)","List(0, 4096, List(60, 88, 116, 199, 393, 413, 421, 433, 485, 513, 762, 803, 881, 1152, 1175, 1182, 1227, 1259, 1287, 1288, 1333, 1344, 1382, 1457, 1491, 1575, 1645, 1654, 1657, 1844, 2064, 2098, 2100, 2114, 2192, 2269, 2325, 2341, 2448, 2605, 2631, 2811, 2899, 3359, 3437, 3454, 3489, 3633, 3701, 3809, 3822, 3894, 3929, 3935, 4068), List(1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(60, 88, 116, 199, 393, 413, 421, 433, 485, 513, 762, 803, 881, 1152, 1175, 1182, 1227, 1259, 1287, 1288, 1333, 1344, 1382, 1457, 1491, 1575, 1645, 1654, 1657, 1844, 2064, 2098, 2100, 2114, 2192, 2269, 2325, 2341, 2448, 2605, 2631, 2811, 2899, 3359, 3437, 3454, 3489, 3633, 3701, 3809, 3822, 3894, 3929, 3935, 4068), List(0.0, 2.7880929087757464, 0.0, 4.174387269895637, 3.7689221617874726, 1.9771626925594177, 3.9512437185814275, 2.670309873119363, 3.9512437185814275, 3.9512437185814275, 3.7689221617874726, 4.174387269895637, 3.481240089335692, 3.258096538021482, 3.3634570536793085, 3.9512437185814275, 0.0, 3.9512437185814275, 3.481240089335692, 3.481240089335692, 3.7689221617874726, 4.174387269895637, 3.6147714819602146, 0.0, 3.9512437185814275, 1.9497637183713032, 3.9512437185814275, 3.7689221617874726, 3.6147714819602146, 0.0, 3.7689221617874726, 3.6147714819602146, 4.174387269895637, 0.0, 3.6147714819602146, 0.0, 3.3634570536793085, 3.3634570536793085, 3.7689221617874726, 2.302585092994046, 3.258096538021482, 3.481240089335692, 2.7880929087757464, 3.6147714819602146, 0.0, 0.0, 3.7689221617874726, 3.3634570536793085, 2.2648447650111985, 0.0, 1.1539623837512747, 3.6147714819602146, 0.0, 2.3826278006675823, 3.258096538021482))"
434816,5.0,0,True,2017-01-29,A1G5EKIXBI6NRE,B01CKH0W94,Mitch,Product is high quality. Quick Effective Seller.,Five Stars,0,"List(product, high, qualiti, quick, effect, seller)","List(product high, high qualiti, qualiti quick, quick effect, effect seller)","List(product, high, qualiti, quick, effect, seller, product high, high qualiti, qualiti quick, quick effect, effect seller)","List(0, 4096, List(339, 462, 852, 1432, 1539, 1606, 2812, 3252, 3567, 3575, 3727), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(339, 462, 852, 1432, 1539, 1606, 2812, 3252, 3567, 3575, 3727), List(4.174387269895637, 3.3634570536793085, 3.481240089335692, 3.481240089335692, 2.727468286959312, 3.6147714819602146, 3.6147714819602146, 3.3634570536793085, 3.9512437185814275, 0.0, 0.7243997240640498))"
410842,5.0,0,True,2015-11-20,APGYGCZ3H17WM,B00ZQ2YFFI,Lion,Great game,I love it,0,"List(great, game)",List(great game),"List(great, game, great game)","List(0, 4096, List(3036, 3727, 3822), List(1.0, 1.0, 1.0))","List(0, 4096, List(3036, 3727, 3822), List(2.1933858010290535, 0.7243997240640498, 1.1539623837512747))"
338533,5.0,0,True,2017-04-10,A3INITPVRAXFOK,B00GZ1GUNO,T Reynolds,"Really fun, great graphics, I like how your skills build as the game progresses. I would have liked to see more puzzles to figure out and less people trying to kill Lara, but it was a great survival game.",Great,0,"List(realli, fun, great, graphic, like, skill, build, game, progress, like, see, puzzl, figur, less, peopl, tri, kill, lara, great, surviv, game)","List(realli fun, fun great, great graphic, graphic like, like skill, skill build, build game, game progress, progress like, like see, see puzzl, puzzl figur, figur less, less peopl, peopl tri, tri kill, kill lara, lara great, great surviv, surviv game)","List(realli, fun, great, graphic, like, skill, build, game, progress, see, puzzl, figur, less, peopl, tri, kill, lara, surviv, realli fun, fun great, great graphic, graphic like, like skill, skill build, build game, game progress, progress like, like see, see puzzl, puzzl figur, figur less, less peopl, peopl tri, tri kill, kill lara, lara great, great surviv, surviv game)","List(0, 4096, List(11, 188, 191, 346, 411, 593, 630, 692, 1055, 1612, 1952, 2135, 2206, 2252, 2408, 2449, 2589, 2603, 2607, 2719, 2741, 2742, 2793, 2840, 2856, 2965, 3104, 3185, 3227, 3265, 3370, 3458, 3543, 3608, 3727, 3754, 3822, 3896), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(11, 188, 191, 346, 411, 593, 630, 692, 1055, 1612, 1952, 2135, 2206, 2252, 2408, 2449, 2589, 2603, 2607, 2719, 2741, 2742, 2793, 2840, 2856, 2965, 3104, 3185, 3227, 3265, 3370, 3458, 3543, 3608, 3727, 3754, 3822, 3896), List(0.0, 3.7689221617874726, 3.481240089335692, 2.727468286959312, 3.6147714819602146, 3.7689221617874726, 3.9512437185814275, 3.6147714819602146, 0.0, 3.9512437185814275, 2.5161591932921046, 0.0, 3.7689221617874726, 0.0, 4.174387269895637, 3.9512437185814275, 3.7689221617874726, 0.0, 2.302585092994046, 2.670309873119363, 3.9512437185814275, 0.0, 4.174387269895637, 1.466337068793427, 2.094945728215801, 3.258096538021482, 3.481240089335692, 3.3634570536793085, 3.481240089335692, 0.0, 3.6147714819602146, 1.6094379124341003, 3.7689221617874726, 0.0, 0.7243997240640498, 3.9512437185814275, 1.1539623837512747, 0.0))"
280157,5.0,0,True,2014-12-07,A3R0LRE7XSL45F,B009AGXH64,Mukai,I owned a PS3 since 2008 and it was great but now I needed something new so I picked up a Wii U. The thing I like the Wii U is the gamepad because it's simething different and I always wanted like a tablet lol. The console is small and quite and doesn't overheat. Gamepad is fun to use and slick with all the features can do and feels comfortable when holding it. I see myself playing this for a long time since I bought it for Super Smash Bros. :D 5/5 in my opinion!,A happy costumer : ),0,"List(own, ps3, sinc, 2008, great, need, someth, new, pick, wii, u, thing, like, wii, u, gamepad, simeth, differ, alway, want, like, tablet, lol, consol, small, quit, doesn, overheat, gamepad, fun, use, slick, featur, feel, comfort, hold, see, play, long, time, sinc, bought, super, smash, bro, d, 5, 5, opinion)","List(own ps3, ps3 sinc, sinc 2008, 2008 great, great need, need someth, someth new, new pick, pick wii, wii u, u thing, thing like, like wii, wii u, u gamepad, gamepad simeth, simeth differ, differ alway, alway want, want like, like tablet, tablet lol, lol consol, consol small, small quit, quit doesn, doesn overheat, overheat gamepad, gamepad fun, fun use, use slick, slick featur, featur feel, feel comfort, comfort hold, hold see, see play, play long, long time, time sinc, sinc bought, bought super, super smash, smash bro, bro d, d 5, 5 5, 5 opinion)","List(own, ps3, sinc, 2008, great, need, someth, new, pick, wii, u, thing, like, gamepad, simeth, differ, alway, want, tablet, lol, consol, small, quit, doesn, overheat, fun, use, slick, featur, feel, comfort, hold, see, play, long, time, bought, super, smash, bro, d, 5, opinion, own ps3, ps3 sinc, sinc 2008, 2008 great, great need, need someth, someth new, new pick, pick wii, wii u, u thing, thing like, like wii, u gamepad, gamepad simeth, simeth differ, differ alway, alway want, want like, like tablet, tablet lol, lol consol, consol small, small quit, quit doesn, doesn overheat, overheat gamepad, gamepad fun, fun use, use slick, slick featur, featur feel, feel comfort, comfort hold, hold see, see play, play long, long time, time sinc, sinc bought, bought super, super smash, smash bro, bro d, d 5, 5 5, 5 opinion)","List(0, 4096, List(6, 20, 56, 100, 189, 259, 298, 317, 346, 362, 413, 433, 459, 492, 555, 590, 677, 705, 795, 834, 849, 871, 939, 995, 1020, 1069, 1101, 1152, 1156, 1209, 1220, 1241, 1281, 1288, 1447, 1460, 1512, 1537, 1623, 1669, 1672, 1702, 1779, 1820, 1840, 1876, 1915, 1938, 1975, 2180, 2241, 2433, 2517, 2589, 2598, 2607, 2631, 2640, 2644, 2655, 2733, 2740, 2750, 2785, 2856, 2906, 2912, 2944, 2949, 3020, 3150, 3215, 3223, 3266, 3458, 3496, 3514, 3622, 3626, 3729, 3755, 3775, 3811, 3817, 3822, 3844, 3881, 3926, 3963), List(1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(6, 20, 56, 100, 189, 259, 298, 317, 346, 362, 413, 433, 459, 492, 555, 590, 677, 705, 795, 834, 849, 871, 939, 995, 1020, 1069, 1101, 1152, 1156, 1209, 1220, 1241, 1281, 1288, 1447, 1460, 1512, 1537, 1623, 1669, 1672, 1702, 1779, 1820, 1840, 1876, 1915, 1938, 1975, 2180, 2241, 2433, 2517, 2589, 2598, 2607, 2631, 2640, 2644, 2655, 2733, 2740, 2750, 2785, 2856, 2906, 2912, 2944, 2949, 3020, 3150, 3215, 3223, 3266, 3458, 3496, 3514, 3622, 3626, 3729, 3755, 3775, 3811, 3817, 3822, 3844, 3881, 3926, 3963), List(3.7689221617874726, 3.3634570536793085, 3.7689221617874726, 7.537844323574945, 3.1627863582171574, 4.174387269895637, 3.9512437185814275, 2.921624301400269, 2.727468286959312, 3.7689221617874726, 1.9771626925594177, 2.670309873119363, 2.6162426518490873, 0.0, 3.7689221617874726, 3.6147714819602146, 3.3634570536793085, 0.0, 3.6147714819602146, 0.0, 3.481240089335692, 2.8526314299133175, 3.6147714819602146, 2.995732273553991, 0.0, 0.0, 1.5533484457830569, 3.258096538021482, 4.174387269895637, 3.9512437185814275, 0.0, 2.921624301400269, 3.6147714819602146, 3.481240089335692, 4.174387269895637, 0.0, 3.7689221617874726, 4.174387269895637, 3.7689221617874726, 3.1627863582171574, 3.9512437185814275, 3.3634570536793085, 4.174387269895637, 3.7689221617874726, 2.6162426518490873, 3.258096538021482, 0.0, 3.258096538021482, 3.9512437185814275, 3.9512437185814275, 3.7689221617874726, 3.9512437185814275, 3.1627863582171574, 3.7689221617874726, 0.0, 2.302585092994046, 3.258096538021482, 3.6147714819602146, 3.7689221617874726, 4.174387269895637, 1.9771626925594177, 0.0, 0.0, 0.0, 2.094945728215801, 0.0, 4.174387269895637, 0.0, 3.3634570536793085, 0.0, 3.6147714819602146, 0.0, 3.7689221617874726, 3.7689221617874726, 1.6094379124341003, 3.6147714819602146, 2.8526314299133175, 2.5161591932921046, 0.0, 0.0, 4.174387269895637, 3.7689221617874726, 0.0, 2.5649493574615367, 1.1539623837512747, 4.174387269895637, 0.0, 4.174387269895637, 3.9512437185814275))"
198505,5.0,0,True,2010-10-13,AVSYVX06HV7PV,B002BSA20M,Bikran Sandhu,never played halo before this. now i wanna play all 4 of the previous games,amazing,0,"List(never, play, halo, wanna, play, 4, previou, game)","List(never play, play halo, halo wanna, wanna play, play 4, 4 previou, previou game)","List(never, play, halo, wanna, 4, previou, game, never play, play halo, halo wanna, wanna play, play 4, 4 previou, previou game)","List(0, 4096, List(479, 518, 747, 770, 1001, 1101, 1360, 1877, 2290, 3081, 3296, 3727, 3973, 3993), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(479, 518, 747, 770, 1001, 1101, 1360, 1877, 2290, 3081, 3296, 3727, 3973, 3993), List(0.0, 3.9512437185814275, 3.9512437185814275, 0.0, 3.1627863582171574, 1.5533484457830569, 4.174387269895637, 3.7689221617874726, 3.3634570536793085, 2.8526314299133175, 3.481240089335692, 0.7243997240640498, 3.9512437185814275, 3.258096538021482))"
269034,5.0,0,True,2012-08-17,A1QS9JL35MX6WF,B0074LIX16,Mora,excelentes graficos. lo recomiendo. mi hijo le gusta mucho el juego y es divertido. tiene buen sonido y es muy rapido en xbox live,Excelente,0,"List(excelent, grafico, lo, recomiendo, mi, hijo, le, gusta, mucho, el, juego, y, es, divertido, tien, buen, sonido, y, es, muy, rapido, en, xbox, live)","List(excelent grafico, grafico lo, lo recomiendo, recomiendo mi, mi hijo, hijo le, le gusta, gusta mucho, mucho el, el juego, juego y, y es, es divertido, divertido tien, tien buen, buen sonido, sonido y, y es, es muy, muy rapido, rapido en, en xbox, xbox live)","List(excelent, grafico, lo, recomiendo, mi, hijo, le, gusta, mucho, el, juego, y, es, divertido, tien, buen, sonido, muy, rapido, en, xbox, live, excelent grafico, grafico lo, lo recomiendo, recomiendo mi, mi hijo, hijo le, le gusta, gusta mucho, mucho el, el juego, juego y, y es, es divertido, divertido tien, tien buen, buen sonido, sonido y, es muy, muy rapido, rapido en, en xbox, xbox live)","List(0, 4096, List(37, 38, 285, 387, 530, 570, 989, 1019, 1075, 1124, 1193, 1196, 1365, 1399, 1404, 1519, 1538, 1589, 1665, 1855, 1972, 1975, 2121, 2126, 2156, 2244, 2448, 2505, 2656, 2790, 2932, 3250, 3363, 3388, 3533, 3534, 3625, 3736, 3745, 3814, 3823, 3859, 4034), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(37, 38, 285, 387, 530, 570, 989, 1019, 1075, 1124, 1193, 1196, 1365, 1399, 1404, 1519, 1538, 1589, 1665, 1855, 1972, 1975, 2121, 2126, 2156, 2244, 2448, 2505, 2656, 2790, 2932, 3250, 3363, 3388, 3533, 3534, 3625, 3736, 3745, 3814, 3823, 3859, 4034), List(3.9512437185814275, 0.0, 4.174387269895637, 4.174387269895637, 3.481240089335692, 0.0, 4.174387269895637, 3.3634570536793085, 0.0, 0.0, 3.6147714819602146, 4.174387269895637, 0.0, 3.9512437185814275, 3.9512437185814275, 4.174387269895637, 3.481240089335692, 3.3634570536793085, 3.7689221617874726, 3.9512437185814275, 3.3634570536793085, 3.9512437185814275, 3.0757749812275277, 3.481240089335692, 0.0, 3.7689221617874726, 3.7689221617874726, 0.0, 0.0, 4.174387269895637, 0.0, 0.0, 8.348774539791274, 0.0, 3.7689221617874726, 0.0, 3.6147714819602146, 3.258096538021482, 3.6147714819602146, 0.0, 3.9512437185814275, 4.174387269895637, 0.0))"
134516,5.0,0,True,2017-03-16,A1IEMXUW2UR1ZO,B0015AARJI,Jesus Granados Jr,"Delivered on time, product itself works flawless.",Five Stars,0,"List(deliv, time, product, work, flawless)","List(deliv time, time product, product work, work flawless)","List(deliv, time, product, work, flawless, deliv time, time product, product work, work flawless)","List(0, 4096, List(155, 929, 1348, 1575, 2117, 2698, 2722, 2733, 3727), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(155, 929, 1348, 1575, 2117, 2698, 2722, 2733, 3727), List(0.0, 3.9512437185814275, 3.7689221617874726, 1.9497637183713032, 3.6147714819602146, 4.174387269895637, 0.0, 1.9771626925594177, 0.7243997240640498))"
241187,5.0,0,True,2014-08-25,A2B1OM6AN53PKQ,B0050SX3KG,Seth Davis,Wonderful!,Awesome!,0,List(wonder),List(),List(wonder),"List(0, 4096, List(3315), List(1.0))","List(0, 4096, List(3315), List(3.258096538021482))"
371821,5.0,0,True,2014-12-22,A2YMIP3W927JDE,B00M5DUOBK,vuskeedoo,I purchased this with the CHERRY MX REDs. I mostly play FPS games and had to get used to the very fast actuation points. This keyboard definitely improved my game.,I purchased this with the CHERRY MX REDs. I ...,0,"List(purchas, cherri, mx, red, mostli, play, fp, game, get, use, fast, actuat, point, keyboard, definit, improv, game)","List(purchas cherri, cherri mx, mx red, red mostli, mostli play, play fp, fp game, game get, get use, use fast, fast actuat, actuat point, point keyboard, keyboard definit, definit improv, improv game)","List(purchas, cherri, mx, red, mostli, play, fp, game, get, use, fast, actuat, point, keyboard, definit, improv, purchas cherri, cherri mx, mx red, red mostli, mostli play, play fp, fp game, game get, get use, use fast, fast actuat, actuat point, point keyboard, keyboard definit, definit improv, improv game)","List(0, 4096, List(34, 100, 283, 316, 413, 435, 458, 515, 535, 689, 920, 1101, 1205, 1248, 1250, 1424, 1546, 1872, 2198, 2309, 2372, 2373, 2391, 2552, 2866, 2911, 3180, 3200, 3639, 3727, 3749, 3806), List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))","List(0, 4096, List(34, 100, 283, 316, 413, 435, 458, 515, 535, 689, 920, 1101, 1205, 1248, 1250, 1424, 1546, 1872, 2198, 2309, 2372, 2373, 2391, 2552, 2866, 2911, 3180, 3200, 3639, 3727, 3749, 3806), List(3.481240089335692, 3.7689221617874726, 3.1627863582171574, 3.9512437185814275, 1.9771626925594177, 0.0, 3.6147714819602146, 4.174387269895637, 2.670309873119363, 0.0, 3.6147714819602146, 1.5533484457830569, 4.174387269895637, 0.0, 3.1627863582171574, 3.3634570536793085, 3.9512437185814275, 3.7689221617874726, 2.995732273553991, 3.481240089335692, 3.7689221617874726, 3.481240089335692, 3.1627863582171574, 3.9512437185814275, 1.923095471289142, 2.995732273553991, 3.9512437185814275, 4.174387269895637, 3.1627863582171574, 0.7243997240640498, 3.258096538021482, 3.9512437185814275))"


In [28]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(prediction_tfidf_hash)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_tfidf_hash.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)


In [29]:
display(predictions.select("overall", "indexedScore", "rawPrediction", "probability", "prediction", "predictedLabel"))

overall,indexedScore,rawPrediction,probability,prediction,predictedLabel
5.0,0.0,"List(1, 2, List(), List(39.427905575581036, 0.5720944244189637))","List(1, 2, List(), List(0.9856976393895259, 0.014302360610474093))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(39.427905575581036, 0.5720944244189637))","List(1, 2, List(), List(0.9856976393895259, 0.014302360610474093))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(39.24941095192512, 0.7505890480748778))","List(1, 2, List(), List(0.981235273798128, 0.018764726201871945))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(37.45762574668051, 2.5423742533194824))","List(1, 2, List(), List(0.936440643667013, 0.06355935633298707))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(39.427905575581036, 0.5720944244189637))","List(1, 2, List(), List(0.9856976393895259, 0.014302360610474093))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(36.50706720059278, 3.492932799407217))","List(1, 2, List(), List(0.9126766800148196, 0.08732331998518045))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(39.427905575581036, 0.5720944244189637))","List(1, 2, List(), List(0.9856976393895259, 0.014302360610474093))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(36.54999480646258, 3.450005193537425))","List(1, 2, List(), List(0.9137498701615645, 0.08625012983843564))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(37.739793835464155, 2.260206164535844))","List(1, 2, List(), List(0.9434948458866039, 0.05650515411339611))",0.0,5.0
5.0,0.0,"List(1, 2, List(), List(37.482604343205615, 2.51739565679438))","List(1, 2, List(), List(0.9370651085801406, 0.0629348914198595))",0.0,5.0


In [30]:
# Calculate AUC for train/test split

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

In [31]:
# Performance evaluation with 5-fold cross validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=5)
cvModel = cv.fit(prediction_tfidf_hash)

print("Average AUC = %g" % cvModel.avgMetrics[0])

### Doc2Vec + Random Forest

In [33]:
# Copy prediction data

prediction_doc2vec = prediction_df_sampled.select('*')

In [34]:
# Calculate Doc2Vec

from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(inputCol="reviewWordCleaned", outputCol="doc2vec")
w2v_model = word2Vec.fit(prediction_doc2vec)

prediction_doc2vec = w2v_model.transform(prediction_doc2vec)

In [35]:
display(prediction_doc2vec)

reviewID,overall,vote,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,label,reviewWordCleaned,doc2vec
321107,5.0,0,True,2014-07-04,ARW4UBHTV3OCS,B00E4MQODC,Anon,"It work as well as my G700, but with battery life that makes it actually practical to use as a wireless mouse. Great shape, textures, button layout, and tracking. The only things the G700 has on it are wheel functions: I miss the express scroll function a lot, and not having side scrolling on the wheel is slightly missed.",Fantastic wireless gaming mouse.,0,"List(work, well, g700, batteri, life, make, actual, practic, use, wireless, mous, great, shape, textur, button, layout, track, thing, g700, wheel, function, miss, express, scroll, function, lot, side, scroll, wheel, slightli, miss)","List(1, 100, List(), List(-0.001739057816467398, -0.001740884948371639, -0.005654011541346629, -0.006864674727342301, -0.004009725984125849, 0.005224136853470437, 0.003777420429903413, 0.003170951232013683, 9.680584508685335E-4, 0.004024521143536173, -1.3964162245693224E-4, 0.010217282778373167, 0.009680151458709471, 0.001778954335848891, -0.0037683945791345207, 0.0018595368657711774, 0.0024294513127496166, -0.008300936240102015, 0.004090982906129812, 0.009098673990416912, 0.004781835288139841, -0.003106387001612494, 0.0035216700090395826, -0.004005957357285004, 0.0035143427911304656, 2.89891310404205E-4, -0.001943070261228469, 3.837787033614492E-4, -0.00921634808483143, 5.013633072526464E-4, -0.0032372582673786147, -0.004009614176597566, -0.003787164324744334, -0.0026525449056568886, 0.003550103965080193, 0.003906347000250413, -9.005187443577714E-4, 0.0029247981447335933, -1.0330189440038896E-4, 5.01754502164981E-4, -2.4493291388235744E-4, -0.004859685792677825, -0.0028130240558135893, -0.0038717661518603563, 0.005446109616558157, 0.003282317364092676, 9.613934033099682E-7, 0.004292821874015874, -0.0077292130914546784, 0.005309816761573236, 0.0027207950911214275, -0.0039022449073531935, 7.779143822018898E-4, 0.005606214027671564, 4.53959428986925E-4, -0.004505529248666379, 0.0024494394648730032, -1.4639646941495517E-4, -0.0026106446297959455, 0.004049240100768304, -7.339779235002014E-4, 0.008837451076795977, 0.005435336351154312, -0.006997652284260238, -0.004751631006177875, 0.01045152389504496, 0.0017504526063754793, 0.0012733270451726932, 9.9251726867571E-4, -0.0028882236201165905, -0.0030789193246633776, 0.003906453614153208, 0.001289025501107737, 0.005596715633967711, 0.005501109626024, 0.001839894012758328, 0.00654343327116822, -0.004154805875112934, 3.832423789126258E-4, -0.0030534312301766005, 0.0040173482810778, -0.0040278767663685065, -0.005175469803713983, -0.005340438152122642, -0.0026855588019374876, 0.009362345065681203, 0.005898321257723916, 0.007879095662745738, 0.00273659429512918, -0.0022184345627113454, -0.005787386341140635, 0.0017511701782143884, 0.001617237360351869, 0.002411325447141163, 0.0033318605292738685, 0.005985518654568061, -0.004280053067862266, -0.008548047389805077, 0.004116966244539306, -0.004979017671317823))"
434816,5.0,0,True,2017-01-29,A1G5EKIXBI6NRE,B01CKH0W94,Mitch,Product is high quality. Quick Effective Seller.,Five Stars,0,"List(product, high, qualiti, quick, effect, seller)","List(1, 100, List(), List(-0.003130931523628533, -8.651975804241374E-4, -0.004426174661299834, -0.0025741824841437238, -0.0014074601155395308, 0.0015870376373641193, 2.4803326232358813E-4, 0.003377680550329387, 4.0088181170479703E-4, 6.980187802885969E-4, 4.617229472690572E-4, 0.006896386155858636, 0.0028189145183811584, -2.7151093430196244E-4, -0.00468251493293792, 0.001488291968901952, 0.002140024919450904, -0.003662306582555175, 0.001063969025077919, 0.004734359215945005, 0.0021430464888301986, -7.095987384673208E-4, 0.0018709287396632135, -0.0011254975494618216, 0.0034865329701763885, -5.932013300480321E-5, 1.8221753028531867E-4, 5.96274194928507E-4, -0.00516478701805075, 5.004942261924346E-4, -0.003418778108122448, -6.219925126060843E-4, -0.002432872968104978, -0.001340738427340208, 0.002599755748330305, 0.0033292171040860312, -7.502741160957763E-4, 0.002895497328912218, 0.0015818336153946195, 5.924951304526378E-4, -0.0014984286584270496, -0.0032480868976563215, -0.0024398911433915296, -0.0019980177651935565, 0.0024722209976365166, -5.117135006003082E-4, -2.794964141988506E-4, 9.668514345927784E-4, -0.004547169218615939, 0.004393621697090566, -3.0479214425819613E-5, -0.0011175998661201447, 0.0016453956134985976, -3.513147433598836E-4, 4.3870069081701024E-4, -0.002390481240581721, 1.3015465810894966E-4, 2.6828947496445227E-4, -0.002025628831082334, 0.002085543140613784, 9.885261145730813E-4, 0.005735655276415248, 0.003036866410790632, -0.0033547019702382386, -0.0034601447211268046, 0.0029865365164975324, 0.0023333029045412936, -0.0019559725769795477, 4.0934550149055815E-4, -0.001451579601659129, -0.003167455395062764, 0.0016482974169775844, 7.306359378465761E-4, 0.0015488085142957666, 0.002396421507000923, 0.0015975030158491184, 0.0039489189512096345, -0.001822314428864047, -6.382430243926743E-5, -0.0026902149741848307, 0.003789450517312313, -0.0024625695853804546, -0.001934059759757171, -0.002454615741347273, -0.0013457229166912534, 0.004347883475323518, 0.0032607172227775054, 0.005435114765229324, 8.80634703207761E-4, -0.0025987677508965135, -0.0024157472265263396, 0.0019287500423767292, 0.0018597555463202298, 0.002707074493325005, 7.928147097118199E-4, 2.894040662795305E-4, -0.0032621255377307534, -0.0021848522204284864, 0.0017622824913511672, -0.003495717537589371))"
410842,5.0,0,True,2015-11-20,APGYGCZ3H17WM,B00ZQ2YFFI,Lion,Great game,I love it,0,"List(great, game)","List(1, 100, List(), List(-0.00550738675519824, -0.019335724413394928, -0.025965160690248013, -0.04364106059074402, -0.02472805418074131, 0.04327865689992905, 0.04283455479890108, 0.011792869423516095, 0.0064354699570685625, 0.02372697275131941, -0.0028819101862609386, 0.05963536351919174, 0.05582370050251484, 0.019326211884617805, -0.03280782327055931, 0.012413835152983665, 0.017816834151744843, -0.051228826865553856, 0.03545447066426277, 0.06067642383277416, 0.03779289871454239, -0.021903823129832745, 0.0284976027905941, -0.027349733747541904, 0.021066422574222088, -0.00383036769926548, -0.022853283677250147, -0.0053521416848525405, -0.07292715273797512, -0.00519458600319922, -0.02272696327418089, -0.030414637178182602, -0.026914228685200214, -0.005890983622521162, 0.024461491033434868, 0.027253338135778904, -0.010264216922223568, 0.008695583906956017, -0.005264200270175934, -5.964245647192001E-4, -0.0023306431248784065, -0.025403075851500034, -0.014716845471411943, -0.018623576033860445, 0.026487285271286964, 0.03069989662617445, 0.013976351357996464, 0.02344383578747511, -0.05150582827627659, 0.030074117705225945, 0.020312995184212923, -0.03352286946028471, -0.0014923374401405454, 0.0385810686275363, 2.446785510983318E-4, -0.037752426229417324, 0.010772957932204008, -0.00834018511523027, -0.025197061710059643, 0.029983745887875557, -0.020961846690624952, 0.059126853942871094, 0.044398498721420765, -0.033163731917738914, -0.02519364282488823, 0.07267126627266407, 0.006112835486419499, 0.010504544479772449, 0.0019257845706306398, -0.01597684482112527, -0.004165697959251702, 0.027267680503427982, 0.016366453492082655, 0.041882239282131195, 0.02602765429764986, 0.024583175778388977, 0.0427998099476099, -0.028552541509270668, 0.008894333615899086, -0.010606434545479715, 0.03433386608958244, -0.034478710032999516, -0.025232925079762936, -0.04032420739531517, -0.03088060300797224, 0.06113614700734615, 0.02531276922672987, 0.05707384645938873, 0.021577881649136543, -0.01025406876578927, -0.032075160183012486, 0.002573128091171384, 0.009362108772620559, 0.02137067075818777, 0.024918872863054276, 0.04401121288537979, -0.036384682171046734, -0.05956268683075905, 0.02876830380409956, -0.03697129059582949))"
338533,5.0,0,True,2017-04-10,A3INITPVRAXFOK,B00GZ1GUNO,T Reynolds,"Really fun, great graphics, I like how your skills build as the game progresses. I would have liked to see more puzzles to figure out and less people trying to kill Lara, but it was a great survival game.",Great,0,"List(realli, fun, great, graphic, like, skill, build, game, progress, like, see, puzzl, figur, less, peopl, tri, kill, lara, great, surviv, game)","List(1, 100, List(), List(-0.004246589802538178, -0.008890061450767374, -0.012672477451685283, -0.021662740857296046, -0.011854755636748104, 0.021187227513153283, 0.01863937215724339, 0.009751040510655867, 0.002933293633672985, 0.010564699641517584, -3.063132254672902E-4, 0.0305243395712404, 0.029261528536499964, 0.008380419203257631, -0.014926217829010316, 0.007303033736423544, 0.009505877188140792, -0.025354324352173576, 0.017150829135928126, 0.029174450552091002, 0.018698395462706685, -0.010516771786136641, 0.012844386078151209, -0.014777697068417354, 0.013245603479888467, -0.0020997742304600595, -0.008796335392010707, -0.0019138261808880736, -0.03444807209251892, -0.001529371859284029, -0.010656772802273432, -0.013822778904189665, -0.014844671734386965, -0.0048091791492576395, 0.012250102507615728, 0.01363024547407847, -0.004900861905688153, 0.005250333104326966, -0.002372049587956142, -7.310695918498649E-4, -0.0019130792837434758, -0.014405329002156143, -0.008489350651091496, -0.009499314989495489, 0.01578044863639488, 0.014869314202639672, 0.004677259335516109, 0.012071308253022531, -0.024290735301162512, 0.01608783371990458, 0.010065565096391808, -0.015938072204811585, -1.1328687902451271E-4, 0.01828972841446687, -1.5854059754582566E-4, -0.018860644466864564, 0.007004267303273082, -0.002514829917345196, -0.012325718817356529, 0.012387274171314423, -0.007128823526381027, 0.029377223391618044, 0.019482783058525195, -0.0185678334507559, -0.013653112353668326, 0.03476729842701128, 0.004816323090857449, 0.0028184571891047413, 4.92353453799816E-4, -0.008683161000676808, -0.0039468109330517195, 0.01381198641666699, 0.007241938579162316, 0.01974820669662828, 0.014302929031795688, 0.009188671662871326, 0.022614240613100783, -0.014104662706986779, 0.0037282329895311876, -0.0071580172987610454, 0.015950523715998446, -0.015444572305395489, -0.014916992363786057, -0.01950386069005444, -0.014025864740168408, 0.030781781810912345, 0.014157646978717475, 0.027517083522287152, 0.01042323433821799, -0.006553363576087923, -0.016983418991523128, 0.003392977792481404, 0.00523788577300452, 0.01009660351735961, 0.01267363450356892, 0.01973089290827158, -0.016963085039086372, -0.03024422510394028, 0.014198962488167342, -0.018850458253707205))"
280157,5.0,0,True,2014-12-07,A3R0LRE7XSL45F,B009AGXH64,Mukai,I owned a PS3 since 2008 and it was great but now I needed something new so I picked up a Wii U. The thing I like the Wii U is the gamepad because it's simething different and I always wanted like a tablet lol. The console is small and quite and doesn't overheat. Gamepad is fun to use and slick with all the features can do and feels comfortable when holding it. I see myself playing this for a long time since I bought it for Super Smash Bros. :D 5/5 in my opinion!,A happy costumer : ),0,"List(own, ps3, sinc, 2008, great, need, someth, new, pick, wii, u, thing, like, wii, u, gamepad, simeth, differ, alway, want, like, tablet, lol, consol, small, quit, doesn, overheat, gamepad, fun, use, slick, featur, feel, comfort, hold, see, play, long, time, sinc, bought, super, smash, bro, d, 5, 5, opinion)","List(1, 100, List(), List(-0.003000022967616856, -0.003923202227511234, -0.006316248841859324, -0.011236518721229263, -0.005993631098191348, 0.011354084611319157, 0.008462904074599927, 0.0058609840468437, 0.0012879316078508462, 0.005214801788738719, 5.030412433611951E-5, 0.015660323478205472, 0.015307186907323609, 0.0030316974734887476, -0.007561412805035634, 0.002894930926398659, 0.004490690290386199, -0.013160997253310467, 0.008452608561314337, 0.014876043623579398, 0.009106084975242918, -0.005721410293350642, 0.006516602123156189, -0.007879776661569367, 0.006823986996801532, -8.285187056516201E-4, -0.003304687493756337, -5.714341045395299E-4, -0.017180989448893434, -7.229674710269675E-4, -0.004952314997576557, -0.006428510001718009, -0.007383183086747113, -0.0036331747969783534, 0.0061381721431959645, 0.007298475282257232, -0.003131164748064831, 0.003525382776777925, -0.0011529507797344454, -0.0012807494607915606, -5.936217176542636E-4, -0.008141926628756051, -0.004665983577106832, -0.005979502606573895, 0.009637246079438803, 0.0070071760636317175, 0.0017699328071808403, 0.005838561919517814, -0.013089886823745102, 0.007942401312234601, 0.0054353051973335745, -0.008033463410672027, 4.422575736071496E-4, 0.008717753095267226, -7.103499782696955E-4, -0.009378523199952074, 0.004085547840745397, -6.278070896787911E-4, -0.006038495923905652, 0.005913531657175294, -0.001910967904390121, 0.014627699540661912, 0.00893334534057245, -0.009898965415658847, -0.006892128807150435, 0.017422015253663516, 0.00307234246533231, 0.0014616372277580047, 5.366661543102592E-4, -0.00515592961580426, -0.0028335034885272684, 0.007846400372170824, 0.0026316826649922494, 0.00908609534547265, 0.007038258983069384, 0.003808487900437749, 0.011592715862207113, -0.006114968933801794, 0.0011011788277644472, -0.003470755241602203, 0.007825997226862046, -0.007052341502216853, -0.00911137364491136, -0.00898756954120472, -0.006570770570055619, 0.015794554751423395, 0.007933911014519327, 0.012945769677812954, 0.005283007920928755, -0.0041397112051656994, -0.009034691837008054, 0.0030149728159553236, 0.0027421693179794414, 0.003840440576563456, 0.006658398551030122, 0.010099652658064602, -0.007983550077722388, -0.015103782137513767, 0.006993011091550698, -0.009395662191494994))"
198505,5.0,0,True,2010-10-13,AVSYVX06HV7PV,B002BSA20M,Bikran Sandhu,never played halo before this. now i wanna play all 4 of the previous games,amazing,0,"List(never, play, halo, wanna, play, 4, previou, game)","List(1, 100, List(), List(-0.0015016455436125398, -0.005970446480205283, -0.00972908319090493, -0.020198109734337777, -0.009665994162787683, 0.019872442120686173, 0.017357236400130205, 0.0034490649995859712, -5.371375591494143E-4, 0.009861042315606028, -0.0016442614178231452, 0.02299764912459068, 0.02106823434587568, 0.008605279232142493, -0.01453108795976732, 0.005323252364178188, 0.005924741679336876, -0.02381530951242894, 0.017111433087848127, 0.022981096262810752, 0.01543221581960097, -0.006151098699774593, 0.011880294070579112, -0.01284467097138986, 0.008388060581637546, -0.0038755656278226525, -0.01139046237221919, -0.002346451045013964, -0.03192975802812725, -0.0032070269817268127, -0.009836870129220188, -0.01323738563223742, -0.011944345838855952, -8.528025355190039E-4, 0.01019457186339423, 0.010366436850745231, -0.003505571323330514, 0.001321127638220787, -0.003536307136528194, -0.002522951952414587, -0.0011333417205605656, -0.011169468722073361, -0.00548514595720917, -0.006350685565848835, 0.009593771363142878, 0.015136149246245623, 0.006665762484772131, 0.010209055384621024, -0.02110478625399992, 0.010147833439987153, 0.008553570572985336, -0.016354400009731762, -0.003566913219401613, 0.016720659798011184, -5.5842669098638E-4, -0.017401277815224603, 0.00493976753205061, -0.0033438233367633075, -0.013043797633145005, 0.011318706150632352, -0.008299027278553694, 0.022321526339510456, 0.01897389981604647, -0.011984176002442837, -0.0075907550053671, 0.029496281931642443, 0.0015100738528417423, 0.002847541152732447, 0.0010984317632392049, -0.006329747760901228, -4.287938354536891E-4, 0.012641759152757004, 0.008439997211098671, 0.017674948379863054, 0.008296807412989438, 0.010319394437829033, 0.016912977065658197, -0.011915580718778074, 0.003150505173834972, -0.0027908930205740035, 0.015167939942330122, -0.012912939564557746, -0.011278986174147576, -0.01618780061835423, -0.015037215547636151, 0.027188206207938492, 0.008762055193074048, 0.02429126249626279, 0.008352197735803202, -0.0032456695626024157, -0.011821334424894303, 0.0012917714775539935, 0.006296001534792595, 0.008424410305451602, 0.011728471450624056, 0.018169827642850578, -0.014696688100229949, -0.025233856140403077, 0.012533876550151035, -0.016469077789224684))"
269034,5.0,0,True,2012-08-17,A1QS9JL35MX6WF,B0074LIX16,Mora,excelentes graficos. lo recomiendo. mi hijo le gusta mucho el juego y es divertido. tiene buen sonido y es muy rapido en xbox live,Excelente,0,"List(excelent, grafico, lo, recomiendo, mi, hijo, le, gusta, mucho, el, juego, y, es, divertido, tien, buen, sonido, y, es, muy, rapido, en, xbox, live)","List(1, 100, List(), List(4.779516990917424E-4, -4.638826649170369E-4, -4.4767243283179897E-4, -3.5149250955631334E-4, -2.0220236910972744E-4, 1.27443578094244E-4, 3.804647324917217E-4, 2.2114033345133066E-4, -6.901916640345007E-5, 3.909075142776904E-4, 2.106671575650883E-4, 7.621919309409956E-4, 4.2305250341693557E-4, 1.8709919337804118E-4, 2.4553405334396905E-4, -1.9811985354560116E-4, 3.728898203310867E-4, -2.6645759741465247E-4, 3.9586679001028335E-4, 5.468658443229893E-4, 4.4679802764828003E-4, -3.188465295049051E-4, 6.057748881479104E-6, 1.3805367052555084E-4, 1.173547595196093E-4, 1.4719705844375614E-4, -3.2175982293362415E-4, 2.839468167318652E-5, -4.6154784892375267E-4, -2.0854510754967728E-4, -3.758276191850503E-4, 6.507188663817942E-5, -4.246262202893073E-4, -2.543903246987611E-4, 3.176559306060274E-4, 5.233412424180035E-4, 4.706442511330048E-4, 2.6521612502013642E-5, -1.4810576006614912E-4, 5.933360565298547E-5, 5.468343927835424E-4, 2.318953920621425E-4, 1.7342890593378493E-4, 5.051222785065571E-4, 4.928533065443237E-4, 2.1539519851406413E-4, 2.7610435305784144E-5, 3.7742410980475444E-4, -5.331697126772875E-4, 3.451927720258633E-5, -3.0448824327322654E-4, -4.53527415326486E-5, 5.2868553514902786E-5, 5.166809181294714E-4, -1.1525327378573516E-4, -2.07292342868944E-4, -1.5567744655224183E-4, 2.0077278410705426E-4, -4.41716838395223E-4, 8.52983115085711E-5, 1.8698092753766105E-4, 6.09437779833873E-4, 2.1351941783602038E-4, -5.521286705819268E-4, 3.4104947311182815E-4, 5.994581539804736E-4, 5.861433843771616E-6, 4.152528005458104E-4, 5.933582821550468E-4, 4.5759452041238546E-4, 4.076823048914472E-4, 2.555387715498606E-4, 8.649355731904507E-5, 6.250563698510328E-5, 1.8622316323065508E-5, 2.9388869491716224E-4, 7.755028733906025E-4, -2.271559787914157E-5, -4.290766082704067E-4, 1.8738833023235202E-4, 4.099031599859396E-4, 5.7736149756237864E-5, -2.2348404551545777E-5, -4.0603331581223756E-4, -3.954693675041199E-4, 5.113152631868918E-4, 3.53557242002959E-4, 6.778039532946423E-4, -4.0216974836463726E-4, -1.1881163421397407E-4, 9.783318576713403E-5, 6.492708635050803E-4, -1.5486435343821842E-4, -3.5236614833896357E-4, 4.074648459209129E-4, 4.811165854334831E-5, 3.1209883066670346E-4, -4.537578206509352E-4, -8.376320086730023E-5, -3.821049661686023E-4))"
134516,5.0,0,True,2017-03-16,A1IEMXUW2UR1ZO,B0015AARJI,Jesus Granados Jr,"Delivered on time, product itself works flawless.",Five Stars,0,"List(deliv, time, product, work, flawless)","List(1, 100, List(), List(-8.623044937849045E-4, -4.7149970196187497E-4, -0.004843831714242697, -0.0014606858836486937, -0.0029130149399861694, -9.751326870173217E-5, 7.353790570050479E-5, 0.0037993656005710363, 1.3600797392427922E-4, 0.002079590642824769, 0.0021406146697700024, 0.004654683731496335, 0.004040610417723656, -0.0030940315686166287, 6.445137783885002E-4, 0.0015605701599270107, 3.823445411399007E-4, -0.0012193918228149414, 6.265214178711176E-4, 0.002257491881027818, -5.507285241037607E-4, -6.755328620783985E-4, -5.56408939883113E-4, -0.001620113803073764, 0.003204259159974754, -2.2389595396816732E-4, -7.486822549253703E-4, 8.495186921209097E-4, -6.303736940026284E-4, -1.8123304471373558E-4, -0.0021557070780545474, 0.0017046793596819044, 0.001010160823352635, -0.0016371463658288123, -0.0011129884049296379, -9.827761910855771E-4, 2.506713150069118E-4, 7.734285783953965E-4, 0.003276021685451269, 0.0021941724698990583, 7.752311648800969E-4, -0.0016523465979844333, -0.002268190635368228, -0.00236829275963828, 0.00215314666274935, -6.25100638717413E-4, -0.0011906127212569118, 0.0018381699919700623, -0.0032073656795546415, 0.003642689995467663, 0.00249620444374159, 0.002189051813911647, -0.0011260662460699678, 0.001193884527310729, 8.054919308051467E-4, 3.704815171658993E-5, -0.0021605086512863636, 0.0010543520969804377, 0.001096571795642376, 0.0017019391991198063, 0.002650206512771547, 0.0036980090197175743, 1.2958343140780925E-4, -0.002668064064346254, -0.0028773031895980242, 0.0016271042171865703, 0.004150041565299034, -5.281943595036864E-4, 0.0012098949868232013, -6.055723293684423E-4, -0.0031637141946703196, -0.0010259464150294661, 3.869332140311599E-5, 4.5064284931868316E-4, 0.00253592892549932, -7.152657839469613E-4, 6.204913835972548E-4, -9.725401527248324E-4, -6.753502879291774E-4, 5.074325250461698E-4, -0.0025380904204212133, -3.588750492781401E-4, -0.0015854794299229981, 1.2019169516861439E-4, 0.0020587846171110868, 0.00245388918556273, 7.529083639383317E-4, 0.004211642500013113, 0.0013307531364262104, -0.0014939681044779719, -3.6422559060156345E-4, 0.0025531517807394267, 0.0013951724860817194, -0.0012141000013798477, -7.352231186814606E-4, 0.0016440060921013356, -0.0021232647821307184, -4.322208929806948E-4, 0.0025969762122258545, -0.003482493478804827))"
241187,5.0,0,True,2014-08-25,A2B1OM6AN53PKQ,B0050SX3KG,Seth Davis,Wonderful!,Awesome!,0,List(wonder),"List(1, 100, List(), List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0))"
371821,5.0,0,True,2014-12-22,A2YMIP3W927JDE,B00M5DUOBK,vuskeedoo,I purchased this with the CHERRY MX REDs. I mostly play FPS games and had to get used to the very fast actuation points. This keyboard definitely improved my game.,I purchased this with the CHERRY MX REDs. I ...,0,"List(purchas, cherri, mx, red, mostli, play, fp, game, get, use, fast, actuat, point, keyboard, definit, improv, game)","List(1, 100, List(), List(-0.0030531087095904, -0.004628382941625793, -0.008476411304025747, -0.014931423329364727, -0.008623311405672747, 0.014455824535723557, 0.013683507609290673, 0.004876635701614706, 0.0029458206503049414, 0.008001073722398895, -9.278399280875045E-4, 0.02098732370444957, 0.018981212572030285, 0.005108954771832728, -0.01114465225734474, 0.003971792044467293, 0.006309489877072766, -0.017843610230449804, 0.010982952662743628, 0.020423443946877822, 0.013488760704229422, -0.006788429389844703, 0.009183119482579915, -0.01005363925223184, 0.007132633096154998, -0.0012318055284113677, -0.007047712984357905, -0.0017711501825354336, -0.02461214464924791, -0.002638845316730166, -0.007929257103515898, -0.010244246147682561, -0.009974595864632112, -0.0025613914819105584, 0.008198224025203244, 0.008270128773908843, -0.0025095737689886898, 0.003225092529593145, -0.002537852324381032, -7.017370491452953E-4, -0.0012860114287939401, -0.008460924351204406, -0.005944101167294909, -0.004748525188701665, 0.009454192874460098, 0.010647045818212278, 0.0038480640710879337, 0.009360160459490383, -0.01732827809310573, 0.010316464796607546, 0.00712175780277261, -0.012451998776454917, -1.287885003394502E-4, 0.012855780625934987, -2.4106664247536922E-4, -0.013669871519703199, 0.004296222783844261, -0.0021017248988808956, -0.009028904742616065, 0.008816239643184577, -0.006558165632989532, 0.019066559910938582, 0.014148875139653683, -0.010833226051866351, -0.007970235680284746, 0.023558670846635803, 0.0012812478543149635, 0.002735049179618192, 8.902085494756809E-4, -0.006468267316984779, -0.0024356209940057903, 0.008838816765038405, 0.00522031618610901, 0.01382511095075375, 0.008134232820285593, 0.007577163619263207, 0.012961443979293108, -0.009686958152622752, 0.0021247900474597424, -0.0035202775342280373, 0.012015273876707344, -0.009610810262315413, -0.008954601958120131, -0.012578104670597789, -0.011084430111462578, 0.02149583152769243, 0.00833446324309882, 0.01797900685821386, 0.0066224627114613265, -0.002819461508325356, -0.010646570480757338, 0.002804687082329217, 0.0038513395531267368, 0.006479358363910304, 0.008701258428011309, 0.013876027550877017, -0.012169222792555742, -0.0208096696091268, 0.011257618693087031, -0.013414368451134685))"


In [36]:
# Random Forest

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(prediction_doc2vec)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="doc2vec", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

(trainingData, testData) = prediction_doc2vec.randomSplit([0.7, 0.3])

rf_model = pipeline.fit(trainingData)
predictions = rf_model.transform(testData)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="indexedScore", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print("AUC = %g" % auc)

In [37]:
# Performance evaluation with 10-fold cross validation

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().build()
cv = CrossValidator(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=5)
cvModel = cv.fit(prediction_doc2vec)

print("Average AUC = %g" % cvModel.avgMetrics[0])

## Model Interpretation

In [39]:
# Extract bigram

interpret_tfidf = prediction_df_sampled.select('*')

from pyspark.ml.feature import NGram
from pyspark.sql.functions import array_union

ngram = NGram(n = 2, inputCol="reviewWordCleaned", outputCol="reviewBigrams")
interpret_tfidf = ngram.transform(interpret_tfidf)

interpret_tfidf = interpret_tfidf.withColumn("reviewNgrams", \
                                             array_union(interpret_tfidf.reviewWordCleaned, \
                                                         interpret_tfidf.reviewBigrams))

In [40]:
# Calculating TF-IDF without hashing; limit vocabulary to top 2^12 (4096) ngrams

from pyspark.ml.feature import CountVectorizer, IDF

tf = CountVectorizer(inputCol="reviewNgrams", outputCol='TF', minDF=2.0, vocabSize=2**12)
tf_model = tf.fit(interpret_tfidf)
tf_transformed = tf_model.transform(interpret_tfidf)
idf = IDF(minDocFreq=3, inputCol="TF", outputCol="TF-IDF")
idfModel = idf.fit(tf_transformed)
interpret_tfidf = idfModel.transform(tf_transformed)

In [41]:
# Building a full Random Forest model with all the data, using TF-IDF embedding without hashing

from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer
from pyspark.ml.classification import RandomForestClassifier

labelIndexer = StringIndexer(inputCol="overall", outputCol="indexedScore").fit(interpret_tfidf)
rf = RandomForestClassifier(labelCol="indexedScore", featuresCol="TF-IDF", numTrees=40)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

pipeline = Pipeline(stages=[labelIndexer, rf, labelConverter])

rf_model = pipeline.fit(interpret_tfidf)

In [42]:
# Getting feature importance from the Random Forest model

feature_importance = rf_model.stages[-2].featureImportances
print(feature_importance)

In [43]:
# Get the top 20 most important feature's indices, and its importance metric

import numpy as np
import pandas as pd

top20_indice = np.flip(np.argsort(feature_importance.toArray()))[:20].tolist()
top20_importance = []
for index in top20_indice:
    top20_importance.append(feature_importance[index])

top20_df = spark.createDataFrame(pd.DataFrame(list(zip(top20_indice, top20_importance)), columns =['index', 'importance']))

display(top20_df)

index,importance
212,0.0389543482455962
522,0.0322075087675973
588,0.0305527243727215
83,0.0259977417354748
262,0.0251409805158823
147,0.0221711394321954
4,0.0203816433753174
131,0.0194958204830844
463,0.0186831241756267
8,0.0174306057391872


In [44]:
# Create a map between each ngram and its index

from pyspark.sql.functions import explode, udf, col
from pyspark.sql.types import *

make_list_udf = udf(lambda col: [col], ArrayType(StringType()))
remove_list_udf = udf(lambda col: col[0], StringType())

def get_index(col):
  if len(col.indices) == 0:
    return -1   # Mark the ngram's index as -1 if it is not the top 2^12 ngrams
  else:
    return int(col.indices[0])
get_index_udf = udf(get_index, IntegerType())

ngram_index = interpret_tfidf.select(explode(interpret_tfidf.reviewNgrams).alias("reviewNgrams")).distinct() \
                             .withColumn("reviewNgrams", make_list_udf("reviewNgrams"))
ngram_index = tf_model.transform(ngram_index)
ngram_index = ngram_index.withColumn("reviewNgrams", remove_list_udf("reviewNgrams")) \
                         .withColumn("index", get_index_udf("TF")) \
                         .select("reviewNgrams", "index")

In [45]:
display(ngram_index.where(ngram_index.index > -1))

reviewNgrams,index
sit couch,651
still,27
hope,184
game like,222
imagin,401
art,427
time game,356
import,914
goe,1078
encount,663


In [46]:
# Find the ngrams that map to the top 20 most important features

# Note that if you used hashingTF for word embedding, there would be multiple ngrams under the same index, because of the collision introduced by hashing, all of which would share and contribute to one importance score, and we don't have a way to separate their contribution to the importance score.
# Here in order to avoid such collision (so just one index per ngram), I used CountVectorizer instead of HashingTF during encoding.

import pyspark.sql.functions as f

top20_ngram = top20_df.join(ngram_index, on="index", how="left_outer")
display(top20_ngram.groupby("importance").agg(f.collect_list(top20_ngram.reviewNgrams).alias("reviewNgrams")).orderBy("importance", ascending=False))

importance,reviewNgrams
0.0389543482455962,List(suck)
0.0322075087675973,List(manag)
0.0305527243727215,List(sometim)
0.0259977417354748,List(charg)
0.0251409805158823,List(stick)
0.0221711394321954,List(money)
0.0203816433753174,List(like)
0.0194958204830844,List(didn)
0.0186831241756267,List(classic)
0.0174306057391872,List(use)
