## Amazon movie reviews data cleanup
Using spark for pre-processing

In [1]:
import pyspark as ps    # for the pyspark suite
spark = (ps.sql.SparkSession
         .builder
         .master('local[4]')
         .appName('moviemood')
         .getOrCreate()
        )

In [2]:
data_path = '../../datasets/'
file_name = 'reviews_Movies_and_TV.json.gz'

In [3]:
amazon_reviews_df = spark.read.json(data_path + file_name)

In [4]:
amazon_reviews_df.cache()
amazon_reviews_df.printSchema()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



In [5]:
total_comment_count = amazon_reviews_df.count()

In [6]:
total_comment_count

4607047

### Data sampling for development

In [7]:
dev_sample_fraction = 0.02

In [8]:
reviews_spark_df = amazon_reviews_df.sample(withReplacement=False, fraction=0.02)

In [9]:
reviews_df = reviews_spark_df.toPandas()

In [10]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91869 entries, 0 to 91868
Data columns (total 9 columns):
asin              91869 non-null object
helpful           91869 non-null object
overall           91869 non-null float64
reviewText        91869 non-null object
reviewTime        91869 non-null object
reviewerID        91869 non-null object
reviewerName      91524 non-null object
summary           91869 non-null object
unixReviewTime    91869 non-null int64
dtypes: float64(1), int64(1), object(7)
memory usage: 6.3+ MB


In [11]:
reviews_df.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1526863,"[0, 0]",5.0,"Attractive, memorable video featuring songs th...","09 27, 2009",A1TN0V94A3ECSH,Karilyn S. Eastvold,Hide 'Em in Your Heart,1254009600
1,5004101,"[2, 3]",4.0,See also: Jehovah's Witnesses: A Non-'Prophet'...,"07 2, 2006",A3KROYMQ61M7A,Rich The Reviewer,Very Fine Jeremiah Films production of this pe...,1151798400
2,5019281,"[0, 0]",5.0,"This has not been on TV, that I know of in man...","12 14, 2012",A1ZAQHMCCT297C,Holly Anne Romanoli,So happy I was able to get this,1355443200
3,5019281,"[0, 0]",5.0,I love An American Christmas Carol with Henry ...,"01 4, 2014",A38JFF6SGZEO37,Lauren,The Best Film about Scrooge!,1388793600
4,5019281,"[4, 4]",5.0,"One of the best ""Scrooge"" movies, a very creat...","01 17, 2007",A22J3OCEN0J191,Steve Gick,An American Chrismas Carol- Henry Winkler,1168992000


### Filter out comments that may be relative to the support (DVD, edition...)

In [33]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

words_to_remove = ['dvd', 'vhs','edition', 'blue-ray', 'blueray', 'blu-ray', 'bluray', 'price']

In [34]:
words_to_remove_re = '(' + '|'.join(words_to_remove) + ')'

In [35]:
import warnings
warnings.simplefilter('ignore')
text_removed = reviews_df[reviews_df['reviewText'].str.contains(words_to_remove_re, case=False)]
warnings.simplefilter('default')

In [36]:
text_removed.describe()

Unnamed: 0,overall,unixReviewTime
count,21672.0,21672.0
mean,4.183924,1252101000.0
std,1.262963,118410800.0
min,1.0,894931200.0
25%,4.0,1167782000.0
50%,5.0,1272413000.0
75%,5.0,1358575000.0
max,5.0,1405987000.0


#### Result: fraction of content removed:

In [37]:
text_removed['reviewText'].count() / (total_comment_count * dev_sample_fraction)

0.23520489372042439

## Quality check on filtered data

In [43]:
warnings.simplefilter('ignore')
relevant_comments = reviews_df[~reviews_df['reviewText'].str.contains(words_to_remove_re, case=False)]
warnings.simplefilter('default')

In [46]:
relevant_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70197 entries, 0 to 91868
Data columns (total 9 columns):
asin              70197 non-null object
helpful           70197 non-null object
overall           70197 non-null float64
reviewText        70197 non-null object
reviewTime        70197 non-null object
reviewerID        70197 non-null object
reviewerName      69930 non-null object
summary           70197 non-null object
unixReviewTime    70197 non-null int64
dtypes: float64(1), int64(1), object(7)
memory usage: 5.4+ MB


In [43]:
relevant_comments['reviewText'].head(2) # try head(50) or more

0    Attractive, memorable video featuring songs that are catchy and easy to learn.  Great way to write the Word of God on young hearts.  Works for parents too--they like the songs and learn the verses with little effort.  Children love these videos.                                                         
1    See also: Jehovah's Witnesses: A Non-'Prophet' Organization. Started by 33rd degree Masonic Pyramidologist Chas.Taze Russell, in 1874 as the 'Russel-ites'. These poor people are badly bam-boozled. They need our prayers and th-is video / book (if you can still findit) are a step in the right direction!
Name: reviewText, dtype: object

#### Result: reviews look good
Based on quick review of head(50)

## Keyword-based sentiment analysis

In [54]:
emotion_keywords_path = '../../datasets/'
emotion_keywords = 'emotions-sensor-data-set.zip'
emotions_keywords_df = pd.read_csv(emotion_keywords_path + emotion_keywords)

In [57]:
emotions_keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1104 entries, 0 to 1103
Data columns (total 8 columns):
word        1104 non-null object
disgust     1104 non-null float64
surprise    1104 non-null float64
neutral     1104 non-null float64
anger       1104 non-null float64
sad         1104 non-null float64
happy       1104 non-null float64
fear        1104 non-null float64
dtypes: float64(7), object(1)
memory usage: 69.1+ KB


In [63]:
rel_comments_w_sentiments = relevant_comments.reindex(columns = relevant_comments.columns.tolist() + list(emotions_keywords_df)[1:])

In [64]:
rel_comments_w_sentiments.head(30)

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,disgust,surprise,neutral,anger,sad,happy,fear
0,1526863,"[0, 0]",5.0,"Attractive, memorable video featuring songs that are catchy and easy to learn. Great way to write the Word of God on young hearts. Works for parents too--they like the songs and learn the verses with little effort. Children love these videos.","09 27, 2009",A1TN0V94A3ECSH,Karilyn S. Eastvold,Hide 'Em in Your Heart,1254009600,,,,,,,
1,5004101,"[2, 3]",4.0,"See also: Jehovah's Witnesses: A Non-'Prophet' Organization. Started by 33rd degree Masonic Pyramidologist Chas.Taze Russell, in 1874 as the 'Russel-ites'. These poor people are badly bam-boozled. They need our prayers and th-is video / book (if you can still findit) are a step in the right direction!","07 2, 2006",A3KROYMQ61M7A,Rich The Reviewer,Very Fine Jeremiah Films production of this pernicious cult...,1151798400,,,,,,,
2,5019281,"[0, 0]",5.0,"This has not been on TV, that I know of in many years but really great performance by Henry Winkler. Thank you.","12 14, 2012",A1ZAQHMCCT297C,Holly Anne Romanoli,So happy I was able to get this,1355443200,,,,,,,
3,5019281,"[0, 0]",5.0,I love An American Christmas Carol with Henry Winkler. In my opinion it is the very best version of Scrooge. Hope everyone sees it and enjoys it as much as I do.,"01 4, 2014",A38JFF6SGZEO37,Lauren,The Best Film about Scrooge!,1388793600,,,,,,,
4,5019281,"[4, 4]",5.0,"One of the best ""Scrooge"" movies, a very creative adaptation.","01 17, 2007",A22J3OCEN0J191,Steve Gick,An American Chrismas Carol- Henry Winkler,1168992000,,,,,,,
5,5019281,"[0, 0]",5.0,I always loved this version of Scrooge and I'm glad it's finally been released on blue ray.So so so happy.,"02 18, 2013",A1BHEPVVAU7NPN,ttwitewolf,Finaly found it!,1361145600,,,,,,,
6,5119367,"[2, 2]",5.0,"I am always skeptical of hollywood productions of anything true but this is about as good as it could get. A few variations from the true story, but they were minor. Their take on the issues surrounding this family demonstrates they listenend to the advice they got from Old Testament experts who actually believe the Bible. It was not only well written, but the acting was superb! We have read the biblical story many times, but found ourselves mesmerized by this movie version. HIGHLY Recommend!","12 17, 2007",A12L2L5AQ8HU6Z,C. Rice,Excellent,1197849600,,,,,,,
7,5119367,"[0, 0]",5.0,I knew this particular movie would be great.Iv'e seen it when it came out.This movie came out in 1995.A modern classic.,"06 9, 2013",AHY53YW4CJYFN,drew dowdy,The best Joseph movie Iv'e ever seen,1370736000,,,,,,,
8,5119367,"[1, 1]",5.0,"This is a brilliant epic movie, a tour de force, the best Biblical movie I have seen with the exception ofOne Night with the KingStarring Paul Mercurio in the title role, he gives an intelligent and passionate performance of a man who was prepared to die, to keep to the word of his G-D.Ben Kingsley is his usual magnificent self in his role of Potiphar which he plays with panache. Warren Clark played a sterling role as Potiphar's thuggish overseer Domo Ednan, later to be Joseph's aid de camp. A deliciously evil Potiphar's Wife is captured in brilliant detail by Lesley Ann Warren and the actors playing the ten brothers also acquit themselves well staying true to the form of Israelite tribesmanThe movie begins with the purchase of Joseph by Potiphar, and his rise from that of a lowly slave to Potiphar's major domo where he captures the attention of Potiphar's evil wife, when he resists her attention and she accuses him of assaulting her, Potiphar soon knows he is innocent, and he embarks on the telling of his story of his boyhood, including the narrative of the rape of his half-sister Dinah (played with shyness and prettiness by Mexican actress Paloma Baeza), and the destruction of the Canaanite town of Schechem in revenge by her brothers Simeon (Vincenzo Nicoli) and Levi (Colin Bruce) Then we move to the central story whose tale baring and father's favour lead to his brother's brutal attack of him, where he is thrown into a pit and sold to Ishmaelite traders.His rise in prison, his interpretation of the dreams of Pharaoh (Stefano Dionisi) at the same time as we are taken back to Canaan and see the story of Tamar( a pert and beautiful Kelly Miller ) and Judah (Michael Attwell). The final sequence where Joseph's brothers go down to Egypt until the moving scene where he reveals himself to his brothers is the finest.The sex scenes are tasteful and not at all gratuitous in any way , and yes this is an adult movie, but they simply cover what is told in the Bible without going further and it is extremely narrow minded to find them offensive","09 9, 2012",A1G9FX1KV45N41,Gary Selikow,Tour de force,1347148800,,,,,,,
9,5119367,"[0, 0]",5.0,"This is the finest Biblical epoch we have ever seen, including Ben Hur and 10 Commandments. A low budget testament to the power of imagination and good acting. Remarkably faithful to both to historical and scriptural sources, the script brings life to an already marvelous tale of alienation, faith and redemption. The tragic family of Isaac is beautifully portrayed with every character (including all 12 brothers) skillfully defined.","08 8, 2005",A2AP13DP049G1I,George R. Odell,Kingsley conquers Egypt,1123459200,,,,,,,


## Isolating Comment helpfulness

In [24]:
helpful = reviews_df['helpful'].map(lambda x: x[0], na_action='ignore')
helpful.describe()

count    91869.000000
mean     2.638866    
std      11.521295   
min      0.000000    
25%      0.000000    
50%      0.000000    
75%      2.000000    
max      1031.000000 
Name: helpful, dtype: float64

In [25]:
not_helpful = reviews_df['helpful'].map(lambda x: x[1], na_action='ignore')
not_helpful.describe()

count    91869.000000
mean     4.093655    
std      14.126499   
min      0.000000    
25%      0.000000    
50%      1.000000    
75%      4.000000    
max      1160.000000 
Name: helpful, dtype: float64

### Note: current filtering is probably too aggressive:'DVD' mostly used to describe the movie
Removing DVD would remove a lot of good comments, as all of the following are appropriate:

In [26]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
text_removed['reviewText'].head(5)

11    I have watched this version of &#34;Joseph&#34; numerous times also on TV and am once again glad to add this one to my collection of Bible DVD's. It's entertainment but for me it's much than that I'm able to receive a better understanding of my Bible it's like a learning process to test me on my Bible knowledge and see how well I know the scriptures, for instance when I'm able to pick something out of the movie that may not be quite true to scripture I go check my Bible and I'm able to learn my scriptures better and my Bible. I believe by watching the DVD over and over and even having different versions you absorb more into your memory and it can be your own individual Bible study. I was really amazed and excited when I saw that you offered many of these DVD's of (The Bible Collection) which are some of my favorites. Hoping to collect them all and other versions plus other Christian DVD's to learn from them for a better understanding of God's true Word.
13    I use the Hemisphere

In [28]:
text_removed = reviews_df[(reviews_df['reviewText'].str.contains('(dvd|vhs)', case=False)) & (helpful > 1)]

  """Entry point for launching an IPython kernel.


In [29]:
text_removed[['reviewText','helpful']].head(5)

Unnamed: 0,reviewText,helpful
30,"What with Meryl Streep portraying her in a movie and all, Julia's books and series are getting a monster tidal wave of fresh attention. I wonder how many youngsters (Food Network viewers) will purchase or have purchased for them the classic ""Mastering the Art of French Cooking"" and find it completely overwhelming and not necessarily useful for everyday cooking?I had few cooking skills in the mid 1980s when this book and DVD came out, and I relied on recipes from Fannie Farmer, supermarket recipe cards, and women's magazines. This series elevated my cooking for good. It is French influenced, but simplified in some cases, and there are variations on most recipes which is really useful if you don't have time to run to the supermarket. I have had many variation cookbooks and this is the best. I have not used the dessert section too much (I prefer Pierre Franey for this--fruit salad with Grand Marnier!), but the bread, meat, and veg sections are the best. I recommend ham steaks with peas and mashed potatoes and cornish game hen broiled w/cheese.I am so pleased this is out on DVD. I gathered up the video tapes one-by-one on ebay and this is definitely more affordable!","[31, 34]"
37,"To start off with, I actually have an early release of the dvd, so my stars are based on that, not the cinema.I made it through about six or seven pages on reviews before I decided to write this and from what I see, three subjects usually appear as the reason for bad reviews.The first reason is it is too violent. We all knew this, before we even went to see it. It was all over news, media, etc. We were warned not to take our kids, be prepared for violence, not for the weak of heart. So, if you went expecting scenes of the beach, then that is your own fault. I didn't watch it hoping to see a blood bath, but I knew it was coming.I am a very faithful Protestant, and I do see some Catholic overtones in the story, but that is Mel's interpretation and overall I felt it was pretty accurate. It is interesting to me that all of these people that may or may not have a personal relationship with Jesus, seem to know that if he were here today, he would not support this film.The second reason people complain is because it doesn't tell the full story of Jesus. It isn't supposed to. It is the Passion of the Christ, and passion means suffering. It isn't the story of the Christ or the Gospel of the Christ. It is basically a Passion play put to cinema. He isn't trying to show us the teachings of Jesus, he is showing us the suffering of Jesus. He is showing us what He went through for us. If you see this movie and aren't moved by the sacrifice that He made for all of us to be free, then you don't get it. And if you don't understand and you don't get it, then that is ok, I pray someday that you will.The third thing I pick up on in these reviews is that Mel is using it just for profit. He has made money at ever other movie he has made I am sure, and yet no one ever questions his motives then. Yet, when he comes out with something of a religious nature when we are in a world that requires us to be politically correct, then he was just doing it for money. He put up his own money and I personally hope he makes all the money he can off of it, not that he needs it. So what if there is a director's cut later on? How may different directors editions or box sets of Lord of the Rings are out there? I get tired of having to apologize for what I believe. If someone stands up for an important issue, whatever the hot topic of the week is, then that is their right to voice their opinion and they expect everyone to recieve it and agree. But if I as a Christian stand up for what I believe, then I am a ""Jesus Freak"" or a ""Fanatic"". So if you don't agree with me, that is your right, but if I don't believe what you do, then I am narrow minded?? What happened to my right to my own beliefs?While I am still on my soap box, it is amazing to me that people still think this film is anti-semitic. If you have any knowledge of the history prior to this then you should know that Jews were a large part of this area, so who else would send Jesus to the cross? He was Jewish, it was just the circumstances. And if you don't know the history, then read the bible, don't blame Gibson.Back to the movie, I liked the dvd, thought it was a great movie. I tip my hat to Mel Gibson for having the courage to put his money where his mouth is and put his beliefs to film.God Bless","[9, 12]"
40,"I have never seen a movie as intensley astounding in performance as that of ""The Passion Of The Christ"". I am 39 years old and I cannot find the words to express to anyone the expanse, magnitude and brilliance that this movie of The Final hours of Jesus Christ has left within me other than an imprint as deep within as the mystery of the universe itself.As a young catholic growing up, I remember seeing the movie on TV called, ""Francis of Assisi"", starring Bradford Dillman, Dolores Hart (who became an actual nun in 1971) and Stuart Whitman. It was and is one of the greatest movies in the history of Religious Film Making with The Passion of The Christ as Number One and Francis as Number two. No film of St. Francis or any kind then, made such an impression on me, that I actually took His name as mine for my Confirmation Name because I became so ""hypnotized"", if you will, that I became more aware that anyone can desire such holiness and attain such (Holy) Valor as Francis did, for me, when I was only 11.Buy this film!! See ""The Passion of The Christ"" Come Alive for you as you have never seen anything like it ever before!!!So that to see for yourselves what Jesus went through in His Final Hours for our redemption and to accomplish for us something so beautiful through such anguish and pain never before witnessed, that to this day, as I hold the DVD in my hands still unwrapped since I saw it on The Big Screen, it remains in me, as if imprinted on my very soul.Maybe a person may see this film as an atheist and leave it as an atheist, but I know one thing. They will never die as one, as it is said, likewise, that ""there are no atheists in fox holes"".Peace of ""The Christ"" be with you!RJ","[6, 8]"
42,"I frankly don't know how anybody could dislike this film. I cannot think of many superior cinemagraphic experiences than viewing ""The Passion of the Chirst."" With the soldiers and the Pharisees, it is nearly unfathomable that such hate could be conjured up for one's fellow human beings but there are endless reminders around us today which document that such hate still exists. Christ in particular remains despised in numerous hearts and locales.The artistic merits of ""The Passion..."" are weighty. Merely hearing Aramaic and Latin for two hours straight is valuable in itself. All of the acting is strong and this is even true of the minor roles as one's outrage at the individual soldiers is palpable.The choice for Satan was fascinating as I don't recall ever seeing a more perfectly androgynous human being captured on the big screen. The way that Mel Gibson ""fleshed out"" Pilate was stupendous, and quite surprising. In the context of the times, it is possible to feel sympathy for him. The very small things are what makes the film uniquely memorable, such as the use of flashbacks, the motif of the dove, and the death scene of Judas. I loved this movie and the DVD was worth the 20 bucks I paid. Never in my life have I wanted more to more change an ending and never in my life have I been as powerless to do so.","[3, 6]"
44,"I've noticed that many reviewers are complaining because there are no extras on this DVD. So what!?! The point of this movie isn't what Mel Gibson or any of the actors think. The point is what's happening on the screen. If you don't get that, you don't get the message. Why cloud the true message with a bunch of talking heads who are only going to say what's been said in countless magazine/newspaper interviews? This is a powerful movie that needs no explanation. It captures a part of Jesus' life that has been overly sanitized. We (Christians) know that Jesus died for us, but how many of us have actually considered ""how"" he died for us? Watching this movie should make a lot of people sit up and take notice. It's not an easy movie to watch, but once again, that's the point. You're supposed to feel uncomfortable. You're supposed to be horrified by the brutality and callousness of what happened to the Son of God. Mel Gibson did a magnificent job of capturing the last hours of Jesus' life. He didn't sugarcoat anything; he didn't downplay any of the ""ugly"" aspects. I respect him and this movie for that. This is definitely a movie that stays with you after you watch it. Hopefully, if you watched it in the right frame of mind, the true message should stay with you for the rest of your life.","[8, 14]"
