# NLP - Text Analytics of Movie Reviews Data with NLTK and TextBlob

By: Susan Kight

Leveraging  movie reviews data from the IMDb website, I will be exploring what type of information is shared within the reviews that are left and to be able to understand this when working with large volumes of text data.

## Prerequisite Libraries

The below libraries are needed for this text analytics project.

In [None]:
#! pip install --user nltk textblob
#! python -m nltk.downloader stopwords
#! python -m textblob.download_corpora

## Importing modules

In [1]:
from collections import Counter              # word counting
import nltk                                  # text processing
import pandas as pd                          # handling Pandas dataframes
from textblob import TextBlob                # sentiment analysis and language detection and translation 

pd.set_option('display.max_colwidth', 200)   # set the maximum column width to 200

## Part 1 - Loading a CSV File into a Pandas Dataframe

File contains movie reviews data from IMDb.

In [2]:
df = pd.read_csv("data/imdb_data.csv", sep=",", header=0)
df.columns=['text']

In [4]:
#view the first few rows to check loaded correctly
df.head()

Unnamed: 0,text
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone..."


In [5]:
df.shape

(1000, 1)

The dataset being used has 1000 rows and the 1 column. I will be exploring the data within this 1 column in further detail throughout this analysis to derive meaningful insights using text processing and mining techniques.

## Part 2 - Text Processing and Mining Techniques

### Deriving New Columns From an Existing Column

In order to start analyzing the data, I need to derive new columns from the set of existing columns.

#### New Column for Text Length

First, let's start by looking at the first 5 rows of existing column that we want to work with:

In [6]:
df.text[:5]

0    One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...
1    A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...
2    I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...
3    Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...
4    Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be

Now I am ready to add a new column leveraging this existing column. I will do this using a lambda function for length if text with the apply function leveraged to create a new column that has the value of the text length.

In [7]:
df["text_length"] = df.text.apply(lambda x: len(x))

In [8]:
#view the first few rows of both columns
df[["text", "text_length"]].head()

Unnamed: 0,text,text_length
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",1761
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",998
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",926
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,748
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",1317


#### Adding a New Column for Tokens

Next, I'll be adding a new column using NLTK for text processing that has a list of tokens from the <i>text</i> column.

In [9]:
df["tokens"] = df.text.apply(lambda x: nltk.word_tokenize(x))

In [10]:
#view the first few rows of the text column and the newly added column
df[["text", "tokens"]].head()

Unnamed: 0,text,tokens
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...","[One, of, the, other, reviewers, has, mentioned, that, after, watching, just, 1, Oz, episode, you, 'll, be, hooked, ., They, are, right, ,, as, this, is, exactly, what, happened, with, me., <, br,..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...","[A, wonderful, little, production, ., <, br, /, >, <, br, /, >, The, filming, technique, is, very, unassuming-, very, old-time-BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomfor..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...","[I, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, ,, sitting, in, the, air, conditioned, theater, and, watching, a, light-hearted, comedy, ., The, plot,..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,"[Basically, there, 's, a, family, where, a, little, boy, (, Jake, ), thinks, there, 's, a, zombie, in, his, closet, &, his, parents, are, fighting, all, the, time., <, br, /, >, <, br, /, >, This,..."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...","[Petter, Mattei, 's, ``, Love, in, the, Time, of, Money, '', is, a, visually, stunning, film, to, watch, ., Mr., Mattei, offers, us, a, vivid, portrait, about, human, relations, ., This, is, a, mo..."


The tokens column has separated out each word from the full string for it to continue to be used during additional columns that are added. 

#### Adding a New Column for Tagged Tokens

The use of the pos_tag function will allow a new column to be created that has a list of tokens that are tagged, not raw text. This will include the Penn Part of Speech Tags: https://cs.nyu.edu/grishman/jet/guide/PennPOS.html.

In [11]:
df["tagged_tokens"] = df.tokens.apply(lambda x: nltk.pos_tag(x))

In [12]:
df[["text", "tokens", "tagged_tokens"]].head()
#the new column is a list of tuples

Unnamed: 0,text,tokens,tagged_tokens
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...","[One, of, the, other, reviewers, has, mentioned, that, after, watching, just, 1, Oz, episode, you, 'll, be, hooked, ., They, are, right, ,, as, this, is, exactly, what, happened, with, me., <, br,...","[(One, CD), (of, IN), (the, DT), (other, JJ), (reviewers, NNS), (has, VBZ), (mentioned, VBN), (that, IN), (after, IN), (watching, VBG), (just, RB), (1, CD), (Oz, NNP), (episode, NN), (you, PRP), (..."
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...","[A, wonderful, little, production, ., <, br, /, >, <, br, /, >, The, filming, technique, is, very, unassuming-, very, old-time-BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomfor...","[(A, DT), (wonderful, JJ), (little, JJ), (production, NN), (., .), (<, JJ), (br, NN), (/, NNP), (>, NNP), (<, NNP), (br, NN), (/, NNP), (>, VBZ), (The, DT), (filming, NN), (technique, NN), (is, VB..."
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...","[I, thought, this, was, a, wonderful, way, to, spend, time, on, a, too, hot, summer, weekend, ,, sitting, in, the, air, conditioned, theater, and, watching, a, light-hearted, comedy, ., The, plot,...","[(I, PRP), (thought, VBD), (this, DT), (was, VBD), (a, DT), (wonderful, JJ), (way, NN), (to, TO), (spend, VB), (time, NN), (on, IN), (a, DT), (too, RB), (hot, JJ), (summer, NN), (weekend, NN), (,,..."
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,"[Basically, there, 's, a, family, where, a, little, boy, (, Jake, ), thinks, there, 's, a, zombie, in, his, closet, &, his, parents, are, fighting, all, the, time., <, br, /, >, <, br, /, >, This,...","[(Basically, NNP), (there, EX), ('s, VBZ), (a, DT), (family, NN), (where, WRB), (a, DT), (little, JJ), (boy, NN), ((, (), (Jake, NNP), (), )), (thinks, VBZ), (there, EX), ('s, VBZ), (a, DT), (zomb..."
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...","[Petter, Mattei, 's, ``, Love, in, the, Time, of, Money, '', is, a, visually, stunning, film, to, watch, ., Mr., Mattei, offers, us, a, vivid, portrait, about, human, relations, ., This, is, a, mo...","[(Petter, NNP), (Mattei, NNP), ('s, POS), (``, ``), (Love, NNP), (in, IN), (the, DT), (Time, NNP), (of, IN), (Money, NNP), ('', ''), (is, VBZ), (a, DT), (visually, RB), (stunning, JJ), (film, NN),..."


#### Adding New Columns for Sentiment (Polarity and Subjectivity)

Two new columns will be added for the below which will allow easy analysis of the data at a deeper level using TextBlob. 
- Polarity between 1 which means positive statement and -1 means a negative statement
- Subjectivity between 0 (objective) and 1 (subjective)

In [13]:
df["polarity"] = df.text.apply(lambda x: TextBlob(x).sentiment.polarity)
df["subjectivity"] = df.text.apply(lambda x: TextBlob(x).sentiment.subjectivity)

In [14]:
df[["text", "polarity", "subjectivity"]].head()

Unnamed: 0,text,polarity,subjectivity
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",0.023433,0.490369
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",0.109722,0.559343
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",0.354008,0.65873
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,-0.057813,0.454167
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",0.217952,0.452916


With these additional columns, we can now start to explore and analyze the data looking for interesting patterns and trends as below. 

In [21]:
df[(df.polarity > 0.5)][["text", "polarity"]]

Unnamed: 0,text,polarity
205,"A new way to enjoy Goldsworthy's work, Rivers and Tides allows fans to see his work in motion. Watching Goldsworthy build his pieces, one develops an appreciation for every stone, leaf, and thorn ...",0.562338
232,"I saw this movie in the theater, and was thoroughly impressed by it. Then again, that was when Claire Danes was a good actress, not the foolish, arrogant, Hollywood-ized bitch she is today. Anyway...",0.517658
305,"I thought this movie was stunning, with completely outstanding performances by Valentina Cervi (Artemisia Gentileschi).<br /><br />Cervi portrays Artemisia so beautifully, with tentative yet confi...",0.536667
329,"What a surprisingly good movie this one turned out to be. This is the type of film that I've been looking for ages. Particularly important for me was the fantastic-looking Chicago, which I still k...",0.52
412,"Ah, another movie with motorcycles, hell's angels posse and Steve A-Lame-o as the not-so-cool car driver. This movie does not rely on story but lots of drinking, pot smoking, and lots of moronic a...",0.5375
491,"This movie does a great job of explaining the problems that we faced and the fears that we had before we put man into space. As a history of space flight, it is still used today in classrooms that...",0.55
532,"I read the book before seeing the movie, and the film is one of the best adaptations out there. Very true and faithful to the book. Sean Penn and Sarandon are amazing. Robbins is a talented filmak...",0.581875
812,"As if the film were not of value in itself, this is an excellent way to get an overview of the novel as a preface to reading it. In the summer of 1968 I saw the film in NYC; that fall in graduate ...",0.5375
832,"As far as I know the real guy that the main actor is playing saw his performance and said it was an outstanding portrayal, I'd agree with him. This is a fantastic film about a quite gifted boy/man...",0.557198
853,"So real and surreal, all in one. I remember feeling like Tessa. Heck, I remember being Tessa. This was a beautiful vignette of a relationship ending. I especially liked the protesters tangent. It ...",0.533333


Of the 1000 movie reviews there are only 10 that are highly positive and would have thoroughly enjoyed the associated movie. 

In [24]:
df[(df.polarity < -0.4)][["text", "polarity"]].head()

Unnamed: 0,text,polarity
84,"We brought this film as a joke for a friend, and could of been our worst joke to play. The film is barely watchable, and the acting is dire. The worst child actor ever used and Hasslehoff giving a...",-0.56875
162,I can't believe it that was the worst movie i have ever seen in my life. i laughed a couple of times. ( probably because of how stupid it was ) If someone paid me to see that movie again i wouldn'...,-0.542857
281,"After a very scary, crude opening which gives you that creepy ""Chainsaw massacre""-feeling, everything falls apart.<br /><br />SPOILER ALERT: As soon as the two FBI-officers start jabbing, you know...",-0.446591
413,"A rating of ""1"" does not begin to express how dull, depressing and relentlessly bad this movie is.",-0.530556
554,If you watched Pulp Fiction don't see this movie. This movie is NOT funny. This is the worst parody movie ever. This is a poor attempt of parody films.<br /><br />The cast is bad. The film is bad....,-0.407692


As exploring the data that has more negativity in the reviews would show these 5 reviewers were most likely the least happy watching the movie. There is not much data at each extreme ends of the scale.

Often a lot can be told by the first couple of words in the review so I will next be exploring only the first fifteen characters of the column <i>text</i>.

In [27]:
df["text_short"] = df.text.apply(lambda x: x[:15])
df[["text", "text_short", "polarity"]].head()

Unnamed: 0,text,text_short,polarity
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me...",One of the othe,0.023433
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire p...",A wonderful lit,0.109722
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue i...",I thought this,0.354008
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenl...,Basically there,-0.057813
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what mone...",Petter Mattei's,0.217952


This can be used as a quick snippet to understand how the review starts in conjunction with polarity see some of the data.

To explore the specific keywords, I will be analyzing popular keywords below.

## Keyword Analysis

In [34]:
#quick view first 5 rows
df.tagged_tokens[:5]

0    [(One, CD), (of, IN), (the, DT), (other, JJ), (reviewers, NNS), (has, VBZ), (mentioned, VBN), (that, IN), (after, IN), (watching, VBG), (just, RB), (1, CD), (Oz, NNP), (episode, NN), (you, PRP), (...
1    [(A, DT), (wonderful, JJ), (little, JJ), (production, NN), (., .), (<, JJ), (br, NN), (/, NNP), (>, NNP), (<, NNP), (br, NN), (/, NNP), (>, VBZ), (The, DT), (filming, NN), (technique, NN), (is, VB...
2    [(I, PRP), (thought, VBD), (this, DT), (was, VBD), (a, DT), (wonderful, JJ), (way, NN), (to, TO), (spend, VB), (time, NN), (on, IN), (a, DT), (too, RB), (hot, JJ), (summer, NN), (weekend, NN), (,,...
3    [(Basically, NNP), (there, EX), ('s, VBZ), (a, DT), (family, NN), (where, WRB), (a, DT), (little, JJ), (boy, NN), ((, (), (Jake, NNP), (), )), (thinks, VBZ), (there, EX), ('s, VBZ), (a, DT), (zomb...
4    [(Petter, NNP), (Mattei, NNP), ('s, POS), (``, ``), (Love, NNP), (in, IN), (the, DT), (Time, NNP), (of, IN), (Money, NNP), ('', ''), (is, VBZ), (a, DT), (visually, RB), (stunn

I will now use a new function that pulls out each word into a tuple that can be accessed and used for analysis. There are several filler words that are not meaningful so these stopwords will be removed from the analysis.

In [36]:
global_stopwords = nltk.corpus.stopwords.words("english")

In [45]:
counter = Counter()

for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ###################################
        # Check if the word is a stopword
        ###################################
        if word in global_stopwords:
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(20)

[('.', 990),
 (',', 959),
 ("'s", 702),
 ('movie', 614),
 ("n't", 597),
 ('/', 588),
 ('br', 587),
 ('>', 587),
 ('<', 587),
 ('film', 544),
 ('one', 539),
 (')', 531),
 ('(', 528),
 ('like', 458),
 ('``', 451),
 ("''", 450),
 ('!', 374),
 ('would', 350),
 ('even', 348),
 ('good', 343)]

As there are a number of punctuation symbols displaying, these will also be removed and the code re-ran.

In [51]:
local_stopwords = [":", ";", ",", ".", "...", "!", "-", "#", "(", ")", "@", "&", "%", "'", "’", "“", "”", 
                   "amp", "/", "<", ">", "?", "n't", "``", "''", "'s", "'m", "'ve", "br", "https", "rt", "t…"] 

counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        
        ##########################################################
        # Check if the word is either a global or a local stopword.
        ##########################################################
        if word in (global_stopwords + local_stopwords):
            continue
        else:
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(30)

[('movie', 614),
 ('film', 544),
 ('one', 539),
 ('like', 458),
 ('would', 350),
 ('even', 348),
 ('good', 343),
 ('see', 339),
 ('time', 310),
 ('really', 309),
 ('get', 308),
 ('story', 302),
 ('much', 285),
 ('well', 282),
 ('made', 256),
 ('could', 255),
 ('way', 254),
 ('also', 252),
 ('first', 251),
 ('great', 246),
 ('people', 241),
 ('bad', 238),
 ('make', 231),
 ('watch', 227),
 ('think', 227),
 ('movies', 225),
 ('never', 220),
 ('many', 211),
 ('acting', 206),
 ('plot', 205)]

From this we can see that the most popular words used in the movie reviews together with the count of how frequently they appear include some of those that you would expect - movie, movies, film, plot - but also give deeper insights into other aspects that are shared. For example, it is common for the reviews to include some indication on the popularity of the movie, discussion on storyline, plot, acting and more. A broad range of topics are covered the in the review text data.
As there is still a large amount of data to analyze in this way, I will next look specifically at adjectives and verbs next to drill down further.
 

### Popular Adjectives 

In [52]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos: 
        word = t[0].lower() 
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue 
            
        ##########################################################
        # Check if the tag matches.
        ##########################################################
        if tag.startswith("JJ"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(15)

[('good', 338),
 ('great', 237),
 ('bad', 231),
 ('many', 210),
 ('first', 195),
 ('much', 192),
 ('little', 181),
 ('best', 166),
 ('real', 138),
 ('old', 120),
 ('least', 118),
 ('better', 118),
 ('funny', 111),
 ('new', 102),
 ('young', 101)]

The most popular adjectives included positive remarks such as 'good' or 'great' in the movie reviews. In comparison, 'bad' had less than half of the combined count of the first 2 keywords showing there were fewer negative movie reviews. The count for 'funny' is also fairly high indicating for a large number of the reviews comedy aspects were referred to. It is also interesting to see 'best' listed quite high given its strong reference.

### Popular Verbs

In [53]:
counter = Counter()
for pos in df.tagged_tokens:
    word_set = set()
    
    for t in pos:
        word = t[0].lower()
        tag = t[1]
        
        if word in (global_stopwords + local_stopwords):
            continue
            
        ##########################################################
        # Check the tag
        ##########################################################
        if tag.startswith("VB"):
            word_set.add(word)
            
    counter.update(word_set)
    
counter.most_common(15)

[('see', 335),
 ('get', 298),
 ('made', 255),
 ('make', 229),
 ('think', 221),
 ('watch', 206),
 ('seen', 201),
 ('know', 194),
 ('say', 161),
 ('go', 161),
 ('watching', 156),
 ('find', 142),
 ('makes', 141),
 ('want', 130),
 ('going', 126)]

The most popular verbs are those that I suspected with reviewers analyzing what they watched and how it made them think. No real surprises jumped out. 

## Conclusion

Taking the large set of movie review data and performing text processing and mining techniques on it I was able to identify key patterns and trends. For this dataset, the movie reviewers data was largely subjective but showed polarity that was not extensive on either end of the extreme values. The most popular adjective keywords leaned towards positive views as the majority.