In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdminterstellar/imdb_reviews_raw.csv


# Project Planning

We’ll start our journey with **Project Planning**.

The idea that I came up with for this notebook is **Reviews Sentiment Analysis**. So, first of all, we’ll have to collect the reviews data. We’ll do this by scraping the movie reviews from IMDb. After this, we’ll have to do some **Data Cleansing and Exploratory Data Analysis**. Then, we’ll move to **Sentiment Analysis and Data Storytelling**.

Let’s start with **Data Collection**. We’ll scrape reviews from the movie **Interstellar** for this project. The idea is to load all reviews by clicking the “Load more” button while it exists, and then scrape rating, headline, review, date, and username for each review. We’ll use Selenium because we have to automate button loading. 
I’ll not explain my code here, I’ll leave the code in the article with some comments inside the code.

We’ll first load the **libraries** and **chrome driver**.

After that, the **idea** is to locate the load more **button** and click the button **while it exists** on the page. Once the button is **not available** anymore, it means that we’ve reached the bottom of the page and loaded all reviews. 
The next step is to locate **all reviews**, loop through each review, and scrape the headline, username, date, review, and rating.

![](https://miro.medium.com/max/1400/1*StZK-xYd5ZoSvl8l630kyA.jpeg)

In [3]:
import pandas as pd # we'll use pandas for storing the data in dataframe
from selenium import webdriver # webdriver is tool used for web application automation
from selenium.webdriver.support.ui import WebDriverWait # this is implicit wait that direct selenium for certain class/object to load before throwing an exception
from selenium.webdriver.support import expected_conditions as EC # expectation that all conditions are True
from selenium.webdriver.common.by import By
import numpy as np

movie_link = "https://www.imdb.com/title/tt0816692/reviews?ref_=tt_urv" # provide link to the movie reviews page
driver = webdriver.Chrome("your_path") # loading chrome WebDriver
driver.get(movie_link) # go to the provided link

button_exists = True # True value means that load more button exists
delay = 30 # delay time 30sec
while button_exists == True: # repeat this process while load more button exists
    try:
        load_more_button = WebDriverWait(driver,delay).until(EC.element_to_be_clickable((By.XPATH, '//button[@class="ipl-load-more__button"]'))) # find the Load More button by xpath
        load_more_button.click() # click on Load More Button
    except:
        button_exists = False # if we get exception, it means that the button is no longer existing and we loaded all reviews

all_movie_reviews = WebDriverWait(driver,delay).until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="lister-item-content"]'))) # locate all loaded reviews
all_reviews = [] # we'll store all reviews in this list
for review in all_movie_reviews:
    # this will get actual rating from user
    try:
        rating = review.find_element_by_xpath('.//div[@class="ipl-ratings-bar"]').text
    except:
        rating = np.nan

    # this will get headline/title of review
    try:
        headline = review.find_element_by_xpath('.//a[@class="title"]').get_attribute('text')
    except:
        headline = np.nan

    # date when the review was posted
    try:
        date = review.find_element_by_xpath('.//span[@class="review-date"]').text
    except:
        date = np.nan

    # username of person who wrote the review
    try:
        username = review.find_element_by_xpath('.//span[@class="display-name-link"]//a').text
    except:
        username = np.nan

    try:
        # some reviews are longer then others
        # longer reviews are listed in different class
        # we'll assume that the review is longer, and try to scrape it like that
        review = review.find_element_by_xpath('.//div[@class="text show-more__control"]').text
    except:
        # exception will raise if our previous attempt was not successful
        # this means that this is shorter review
        # shorter reviews are listed under the 'content' class
        review = review.find_element_by_xpath('.//div[@class="content"]').text

    # dictionary to store all scraped elements
    full_review = {
        "rating" : rating,
        "headline" : headline,
        "username" : username,
        "date" : date,
        "review" : review
    }
    all_reviews.append(full_review) # append our dictionary to list

ModuleNotFoundError: No module named 'selenium'

# Data Cleansing
Now we have the data. It’s time to clean and **explore the data**. So, let’s start with **Data Cleansing**.
Let’s take a look at the data and write our ToDo list.

In [7]:
df = pd.read_csv('/kaggle/input/imdminterstellar/imdb_reviews_raw.csv')
df.head()

Unnamed: 0,rating,headline,username,date,review
0,8/10,Why tack on an ending where everything works ...,MartinHafer,22 February 2015,
1,6/10,"Often impressive and very beautiful, but less...",TheLittleSongbird,21 January 2017,"As someone who likes the cast, loved the conce..."
2,4/10,"Another bloated, overrated space epic\n",Leofwine_draca,14 January 2017,
3,7/10,some problems but a few great touches of 2001\n,SnoopyStyle,25 June 2015,"In the near future, Earth is devastated by bli..."
4,7/10,"Intellectual, Although Sometimes Flawed Scien...",Hitchcoc,19 June 2015,I'm treading on some little used ground. From ...


Let’s begin with **rating**, currently, the rating is a string and our goal is to extract the first number and convert it to an **integer**. We’ll also drop **NaNs** from this column. After that, we’ll convert the date column from string to **DateTime** data type. 

And then we’ll focus on our main column, **reviews**. Here we’ll do a couple of things. First, we’ll remove **NaNs**, remove **punctuation**, **numbers**, **links**, **stopwords**(Stopwords are words which does not add much meaning to a sentence), and then we’ll convert reviews to **lower-case**.

**ToDo**:

Extract 1st number from Rating and convert it to int

Drop NaNs from Rating and Review

Convert date column to datetime dtype

Remove punctuation, numbers, links, stopwords from Reviews, convert to lower-case, and strip whitespaces.

In [9]:
#1.Ratings are in format X/10 where X is rating. We want to extract X to new column.
df.dropna(subset=["rating","review"],inplace=True) # drop NaNs from rating and review columns
df["rating_number"] = df["rating"].apply(lambda x: x.split("/")[0]) # extract 1st number from rating column
df["rating_number"] = pd.to_numeric(df["rating_number"]) # convert rating number to int


#2. Convert date column to date datatype
df["date"] = pd.to_datetime(df["date"]) # convert date column to datetime dtype


#3. Clean Review column
from nltk.corpus import stopwords
import nltk

nltk.download("stopwords") # download all stopwords from nltk

def clean_reviews(df):
    stop_words = stopwords.words("english") # pull all stopwords from nltk library
    df["review_clean"] = df["review"].str.lower()  # convert reviews to lower-case
    df["review_clean"] = df["review_clean"].apply(lambda x: x.replace("\n","")) # remove new lines from reviews
    df["review_clean"] = df["review_clean"].str.replace(r'[^\w\s]+', '') # remove punctuation from reviews
    df["review_clean"] = df["review_clean"].str.replace('\d+', '') # remove numbers from reviews
    df["review_clean"] = df["review_clean"].replace(r'http\S+', '', regex=True).replace(r'www\S+','',regex=True) # remove links from reviews
    df["review_clean"] = df["review_clean"].str.strip() # strip whitspaces
    df["review_clean_no_stopwords"] = df["review_clean"].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words])) # remove stopwords from reviews
    
    return df

cleaned_df = clean_reviews(df)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




In [10]:
cleaned_df

Unnamed: 0,rating,headline,username,date,review,rating_number,review_clean,review_clean_no_stopwords
1,6/10,"Often impressive and very beautiful, but less...",TheLittleSongbird,2017-01-21,"As someone who likes the cast, loved the conce...",6,as someone who likes the cast loved the concep...,someone likes cast loved concept considers chr...
3,7/10,some problems but a few great touches of 2001\n,SnoopyStyle,2015-06-25,"In the near future, Earth is devastated by bli...",7,in the near future earth is devastated by blig...,near future earth devastated blight corn survi...
4,7/10,"Intellectual, Although Sometimes Flawed Scien...",Hitchcoc,2015-06-19,I'm treading on some little used ground. From ...,7,im treading on some little used ground from re...,im treading little used ground reading previou...
8,10/10,Out of this world\n,kosmasp,2015-05-31,A lot has been said and written about Interste...,10,a lot has been said and written about interste...,lot said written interstellar obviously take a...
10,9/10,Absolutely Brilliant\n,gavin6942,2015-01-25,A team of explorers travel through a wormhole ...,9,a team of explorers travel through a wormhole ...,team explorers travel wormhole attempt ensure ...
...,...,...,...,...,...,...,...,...
4972,10/10,Matthew McConaughey is the GOAT\n,blakeharthcock,2019-02-05,Interstellar is one of the most unique movies ...,10,interstellar is one of the most unique movies ...,interstellar one unique movies experiment diff...
4974,10/10,Best movie to watch and cry\n,periclesjunior,2019-02-19,I watched this movie four times and every time...,10,i watched this movie four times and every time...,watched movie four times every time feel soul ...
4975,10/10,Mindblowing\n,lucasjanssen-82918,2021-05-24,An absolute mind altering audiovisual masterpi...,10,an absolute mind altering audiovisual masterpi...,absolute mind altering audiovisual masterpiece...
4977,9/10,Interesting storyline\n,ak-31680,2019-03-03,Very interesting storyline and ending. The act...,9,very interesting storyline and ending the acti...,interesting storyline ending acting although m...


Before diving into **Sentiment Analysis**, let’s **explore** the data.


In [12]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

In [13]:
fig = px.histogram(df, x="rating_number",title="Interstellar Rating Histogram")
fig.show()

Our **histogram** is **left-skewed**, which means that most of the ratings are higher(positive). 

This means that we should expect our **Sentiment Analysis** results to be mostly **positive**.

Now, let’s find the most **mentioned persons** and **organizations** in reviews.

We’ll use **Named Entity Recognition(NER)** for this. **NER** helps you easily identify the key elements in a text, like names of people, places, brands, monetary values, and more. For this, we’ll use a library called **Spacy**.

In [14]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.max_length = 4000000

str1 = " " # we'll store our text column in this empty string
stem1 = str1.join(df["review_clean"])
stem2 = nlp(stem1) # spaCy tokenizes the text to produce a Doc object
label= [(x.text, x.label, x.label_) for x in stem2.ents] # this is used for the named entities in the document
ents_df = pd.DataFrame(label, columns=["word","entity","label"]) # create new df with label values

# most mentioned persons
persons_df = ents_df[ents_df["label"] == "PERSON"] # extract only persons from ents_df
persons_df = persons_df["word"].value_counts().reset_index()
persons_df = persons_df.head() # keep only 5 most mentioned persons
fig_most_mentioned_persons = px.bar(persons_df,x="index",y="word",title="The Most Mentioned Persons in Reviews")
fig_most_mentioned_persons.show()

# most mentioned organizations
organizations_df = ents_df[ents_df["label"] == "ORG"] # extract only organizations from ents_df
organizations_df = organizations_df["word"].value_counts().reset_index()
organizations_df = organizations_df.head() # keep only 5 most mentioned organizations
fig_most_mentioned_organizations = px.bar(organizations_df,x="index",y="word",title="The Most Mentioned Organizations in Reviews")
fig_most_mentioned_organizations.show()

**Obviously**, the main characters are the most mentioned in the **reviews**. 

But, it’s very interesting to see **Hans Zimmer** as one of the most mentioned persons here. Soundtracks from **Interstellar** were composed by **Hans Zimmer**.

**NASA** is the most mentioned organization in reviews. This is expected because the main character Cooper is a **NASA** pilot. 
We can see that **Spacy** recognized quantum physics, jargon, sci, and **Kubrick** as organizations, which is not really correct.

We’ll use **TextBlob** for **Sentiment Analysis**. **TextBlob** returns the **polarity** of a sentence. The **polarity** score indicates how negative or positive a sentence is. 
The **idea** is to create a function for calculating the **polarity**, call the function on our cleaned reviews, and label reviews as positive, negative, or neutral.

In [15]:
from textblob import TextBlob
polarity = lambda x: TextBlob(x).sentiment.polarity # function to calculate the polarity
df["polarity"] = df["review_clean_no_stopwords"].apply(polarity) # apply polarity function
df["score"] = ""

for index,row in df.iterrows(): # iterating over dataframe
    # positive reviews
    if(row["polarity"] >= 0.01):
        df.loc[index,"score"] = "positive"
    # negative reviews
    elif(row["polarity"] <= -0.01):
        df.loc[index,"score"] = "negative"
    # neutral reviews
    else:
        df.loc[index,"score"] = "neutral"

results = df.groupby(["score"])["polarity"].count().reset_index() # groupby and store our sentiment analysis results

In [16]:
results

Unnamed: 0,score,polarity
0,negative,303
1,neutral,143
2,positive,3209


This is the **results** data frame. Our **histogram** with user ratings was telling us that most of the reviews were positive. 

So, it seems that our results are correct. Let’s take the actual rating and also label it as positive/negative/neutral and compare it with our results. 

We’ll **label** ratings as: Positive(rating≥6), Negative(rating≤4), and Neutral(rating==5). I left a very small gap for Neutral reviews because people don’t really leave Neutral reviews that often.

In [17]:
df["score_original"] = ""
for index,row in df.iterrows(): 
    if(row["rating_number"] >= 6):
        df.loc[index,"score_original"] = "positive"
    elif(row["rating_number"] <= 4):
        df.loc[index,"score_original"] = "negative"
    else:
        df.loc[index,"score_original"] = "neutral"

original_results = df.groupby(["score_original"])["polarity"].count().reset_index()

In [19]:
original_results

Unnamed: 0,score_original,polarity
0,negative,412
1,neutral,90
2,positive,3153


**Now** let’s compare **Predicted Results** with **Actual Results**.


In [23]:
original_results.rename(columns={"score_original" : "score",
                                 "polarity" : "actual_result"},inplace=True)

results.rename(columns={"polarity" : "predicted_result"},inplace=True)


In [24]:
compare = pd.merge(left=results,right=original_results,left_on="score",right_on="score") # merge actual results and predicted results
compare["difference"] = compare["predicted_result"] - compare["actual_result"] # calculate the difference

In [25]:
compare

Unnamed: 0,score,predicted_result,actual_result,difference
0,negative,303,412,-109
1,neutral,143,90,53
2,positive,3209,3153,56


Our **results** are **very good**. Our **predicted** positive and neutral results are almost the same as actual **results**. The **predicted** negative is also very close, but not as close as positive and neutral results.

”A special effect without a story is a pretty boring thing.” 
**-George Lucas**

Our **special effect** is our **Sentiment Analysis results**, but the table above would look pretty boring to people. So, we should present our results in **the best light**.

There are many reporting tools, such as **Tableau**, **Power BI**, **Google Data Studio**, etc. For this project, I’m going to use **Google Data Studio**. The **goal** here is to tell the story with **highlights** from the **project**.

![](https://miro.medium.com/max/1400/1*UHZA_WGxiBvfL0UjnrMCBQ.png)

You can see the report [**here**](https://datastudio.google.com/reporting/ee534b7c-d4ef-456e-96d6-f1831e750306). 

**Highlights** here are our **Sentiment Analysis** results(Predicted Results) compared to Actual Results, Top 3 Positive/Negative Words, and the Most Common Persons in Reviews. 

It’s **amazing** how much you can do with only one column because everything from here is from the Reviews column.

**Vote** if you enjoyed this simple yet **interesting project**. 🖖
