# Web Scraping 

In this notebook, I will be using sentiment analysis on Rotten Tomatoes movie reviews to make predictions on users star ratings. First I am web scraping the movie rating website Rotten Tomatoes. Here I will parse the user id, review and star rating and add these features to a dataframe. 

### Importing Modules

In [173]:
import pandas as pd
from collections import defaultdict

import requests
import urllib.request
import re
import json

# Parsing The Raw Data

First, we send a GET request to our chosen website, which is outlined by the url variable. Next, we take the received HTML documentation and we complete a regular expression search to find the movie ID, which is needed later to change onto the next review pages. Rotten tomatoes have ten reviews on each page so to scrape many reviews at once, we need to have a method to automatically move onto the next page.

In [137]:
# Defining the URL and requesting the HTML documentation
url = 'https://www.rottentomatoes.com/m/inception/reviews?type=user'
response = requests.get(url)

# Searching the HTML doc to extract the movieId 
html_data = json.loads(re.search('movieReview\s=\s(.*);', response.text).group(1))
movieId = html_data["movieId"]

# Function to flick through the review pages
def getReviews(endCursor):
    r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
    params = {
        "direction": "next",
        "endCursor": endCursor,
        "startCursor": ""
    })
    return r.json()

# No. of pages to scrape
pages = 100

# Looping over no. of pages to flick through
reviews = []
result = {}
for i in range(pages):
    result = getReviews(result["pageInfo"]["endCursor"] if i != 0  else "")
    reviews.extend([t for t in result["reviews"]])

# Creating Review Dataframe

In [174]:
# Empty data dictionary
data = defaultdict(list)

# Finding reviewers who have user id's 
users_all = [reviews[i]['user']['userId'] for i in range(len(reviews))]
idx = [i for i in range(len(users_all)) if len(users_all[i]) == 9]
users = [reviews[i]['user']['userId'] for i in idx]
data['user'].extend(users)

# Verified
super_reviewer = [reviews[i]['isSuperReviewer'] for i in idx]
super_reviewer = [int(x) for x in super_reviewer]
data['super_reviewer'].extend(super_reviewer)

# Profanity
profanity = [reviews[i]['hasProfanity'] for i in idx]
profanity = [int(x) for x in profanity]
data['profanity'].extend(profanity)

# Written Review
data['review'].extend([reviews[i]['review'] for i in idx])

# Star rating
star_rating = [reviews[i]['rating'] for i in idx]
star_rating = [float(x.replace('STAR_','').replace('_','.')) for x in star_rating]
data['rating'].extend(star_rating)

# Creating dataframe of reviews
df = pd.DataFrame(data)
df

Unnamed: 0,user,super_reviewer,profanity,review,rating
0,917890251,0,0,In a world where anything is possible nothing ...,1.0
1,978487850,0,0,the creativity and imagination Nolan put into ...,5.0
2,978919953,0,0,Very interesting and entertaining and ofcourse...,4.5
3,979007855,0,0,"Complex, dense, and surprisingly emotionally a...",5.0
4,978420370,0,0,Watch it on a big screen. Go with the flow! Mi...,5.0
...,...,...,...,...,...
964,967668352,0,0,"The most intelligent popcorn movie in a long, ...",4.5
965,789481207,0,0,"Profound and ambitious like few films are, Inc...",5.0
966,967598413,0,0,Awesome!!!ð~ð~ð~?Love it!!,5.0
967,909092042,0,0,A perfect film. You'll watch it then turn aro...,5.0
