Project Luther: Ebertron
--
Ozzie Liu - ozzie@ozzieliu.com

Part 1/3 - Webpage scraping for roger ebert's movie reviews on www.rogerebert.com  

This IPython notebook documents the process of scraping web data to analyze and predict Mr. Ebert's reviews with other movie review data.

### Process:
1. <a href='#1'>Collecting movie review metadata from www.rogerebert.com/reviews</a>
2. <a href='#2'>Collecting movie review body from each review webpage</a>

### Packages:
- requests - fetch HTML pages
- BeautifulSoup: web scraping
- time: to add in a sleep delay when scraping
- tqdm: a nifty tool to show progress bar
- pandas: for data frames
- pickle: to pickle things

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import tqdm
import pickle

# 1 - Collecting movie review metadata from main review page
<a id='1'/></a>
On www.rogerebert.com/reviews, I wanted to collect the following information:
- Movie title
- URL of movie review page
- Star ratings
- Year of review

Unfortunately, the main review webpage does not lend itself to easy data scraping. The page is dynamically generated with an infinity scroll that loads a few movie reviews at a time. So I literally scrolled down for a few minutes to get about 22 years of data. Then I saved the HTML locally to load into BeautifulSoup

In [4]:
## Open locally saved HTML file and create a BeautifulSoup object
soup = BeautifulSoup(open('ebert_reviews20.html'), 'lxml')

In [10]:
## Select all elements of <figure class="movie review">. 
## This returns a ResultSet which is just a list of results
all_reviews = soup('figure', {'class':'movie review'})

In [12]:
"""
This function goes through the locally cached webpage of rogerebert.com/reviews and parse each movie reviews' name,
rating, year, and URL.
Input is a BeautifulSoup ResultSet
Output is a dataframe with movie name, review URL, # of stars, and year of review in a nice DataFrame 
"""
def scrape_eberts_review(review_set):
    review_list = list()
    
    for movie in review_set:
        ## Since there's only one <a> link with a href, just return that
        url = movie.a.get('href')
        
        ## Grab the title, which is the text of the 2nd <a> tag
        title = movie.find_all('a')[1].text
        
        ## Calculate the star rating by adding the <i> that's a full star and <i> that's a half star
        stars = len(movie.find_all('i', {'class':'icon-star-full'})) + 0.5 * len(movie.find_all('i', {'class':'icon-star-half'}))
        
        ## Get the year in the class:release year. Since some movies don't have an year listed, use try-except.
        try:
            year = movie.find('span', {'class':'release-year'}).text[1:-1]
        except:
            year = ''
        ## Add the data into the list
        review_list.append([title, stars, year, url])
    
    ## Create a data frame with respective column names and return the data frame
    df = pd.DataFrame(review_list, columns = ['Title', 'EbertStars', 'Year', 'URL'])
    return df

In [17]:
## Call the function and store the data frame
review_df = scrape_eberts_review(all_reviews)
review_df

Unnamed: 0,Title,EbertStars,Year,URL
0,The Spectacular Now,4.0,2013,/reviews/the-spectacular-now-2013
1,Computer Chess,2.0,2013,/reviews/computer-chess-2013
2,At Any Price,4.0,2012,/reviews/at-any-price-2012
3,Blancanieves,4.0,2012,/reviews/blancanieves-2012
4,Deceptive Practice: The Mysteries and Mentors ...,3.0,2013,/reviews/deceptive-practice-the-mysteries-and-...
5,To the Wonder,3.5,2013,/reviews/to-the-wonder-2013
6,From Up on Poppy Hill,2.5,2013,/reviews/from-up-on-poppy-hill-2013
7,The Host,2.5,2013,/reviews/the-host-2013
8,Ginger and Rosa,3.0,2013,/reviews/ginger-and-rosa-2013
9,On the Road,2.0,2013,/reviews/on-the-road-2013


# 2 - Collecting movie review body from each review page
<a id='2'></a>
With the list of movie review URLs, I want to go to each page and gather the MPAA rating, Runtime, Genre, and the
body text of the review. I use `requests` to fetch the webpage and `BeautifulSoup` again to parse the elements I want.

In [39]:
"""
scrape_webpage(link) function takes the link of the review page, fetches the page, and return the requested 
fields in a data frame
"""

def scrape_webpage(link):
    
    ## Build the compete URL, fetch the page, and create the BeautifulSoup object
    url = 'http://www.rogerebert.com/'+link
    webpage = requests.get(url).text
    soup = BeautifulSoup(webpage, 'lxml')
    
    ## Find the <p> elements with their respective class, select the bold text, slice any unneeded characters,
    ## and encode it to simple ascii text. Using try-except since some movies don't have those fields. 
    try:
        mpaa = soup.find('p', {'class':'mpaa-rating'}).strong.text[6:].encode('ascii','ignore')
    except:
        mpaa = ''
    try: 
        runningtime = soup.find('p', {'class':'running-time'}).strong.text[:3].strip().encode('ascii', 'ignore')
    except:
        runningtime = ''
    try:
        genres = soup.find('p', {'class':'genres'}).strong.text.encode('ascii', 'ignore').replace(',', '').split()
    except:
        genres = []
    
    ## Construct the review text by gathering all the <div class=reviewBody> elements, pulling out their child <p>,
    ## and extracting their text into a list. Then join each separate paragraph into one body.
    reviewbody = ' '.join([paragraph.text for paragraph in soup.find('div', {'itemprop':'reviewBody'}).find_all('p')])
    
    ## Return results as a list for each review URL
    return [link, mpaa, runningtime, reviewbody]
 

In [44]:
## Iterate through each review page's URL from the previos data frame and call the scrape_movie(). Store each output
## list in scraped_list, and turn it into a dataframe.
## Use tqdm for a nifty progress bar to indicate progress. Since this could take a while.
## Also sleep for a short time before calling another webpage. Maybe so I don't get blocked for being a DDOS attacker
scraped_list = list()

for movie in tqdm.tqdm(review_df['URL']):
    scraped_list.append(scrape_webpage(movie))
    time.sleep(0.5)

review_content = pd.DataFrame(scraped_list, columns = ['URL', 'Rating', 'Runtime', 'Review'])




In [45]:
review_content.head()

Unnamed: 0,URL,Rating,Runtime,Review
0,/reviews/the-spectacular-now-2013,R,99,[Editor's note: Roger Ebert filed this review ...
1,/reviews/computer-chess-2013,,91,[Editor's note: Roger Ebert filed this review ...
2,/reviews/at-any-price-2012,R,105,"Ramin Bahrani, the best new American director ..."
3,/reviews/blancanieves-2012,PG-13,104,Note: The following was reworked from a blog p...
4,/reviews/deceptive-practice-the-mysteries-and-...,NR,88,This is another of Roger Ebert's final reviews...


In [171]:
## Finally, pickle the two resulting dataframes for the next step for data wrangling
# pickle.dump(review_df, open('review_metadata.pkl', 'wb'))
pickle.dump(review_df, open('ebert_reviews.pkl', 'wb'))
pickle.dump(review_content, open('review_content.pkl', 'wb'))