# Crawling Rotten Tomatoes reviews

Rotten Tomatoes is a review website for movies and TV shows. Different other film review websites, like IMDb, the scores of Rotten Tomatoes (whether the tomato🍅 is fresh or rotten) is aggregated from external reviews by critics, instead of common audiences. While, this website also provides audiences' reviews and scores, represented by a popcorn icon🍿.

In this assignment, we only focus on reviews from critics.

Take the movie "GANGS OF NEW YORK" for example.
The url of the review page is https://rottentomatoes.com/m/gangs_of_new_york/reviews?page=1

![](https://github.com/hujiayin/WebAnalytics/blob/master/Web%20Content/RottenTomatoesWebsiteNew.png?raw=true)

What we want to crawl is the information related to reviews from critics. The five following fields are the target!

1. The name of the critic 
2. The rating. The rating should be 'rotten' ,  'fresh', or 'NA' if the review doesn't have a rating.
3. The source of the review (e.g 'New York Times). This should be 'NA' if the review doesn't have a source.
4. The text of the review. This  should be 'NA' if the review doesn't have text.
5. The date of the review. This should be  'NA' if the review doesn't have a date.


For a single review, it was writtern in tag `<div class: "row review_table_row">` and `</div>`

![](https://github.com/hujiayin/WebAnalytics/blob/master/Web%20Content/RottenTomatoesHtml1.png?raw=true)

Name of the critic: 

`<a href="/critic/joshua-brown" class="unstyled bold articleLink">Joshua Brown</a>`

Rating: 

`<div class="review_icon icon small rotten">`

Source of the review: 

`<em class="subtle critic-publication">London Review of Books</em>`

Text of the review: 

`<div class="the_review">
                                    Gangs of New York is to Fernando Wood's Manhattan what Fellini's Satyricon is to Nero's Rome, with a touch of Monty Python filth on the faces and clothes.
                                </div>`
                                
Date: 

`<div class="review-date subtle small">
                                June 27, 2019
                            </div>`


Except getting tags of every review, we need to notice that some of the fields may be None. This not only includes the situation that there is no tag for the field, but also the string we get maybe empty, albeit the tag exists.

In [1]:
import requests
import re
import time
from bs4 import BeautifulSoup

In [2]:
# get html for a single page
def get_html(url): 
    my_header = {'User Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
    
    src = False
    
    # try 3 times to get the page
    for i in range(3): 
        try:
            response = requests.get(url, headers = my_header)
            src = response.content
            break
        except:
            time.sleep(2)
    return src

In [3]:
review_data = []
movie = 'gangs_of_new_york'
page_num = 3

for i in range(page_num):
    url = 'https://rottentomatoes.com/m/' + movie + '/reviews?page=' + str(i+1)
    
    html = get_html(url)
    
    if not html:
        print('Failed get ' + movie + ' page ' + str(i+1))
        
    else:
        soup = BeautifulSoup(html.decode('ascii', 'ignore'), 'lxml')
        review_info = soup.findAll('div', {'class':'row review_table_row'})

        for review in review_info:

            '''
            If the tag for any field is not found in html or the content is empty, 
            set default value NA for the field.
            '''

            # Find critic's name
            name_find = review.find('a', {'href': re.compile('critic')})
            name = name_find.text.strip() if name_find else 'NA'
            name = name if name != '' else 'NA'

            # Find rating
            rating_find = review.find('div', {'class': re.compile('review_icon')})
            rating = rating_find.attrs['class'][-1].strip() if rating_find else 'NA'
            rating = rating if rating != '' else 'NA'

            # Find source
            source_find = review.find('em', {'class': re.compile('critic-publication')})
            source = source_find.text.strip() if source_find else 'NA'
            source = source if source != '' else 'NA'

            # Find review content
            review_text_find = review.find('div', {'class': 'the_review'})
            review_text = review_text_find.text.strip() if review_text_find else 'NA'
            review_text = review_text if review_text != '' else 'NA'

            # Find review date
            date_find = review.find('div', {'class': re.compile('review-date')})
            date = date_find.text.strip() if date_find else 'NA'
            date = date if date != '' else 'NA'

            # Append into review_data
            review_data.append([name, rating, source, review_text, date])

# Write the reviews into a .txt file
with open(movie + '_' + str(page_num) + 'pages_reviews.txt', mode='w', encoding='utf-8') as file:
    for review in review_data:
        file.write(review[0] + '\t' + review[1] + '\t' + review[2] + '\t' + review[3] + '\t' + review[4] + '\n')

The final result we get is a list containing the five fields of every review.

In [4]:
review_data[0]

['Joshua Brown',
 'rotten',
 'London Review of Books',
 "Gangs of New York is to Fernando Wood's Manhattan what Fellini's Satyricon is to Nero's Rome, with a touch of Monty Python filth on the faces and clothes.",
 'June 27, 2019']