# Wisdom of the Tribe 
## Webcraping film and review information from metacritic.com
### *Justin M. Olds* [github.com/jmolds](https://github.com/jmolds)
---
**Project Overview:** The ultimate purpose of this project is to develop a recommendation system for films based the match between a user's film preferences and the preferences of established film critics. This *Wisdom of the Tribes* approach contrasts with *Wisdom of the Masses* approaches provided by many popular websites, such as Metacritic and RottenTomatoes. 

---
In this notebook, I showcase how a film and review information was scraped from metacritic.com. This involved a two-pronged approach:
* **First**, searching by converting film titles to urls based on the metacritic url formatting and, 
* **Second**, if this approach did not return a webpage, a second approach of using an automated web browers (selenium module) to insert film titles into a search bar on metacritic.com was used. 

This resulted in **165,332 critic reviews** taken corresponding to **7,689 films**.

---
As a first step, a master list of 19,000 film titles was loaded based on a separate [webscraping notebook](https://github.com/jmolds/widsom-of-the-tribe/blob/master/imdb-master-film-list-webscraping.ipynb "Notebook Link").

In [8]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import IPython
from IPython.display import HTML
from IPython.display import display
import requests
from bs4 import BeautifulSoup
import time
from tqdm import tqdm
import csv
import pandas as pd 
import numpy as np

film_list = pd.read_csv('https://raw.githubusercontent.com/jmolds/widsom-of-the-tribe/master/film_list.csv', encoding = "ISO-8859-1", header=None)

In [13]:
print(len(film_list)) ## total number of films in the master list
film_list[0:10] ## display first 10 film titles 

19000


Unnamed: 0,0
0,Star Wars: Episode V - The Empire Strikes Back
1,Superman II
2,Nine to Five
3,Stir Crazy
4,Airplane!
5,Any Which Way You Can
6,Private Benjamin
7,Coal Miner's Daughter
8,Smokey and the Bandit II
9,The Blue Lagoon


---
Next step is to create list type copies of the film list to to use for converting film title strings into urls. 

Note: The format for urls on metacritic.com is to delete punctuation and replace spaces with hyphens. This is shown in the code below. 

In [33]:
## save film title strings as list objects (one for the film review pages and one for the film pages)
film_list_hyphens = film_list[0].tolist()
reviews_list_hyphens = film_list[0].tolist()

#replace string elements to correspond with metacritic url formatting
for x in range(0,len(film_list_hyphens)):
    film_list_hyphens[x] = film_list_hyphens[x].replace(" ","-") #replace spaces with hyphens
    film_list_hyphens[x] = film_list_hyphens[x].replace(":","") #delete punction
    film_list_hyphens[x] = film_list_hyphens[x].replace(".","") #
    film_list_hyphens[x] = film_list_hyphens[x].replace(",","") #
    film_list_hyphens[x] = film_list_hyphens[x].replace("'","") #
    film_list_hyphens[x] = film_list_hyphens[x].replace("& ","") #
    film_list_hyphens[x] = film_list_hyphens[x].lower()         # all lowercase

#create full urls for review pages and film pages
for x in range(0,len(film_list_hyphens)):
    reviews_list_hyphens[x] = "https://www.metacritic.com/movie/" + film_list_hyphens[x] +  "/critic-reviews"
    film_list_hyphens[x] = "https://www.metacritic.com/movie/" + film_list_hyphens[x]

In [36]:
##examples of each are shown below
print(film_list_hyphens[0])
print(reviews_list_hyphens[0])

https://www.metacritic.com/movie/star-wars-episode-v---the-empire-strikes-back
https://www.metacritic.com/movie/star-wars-episode-v---the-empire-strikes-back/critic-reviews


These url strings are used for the first approach to webscraping. The second approach entails using an automated web browser.

**Note:** To run this code, you will need to download a webdriver application and set it as shown below. 

Webdriver link: http://chromedriver.chromium.org/downloads

In [38]:
         ###this will need to be edited based on where your chromedriver is saved
browser = webdriver.Chrome("/Users/Justin/Dropbox/Python and SQL/chromedriver")
## open the automated browser to the metacritic movies page
browser.get('http://www.metacritic.com/movies')
## some websites deny get requests unless specific user information is provided. 
## Metacritic does.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

### Webscraping from metacritic
The following loop does the following:
* Attempt to request and save html based on the urls generated above.
* check if a webpage was found (.status_code would be 200 if received)
* if one was found both the film page and review pages are saved
* if one was not found the automated web browser inserts the film title into the search bar at metacritic.com
* if any results match the film title string, the first title is clicked on and the retrieved webpage is checked to see if the title of the film entered into the search bar matches the film title on the webpage.
* if the titles match, the film webpage and corresponding critic reviews webpage are saved.

**Note:** Because this loop requires a long time (roughly 2 full days for my laptop) to complete. Errors on any attempts are logged such that the loop and be continued later and failed attempts can be doubled checked.

In [None]:
browser.implicitly_wait(3)   ### allow 3 seconds for webpages to respond
review_pages = [None] * len(film_list) ##empty list to save review page information
film_pages = [None] * len(film_list)  ##empty list to save film page information
error_index = list()  #empty list to save any erroneous iterations

for x in tqdm(range(0, len(film_list))):   #tqdm allows shows a loop progress bar
    if film_pages[x] is None: ##for loop restarts to pick up where ended
        try:
            review_pages[x] = requests.get(reviews_list_hyphens[x], headers=headers)       
            if review_pages[x].status_code != 200:   #checks if the get.request returned a page
                searchBar = browser.find_element_by_id('primary_search_box')
                searchBar.send_keys(film_list[x])
                time.sleep(2)  
                elem = browser.find_elements_by_class_name('search_results_item')
                time.sleep(2)  
                if len(elem) > 0:  #check if any search results are returned 
                    elem[0].click()  ##click on first item
                    time.sleep(3) 
                    film_url = browser.current_url
                    critics_url = browser.current_url + '/critic-reviews'
                    review_pages[x] = requests.get(critics_url, headers=headers)
                    film_pages[x] = requests.get(film_url, headers=headers)
                    soup = BeautifulSoup(review_pages[x].content, 'html.parser')
                    header_check = soup.select("a > h1")
                    header_check = str(header_check)
                    header_check = header_check.partition('</h1>')[0]
                    header_check = header_check.partition('<h1>')[2]
                    if header_check != film_list[x]:
                        review_pages[x] = "No page found"
                        film_pages[x] = "No page found"
                else: 
                    review_pages[x] = "No page found" 
                    film_pages[x] = "No page found"
                    searchBar.send_keys(100 * Keys.BACKSPACE)
                    time.sleep(2)
            else:
                film_pages[x] = requests.get(film_list_hyphens[x], headers=headers)
        except (Exception, RuntimeError, ConnectionError):
            error_index.append(x)
            continue
    
restart = review_pages.index(None) #save loop starting value   

Importantly, to save the result of such a time consuming loop. The shelve module is used to reinstate the saved objects. 

In [None]:
import shelve 
s = shelve.open("films.requests.dat") 
s["film_pages"]= film_pages
s["review_pages"]= review_pages
s["error_index"]= error_index
s.close() 

## to reinstate the page objects for later parsing and insertion into a database
r = shelve.open("films.requests.dat") 
film_pages = r["film_pages"] 
review_pages = r["review_pages"] 
error_index = r["error_index"] 
r.close()