# Wisdom of the Tribe 
## Webcraping Master Film List from imdb.com 
### *Justin M. Olds* [github.com/jmolds](https://github.com/jmolds)
---
**Project Overview:** The ultimate purpose of this project is to develop a recommendation system for films based the match between a user's film preferences and the preferences of established film critics. This *Wisdom of the Tribes* approach contrasts with *Wisdom of the Masses* approaches provided by many popular websites, such as Metacritic and RottenTomatoes. 

---
In this notebook, I showcase how a master list of 19,000 film titles was webscraped from imdb.com. 

To begin, I will display an example webpage that I used to scrape film titles from. I chose to use imdb, not only because it likely has the largest film database available online, but also because it allows for advanced searches based on release date.

![Example Advanced Search page](https://github.com/jmolds/widsom-of-the-tribe/blob/master/data-images-etc/imdb-page-example.JPG?raw=true)

With these pages, the html tags/classes/divs specific for identifying the film title text within each page was identified using google chrome dev tools (e.g., inspect element). As shown below, the specific class was named **'lister-item-header'**.
![Example Dev tools page](https://github.com/jmolds/widsom-of-the-tribe/blob/master/data-images-etc/imdb-devtools-example.JPG?raw=true)


Next, I show how a for loop was developed to iterate across search pages to collect the top 500 grossing films (US) for each year form 1980 through 2018. Additionally, the film titles were scraped from each page.

In [1]:
import IPython
from IPython.display import HTML
from IPython.display import display
import requests   ##module for obtaining webpage html data
from bs4 import BeautifulSoup ##module for parsing webpage data based on tags etc. 
import time
from tqdm import tqdm ## module for displaying progress while running loops
import csv ## module for reading a writing python objects to and from csv files

film_list = list()   ### create empty list object to add film titles to
films_per_page_range = range(0,250) ### each page contained 250 film titles

## because each loops scraped two different imdb search pages separate temporary lists 
## were created to .append with the master film_list object at the end of the loop
temp_films_list1 = [None] * len(films_per_page_range) 
temp_films_list2 = [None] * len(films_per_page_range) 

year_range = range(1980,2018) ## the specific range of years to loop over
                              ## these 'year' values were added to url strings
    
for year in tqdm(year_range): 
    y1 = year   ## save new iterator variable to concatenate with url strings
    ##url for first 250 films per year
    imdb_page_1 = "https://www.imdb.com/search/title?title_type=feature&release_date=" + str(y1) + "-01-01," + str(y1) + "-12-31&view=simple&sort=boxoffice_gross_us,desc&count=250&page=1&ref_=adv_nxt"
    ##url second 250 films per year
    imdb_page_2 = "https://www.imdb.com/search/title?title_type=feature&release_date=" + str(y1) + "-01-01," + str(y1) + "-12-31&view=simple&sort=boxoffice_gross_us,desc&count=250&start=251&ref_=adv_nxt"
    #imdb_page_2 = "https://www.imdb.com/search/title?title_type=feature&release_date=" + str(y1) + "-01-01," + str(y1) + "-12-31&view=simple&sort=boxoffice_gross_us,desc&count=250&page=2&ref_=adv_nxt"
    page1 = requests.get(imdb_page_1)  ## obtain html for first 250 films
    page2 = requests.get(imdb_page_2)  ## obtain html for second 250 films
    soup1 = BeautifulSoup(page1.content, 'html.parser')  ##parse html object to obtain film titles
    soup2 = BeautifulSoup(page2.content, 'html.parser') 
    #### using the dev tools of google chrome
    lister_tag_list1 = soup1.find_all(class_='lister-item-header') ## find_all returns a list for all elements 
    lister_tag_list2 = soup2.find_all(class_='lister-item-header') ## with the class associated with film titles
    ## embedded loop to islolate the text for each film title returned within the 
    ## lister-tag objects
    for film in films_per_page_range:
        temp_films_list1[film] = str(lister_tag_list1[film])
        temp_films_list1[film] = temp_films_list1[film].partition('</a>')[0] 
        temp_films_list1[film] = temp_films_list1[film].partition('li_tt">')[2]
        temp_films_list2[film] = str(lister_tag_list2[film])
        temp_films_list2[film] = temp_films_list2[film].partition('</a>')[0]
        temp_films_list2[film] = temp_films_list2[film].partition('li_tt">')[2]
    film_list.extend(temp_films_list1) ##add film titles to film_list object
    film_list.extend(temp_films_list2)        
    
## loop to scan entire film list and eliminate the html code for ampersands
for x in range(0,len(film_list)):
    film_list[x] = film_list[x].replace("amp;","")

100%|██████████| 38/38 [10:51<00:00, 10.81s/it]


In [3]:
print(len(film_list)) #total number of films
print(film_list[0:100]) ##first 100 films from the list

19000
['Star Wars: Episode V - The Empire Strikes Back', 'Superman II', 'Nine to Five', 'Stir Crazy', 'Airplane!', 'Any Which Way You Can', 'Private Benjamin', "Coal Miner's Daughter", 'Smokey and the Bandit II', 'The Blue Lagoon', 'The Blues Brothers', 'Ordinary People', 'Urban Cowboy', 'Popeye', 'The Shining', 'Seems Like Old Times', "Cheech and Chong's Next Movie", 'Caddyshack', 'Friday the 13th', 'Brubaker', 'Little Darlings', 'Dressed to Kill', 'The Gods Must Be Crazy', 'The Jazz Singer', 'Bronco Billy', 'Raging Bull', 'The Long Riders', 'American Gigolo', 'Xanadu', 'My Bodyguard', 'The Fog', 'Altered States', 'Cruising', 'The Octagon', 'Windwalker', 'The Private Eyes', 'Herbie Goes Bananas', 'Honeysuckle Rose', 'The Final Countdown', 'Hero at Large', 'The Island', 'First Family', 'Raise the Titanic', 'Prom Night', 'The Nude Bomb', 'Oh, God! Book II', 'The Competition', 'Wholly Moses!', "The Last Flight of Noah's Ark", "It's My Turn", 'The Fiendish Plot of Dr. Fu Manchu', 'Stardus

Save the film_list object as a csv file for later use:

In [None]:
## write list object to csv file
with open('C:/Users/Justin/Dropbox/Python and SQL/film_list.csv', 'w', newline='') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    for x in range(0,len(film_list)):
        wr.writerow([film_list[x]])

## read csv file         
film_list = []
with open('C:/Users/Justin/Dropbox/Python and SQL/film_list.csv', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    for row in spamreader:
        film_list.append(row[0])