In [1]:
#Import relevant libraries
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import pandas as pd
import numpy as np
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

### Nathan updating manual url list with url function. This should work now with all queries. 

In [2]:
#Writing a single function to extract all title urls from an IMDB search
def url_extractor(search_url):
    #setting up initial BeautifulSoup object from websearch 
    init_resp = urllib.request.urlopen(search_url)
    init_soup = BeautifulSoup(init_resp, 'html.parser')
    #extract the number of films that the query returned. Used to confirm at end and generate each url 
    number_of_films = int(str(init_soup.find_all('div', class_='desc')[0].find_all('span')[0]).split(' ')[2].replace(',',''))
    print(number_of_films)
    #each page has 50 movies, so setting up a list to iterate through the pages, set up blank list to store final urls
    iterative_urls = [i for i in range(1,number_of_films, 50)]
    url_list = []
    #loop through the 50-spaced interger values to generate entire list of needed search urls 
    for i in iterative_urls:
        # set url
        url = search_url + '&start=' + str(i)
        #set up the BeautifulSoup object for this specific page of the search
        resp = urllib.request.urlopen(url)
        soup = BeautifulSoup(resp, 'html.parser')
        #generating list of all links on this page
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        # printing out where we are in the query to monitor efficiency
        print('Running query from {} to {}'.format(i, i+49))
        # checking each link in each search page for title/tt keyword
        for link in links:
            if 'title/tt' in link:
                #when the length is 4 of the split title url, that means it is part of query 
                #when the length is 3, it means that the movie is ancillary to the actual search (prequel/sequel)
                if len(link.split('/')) == 4:
                # format the resulting url in the correct manner and appending it to final list 
                    title_link = 'https://www.imdb.com' + '/' + link.split('/')[1] + '/' + link.split('/')[2] + '/?ref_=adv_li_i'
                    if not title_link in url_list:
                        url_list.append(title_link)
                    else: 
                        continue 
                else: 
                    continue
            else:
                continue
    # Final test to make sure that the length of query equals the length of returned list and returning the final list
    if len(url_list) == number_of_films:
        print('All urls have been extracted successfully')
    else: 
        print('WARNING: The number of films in this query was {}, but {} urls were returned'.format(number_of_films, len(url_list)))
    return(url_list)

In [3]:
url_list = url_extractor('https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2018-01-31&genres=action')

100
Running query from 1 to 50
Running query from 51 to 100
All urls have been extracted successfully


<h1>Web Scraper</h1>

<ol> In this part of the project we use web scraping to gather data about all of the query result movies from IMDB's web page in a format that will be usable for analysis. In particular, we want to collect information about movie title, movie's genres, country, language, filming location, production company, runtime, budget information, release date, directors, writers, stars acting in the movie, rating, review count and finally mpaa ratings. </ol>
<ol>  Since our goal is to scrape information from multiple webpages with the same html structure, first we had to understand the websites structure. To do this, we used Chrome's Developer tools to inspect the structure of only one page. By right clicking and hitting inspect you can see the HTML line that corresponds to the part of webpage you clicked at. There are lots of HTML lines nested within each tag and for each feature we found unique identifier of information we wanted to extract. Then, in order to get the content of the webpage, we downloaded the pages that we wanted to scrape by using the request library. The library makes a get request to IMDB's server and downloads the HTML contents of a given web page. After we run the request using requests.get method, we get back a response object that has a status_code, which equals to 200 if the page gets downloaded successfully. This enabled us to include error catching into our code and return a warning whenever web pages did not get downloaded successfully (status_code does not equal 200).  Next step is to extract relevant text from the downloaded HTML documents. We used BeautifulSoup library for to parse these documents. In order to do this, we first created an instance of the BeautifulSoup class, which gives us a BeautifulSoup object. BeautifulSoup object has a nested data structure. The tag objects within this nested structure allowed us to extract the relevant information. We used find_all method to find all instances of a tag in a webpage. Find_all returns a list, therefore we used list indexing to extract relevant text by using text method. We also used find_all method to extract by class, for example in line 39 we search for any div that has class being 'see-more inline canwrap'. Once we are able to uniquely identify how to get each piece of information:</ol>
<ol>
<ul>
<li>we can loop through multiple urls,</li>
<li>place get requests within the loop for each page</li>
<li>convert the response's html content to beautiful soup object</li>
<li>extract all containers from this object by using find_all method, if there is no information than continue to the next variable</li>
<li>use list indexing to access information</li>
<li>and save this information into pandas’ data frame.</li>
</ul>
</ol>
	
<ol>At the end of this process, we have a data frame with each row representing information on each movie scraped and each column corresponding to the variable we scraped.</ol>

In [4]:
#write a function that scrapes information from multiple urls
def scraper(urls):
    '''Scrape information about movies from imdbs website

    Input
    =====
    urls: list of urls
    
    Output
    ======
    movies: dataframe containning as rows movies we scraped and columns the features scraped
    
    '''
    
    #Create a dataframe to store scraped data in (features as columns)
    features = ['genre_0','genre_1','genre_2', 'genre_3','country','language','filming_locs','production_co','runtime',
    'budget','gross_usa','release','director_0','director_1','rating','star_0','star_1','star_2','star_3','writer_0',
    'writer_1','writer_2','review_count','title','open_week','cumulative', 'mpaa_rating']
    
    #number of rows of data frame = # of movies scraped, # of columns = # of features
    movies = pd.DataFrame(data = np.empty((len(urls), len(features))),
                         columns= features)
    
    #data set is initialized as nan
    movies[:] = np.nan
    
    #create a list of rating we want to collect
    rating_list = ['   G', '   PG', 'PG-13', '   R', 'NC-17', 'Not Rated', 'Unrated', 
                   'TV-Y', 'TV-Y7', 'TV-G', 'TV-PG','TV-14', 'TV-MA']   
    
    #loop through urls and scrape relevant features
    for idx,url in enumerate(urls):
        
        resp = requests.get(url)#request content of the webpage and store in resp

        #Pause the loop
        sleep(randint(8,15))

        #Error if  status codes is not 200
        if resp.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        soup = BeautifulSoup(resp.text, 'html.parser')#use pythons built in html parser
        
        #collect mpaa ratings
        mpaa = soup.find_all('div', class_ = 'title_wrapper')[0].find_all('div', class_ = 'subtext')[0].text
        for i in rating_list:
            if i in mpaa:
                movies.loc[idx, 'mpaa_rating'] = i
                
        #get all genres
        genres_spec = soup.find_all('div', class_ = 'see-more inline canwrap')
        for i in range(len(genres_spec)):
            if 'Genres' in genres_spec[i].find('h4', class_ = 'inline').text:
                genres = soup.find_all('div', class_ = 'see-more inline canwrap')[i].find_all('a')
                movies.loc[idx,'genre_0'] = genres[0].text
                try:
                    movies.loc[idx,'genre_1']= genres[1].text
                except:
                    pass
                try:
                    movies.loc[idx,'genre_2'] = genres[2].text
                except:
                    pass
                try:
                    movies.loc[idx,'genre_3'] = genres[3].text
                except:
                    pass

        #Get country, language, filming location, production co, budget info variables
        other_specs = soup.find_all('div', class_ = 'txt-block')
        for i in range(len(other_specs)):
            if other_specs[i].find('h4', class_ = 'inline') is None:
                pass
            elif 'Country' in other_specs[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'country'] = other_specs[i].a.text
            elif 'Language' in other_specs[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'language'] = other_specs[i].a.text
            elif 'Filming Locations' in other_specs[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'filming_locs'] = other_specs[i].a.text
            elif 'Production Co' in other_specs[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'production_co'] =other_specs[i].a.text
            elif 'Runtime' in other_specs[i].find('h4', class_ = 'inline').text:
                movies.loc[idx, 'runtime'] = other_specs[i].time.text
            elif 'Budget' in other_specs[i].find('h4', class_ = 'inline').text:
                try:
                    movies.loc[idx,'budget']= other_specs[i].text
                except: 
                    pass
            elif 'Gross USA' in other_specs[i].find('h4', class_ = 'inline').text:
                try:
                    movies.loc[idx,'gross_usa'] = other_specs[i].text
                except:
                    pass
            elif 'Opening Weekend USA' in other_specs[i].find('h4', class_ = 'inline').text:
                try:
                    movies.loc[idx,'open_week'] = other_specs[i].text
                except:
                    pass
            elif 'Cumulative Worldwide Gross' in other_specs[i].find('h4', class_ = 'inline').text:
                try:
                    movies.loc[idx,'cumulative'] = other_specs[i].text
                except:
                    pass

        #get release date
        try:
            movies.loc[idx,'release'] = soup.find_all('div', class_ = 'subtext')[0].find_all('a', title = "See more release dates")[0].text
        except:
            pass

        #Get director, writer and stars    
        movie_containers = soup.find_all('div', class_ = 'credit_summary_item')
        for i in range(len(movie_containers)):
            if 'Director' in movie_containers[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'director_0'] = movie_containers[i].find_all('a')[0].text
                try:
                    movies.loc[idx,'director_1']= movie_containers[i].find_all('a')[1].text
                except:
                    pass
            if 'Writer' in movie_containers[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'writer_0']=movie_containers[i].find_all('a')[0].text
                try:
                    movies.loc[idx,'writer_1']=movie_containers[i].find_all('a')[1].text
                except:
                    pass
                try:
                    movies.loc[idx,'writer_2']=movie_containers[i].find_all('a')[2].text
                except:
                    pass

            if 'Star' in movie_containers[i].find('h4', class_ = 'inline').text:
                movies.loc[idx,'star_0']=movie_containers[i].find_all('a')[0].text
                try:
                    movies.loc[idx,'star_1']=movie_containers[i].find_all('a')[1].text
                except:
                    pass
                try:
                    movies.loc[idx,'star_2']=movie_containers[i].find_all('a')[2].text
                except:
                    pass
                try:
                    movies.loc[idx,'star_3']=movie_containers[i].find_all('a')[3].text
                except:
                    pass

        #Get raitng, review count and the title of the movie
        try:
            movies.loc[idx,'rating']=soup.find_all('span',{'itemprop':'ratingValue'})[0].text
        except:
            pass
        try:
            movies.loc[idx,'review_count']=soup.find_all('span',{'itemprop':'ratingCount'})[0].text
        except:
            pass
        movies.loc[idx,'title']=soup.find_all('title')[0].text
        
    return movies

In [5]:
movies_df  = scraper(url_list)

In [6]:
movies_df

Unnamed: 0,genre_0,genre_1,genre_2,genre_3,country,language,filming_locs,production_co,runtime,budget,...,star_3,writer_0,writer_1,writer_2,review_count,title,open_week,cumulative,mpaa_rating,url
0,Action,Adventure,Sci-Fi,,USA,English,Pinewood Atlanta Studios - 461 Sandy Creek Roa...,Marvel Studios,134 min,"\nBudget:$200,000,000\n (estimated)\n",...,See full cast & crew,Ryan Coogler,Joe Robert Cole,2 more credits,554539,Black Panther (2018) - IMDb,"\nOpening Weekend USA: $202,003,951,\n18 Febru...","\nCumulative Worldwide Gross: $1,346,913,161 ...",PG-13,https://www.imdb.com/title/tt1825683/?ref_=adv...
1,Action,Drama,History,War,USA,English,"Albuquerque, New Mexico, USA",Alcon Entertainment,130 min,"\nBudget:$35,000,000\n (estimated)\n",...,See full cast & crew,Ted Tally,Peter Craig,1 more credit,54908,12 Strong (2018) - IMDb,"\nOpening Weekend USA: $15,815,025,\n21 Januar...","\nCumulative Worldwide Gross: $67,450,815",R,https://www.imdb.com/title/tt1413492/?ref_=adv...
2,Action,Crime,Drama,Mystery,USA,English,"Atlanta, Georgia, USA",Atmosphere Entertainment MM,148 min,"\nBudget:$30,000,000\n (estimated)\n",...,See full cast & crew,Christian Gudegast,Christian Gudegast,1 more credit,76391,Den of Thieves (2018) - IMDb,"\nOpening Weekend USA: $15,206,108,\n21 Januar...","\nCumulative Worldwide Gross: $80,509,622",R,https://www.imdb.com/title/tt1259528/?ref_=adv...
3,Action,Mystery,Thriller,,France,English,"Sacramento, California, USA",StudioCanal,105 min,"\nBudget:$30,000,000\n (estimated)\n",...,See full cast & crew,Byron Willinger,Philip de Blasi,3 more credits,90154,The Commuter (2018) - IMDb,"\nOpening Weekend USA: $13,701,452,\n14 Januar...","\nCumulative Worldwide Gross: $119,942,387 ...",PG-13,https://www.imdb.com/title/tt1590193/?ref_=adv...
4,Action,Fantasy,Horror,Mystery,UK,English,Belgium,SpectreVision,121 min,"\nBudget:$6,000,000\n (estimated)\n",...,See full cast & crew,Panos Cosmatos,Aaron Stewart-Ahn,2 more credits,50589,Mandy (2018) - IMDb,"\nOpening Weekend USA: $225,723,\n16 September...","\nCumulative Worldwide Gross: $1,524,880",Not Rated,https://www.imdb.com/title/tt6998518/?ref_=adv...
5,Action,Drama,,,Canada,English,"Newfoundland, Canada",Braven NL,,"\nBudget:$5,000,000\n (estimated)\n",...,See full cast & crew,Thomas Pa'a Sibbett,Michael Nilon,1 more credit,24807,Braven (2018) - IMDb,,"\nCumulative Worldwide Gross: $823,471",R,https://www.imdb.com/title/tt5001754/?ref_=adv...
6,Action,Sci-Fi,Thriller,,USA,English,"Cape Town, South Africa",Gotham Group,143 min,"\nBudget:$62,000,000\n (estimated)\n",...,See full cast & crew,T.S. Nowlin,James Dashner,,97650,Maze Runner: The Death Cure (2018) - IMDb,"\nOpening Weekend USA: $24,167,011,\n28 Januar...","\nCumulative Worldwide Gross: $288,175,335 ...",PG-13,https://www.imdb.com/title/tt4500922/?ref_=adv...
7,Action,Comedy,Crime,Drama,USA,English,,BRON Studios,108 min,"\nBudget:$7,000,000\n (estimated)\n",...,See full cast & crew,Sam Levinson,,,11588,Assassination Nation (2018) - IMDb,"\nOpening Weekend USA: $1,050,021,\n23 Septemb...","\nCumulative Worldwide Gross: $2,584,988",R,https://www.imdb.com/title/tt6205872/?ref_=adv...
8,Action,Crime,Drama,,Canada,English,"Cleveland, Ohio, USA",Colecar Productions,86 min,,...,See full cast & crew,Nicolas Aaron Mezzanatto,,,8647,Acts of Violence (2018) - IMDb,,"\nCumulative Worldwide Gross: $386,790",R,https://www.imdb.com/title/tt6684714/?ref_=adv...
9,Action,Drama,,,USA,English,Thailand,Our House Films,110 min,"\nBudget:$13,000,000\n (estimated)\n",...,See full cast & crew,Jean-Claude Van Damme,Mark DiSalle,2 more credits,3817,Kickboxer: Retaliation (2018) - IMDb,"\nOpening Weekend USA: $3,061,\n28 January 2018","\nCumulative Worldwide Gross: $101,690",R,https://www.imdb.com/title/tt5208950/?ref_=adv...
