In [2]:
#Import relevant libraries
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import pandas as pd
import numpy as np
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import datetime 

### Step 1: Extract urls from IMBD search
[IMDB Search](https://www.imdb.com/search/title/)

This project is motivated by the groups interest in movies and being able to pull some specific datasets for analysis on demand. While IMDB does have some datasets puiblicly available, they are small (at most 8 columns), and would require significant merging to obtain what we are going for. Further, there is no box office information in these datasets, and earnings is the most significant variable that we would like to analyze. Beyond insterest in this industry, this application is also motivated by the fact that web scraping is a valuable tool for every data scientist to have at their disposal. Any time there is a lack of data in a problem you are facing, there is potentially valuable information available somewhere on the internet, and being able to extract it, clean it, and test it's validity allows for solving problems that otherwise may seem unsolvable. 

The first step in this application is to take an input search (link above) and return a list of urls: one for each movie that satisfies those parameters. To implement this function, we will need the help of a few libraries: BeautifulSoup, urllib, and webbrowser (for debugging). The first thing to do is just take a look at what the search returns in the actual webpage. The things that jump out are: the total number of movies that match the search listed at the top as well as there being 50 movies per page. Therefore, we know two important variables right off the bat: how many urls we expect to return and how many webpages we will have to iterate over to get them all. 

The function, therefore, will begin by creating a BeautifulSoup object (i.e. a parsed HTML file) of the search url by opening the search url with urllib. Right away, we extract the total number of films returned from this query using the find_all method in BeautifulSoup, replacing the comma with nothing, and converting the resulting string to an integer. We then print the number for the users reference. From here, we need to understand how to move through each page of the query. By looking at page two, we see that it is, fortunately, pretty simple. By just appending "&start=n" where n is some number, to the search url, we can look at a page with 50 entries, starting at the nth entry. Getting a list of these numbers is a simple list comprehension. 

From there, we iterate each of these urls, creating a new BeautifulSoup object each time, and grabbing all links from the page using the get and findall methods. From the list of links, we look for a specific string 'title/tt' that corresponds to movie webpages, add it to the list of final urls, checking to make sure it is not already there. Once it iterates through each page and adds all urls, a test is run to ensure that the number of movies in the search matches the length of the final url list. 

There was an issue initially with the length of the list exceeding the number of movies in the search, so some debugging was required.  In some cases, there was only one extra link, while in others there was more than 100. The webbrowser library was used to open every 50 links to see which page was adding extra information. In hindsight, a more efficient option may be to just print how many items were added to the list in each loop and look for values over 50. Regardless, the issue was that some movies had a link to sequels/prequels included with them. This did not show up on the webpage, but was included in the html file. Further, each valid url was listed 3 seperate times, one with an extra directory in the url. Since these were the only title links with a length of 4 when split on '/', that is ultimately how we filter out for valid urls. It also avoids checking for duplicate values in the list twice for a better optimized runtime.

In [3]:
#Writing a single function to extract all title urls from an IMDB search
def url_extractor(search_url):
    #setting up initial BeautifulSoup object from websearch 
    init_resp = urllib.request.urlopen(search_url)
    init_soup = BeautifulSoup(init_resp, 'html.parser')
    #extract the number of films that the query returned. Used to confirm at end and generate each url 
    number_of_films = int(str(init_soup.find_all('div', class_='desc')[0].find_all('span')[0]).split(' ')[2].replace(',',''))
    print(number_of_films)
    #each page has 50 movies, so setting up a list to iterate through the pages, set up blank list to store final urls
    iterative_urls = [i for i in range(1,number_of_films, 50)]
    url_list = []
    #loop through the 50-spaced interger values to generate entire list of needed search urls 
    for i in iterative_urls:
        # set url
        url = search_url + '&start=' + str(i)
        #set up the BeautifulSoup object for this specific page of the search
        resp = urllib.request.urlopen(url)
        soup = BeautifulSoup(resp, 'html.parser')
        #generating list of all links on this page
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        # printing out where we are in the query to monitor efficiency
        print('Running query from {} to {}'.format(i, i+49))
        # checking each link in each search page for title/tt keyword
        for link in links:
            if 'title/tt' in link:
                #when the length is 4 of the split title url, that means it is part of query 
                #when the length is 3, it means that the movie is ancillary to the actual search (prequel/sequel)
                if len(link.split('/')) == 4:
                # format the resulting url in the correct manner and appending it to final list 
                    title_link = 'https://www.imdb.com' + '/' + link.split('/')[1] + '/' + link.split('/')[2] + '/?ref_=adv_li_i'
                    if not title_link in url_list:
                        url_list.append(title_link)
                    else: 
                        continue 
                else: 
                    continue
            else:
                continue
    # Final test to make sure that the length of query equals the length of returned list and returning the final list
    if len(url_list) == number_of_films:
        print('All urls have been extracted successfully')
    else: 
        print('WARNING: The number of films in this query was {}, but {} urls were returned'.format(number_of_films, len(url_list)))
    return(url_list)

In [8]:
times = []
for i in range(10):
    start_time = datetime.datetime.now()
    my_test = url_extractor('https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2018-03-01')
    end_time = datetime.datetime.now()
    times.append(end_time - start_time)

1647
Running query from 1 to 50
Running query from 51 to 100
Running query from 101 to 150
Running query from 151 to 200
Running query from 201 to 250
Running query from 251 to 300
Running query from 301 to 350
Running query from 351 to 400
Running query from 401 to 450
Running query from 451 to 500
Running query from 501 to 550
Running query from 551 to 600
Running query from 601 to 650
Running query from 651 to 700
Running query from 701 to 750
Running query from 751 to 800
Running query from 801 to 850
Running query from 851 to 900
Running query from 901 to 950
Running query from 951 to 1000
Running query from 1001 to 1050
Running query from 1051 to 1100
Running query from 1101 to 1150
Running query from 1151 to 1200
Running query from 1201 to 1250
Running query from 1251 to 1300
Running query from 1301 to 1350
Running query from 1351 to 1400
Running query from 1401 to 1450
Running query from 1451 to 1500
Running query from 1501 to 1550
Running query from 1551 to 1600
Running query 

Running query from 1301 to 1350
Running query from 1351 to 1400
Running query from 1401 to 1450
Running query from 1451 to 1500
Running query from 1501 to 1550
Running query from 1551 to 1600
Running query from 1601 to 1650
All urls have been extracted successfully
1647
Running query from 1 to 50
Running query from 51 to 100
Running query from 101 to 150
Running query from 151 to 200
Running query from 201 to 250
Running query from 251 to 300
Running query from 301 to 350
Running query from 351 to 400
Running query from 401 to 450
Running query from 451 to 500
Running query from 501 to 550
Running query from 551 to 600
Running query from 601 to 650
Running query from 651 to 700
Running query from 701 to 750
Running query from 751 to 800
Running query from 801 to 850
Running query from 851 to 900
Running query from 901 to 950
Running query from 951 to 1000
Running query from 1001 to 1050
Running query from 1051 to 1100
Running query from 1101 to 1150
Running query from 1151 to 1200
Runn

In [9]:
times

[datetime.timedelta(seconds=29, microseconds=904251),
 datetime.timedelta(seconds=45, microseconds=839683),
 datetime.timedelta(seconds=23, microseconds=486884),
 datetime.timedelta(seconds=20, microseconds=574045),
 datetime.timedelta(seconds=20, microseconds=720692),
 datetime.timedelta(seconds=23, microseconds=879749),
 datetime.timedelta(seconds=24, microseconds=616520),
 datetime.timedelta(seconds=26, microseconds=448705),
 datetime.timedelta(seconds=24, microseconds=201703),
 datetime.timedelta(seconds=20, microseconds=44910)]