# PA Project - Team 5 : Code File 1/3
##Caglar Dogan - Ekantika Singh - Gurmehr Sohi

The code in this file scrapes the IMDB website to get information about the first 2000 feature movies (according to the IMDB popularity score) produced in the USA and available in English that were released between 2016-01-01 and 2019-12-31.

The first 2000 movies were selected to reduce the time needed to gather the relevant data. We had noticed that a great majority of movies that are less popular do not contain the US opening weekend gross information - and thus are not usable for our models. For this reason, this restriction of our web scraping process to 2000 movies does not reduce the number of valid data points we get much, while it reduces the time needed to complete the process drastically.

The result is saved into a CSV file with the name "IMDB_movie_data.csv". Please note that the data in this file must be cleaned, and relevant sentiment information must be added before the use of this data in modeling.

In [7]:
#importing required Libraries
import numpy as np
import pandas as pd 
import requests #to send HTTP requests
from bs4 import BeautifulSoup #to parse the files acquired through requests

First, we get the name, run time, and URL information of the relevant movies with an IMDB search.

This part builds upon the IMDB web scraping tutorial: https://www.youtube.com/watch?v=I5L3OJ-xtsw

In [8]:
#creating empty lists to store the values scraped
movie_name = []
time = []
urls=[]
completeUrl=[]

budget = []
openingWeekendUsAndCanada = []
weekendDate=[]

release_dates_begin = "2016-01-01"
release_dates_end = "2019-12-31"

#i will be used as the popularity rank of the first movie 
#in the result page in the specified interval
#For example, at i = 251, information for movies ranked 251-500 will be returned
i_max = 1999;
print("Max i:", str(i_max))
print("---")
for i in range(1, i_max, 250):
    print("Current i:", str(i))
    #Assign the URL to be used for a request
    #(will change as i changes to get later pages from the IMDB seearch results)
    url = 'https://www.imdb.com/search/title/?title_type=feature&release_date='+release_dates_begin+','+release_dates_end+'&countries=us&languages=en&adult=include&start='+str(i)+'&count=250'
    #Use a GET request to get the IMDB search results
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')


    #Store all the individual URL information for movies by scraping it from the relevant headers
    movies = soup.findAll('h3', attrs={'class' : 'lister-item-header'})
    for movie in movies:
        urls.append(movie.a['href'])
        completeUrl.append('https://www.imdb.com'+movie.a['href'])

    #Store all the individual name and run time information for movies by scraping it from the relevant divs
    movie_data = soup.findAll('div', attrs= {'class': 'lister-item mode-advanced'})
    for store in movie_data:
        name = store.h3.a.text
        movie_name.append(name)
        runtime = store.p.find('span', class_ = 'runtime')
        time.append(runtime)

Max i: 1999
---
Current i: 1
Current i: 251
Current i: 501
Current i: 751
Current i: 1001
Current i: 1251
Current i: 1501
Current i: 1751


Now, we scrape the budget, openning weekend gross in US and Canada.
(Please note that this process is likely to take a long time)

In [9]:
k = 0
for url in completeUrl:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    print('Current iteration:', str(k))
    k = k + 1
    #budget
    individual_movie_data = soup.findAll('li', attrs= {'data-testid': 'title-boxoffice-budget'})  
    if(individual_movie_data == []):
        budget.append("")
    else:
    #calling one by one using for loop
        for store in individual_movie_data:
            budgetsub = store.find('span', class_ = 'ipc-metadata-list-item__list-content-item').text.replace('(', '').replace(')', '') 
            budget.append(budgetsub)
            break

    #opening weekendopening weekend gross information for Us and Canada
    individual_movie_data = soup.findAll('li', attrs= {'data-testid': 'title-boxoffice-openingweekenddomestic'})
    if(individual_movie_data == []):
        openingWeekendUsAndCanada.append("")
        weekendDate.append("")
    #calling one by one using for loop
    else:
        for store in individual_movie_data:    
            elems = store.find_all("li")
            weekend_gross = elems[0].text.replace('(', '').replace(')', '')
            weekend_date = elems[1].text.replace('(', '').replace(')', '')
            openingWeekendUsAndCanada.append(weekend_gross)
            weekendDate.append(weekend_date)
            break
 

Current iteration: 0
Current iteration: 1
Current iteration: 2
Current iteration: 3
Current iteration: 4
Current iteration: 5
Current iteration: 6
Current iteration: 7
Current iteration: 8
Current iteration: 9
Current iteration: 10
Current iteration: 11
Current iteration: 12
Current iteration: 13
Current iteration: 14
Current iteration: 15
Current iteration: 16
Current iteration: 17
Current iteration: 18
Current iteration: 19
Current iteration: 20
Current iteration: 21
Current iteration: 22
Current iteration: 23
Current iteration: 24
Current iteration: 25
Current iteration: 26
Current iteration: 27
Current iteration: 28
Current iteration: 29
Current iteration: 30
Current iteration: 31
Current iteration: 32
Current iteration: 33
Current iteration: 34
Current iteration: 35
Current iteration: 36
Current iteration: 37
Current iteration: 38
Current iteration: 39
Current iteration: 40
Current iteration: 41
Current iteration: 42
Current iteration: 43
Current iteration: 44
Current iteration: 4

Now, the scraped information can be brought together to form a DataFrame and be stored in a .csv file as follows:

In [10]:
FinalDataFrame = pd.DataFrame({'Name of movie': movie_name, 'Watchtime': time,'Budget': budget, 'Opening Weekend Us And Canada': openingWeekendUsAndCanada, 'Openning Weekend Date' : weekendDate})

FinalDataFrame.to_csv("IMDB_movie_data.csv")