# Project-1: Using web scraping to build a database of movie related information from: The Movie Database (TMDB) movie data

**Problem statement**:
A common business requirement in the context of information gathering is to extract and filter relevant
data from web pages that host this information. However, access to information spread over several
web pages, hosted potentially on multiple websites is a cumbersome process and we cannot rely on
manual procedures to execute this task. In this project, you will employ a programmatic approach to
access, parse and extract relevant information from a website of interest. 

**Objective**:
The project's goal is to extract data (from a chosen number of pages) from The Movie Database website
(https://www.themoviedb.org/) into a tabular data format so that further analysis (e.g., details about a
movie's genre, cast, and user rating) can be facilitated.
To execute this project, you will have to read the documentation links provided against each task in the
assignment and adapt the code examples provided in the documentation for the task at hand

##### **Pre-requisites** :
 * **Tools**: Jupyter Notebook or Google Colab or Microsoft Visual Studio IDE
 * **Languages**: Python, HTML
 * **Libraries**: requests, beautifulSoup, pandas

1. Establish a connection to the webpage - "https://www.themoviedb.org/movie" - and provide the following details ( 4 marks )
     1. Import the requests library (https://requests.readthedocs.io/en/latest/ ) and formulate a get request to download the contents of the webpage ( " https://www.themoviedb.org/movie " ) ( 1 mark )

In [1]:
import requests
resp = requests.get("https://www.themoviedb.org/movie")
resp

<Response [403]>

Note that we got 403 response code and not 200. Hence we need to add headers required to get the page. 

In [2]:
needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
response = requests.get(("https://www.themoviedb.org/movie"),headers = needed_headers)
response

<Response [200]>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;B. Verify the status code of the request and confirm that the request was executed appropriately (https://requests.readthedocs.io/en/latest/user/quickstart/#responsestatus-code) ( 1 mark )

In [3]:
assert(response.status_code == 200)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C. Print the contents of the page obtained from the response and save it in a variable (https://requests.readthedocs.io/en/latest/user/quickstart/#response-content) ( 1 mark )

In [4]:
response.text

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-icon" sizes="180x180" href=

In [5]:
html_content = response.text

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;D. Infer the type of the variable created in part 1c and display the first 200 characters of the content from the server’s response ( 1 Mark )

In [6]:
print("The content of the response is of type ", response.headers['content-type'])
print("The content of extracted data is of type ", type(html_content))


The content of the response is of type  text/html;charset=utf-8
The content of extracted data is of type  <class 'str'>


In [7]:
html_content[:200]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n  '

 2. Parse the content of HTML response using the BeautifulSoup library and execute the tasks specified in the guidelines mentioned below ( 6 marks )
    1. From the BeautifulSoup library (bs4) import the BeautifulSoup class. Pass the contents of the webpage obtained from step 1c as an argument to create an instance of the BeautifulSoup class ( 2 Marks )

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
type(soup)

bs4.BeautifulSoup

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; B. Extract the title of the parsed web page content using an appropriate method or attribute of the document object created in part 2a (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tagnames) ( 1 Mark )

In [9]:
print("The title of the page is ", soup.title.string)

The title of the page is  Popular Movies — The Movie Database (TMDB)


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; C. Write a user defined function to generalize the task presented in Q2a to any URL that retrieves the content of the webpage. Your function should take a URL string as an input and return a correctly formulated BeautifulSoup instance as the output. In your function definition, ensure that appropriate exceptions are raised to the user (through status codes) if they pass in malformed/incorrect URLs. Write two test cases for your function - one with a working URL and another with an URL that gets a 404 response. ( 3 marks ) 

In [10]:
NEEDED_HEADERS = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
def get_beautiful_soup_instance_from_url(url, needed_headers=NEEDED_HEADERS): 
    response = requests.get(url, headers = needed_headers)
    response.raise_for_status()
    if response.status_code != 200: # This condition is to check if we are getting page as raise_for_status checks if status code is 2xx
        raise Exceptionf(f'Got unexpected Status Code {response.staus_code}')
    content_type = response.headers.get('content-type')
    if "text/html" in content_type:
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup
    else:
        raise Exception(f'The request return {content_type} content-type. The content need to be text/html to convert into Beautiful Soup instance')

Testing for valid URL

In [11]:
assert isinstance(get_beautiful_soup_instance_from_url("https://www.themoviedb.org/movie"), BeautifulSoup)

Testing for invalid URL. 

In [12]:

try:
    # Code that might raise an error
    result = get_beautiful_soup_instance_from_url("https://www.themoviedb.org/movie/unknown")
    print("There was no error from accessing invalid url https://www.themoviedb.org/movie/unknown.")
except Exception as e:
    # Handle any type of exception
    print(f"We got the Error: {e}")

We got the Error: 404 Client Error: Not Found for url: https://www.themoviedb.org/movie/unknown


On invalid request we have raised the error based on response code **HTTPError: 404 Client Error: Not Found for url: https://www.themoviedb.org/movie/unknown**

3. Extract the content of the webpage - https://www.themoviedb.org/movie - that hosts a current dated listing of popular movies. ( 5 Marks )
    1. Write a function call to the user defined function created in 2c with the url https://www.themoviedb.org/movie as an input and store the response in a variable ( 1 mark )

In [13]:
BASE_MOVIELIST_URL = "https://www.themoviedb.org"
soup_main = get_beautiful_soup_instance_from_url(BASE_MOVIELIST_URL + "/movie")
type(soup_main)

bs4.BeautifulSoup

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; B. Print the HTML content associated with the first movie displayed on the web page using appropriate HTML tags to access this listing on the object created in part 3a ( 1 mark )

In [14]:
movie_obj = soup_main.find('div', "card style_1")

In [15]:
print(movie_obj.prettify())

<div class="card style_1">
 <div class="image">
  <div class="wrapper">
   <a class="image" href="/movie/901362" title="Trolls Band Together">
    <img alt="" class="poster" loading="lazy" src="/t/p/w220_and_h330_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg" srcset="/t/p/w220_and_h330_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg 1x, /t/p/w440_and_h660_face/bkpPTZUdq31UGDovmszsg2CchiI.jpg 2x"/>
   </a>
  </div>
  <div class="options" data-id="901362" data-media-type="movie" data-object-id="619bea97c0ae360089136cff">
   <a class="no_click" href="#">
    <div class="glyphicons_v2 circle-more white">
    </div>
   </a>
  </div>
 </div>
 <div class="content">
  <div class="consensus tight">
   <div class="outer_ring">
    <div class="user_score_chart 619bea97c0ae360089136cff" data-bar-color="#21d07a" data-percent="72.0" data-track-color="#204529">
     <div class="percent">
      <span class="icon icon-r72">
      </span>
     </div>
    </div>
   </div>
  </div>
  <h2>
   <a href="/movie/901362" title="Tr

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C. Display the name of the first movie using appropriate HTML tags to access this listing on the object created in part 3a ( 1 mark )


In [16]:
title = movie_obj.find('h2').a.text
title

'Trolls Band Together'

In [17]:
print(f"Given the first movie in the list, the name of the movie is {title}.")

Given the first movie in the list, the name of the movie is Trolls Band Together.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;D. Display the user rating of the first movie by using appropriate HTML tags to access this listing on the object created in part 3a (1 mark )

In [18]:
rating = float(movie_obj.find('div', 'user_score_chart')['data-percent'])
rating

72.0

In [19]:
print(f"Given the first movie in list, The rating of the movie is {rating}.")

Given the first movie in list, The rating of the movie is 72.0.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;E. For the first movie, extract the part of the url following the string “https://www.themoviedb.org/” using the appropriate HTML tags to extract this portion on the object created in part 3a (do not use built-in string methods). (1 mark )

In [20]:
url_suffix = movie_obj.find('h2').a["href"]
url_suffix

'/movie/901362'

In [21]:
url = BASE_MOVIELIST_URL + url_suffix
print(f"Given the first movie in list, The individual page url suffix of the movie is {url_suffix}.")
print(f"Given the first movie in list, The individual page url of the movie is {url}.")

Given the first movie in list, The individual page url suffix of the movie is /movie/901362.
Given the first movie in list, The individual page url of the movie is https://www.themoviedb.org/movie/901362.


 4. Write user defined functions for each subsection below (i.e., Q4 a, Q4b, Q4c, Q4d, and Q4e) to return ( 10 marks )
    1. Titles of all the movies on the page as a Python list (2 marks )

In [22]:
def get_titles_from_index_page(soup):
    titles = []
    for movie_obj in soup.find_all('div', "card style_1"):
        print
        titles.append(movie_obj.find('h2').a.text)
    return titles
titles_page1 = get_titles_from_index_page(soup)
titles_page1

['Trolls Band Together',
 'Oppenheimer',
 'The Creator',
 "Five Nights at Freddy's",
 'Jawan',
 'Expend4bles',
 'Mission: Impossible - Dead Reckoning Part One',
 'Fast X',
 'Napoleon',
 'Leo',
 'The Hunger Games: The Ballad of Songbirds & Snakes',
 'The Equalizer 3',
 'Dragon Ball: Mystical Adventure',
 'Blue Beetle',
 'Saw X',
 'Meg 2: The Trench',
 'The Super Mario Bros. Movie',
 'Elemental',
 'Barbie',
 'Gran Turismo']

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;B. User ratings of all the movies on the page as a Python list (2 marks )

In [23]:
def get_ratings_from_index_page(soup):
    ratings = []
    for movie_obj in soup.find_all('div', "card style_1"):
        user_score_chart = movie_obj.find('div', 'user_score_chart')
        if user_score_chart:
            rating = user_score_chart['data-percent']
            if rating:
                ratings.append(float(rating))
                continue
        ratings.append("not rated")
    return ratings
ratings_page1 = get_ratings_from_index_page(soup)
ratings_page1

[72.0,
 81.61,
 71.57,
 78.82,
 71.53999999999999,
 64.34,
 76.0,
 72.06,
 63.89,
 79.0,
 73.0,
 74.21000000000001,
 68.03,
 69.41,
 74.1,
 67.45,
 77.49,
 77.26,
 71.92,
 79.92]

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C. HTML content of all the individual pages of movies collected into a Python list.

In [24]:
from urllib.parse import urljoin

def get_individual_page_content(soup):
    html_contents = []
    for movie_obj in soup.find_all('div', "card style_1"):
        url_suffix = movie_obj.find('h2').a["href"]
        url = urljoin(BASE_MOVIELIST_URL, url_suffix)
        response = requests.get(url,headers = NEEDED_HEADERS)
        html_contents.append(response.text)
    return html_contents
html_pages_page_1 = get_individual_page_content(soup)
print(f"The pages are teturned. Each entry is of type {type(html_pages_page_1[0])} and the list is of length {len(html_pages_page_1)}.")

The pages are teturned. Each entry is of type <class 'str'> and the list is of length 20.


In [25]:
def get_individual_page_beautiful_soup(soup):
    html_page_soups = []
    for movie_obj in soup.find_all('div', class_="card style_1"):
        url_suffix = movie_obj.find('h2').a["href"]
        url = urljoin(BASE_MOVIELIST_URL, url_suffix)
        soup = get_beautiful_soup_instance_from_url(url)
        html_page_soups.append(soup)
    return html_page_soups
html_page_soups_page_1 = get_individual_page_beautiful_soup(soup)
print(f"The pages are teturned. Each entry is of type {type(html_page_soups_page_1[0])} and the list is of length {len(html_page_soups_page_1)}.")

The pages are teturned. Each entry is of type <class 'bs4.BeautifulSoup'> and the list is of length 20.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;D. Genres of all the movies on the page as a Python list ( 2 marks )

In [26]:
def get_generes(html_pages):
    all_generes = []
    for soup in html_pages:
        if not isinstance(soup, BeautifulSoup):
            soup = BeautifulSoup(soup, 'html.parser')
        generes = []
        for genere in soup.find('span', class_='genres').find_all('a'):
            generes.append(genere.get_text().strip())
        all_generes.append(generes)
    return all_generes
generes_page1_html = get_generes(html_pages_page_1)
generes_page1_html

[['Animation', 'Family', 'Music', 'Fantasy', 'Comedy'],
 ['Drama', 'History'],
 ['Science Fiction', 'Action', 'Thriller'],
 ['Horror', 'Mystery'],
 ['Action', 'Adventure', 'Thriller'],
 ['Action', 'Adventure', 'Thriller'],
 ['Action', 'Thriller'],
 ['Action', 'Crime', 'Thriller'],
 ['Drama', 'History', 'War'],
 ['Animation', 'Comedy', 'Family'],
 ['Action', 'Adventure', 'Science Fiction'],
 ['Action', 'Thriller', 'Crime'],
 ['Action', 'Animation'],
 ['Action', 'Science Fiction', 'Adventure'],
 ['Horror', 'Thriller'],
 ['Action', 'Science Fiction', 'Horror'],
 ['Animation', 'Family', 'Adventure', 'Fantasy', 'Comedy'],
 ['Animation', 'Comedy', 'Family', 'Fantasy', 'Romance'],
 ['Comedy', 'Adventure', 'Fantasy'],
 ['Adventure', 'Action', 'Drama']]

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;E. Cast of all the movies on the page as a Python list ( 2 marks )

In [27]:
def get_casts(html_page_soups): # Gets all casts. Not just famous ones
    all_casts=[]
    for soup in html_page_soups:
        if not isinstance(soup, BeautifulSoup):
            soup = BeautifulSoup(soup, 'html.parser')
        casts = set()
        cast_url_suffix = soup.find('section', class_='panel top_billed scroller').find('p', class_='new_button').a["href"]
        cast_url = urljoin(BASE_MOVIELIST_URL, cast_url_suffix)
        cast_soup = get_beautiful_soup_instance_from_url(cast_url)
        for ol_soup in cast_soup.find('ol', class_='people credits').find_all('li'):
            casts.add(ol_soup.select_one('.info p a').get_text().strip())
        all_casts.append(list(casts))
    return all_casts
        
all_casts = get_casts(html_pages_page_1)
all_casts

[['Kid Cudi',
  'Christopher Mintz-Plasse',
  'Troye Sivan',
  'Glozell Green',
  'JC Chasez',
  'Caroline Hjelt',
  'Chris Kirkpatrick',
  'RuPaul',
  'Daveed Diggs',
  'Amy Schumer',
  'Aino Jawo',
  'Zooey Deschanel',
  'Camila Cabello',
  'Joey Fatone',
  'Patti Harrison',
  'Justin Timberlake',
  'Kenan Thompson',
  'Kevin Michael Richardson',
  'Dillon Francis',
  'David Fynn',
  'Anna Kendrick',
  'Lance Bass',
  'Kunal Nayyar',
  'Walt Dohrn',
  'Eric André',
  'Ron Funches',
  'Anderson .Paak',
  'Andrew Rannells',
  'Zosia Mamet'],
 ['Olivia Thirlby',
  'Dylan Arnold',
  'Kenneth Branagh',
  'Petrie Willink',
  'Christopher Denham',
  'Meg Schimelpfenig',
  'Devon Bostick',
  'Drew Kenney',
  'Matt Damon',
  'Jeff Hephner',
  'Josh Hartnett',
  'Benny Safdie',
  'Tony Goldwyn',
  'Jeremy John Wells',
  'Kurt Koehler',
  'Harrison Gilbertson',
  'Emma Dumont',
  'Alden Ehrenreich',
  'Bryce Johnson',
  'Josh Zuckerman',
  'Trond Fausa Aurvåg',
  'Jack Quaid',
  'Clay Bunker',


5. Write an user defined function that returns a pandas data frame with following data: ( 5
marks )
    1. Titles of the movies listed on the page
    2. User ratings of the movies listed on the page
    3. Genres of the movies listed on the page
    4. Cast of the movies listed on the page

Input to the user defined function
- The response object created in Q3a
- The list output from Q4c

In [28]:
import pandas as pd
def extract_data(index_page_soup, individual_pages_htmls):
    data = {
        "title":  get_titles_from_index_page(index_page_soup),
        "user ratings": get_ratings_from_index_page(index_page_soup),
        "genres" : get_generes(individual_pages_htmls),
        "cast": get_casts(individual_pages_htmls)
    }
    return pd.DataFrame(data)
data_page1 = extract_data(soup, html_pages_page_1)

In [29]:
pd.set_option('display.max_colwidth', None)
data_page1

Unnamed: 0,title,user ratings,genres,cast
0,Trolls Band Together,72.0,"[Animation, Family, Music, Fantasy, Comedy]","[Kid Cudi, Christopher Mintz-Plasse, Troye Sivan, Glozell Green, JC Chasez, Caroline Hjelt, Chris Kirkpatrick, RuPaul, Daveed Diggs, Amy Schumer, Aino Jawo, Zooey Deschanel, Camila Cabello, Joey Fatone, Patti Harrison, Justin Timberlake, Kenan Thompson, Kevin Michael Richardson, Dillon Francis, David Fynn, Anna Kendrick, Lance Bass, Kunal Nayyar, Walt Dohrn, Eric André, Ron Funches, Anderson .Paak, Andrew Rannells, Zosia Mamet]"
1,Oppenheimer,81.61,"[Drama, History]","[Olivia Thirlby, Dylan Arnold, Kenneth Branagh, Petrie Willink, Christopher Denham, Meg Schimelpfenig, Devon Bostick, Drew Kenney, Matt Damon, Jeff Hephner, Josh Hartnett, Benny Safdie, Tony Goldwyn, Jeremy John Wells, Kurt Koehler, Harrison Gilbertson, Emma Dumont, Alden Ehrenreich, Bryce Johnson, Josh Zuckerman, Trond Fausa Aurvåg, Jack Quaid, Clay Bunker, Adam Kroeger, Máté Haumann, Matthias Schweighöfer, Gustaf Skarsgård, Casey Affleck, Pat Skipper, James D'Arcy, Olli Haaskivi, Christina Hogue, Rory Keane, Ronald Auguste, James Urbaniak, Tom Conti, Sean Avery, Michael Angarano, Steven Houska, David Dastmalchian, Steve Coulter, James Remar, Robert Downey Jr., Kate French, Gregory Jbara, Harry Groener, Will Roberts, Tyler Beardsley, Kerry Westcott, Hap Lawrence, Troy Bronson, Josh Peck, Louise Lombard, Cillian Murphy, Alex Wolff, Dane DeHaan, Tom Jenkins, Florence Pugh, Michael Andrew Baker, Maria Teresa Zuppetta, Britt Kyle, Emily Blunt, Jason Clarke, Guy Burnet, Jack Cutmore-Scott, Sadie Stratton, John Gowans, Rami Malek, David Krumholtz, David Rysdahl, Scott Grimes, Jefferson Hall, Danny Deferrari, Matt Snead, Matthew Modine, Jessica Erin Martin, Flora Nolan, Brett DelBuono, Ted King, Gary Oldman, Macon Blair, Tim DeKay]"
2,The Creator,71.57,"[Science Fiction, Action, Thriller]","[Rad Pereira, Mariam Khummaung, Natthaphong Chaiyawong, Karen Aldridge, Niko Rusakov, Tawee Teesura, Robbie Tann, Anjana Ghogar, Veronica Ngo, Ian Verdun, Ken Watanabe, Agneta Catarina Békassy de Békas, Dana Blouin, Jeb Kreager, Ralph Ineson, Brett Bartholomew, Brett Parks, Molywon Phantarak, Madeleine Yuna Voyles, John David Washington, Teerawat Mulvilai, Michael Esper, Charlie McElveen, James David Henry, Pat Skelton, Gemma Chan, Eoin O'Brien, Marc Menchaca, Mackenzie Lansing, Scott Thomas, Sturgill Simpson, Chananticha Chaipa, Phaithoon Wanglomklang, Elliot Berk, Art Ybarra, Monthatip Suksopha, Kandanai Chotikapracal, Sawanee Utoomma, Amar Chadha-Patel, Apiwantana Duenkhao, Leanna Chea, Stephen Howard Thomas, John Garrett Mahlmeister, Allison Janney, Daniel Ray Rodriguez, Syd Skidmore, Pongsanart Vinsiri, Sahatchai Chumrum, Ron Weaver, Kulsiri Thongrung, Mav Kang, Chalee Sankhavesa]"
3,Five Nights at Freddy's,78.82,"[Horror, Mystery]","[Joseph Poliquin, Theodus Crane, Tadasay Young, Mary Stuart Masterson, Gralen Bryant Banks, Xander Mateo, Jophielle Love, Piper Rubio, David Huston Doty, Garrett Hines, Bailey Winston, Lisa Mackel Smith, Victoria Patenaude, Liam Hendrix, Julia Belanova, Grant Feely, David Lind, Matthew Lillard, Wyatt Parker, Ryan Reinike, Cory Kenshin, Josh Hutcherson, Michael P. Sullivan, Elizabeth Lail, Matthew Patrick, Asher Colton Spence, Kat Conner Sterling, Christian Stokes, Jessica Blackmore, Lucas Grant]"
4,Jawan,71.54,"[Action, Adventure, Thriller]","[Priyamani, Deepika Padukone, Viraj Ghelani, Sukhi Garewal, Sanjay Dutt, Yogi Babu, Shah Rukh Khan, Ravindra Vijay, Lehar Khan, Sanjeeta Bhattacharya, Sunil Grover, Eijaz Khan, Naresh Gosain, Nayanthara, Ashlesha Thakur, Sanya Malhotra, Sirisha Hanumanth, Parth Siddhpura, Smita Tambe, Aaliyah Qureishi, Kenny Basumatary, Ridhi Dogra, Sangay Tsheltrim, Vivek Raghuvanshi, Sharad Vyas, Bharat Raj, Jaffer Sadiq, Seeza Saroj, Ashwin Goshal, Mukesh Chhabra, Vijay Sethupathi, Boxer Dheena, Atlee, Abiral Limboo, Omkar Das Manikpuri, Girija Oak, Direndo Loitangbam, Abhishek Deswal, Priyansh Vatiani, Utsav Narula]"
5,Expend4bles,64.34,"[Action, Adventure, Thriller]","[Levy Tran, Susanne Potrock, Igor Pečenjev, Martin Ghiaurov, Tjaša Perko, Daren Nop, Jacob Scipio, Karim Saidi, Stefan Ivanov, Tony Jaa, Caroline Wilde, Cokey Falkow, Cody Mackie, Jason Lines, Adam Masto, Oat Jenner, Vladimir Mihailov, Alexander Hristozov, Dan Chupong, Sheila Shah, David Nop, Megan Fox, Assen Karanikolov, Iko Uwais, Jason Statham, Dolph Lundgren, Nicole Andrews, Antoni Davidov, Andy García, Kenny ""Cowboy"" Bartram, Sylvester Stallone, 50 Cent, Mike Möller, Stefan Bahrun, Randy Couture, Lucy Newman-Williams, Samuel Black, Eddie Hall]"
6,Mission: Impossible - Dead Reckoning Part One,76.0,"[Action, Thriller]","[Dana Blacklake, Damian Rozanek, Sam Barrett, Esai Morales, Andrea Scarduzio, Ving Rhames, Zahari Baharov, Adrian Bouchet, Nicolas Wang, Pom Klementieff, Alex James-Phelps, Megan Westpfel, Rob Delaney, Ivan Ivashkin, Luke Smith, Michael Kosterin, Cary Elwes, Christopher Sciueref, Marcello Walton, Mikhail Safronov, Os Leanse, Barnaby Kay, Doroteya Toleva, Ioachim Ciobanu, Melissa Anna Bartolini, Joanna Dyce, Mariela Garriga, Brian Law, Marc Wesley DeHaney, Rocky Taylor, Tom Cruise, Philip Hulford, Mark Gatiss, Indira Varma, Grace Jabbari, Frederick Schmidt, Lucia Tong, Katie Collins, Louis Vaughan, Sergej Lopouchanski, Antonio Bustorff, Arevinth V Sarma, Evita Ciri, Nikolaos Brahimllari, Gaetano Bruno, Ira Mandela Siobhan, Nicholas Tredrea, Kaye Dinauto, Yennis Cheung, Henry Czerny, Daniella Carraturo, Marco Sincini, Faycal Attougui, Greg Tarzan Davis, Simon Pegg, Shea Whigham, Jadran Malkovich, Matt Malecki, Taylor Goodridge, Alex Brock, Laura Vortler, Jean Kartal, Gloria Obianyo, Hersha Verity, Jessica Holland, Vanessa Kirby, Simon Rizzoni, Lincoln Conway, Robert Luckay, Rebecca Ferguson, Marcin Dorociński, Marco Lascari, John Akanmu, Charles Parnell, Hayley Atwell, Lee Bridgman]"
7,Fast X,72.06,"[Action, Crime, Thriller]","[Dwayne Johnson, Ludacris, Miraj Grbić, Jordana Brewster, Josh Dun, Luis Da Silva Jr., Pete Davidson, Nathalie Emmanuel, Alexander Capon, Shahir Figueira, Brie Larson, Luka Hays, Charlize Theron, Emily Buchan, Tyrese Gibson, Ben-Hur Santos, Alan Ritchson, Vin Diesel, Debby Ryan, Paul Walker, Daniela Melchior, Jason Statham, Joaquim de Almeida, Gal Gadot, Scott Eastwood, Shadrach Agozino, Ludmilla, Jaz Hutchins, Ali Baddou, John Cena, Michelle Rodriguez, Rita Moreno, Jason Momoa, Leo A. Perry, Sung Kang, Michael Irby, Helen Mirren, Meadow Walker Thornton-Allan]"
8,Napoleon,63.89,"[Drama, History, War]","[Paul Rhys, Ed Hughes, Tim Faulkner, David Verrey, Sophie Lund, John Hollingworth, Mark Bonnar, Tahar Rahim, Ben Miles, Davide Tucci, Robert William Carlisle, Arthur McBain, Ed Eales White, Joaquin Phoenix, Thea Achillea, Riana Duce, Jacob Marshfield, Sam Troughton, Anna Mawn, Edouard Philipponnat, Rupert Everett, Paul Riddell, Kevin Eldon, Jonathan Barnwell, Julian Wadham, Cormac Hyde-Corrin, Youssef Kerkour, Hannah Flynn, Richard McCabe, Julian Rhind-Tutt, Thom Ashley, Cesare Taurasi, Sinéad Cusack, Ludivine Sagnier, Phil Cornwell, Abubakar Salim, Ian McNeice, Edward Hogg, Catherine Walker, Vanessa Kirby, Miles Jupp, Jannis Niewöhner, Gavin Spokes, Erin Ainsworth, Matthew Needham]"
9,Leo,79.0,"[Animation, Comedy, Family]","[Christian Capozzoli, Blake Clark, Jonny Solomon, Giselle Fernández, Nicholas Turturro, John Farley, Ryun Yu, Tiffany Topol, Ryan Bartley, Lileina Joy, Dan Reitz, Aliza Pelavin, Janie Haddad Tompkins, Jo Koy, Corey J, Allison Strong, Sunita Param, Ashley Lambert, Jaquita Ta'le, Cecily Strong, Jackie Sandler, Adam Sandler, Rebecca Vigil, Kelly Stables, Bill Burr, Benjamin Bottani, Chris Titone, Katie Hartman, Elijah Kim, Nick Swardson, Sunny Sandler, Frankie Figliozzi, David Michie, Noah Robbins, Alex Quijano, Ranjani Brow, Kyra Wachtenheim, Stephanie Hsu, Andrew Morgado, Aldan Liam Philipson, Ethan Smigel, Ava Acres, Bryant Tardy, Paul Brittain, Mary Deaton, Warren Sroka, Jonathan Loughran, Robert Marianetti, David Wachtenheim, Doug Dale, Rob Schneider, Germar Terrell Gardner, Gloria Manning, Carson Minniear, Reese Lores, Nora Wyman, Chris Kattan, Terence Mathews, Jason Griffith, Joel Marsh Garland, Heidi Gardner, Nikki Castillo, Roey Smigel, Shelby Young, Chase Fein, Jason Alexander, Sonya Leslie, Scott Menville, Andre Robinson, Coulter Ibanez, Sadie Sandler, TienYa Safko, Rose Abdoo, Robert Smigel, Sheila Carrasco]"


6. Scraping the data and combining the dataframes ( 5 marks )
    1. Write a function that scrapes data (mentioned in Q5) from page number 1, 2, 3, 4 and 5 on the URL https://www.themoviedb.org/movie and returns 5 data frames which can be exported to csv file by calling the functions defined in Q3a, Q4c and Q5 (3 marks)

In [30]:
def get_movie_data_from_pages(start_page=1, end_page=5):
    df_list = []
    if start_page <= 0:
        raise Exception("The starting page should be greater than 0")
    for page_num in range(start_page, end_page+1):
        url = BASE_MOVIELIST_URL + f"/movie?page={page_num}"
        index_page_soup = get_beautiful_soup_instance_from_url(url)
        individual_movie_page_list = get_individual_page_beautiful_soup(index_page_soup)
        if len(individual_movie_page_list) == 0:
            print(f"There is no data in page {page_num}. Returning the result")
        data = extract_data(index_page_soup, individual_movie_page_list)
        data.to_csv(f"tmdb_page{page_num}.csv", index=False)
        df_list.append(data)
    return df_list
        
data_frames = get_movie_data_from_pages(1, 5)
print(f"The data is of length {len(data_frames)} and type of data is {type(data_frames[0])}.")

The data is of length 5 and type of data is <class 'pandas.core.frame.DataFrame'>.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;B. Combine the data obtained from dataframes in Q6(a) (2 marks)

In [31]:
full_data = pd.concat(data_frames, axis=0, ignore_index=True)
full_data.to_csv(f"tmdb_full_data.csv", index=False)
full_data

Unnamed: 0,title,user ratings,genres,cast
0,Trolls Band Together,72.00,"[Animation, Family, Music, Fantasy, Comedy]","[Kid Cudi, Christopher Mintz-Plasse, Troye Sivan, Glozell Green, JC Chasez, Caroline Hjelt, Chris Kirkpatrick, RuPaul, Daveed Diggs, Amy Schumer, Aino Jawo, Zooey Deschanel, Camila Cabello, Joey Fatone, Patti Harrison, Justin Timberlake, Kenan Thompson, Kevin Michael Richardson, Dillon Francis, David Fynn, Anna Kendrick, Lance Bass, Kunal Nayyar, Walt Dohrn, Eric André, Ron Funches, Anderson .Paak, Andrew Rannells, Zosia Mamet]"
1,Oppenheimer,81.61,"[Drama, History]","[Olivia Thirlby, Dylan Arnold, Kenneth Branagh, Petrie Willink, Christopher Denham, Meg Schimelpfenig, Devon Bostick, Drew Kenney, Matt Damon, Jeff Hephner, Josh Hartnett, Benny Safdie, Tony Goldwyn, Jeremy John Wells, Kurt Koehler, Harrison Gilbertson, Emma Dumont, Alden Ehrenreich, Bryce Johnson, Josh Zuckerman, Trond Fausa Aurvåg, Jack Quaid, Clay Bunker, Adam Kroeger, Máté Haumann, Matthias Schweighöfer, Gustaf Skarsgård, Casey Affleck, Pat Skipper, James D'Arcy, Olli Haaskivi, Christina Hogue, Rory Keane, Ronald Auguste, James Urbaniak, Tom Conti, Sean Avery, Michael Angarano, Steven Houska, David Dastmalchian, Steve Coulter, James Remar, Robert Downey Jr., Kate French, Gregory Jbara, Harry Groener, Will Roberts, Tyler Beardsley, Kerry Westcott, Hap Lawrence, Troy Bronson, Josh Peck, Louise Lombard, Cillian Murphy, Alex Wolff, Dane DeHaan, Tom Jenkins, Florence Pugh, Michael Andrew Baker, Maria Teresa Zuppetta, Britt Kyle, Emily Blunt, Jason Clarke, Guy Burnet, Jack Cutmore-Scott, Sadie Stratton, John Gowans, Rami Malek, David Krumholtz, David Rysdahl, Scott Grimes, Jefferson Hall, Danny Deferrari, Matt Snead, Matthew Modine, Jessica Erin Martin, Flora Nolan, Brett DelBuono, Ted King, Gary Oldman, Macon Blair, Tim DeKay]"
2,The Creator,71.57,"[Science Fiction, Action, Thriller]","[Rad Pereira, Mariam Khummaung, Natthaphong Chaiyawong, Karen Aldridge, Niko Rusakov, Tawee Teesura, Robbie Tann, Anjana Ghogar, Veronica Ngo, Ian Verdun, Ken Watanabe, Agneta Catarina Békassy de Békas, Dana Blouin, Jeb Kreager, Ralph Ineson, Brett Bartholomew, Brett Parks, Molywon Phantarak, Madeleine Yuna Voyles, John David Washington, Teerawat Mulvilai, Michael Esper, Charlie McElveen, James David Henry, Pat Skelton, Gemma Chan, Eoin O'Brien, Marc Menchaca, Mackenzie Lansing, Scott Thomas, Sturgill Simpson, Chananticha Chaipa, Phaithoon Wanglomklang, Elliot Berk, Art Ybarra, Monthatip Suksopha, Kandanai Chotikapracal, Sawanee Utoomma, Amar Chadha-Patel, Apiwantana Duenkhao, Leanna Chea, Stephen Howard Thomas, John Garrett Mahlmeister, Allison Janney, Daniel Ray Rodriguez, Syd Skidmore, Pongsanart Vinsiri, Sahatchai Chumrum, Ron Weaver, Kulsiri Thongrung, Mav Kang, Chalee Sankhavesa]"
3,Five Nights at Freddy's,78.82,"[Horror, Mystery]","[Joseph Poliquin, Theodus Crane, Tadasay Young, Mary Stuart Masterson, Gralen Bryant Banks, Xander Mateo, Jophielle Love, Piper Rubio, David Huston Doty, Garrett Hines, Bailey Winston, Lisa Mackel Smith, Victoria Patenaude, Liam Hendrix, Julia Belanova, Grant Feely, David Lind, Matthew Lillard, Wyatt Parker, Ryan Reinike, Cory Kenshin, Josh Hutcherson, Michael P. Sullivan, Elizabeth Lail, Matthew Patrick, Asher Colton Spence, Kat Conner Sterling, Christian Stokes, Jessica Blackmore, Lucas Grant]"
4,Jawan,71.54,"[Action, Adventure, Thriller]","[Priyamani, Deepika Padukone, Viraj Ghelani, Sukhi Garewal, Sanjay Dutt, Yogi Babu, Shah Rukh Khan, Ravindra Vijay, Lehar Khan, Sanjeeta Bhattacharya, Sunil Grover, Eijaz Khan, Naresh Gosain, Nayanthara, Ashlesha Thakur, Sanya Malhotra, Sirisha Hanumanth, Parth Siddhpura, Smita Tambe, Aaliyah Qureishi, Kenny Basumatary, Ridhi Dogra, Sangay Tsheltrim, Vivek Raghuvanshi, Sharad Vyas, Bharat Raj, Jaffer Sadiq, Seeza Saroj, Ashwin Goshal, Mukesh Chhabra, Vijay Sethupathi, Boxer Dheena, Atlee, Abiral Limboo, Omkar Das Manikpuri, Girija Oak, Direndo Loitangbam, Abhishek Deswal, Priyansh Vatiani, Utsav Narula]"
...,...,...,...,...
95,The Batman,77.00,"[Crime, Mystery, Thriller]","[Parry Glasspool, Arthur Lee, Marcus Onilude, James Eeles, Kazeem Tosin Amore, Bobby Cuza, Lorraine Tai, Alex Ferns, Dave Simon, Sid Sagar, Lorna Brown, Heider Ali, Joseph Walker, Colin Farrell, John Turturro, Stewart Alexander, Jack Bennett, Roma Torre, Elijah Baker, Itoya Osagiede, Ed Kear, Urielle Klein-Mekongo, Elena Saurel, Chabris Napier-Lawrence, Adam Rojko Vega, Andy Serkis, Con O'Neill, Douglas Russell, Robert Pattinson, Leemore Marrett Jr., Amanda Hurwitz, Jeffrey Wright, Hadas Gold, Will Austin, Stefan Race, Janine Harouni, Nathalie Armin, Madeleine Gray, Andre Nightingale, Jayme Lawson, Pat Battle, Sandra Dickinson, Todd Boyce, Jay Lycurgo, Akie Kotabe, Elliot Warren, Paul Dano, Rupert Penry-Jones, Peter McDonald, Craige Middleburg, Luke Roberts, Charlie Carver, Craig Douglas, Stella Stocker, Phil Aizlewood, Zoë Kravitz, Daniel Rainford, Brandon Bassir, Archie Barnes, Angela Yeoh, Philip Shaun McGuinness, Amanda Blake, Peter Sarsgaard, Bronson Webb, Ezra Elliott, Richard James-Neale, Gil Perez-Abraham, Kosha Engler, Mark Killeen, Joseph Balderrama, Barry Keoghan, Joshua Eldridge-Smith, Ste Johnston, Spike Fearn, Hana Hrzic, Dean Meminger, Mike Capozzola, Jose Palma, Oscar Novak, Jordan Coulson, Rodrig Andrisan, Max Carver]"
96,The Last Voyage of the Demeter,71.54,"[Thriller, Horror]","[Vladimir Cabak, Liam Cunningham, Noureddine Farihi, Javier Botet, Nicolo Pasetti, Rudolf Danielewicz, Christopher York, Woody Norman, Malcolm Galea, Joe Depasquale, Graham Turner, Nikolai Nikolaeff, Andy Murray, Chris Walley, Aisling Franciosi, Jack Doggart, Stefan Kapičić, Jon Jon Briones, Adam Shaw, Sally Reeve, David Dastmalchian, Corey Hawkins, Martin Furulund]"
97,Harry Potter and the Philosopher's Stone,79.15,"[Adventure, Fantasy]","[David Bradley, Will Theakston, Paul Grant, Julie Walters, Geraldine Somerville, Terence Bayler, Leslie Phillips, Simon Fisher-Becker, Fiona Shaw, Chris Rankin, Leila Hoffman, Alan Rickman, Saunders Triplets, Julianne Hough, Adrian Rawlins, Verne Troyer, Alfred Enoch, Ben Borowiecki, Zoe Sugg, Harry Taylor, Rupert Grint, Jean Southern, Kieri Kennedy, Robbie Coltrane, Matthew Lewis, Warwick Davis, Devon Murray, Eleanor Columbus, David Holmes, Violet Columbus, Daniel Radcliffe, Tom Felton, Ray Fearon, Elizabeth Spriggs, Derek Deadman, Sean Biggerstaff, Dani Harmer, Emily Dale, Mark Ballas, Richard Griffiths, Josh Herdman, James Phelps, Ian Hart, Jamie Waylett, Zoë Wanamaker, John Cleese, Danielle Tabor, Derek Hough, Scot Fearn, Bonnie Wright, Leilah Sutherland, Richard Bremmer, Paul Marc Davis, John Hurt, Maggie Smith, Harry Melling, Richard Harris, Jimmy Vee, Luke Youngblood, Emma Watson, Nina Young, Oliver Phelps]"
98,Coco,82.00,"[Family, Animation, Fantasy, Music, Comedy, Adventure]","[Jaime Camil, Edward James Olmos, Selene Luna, Polo Rojas, Herbert Siguenza, Ana Ofelia Murguía, Carla Medina, Benjamin Bratt, Natalia Cordova-Buckley, Montse Hernandez, Octavio Solis, Anthony Gonzalez, Alanna Ubach, Salvador Reyes, Luis Valdez, Gabriel Iglesias, Cheech Marin, Blanca Araceli, Lombardo Boyar, Sofía Espinosa, Renée Victor, Dyana Ortelli, John Ratzenberger, Alfonso Arau, Gael García Bernal]"
