1)  Establish a connection to the webpage and display the page content received from the server's response.

In [1]:
import requests
url="https://www.themoviedb.org/movie"
needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
response=requests.get(url,headers = needed_headers)

In [2]:
response.status_code

200

In [3]:
response.status_code == requests.codes.ok

True

The request was successful. Next step is to get the contents of the page using response.text.

In [4]:
page_content=response.text
page_content

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n    <meta name="keywords" content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast">\n    <meta name="mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="viewport" content="width=device-width,initial-scale=1">\n      <meta name="description" content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows.">\n    <meta name="msapplication-TileImage" content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png">\n<meta name="msapplication-TileColor" content="#032541">\n<meta name="theme-color" content="#032541">\n<link rel="apple-touch-icon" sizes="180x180" href=

In [5]:
type(page_content)

str

The type of the page_content obtained from server's response is "String". So in order to display the first 200 characters of the content from the server's response, I'll use string slicing.

In [6]:
page_content[:200]

'<!DOCTYPE html>\n<html lang="en" class="no-js">\n  <head>\n    <title>Popular Movies &#8212; The Movie Database (TMDB)</title>\n    <meta http-equiv="cleartype" content="on">\n    <meta charset="utf-8">\n  '

2) Parse the content of HTML response using the BeautifulSoup library

In [7]:
from bs4 import BeautifulSoup
soup=BeautifulSoup(page_content,'html.parser')

In [8]:
print("The title of the webpage is: ",soup.find('title').text)

The title of the webpage is:  Popular Movies — The Movie Database (TMDB)


Below is the user defined function that takes a URL string as an input and returns a correctly formulated BeautifulSoup instance as the output. It also raises exception (raise_for_status():Raises HTTPError, if one occurred.) if the status code is not 200.


In [9]:
def get_webpage_source(url):
    needed_headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
    new_reponse=requests.get(url,headers=needed_headers)        
    try:
        new_reponse.raise_for_status()
        page_source=new_reponse.text
        output=BeautifulSoup(page_source,'html.parser')
        return output

    except requests.exceptions.RequestException as e:
        print("Error : ", e)    

Test case 1: Working URL

In [10]:
popular_people_url='https://www.themoviedb.org/person'
people_content=get_webpage_source(popular_people_url)
print(people_content.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <title>
   Popular People — The Movie Database (TMDB)
  </title>
  <meta content="on" http-equiv="cleartype"/>
  <meta charset="utf-8"/>
  <meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
  <meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
  <meta content="#032541" name="msapplication-TileColor"/>
  <meta content="#032541" name="theme-color"/>
  <link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b

In [11]:
people_content.find('title').text

'Popular People — The Movie Database (TMDB)'

Test case 2: Non-working URL

In [12]:
gibberish_url='https://www.themoviedb.org/personhelloxyz'
html_content=get_webpage_source(gibberish_url)

Error :  404 Client Error: Not Found for url: https://www.themoviedb.org/personhelloxyz


3) Extract the content of the webpage - https://www.themoviedb.org/movie - that hosts a current dated listing of popular movies

In [13]:
movie_url='https://www.themoviedb.org/movie'
movie_content=get_webpage_source(movie_url)
print(movie_content.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <title>
   Popular Movies — The Movie Database (TMDB)
  </title>
  <meta content="on" http-equiv="cleartype"/>
  <meta charset="utf-8"/>
  <meta content="Movies, TV Shows, Streaming, Reviews, API, Actors, Actresses, Photos, User Ratings, Synopsis, Trailers, Teasers, Credits, Cast" name="keywords"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="The Movie Database (TMDB) is a popular, user editable database for movies and TV shows." name="description"/>
  <meta content="/assets/2/v4/icons/mstile-144x144-30e7905a8315a080978ad6aeb71c69222b72c2f75d26dab1224173a96fecc962.png" name="msapplication-TileImage"/>
  <meta content="#032541" name="msapplication-TileColor"/>
  <meta content="#032541" name="theme-color"/>
  <link href="/assets/2/apple-touch-icon-57ed4b3b0450fd5e9a0c20f34e814b

In [14]:
first_movie_content=movie_content.find("div", class_="card style_1")
print(first_movie_content)

<div class="card style_1">
<div class="image">
<div class="wrapper glyphicons_v2 picture grey no_image_holder">
<a class="image" href="/movie/933131" title="Badland Hunters">
<img alt="Badland Hunters" class="poster" loading="lazy" src="https://media.themoviedb.org/t/p/w220_and_h330_face/24CL0ySodCF8bcm38xtBeHzHp7W.jpg" srcset="https://media.themoviedb.org/t/p/w220_and_h330_face/24CL0ySodCF8bcm38xtBeHzHp7W.jpg 1x, https://media.themoviedb.org/t/p/w440_and_h660_face/24CL0ySodCF8bcm38xtBeHzHp7W.jpg 2x"/>
</a>
</div>
<div class="options" data-id="933131" data-media-type="movie" data-object-id="61f812b1fe6c180068897669">
<a class="no_click" href="#"><div class="glyphicons_v2 circle-more white"></div></a>
</div>
</div>
<div class="content">
<div class="consensus tight">
<div class="outer_ring">
<div class="user_score_chart 61f812b1fe6c180068897669" data-bar-color="#d2d531" data-percent="69.28" data-track-color="#423d0f">
<div class="percent">
<span class="icon icon-r69"></span>
</div>
</div

In [15]:
first_movie_title=movie_content.find("div", class_="card style_1").h2.a.text
print(first_movie_title)

Badland Hunters


In [16]:
first_movie_rating=movie_content.find("div", class_="user_score_chart").get('data-percent')
print(first_movie_rating)

69.28


In [17]:
first_movie_url=movie_content.find("div", class_="card style_1").h2.a.get('href')
print(first_movie_url[1:])

movie/933131


4) Write some user defined functions.

In [18]:
def get_all_movie_title(mov_info):
    movie_title_list=[]
    movie_cards=mov_info.find_all("div", class_="card style_1")
    for movie in movie_cards:
        movie_title_list.append(movie.h2.a.text)
    return movie_title_list
print(get_all_movie_title(movie_content))

['Badland Hunters', 'Sixty Minutes', 'Wonka', 'The Marvels', 'The Beekeeper', 'Lift', 'Migration', 'Wish', 'Aquaman and the Lost Kingdom', 'Orion and the Dark', 'The Hunger Games: The Ballad of Songbirds & Snakes', "The Tiger's Apprentice", 'The Painter', 'Role Play', 'Anyone But You', 'Napoleon', 'Trunk: Locked In', 'The Family Plan', 'Rebel Moon - Part One: A Child of Fire', 'One More Shot']


In [19]:
def get_all_movie_ratings(mov_info):
    movie_rating_list=[]
    movie_cards=mov_info.find_all("div", class_="user_score_chart")
    for movie in movie_cards:
        if movie['data-percent'] == '0':
            movie_rating_list.append('Not rated')
        else:
            movie_rating_list.append(movie['data-percent'])
    return movie_rating_list
print(get_all_movie_ratings(movie_content))

['69.28', '71.0', '72.15', '63.129999999999995', '73.04', '65.84', '77.0', '66.36', '69.17', '68.19', '72.21000000000001', '70.66', '71.0', '60.099999999999994', '70.74', '65.22', '56.0', '73.5', '64.17', '67.9']


In [20]:
def get_all_html_content(mov_info):
    movie_cards=mov_info.find_all("div", class_="card style_1")
    each_movie_html_content_list=[]
    for movie in movie_cards:
        movie_link=movie.h2.a['href']
        each_movie_html=get_webpage_source('https://www.themoviedb.org'+movie_link)
        each_movie_html_content_list.append(each_movie_html)
    return each_movie_html_content_list
all_movie_page_content= get_all_html_content(movie_content)

In [21]:
def get_movie_genre(movie_page_content):
    genre=[]
    for movie in movie_page_content:
        genre_tags=movie.find("span", class_="genres")
        genre.append([anchor.text for anchor in genre_tags.find_all("a")])
    return genre

genres_each_movie_list=get_movie_genre(all_movie_page_content)
print(genres_each_movie_list)

[['Science Fiction', 'Action', 'Drama'], ['Action', 'Drama'], ['Comedy', 'Family', 'Fantasy'], ['Science Fiction', 'Adventure', 'Action'], ['Action', 'Thriller', 'Drama'], ['Action', 'Comedy', 'Crime'], ['Animation', 'Action', 'Adventure', 'Comedy', 'Family'], ['Animation', 'Family', 'Fantasy', 'Adventure'], ['Action', 'Adventure', 'Fantasy'], ['Animation', 'Family', 'Comedy', 'Fantasy'], ['Drama', 'Science Fiction', 'Action'], ['Animation', 'Action', 'Adventure', 'Family', 'Fantasy'], ['Action', 'Thriller'], ['Action', 'Comedy', 'Romance'], ['Comedy', 'Romance'], ['History', 'War', 'Action', 'Drama'], ['Thriller', 'Action', 'Drama'], ['Action', 'Comedy'], ['Science Fiction', 'Drama', 'Action'], ['Action', 'Thriller']]


In [22]:
def get_movie_cast(movie_page_content):
    cast=[]
    cast_tags=[]
    for movie in movie_page_content:
        cast_tags.append(movie.select('.card > p > a'))
    for cast_tag in cast_tags:
        cast.append([anchor.text for anchor in cast_tag])
    return cast
cast_each_movie_list=get_movie_cast(all_movie_page_content)
print(cast_each_movie_list)

[['Ma Dong-seok', 'Lee Hee-jun', 'Lee Jun-young', 'Roh Jeong-eui', 'Ahn Ji-hye', 'Park Ji-hoon', 'Jang Young-nam', 'Park Hyo-joon', 'Seong Byeong-suk'], ['Emilio Sakraya', 'Dennis Mojen', 'Marie Mouroum', 'Florian Schmidtke', 'Paul Wollin', 'Aristo Luis', 'Morik Heydo', 'Alain Blazevic', 'Harry Szovik'], ['Timothée Chalamet', 'Calah Lane', 'Keegan-Michael Key', 'Hugh Grant', 'Paterson Joseph', 'Olivia Colman', 'Tom Davis', 'Jim Carter', 'Rowan Atkinson'], ['Brie Larson', 'Teyonah Parris', 'Iman Vellani', 'Zawe Ashton', 'Samuel L. Jackson', 'Gary Lewis', 'Park Seo-jun', 'Zenobia Shroff', 'Mohan Kapur'], ['Jason Statham', 'Emmy Raver-Lampman', 'Bobby Naderi', 'Josh Hutcherson', 'Jeremy Irons', 'Taylor James', 'Phylicia Rashād', 'Jemma Redgrave', 'Minnie Driver'], ['Kevin Hart', 'Gugu Mbatha-Raw', 'Sam Worthington', "Vincent D'Onofrio", 'Úrsula Corberó', 'Billy Magnussen', 'Kim Yun-jee', 'Viveik Kalra', 'Jean Reno'], ['Kumail Nanjiani', 'Elizabeth Banks', 'Caspar Jennings', 'Tresi Gazal',

5)  Write an user defined function that returns a pandas data frame

In [23]:
import pandas as pd
def get_pandas_dataframe(home_page_content,individual_page_content):
    title=get_all_movie_title(home_page_content)
    rating=get_all_movie_ratings(home_page_content)
    genre=get_movie_genre(individual_page_content)
    cast=get_movie_cast(individual_page_content)
    movie_data = pd.DataFrame({
        "Movie Title": title,
        "Movie Rating": rating,
        "Movie Genre": genre,
        "Movie Cast": cast
    })
    return movie_data

movies_df=get_pandas_dataframe(movie_content,all_movie_page_content)
movies_df.head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,Badland Hunters,69.28,"[Science Fiction, Action, Drama]","[Ma Dong-seok, Lee Hee-jun, Lee Jun-young, Roh..."
1,Sixty Minutes,71.0,"[Action, Drama]","[Emilio Sakraya, Dennis Mojen, Marie Mouroum, ..."
2,Wonka,72.15,"[Comedy, Family, Fantasy]","[Timothée Chalamet, Calah Lane, Keegan-Michael..."
3,The Marvels,63.13,"[Science Fiction, Adventure, Action]","[Brie Larson, Teyonah Parris, Iman Vellani, Za..."
4,The Beekeeper,73.04,"[Action, Thriller, Drama]","[Jason Statham, Emmy Raver-Lampman, Bobby Nade..."


6)  Scraping the data and combining the dataframes

In [24]:
import os

base_link="https://www.themoviedb.org/movie"
csv_folder = input("Please enter the directory name to store the generated CSVs: ")

Please enter the directory name to store the generated CSVs: Pooja_TMDB_CSVs_Folder


In [25]:
# Create folder if it doesn't exist
os.makedirs(csv_folder, exist_ok=True)

In [26]:
def get_multiple_pages_data(base_url,total_num_pages):
    dataframes_list=[]
    for page_no in range(1,total_num_pages+1):
        current_page_url=base_url+f'?page={page_no}'
        current_page_content=get_webpage_source(current_page_url)
        current_page_all_movies_content= get_all_html_content(current_page_content)
        current_data_frame=get_pandas_dataframe(current_page_content,current_page_all_movies_content)
        dataframes_list.append(current_data_frame)
    return dataframes_list
output_data_frames=  get_multiple_pages_data(base_link,5)

In [27]:
output_data_frames[0].head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,Badland Hunters,69.28,"[Science Fiction, Action, Drama]","[Ma Dong-seok, Lee Hee-jun, Lee Jun-young, Roh..."
1,Sixty Minutes,71.0,"[Action, Drama]","[Emilio Sakraya, Dennis Mojen, Marie Mouroum, ..."
2,Wonka,72.15,"[Comedy, Family, Fantasy]","[Timothée Chalamet, Calah Lane, Keegan-Michael..."
3,The Marvels,63.13,"[Science Fiction, Adventure, Action]","[Brie Larson, Teyonah Parris, Iman Vellani, Za..."
4,The Beekeeper,73.04,"[Action, Thriller, Drama]","[Jason Statham, Emmy Raver-Lampman, Bobby Nade..."


In [28]:
output_data_frames[1].head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,Silent Night,63.69,"[Action, Adventure, Thriller, Crime, Drama]","[Joel Kinnaman, Kid Cudi, Harold Torres, Catal..."
1,The Bricklayer,62.27,"[Action, Thriller]","[Aaron Eckhart, Nina Dobrev, Clifton Collins J..."
2,Freelance,65.62,"[Action, Comedy]","[John Cena, Alison Brie, Juan Pablo Raba, Alic..."
3,Due Justice,67.78,"[Crime, Thriller, Action]","[Kellan Lutz, William Forsythe, Jeff Fahey, Ef..."
4,Expend4bles,63.84,"[Action, Adventure, Thriller]","[Jason Statham, Sylvester Stallone, 50 Cent, M..."


In [29]:
output_data_frames[2].head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,The Underdoggs,60.69,[Comedy],"[Snoop Dogg, Tika Sumpter, Elias Ferguson, Jon..."
1,Monsters 103 Mercies Dragon Damnation,80.98,"[Animation, Action, Fantasy, Adventure]","[Yoshimasa Hosoya, Kana Hanazawa, Hiroki Touch..."
2,Night Swim,59.55,[Horror],"[Wyatt Russell, Kerry Condon, Amélie Hoeferle,..."
3,Trolls Band Together,72.61999999999999,"[Animation, Family, Music, Fantasy, Comedy]","[Anna Kendrick, Justin Timberlake, Camila Cabe..."
4,Radical,84.93,[Drama],"[Eugenio Derbez, Daniel Haddad, Jennifer Trejo..."


In [30]:
output_data_frames[3].head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,Gran Turismo,79.42,"[Adventure, Action, Drama]","[Archie Madekwe, David Harbour, Orlando Bloom,..."
1,Leo,75.17,"[Animation, Comedy, Family]","[Adam Sandler, Bill Burr, Cecily Strong, Jason..."
2,My Fault,80.28999999999999,"[Drama, Romance]","[Nicole Wallace, Gabriel Guevara, Marta Hazas,..."
3,Spider-Man: Across the Spider-Verse,84.0,"[Animation, Action, Adventure, Science Fiction]","[Shameik Moore, Hailee Steinfeld, Jason Schwar..."
4,Genghis Khan,53.5,"[Fantasy, Action, Adventure]","[William Chan Wai-Ting, Lin Yun, Hu Jun, Ni Da..."


In [31]:
output_data_frames[4].head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,The Duel,65.0,"[Action, Comedy, Crime]","[Eugenia Suárez, Joaquín Furriel, Juan Ignacio..."
1,Guardians of the Galaxy Vol. 3,79.83,"[Science Fiction, Adventure, Action]","[Chris Pratt, Zoe Saldaña, Dave Bautista, Kare..."
2,Erik Stoneheart,40.0,"[Family, Adventure, Fantasy]","[Herman Avandi, Florin Gussak, Juhan Ulfsak, L..."
3,Spider-Man: No Way Home,79.75,"[Action, Adventure, Science Fiction]","[Tom Holland, Zendaya, Benedict Cumberbatch, J..."
4,Turning Red,74.0,"[Animation, Family, Comedy, Fantasy]","[Rosalie Chiang, Sandra Oh, Ava Morse, Hyein P..."


In [32]:
def generate_csv(df_list):
    #Storing each data frame in a separate CSV file
    for i, dataframe in enumerate(output_data_frames,start=1):
        # Define the file path within the new directory
        file_path = os.path.join(csv_folder, f'movie_page_{i}.csv')
        # Save the DataFrame to the CSV file
        dataframe.to_csv(file_path, index=False)
    return True
csv_flag=generate_csv(output_data_frames)
if csv_flag:
    print(f"Sucessfully generated {len(output_data_frames)} CSVs and stored it in directory named: {csv_folder}")
else:
    print(f"Something went wrong...")

Sucessfully generated 5 CSVs and stored it in directory named: Pooja_TMDB_CSVs_Folder


In [33]:
dataframe_list= get_multiple_pages_data(base_link,5)
consolidated_dataframe=pd.concat(dataframe_list, ignore_index=True)
consolidated_dataframe.head()

Unnamed: 0,Movie Title,Movie Rating,Movie Genre,Movie Cast
0,Badland Hunters,69.28,"[Science Fiction, Action, Drama]","[Ma Dong-seok, Lee Hee-jun, Lee Jun-young, Roh..."
1,Sixty Minutes,71.0,"[Action, Drama]","[Emilio Sakraya, Dennis Mojen, Marie Mouroum, ..."
2,Wonka,72.15,"[Comedy, Family, Fantasy]","[Timothée Chalamet, Calah Lane, Keegan-Michael..."
3,The Marvels,63.13,"[Science Fiction, Adventure, Action]","[Brie Larson, Teyonah Parris, Iman Vellani, Za..."
4,The Beekeeper,73.04,"[Action, Thriller, Drama]","[Jason Statham, Emmy Raver-Lampman, Bobby Nade..."


In [34]:
consolidated_dataframe.shape

(100, 4)