For this exercise, we will scrape a simple webpage and extract information from it. Let's say we want to scrape the top 250 movies of all time from the IMDb website. Let's design a more comprehensive web scraping code walkthrough using Python and Beautiful Soup.

In [10]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.11.2-py3-none-any.whl (129 kB)
     -------------------------------------- 129.4/129.4 kB 2.5 MB/s eta 0:00:00
Collecting soupsieve>1.2
  Downloading soupsieve-2.4-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
  Running setup.py install for bs4: started
  Running setup.py install for bs4: finished with status 'done'
Successfully installed beautifulsoup4-4.11.2 bs4-0.0.1 soupsieve-2.4


  DEPRECATION: bs4 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559

[notice] A new release of pip available: 22.3 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# Request library in python can be used for making web requests
import requests

In [2]:
# URL for the imdb web page that we are trying the scrape. Visit the url to understand what information this page contains.
url = "https://www.imdb.com/chart/top"

In [3]:
# Fetch the web page using requests library's get method
response = requests.get(url)

In [4]:
# When the web page is fetched successfully, the response will have the status code 200
# which refers that the get request is completed.
# Failure to fetch the page will generate other status codes (e.g. 204, 400, 404, etc)
response.status_code

200

In [5]:
# Data type of the web response
type(response)

requests.models.Response

In [6]:
# The html page content we fetched is residing within this response variable
# We need to access the right attribute of response to access the html content.
# Calling help() on 'response' shows the possible attributes and methods of 'response' 

# help(response) # Uncomment this if you're interested

In [6]:
# To get the html content, we are only interested in the attribute named 'content' for now
page_content = response.content

In [7]:
# This shows that the html content is fetched in bytes format
type(page_content)

bytes

In [None]:
# now try this: read http://somepage.com and output the contents as a string


In [9]:
# If you look into the html content itself, it may seem like a mess
# Our goal is to pull out the necessary text content from this messy data
# page_content

**BeautifulSoup** is a python library which can be used to parse html content and extract relevant information from web pages.

In [11]:
from bs4 import BeautifulSoup

In [12]:
# Load the html page as BeautifulSoup structure
# The 'soup'
soup = BeautifulSoup(page_content, 'html.parser')

As we can access the html content now, our next task is to parse the html content to access important information. If you explore the data above closely, you can see there are different tags (e.g. < a >, < span >, etc) containing piece of information. The elements with tags are known as DOM (Document Object Model) elements. These DOM elements maintain the structure, style and content of a html page. We will use these DOM elements to get different portions of the html page.   




In [19]:
soup







Top 250 Movies - IMDb




















Top 250 Movies

















































MenuMoviesRelease CalendarTop 250 MoviesMost Popular MoviesBrowse Movies by GenreTop Box OfficeShowtimes & TicketsMovie NewsIndia Movie SpotlightTV ShowsWhat's on TV & StreamingTop 250 TV ShowsMost Popular TV ShowsBrowse TV Shows by GenreTV NewsIndia TV SpotlightWatchWhat to WatchLatest TrailersIMDb OriginalsIMDb PicksIMDb PodcastsAwards & EventsOscarsBest Picture WinnersBest Picture WinnersIndependent Spirit AwardsWomen's History MonthSXSWSTARmeter AwardsAwards CentralFestival CentralAll EventsCelebsBorn TodayMost Popular CelebsMost Popular CelebsCelebrity NewsCommunityHelp CenterContributor ZonePollsFor Industry ProfessionalsAllAllTitlesTV EpisodesCelebsCompaniesKeywordsAdvanced SearchWatchlistSign InENFully supportedEnglish (United States)Partially supportedFrançais (Canada)Français (France)Deutsch (Deutschland)हिंदी (भारत)Italiano (Italia)Português (Brasil)Español (España

Let's visit the imdb web page for 250 movies: 
https://www.imdb.com/chart/top


You can find a long list of movies in the page. At the top, you can see some classic movies like "The Shawshank Redemption", "The Godfather", "The Dark Knight", etc.

When you explore the page, you can see various information about each movie, for example, title of the movie, release year, rating out of 10. If you hover you mouse of the title and rating of a movie, other information about the movie including the director's and major actors' name, and the count of users who voted for this movie in imdb. In this lab, we will extract these information for all movies in the webpage.

Now, let's take the first movie for example, "The Shawshank Redemption" and try to locate it in the html page content we fetched. This phrase is visible in some places in the html. If you explore the html code around it carefully, you can see a lot of < th >, < tr >, etc tags which means that the information in this page are organized in a table. Above these code, you can also find the < table > tag for the table which contains information about movies.

In [13]:
# Let's get the <table> element
movie_table = soup.find('table')

In [14]:
# Take a closer look at the tags
movie_table

<table class="chart full-width" data-caller-name="chart-top250movie">
<colgroup>
<col class="chartTableColumnPoster"/>
<col class="chartTableColumnTitle"/>
<col class="chartTableColumnIMDbRating"/>
<col class="chartTableColumnYourRating"/>
<col class="chartTableColumnWatchlistRibbon"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>
</thead>
<tbody class="lister-list">
<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.235699555980915" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2710615" name="nv"></span>
<span data-value="-1.764300444019085" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BNDE3ODcxYzMtY2YzZC00NmNlLWJiNDMtZDViZWM2MzIxZDYwXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UX45_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
    

There are < td > (Table Data), < tr > (Table Rows), < th > (Table Header) tags which contains the information we are interested in. Let's grab them one by one. 

Locate the first movie, "The Shawshank Redemption" again within the html < table > element. You can find it twice in the table, both within separate < td > elements. The 'class' attributes of these two < td > elements are "posterColumn" and "titleColumn". Let's get the < td > element with "titleColumn" for the first movie, "The Shawshank Redemption". 

In [15]:
first_movie_title_element = movie_table.find('td',class_="titleColumn")
first_movie_title_element

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

Now, we can easily scrape the title from this < td > element by accessing the < a > element inside it.

In [16]:
# Title of the first movie
first_movie_title_element.a.text.strip() # Remember the syntax? Revisit the class code to recap.

'The Shawshank Redemption'

In [17]:
# Exercise:
# Get the title name for the first movie from the < td > element with 'class' value 'posterColumn'

## Solution:
# first_movie_poster_element = movie_table.find('td',class_="posterColumn")
# first_movie_poster_element.a.img['alt'].strip()

In [18]:
# Let's write a function to get all the movie titles as a list
def get_movie_titles(table_element):
  movie_list = []
  td_elements = table_element.find_all('td', class_="titleColumn")
  for elem in td_elements:
    movie_list.append(elem.a.text.strip())
  
  return movie_list

In [19]:
# Calling the function to get all movie titles
movie_titles = get_movie_titles(movie_table)

In [20]:
# Observe the first 10 movie titles
movie_titles[:10]

['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 'The Godfather Part II',
 '12 Angry Men',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'Pulp Fiction',
 'The Lord of the Rings: The Fellowship of the Ring',
 'The Good, the Bad and the Ugly']

Let's see the title 'first_movie_title_element' again. The director and actor names are in an < a > element under the 'title' attribute. Let's scrape that. 

In [21]:
first_movie_title_element

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [22]:
# Director and actor names from the <a> element's 'title' attribute
first_dir_and_actor_names = first_movie_title_element.a['title'].strip()
first_dir_and_actor_names

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

These comma seperated names can be parsed using .split() on the text.

In [23]:
# We can get the director and actor names as a list
first_movie_persons = first_dir_and_actor_names.split(',')
first_movie_persons

['Frank Darabont (dir.)', ' Tim Robbins', ' Morgan Freeman']

In [24]:
# Director is the first element in the list
first_movie_persons[0]

'Frank Darabont (dir.)'

In [25]:
#Let's remove the "(dir.)" part at the end
first_movie_persons[0].replace("(dir.)", "").strip()

'Frank Darabont'

In [26]:
# Actors are the rest of the elements in the list
first_movie_persons[1:]

[' Tim Robbins', ' Morgan Freeman']

In [27]:
# Write a function to get directors and actors name for all the movies
def get_dir_and_actors(table_element):
  directors = []
  actors = []
  td_elements = table_element.find_all('td', class_="titleColumn")
  for elem in td_elements:
    people = elem.a['title'].strip().split(',')
    directors.append(people[0].replace("(dir.)", ""))
    actors.append(people[1:])
      
  return directors, actors

In [28]:
director_names, actors_names = get_dir_and_actors(movie_table)

In [29]:
# Directors of 250 movies
len(director_names)

250

In [30]:
# Observe directors and actors for first 10 movies
director_names[:10], actors_names[:10]

(['Frank Darabont ',
  'Francis Ford Coppola ',
  'Christopher Nolan ',
  'Francis Ford Coppola ',
  'Sidney Lumet ',
  'Steven Spielberg ',
  'Peter Jackson ',
  'Quentin Tarantino ',
  'Peter Jackson ',
  'Sergio Leone '],
 [[' Tim Robbins', ' Morgan Freeman'],
  [' Marlon Brando', ' Al Pacino'],
  [' Christian Bale', ' Heath Ledger'],
  [' Al Pacino', ' Robert De Niro'],
  [' Henry Fonda', ' Lee J. Cobb'],
  [' Liam Neeson', ' Ralph Fiennes'],
  [' Elijah Wood', ' Viggo Mortensen'],
  [' John Travolta', ' Uma Thurman'],
  [' Elijah Wood', ' Ian McKellen'],
  [' Clint Eastwood', ' Eli Wallach']])

The actor names still have some whitespaces. Let's use .strip() to clean these white spaces.

In [31]:
actor_names = [[actor.strip() for actor in actor_list] for actor_list in actors_names]
actor_names[:10]

[['Tim Robbins', 'Morgan Freeman'],
 ['Marlon Brando', 'Al Pacino'],
 ['Christian Bale', 'Heath Ledger'],
 ['Al Pacino', 'Robert De Niro'],
 ['Henry Fonda', 'Lee J. Cobb'],
 ['Liam Neeson', 'Ralph Fiennes'],
 ['Elijah Wood', 'Viggo Mortensen'],
 ['John Travolta', 'Uma Thurman'],
 ['Elijah Wood', 'Ian McKellen'],
 ['Clint Eastwood', 'Eli Wallach']]

We will now get the release year of the first movie. Explore the 'first_movie_title_element' we fetched before.

In [32]:
first_movie_title_element

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [33]:
# Exercise:
# Write a function to get release years for all the movies as a list
# Use 'movie_table' we calculated as the input of the function

# Solution:
def get_release_years(table_element):
  years = []
  td_elements = table_element.find_all('td', class_='titleColumn')
  for elem in td_elements:
    years.append(elem.span.text[1:-1])
  return years

release_years = get_release_years(movie_table)
release_years[:10]

['1994',
 '1972',
 '2008',
 '1974',
 '1957',
 '1993',
 '2003',
 '1994',
 '2001',
 '1966']

In [34]:
# Exercise:
# Get a list of links for every movie. Use the previous code block as base code
# Hint: Link for the first movie, "The Shawshank Redemption" should be 'https://www.imdb.com/title/tt0111161/'
# Try to locate the portion after 'https://www.imdb.com'. For example, for the first movie, it's '/title/tt0111161/' 

# Solution
def get_urls(table_element):
  urls = []
  base_url = 'https://www.imdb.com'

  td_elements = table_element.find_all('td', class_='titleColumn')
  for elem in td_elements:
    urls.append(base_url + elem.a['href'])
  return urls
  

In [35]:
# Getting urls for all the movies
links = get_urls(movie_table)

In [36]:
# Display imdb page link for first 10 movies
links[:10]

['https://www.imdb.com/title/tt0111161/',
 'https://www.imdb.com/title/tt0068646/',
 'https://www.imdb.com/title/tt0468569/',
 'https://www.imdb.com/title/tt0071562/',
 'https://www.imdb.com/title/tt0050083/',
 'https://www.imdb.com/title/tt0108052/',
 'https://www.imdb.com/title/tt0167260/',
 'https://www.imdb.com/title/tt0110912/',
 'https://www.imdb.com/title/tt0120737/',
 'https://www.imdb.com/title/tt0060196/']

We will now get the ratings for the movies. Let's visit the raw html table again and locate rating information for a movie. You can see the rating information is within a < td > element with 'class' name 'ratingColumn imdbRating'

In [37]:
# Get the html <td> elements for ratings
ratings_elements = movie_table.find_all('td', class_ ='ratingColumn imdbRating')

In [38]:
# Observe the first element
ratings_elements[0]

<td class="ratingColumn imdbRating">
<strong title="9.2 based on 2,710,615 user ratings">9.2</strong>
</td>

In [39]:
# Now let's scrape the text here from <strong> element
ratings_elements[0].strong['title']

'9.2 based on 2,710,615 user ratings'

You can split the text using .split() with space delimeter and grab the ratings and user count

In [40]:
text_tokens = ratings_elements[0].strong['title'].split(' ')
text_tokens

['9.2', 'based', 'on', '2,710,615', 'user', 'ratings']

In [41]:
# Rating and user count are first and fouth elements in the list
text_tokens[0], text_tokens[3]

('9.2', '2,710,615')

In [42]:
# You can also find the rating inside the strong element as text
ratings_elements[0].strong.text

'9.2'

In [43]:
# Exercise:
# Write a function to get the ratings and user count for all the movies in two lists
# Use 'ratings_elements' list we fetched above as the input of the function

# Solution:
def get_ratings_and_voters(ratings_elements):
  voter_counts = []
  ratings = []

  for elem in ratings_elements:
    ratings.append(elem.strong.text)
    text_tokens = elem.strong['title'].split(' ')
    voter_counts.append(elem.strong['title'].split(' ')[3])

  return ratings, voter_counts

ratings, user_counts = get_ratings_and_voters(ratings_elements) # This line can be provided as a reference to the function name 

# Let's see the first 10 ratings
print(ratings[:10])

# See the voter counts for first 10 movies
user_counts[:10]

['9.2', '9.2', '9.0', '9.0', '9.0', '8.9', '8.9', '8.8', '8.8', '8.8']


['2,710,615',
 '1,882,483',
 '2,683,633',
 '1,285,154',
 '800,827',
 '1,369,466',
 '1,865,542',
 '2,081,146',
 '1,895,085',
 '769,291']

Store the information we fetched in a Pandas dataframe in a tabular form

In [44]:
import pandas as pd

In [45]:
# Exercise:
# If the above tasks are completed successfully, execution of this block should create a dataframe for movies
# Try it out.

movies = pd.DataFrame({
    "Title": movie_titles,
    "Rating": ratings,
    "directors": director_names,
    "actors": actors_names,
    "voter_count": user_counts,
    "years": release_years,
    "link": links
})

movies.head()

Unnamed: 0,Title,Rating,directors,actors,voter_count,years,link
0,The Shawshank Redemption,9.2,Frank Darabont,"[ Tim Robbins, Morgan Freeman]",2710615,1994,https://www.imdb.com/title/tt0111161/
1,The Godfather,9.2,Francis Ford Coppola,"[ Marlon Brando, Al Pacino]",1882483,1972,https://www.imdb.com/title/tt0068646/
2,The Dark Knight,9.0,Christopher Nolan,"[ Christian Bale, Heath Ledger]",2683633,2008,https://www.imdb.com/title/tt0468569/
3,The Godfather Part II,9.0,Francis Ford Coppola,"[ Al Pacino, Robert De Niro]",1285154,1974,https://www.imdb.com/title/tt0071562/
4,12 Angry Men,9.0,Sidney Lumet,"[ Henry Fonda, Lee J. Cobb]",800827,1957,https://www.imdb.com/title/tt0050083/


In [46]:
# Exercise:
# Save the dataframe as a csv file

# Solution:
movies.to_csv('movies250.csv')

Let's consider the 5th movie in the list, "12 Angry Men". We will try to fetch more information about the movie using the link we extracted.

In [47]:
movies.iloc[4]

Title                                   12 Angry Men
Rating                                           9.0
directors                              Sidney Lumet 
actors                  [ Henry Fonda,  Lee J. Cobb]
voter_count                                  800,827
years                                           1957
link           https://www.imdb.com/title/tt0050083/
Name: 4, dtype: object