# Demo: How to scrape multiple things from multiple pages

The goal is to scrape info about the five top-grossing movies for each year, for 10 years. I want the title and rank of the movie, and also, how much money did it gross at the box office. In the end I will put the scraped data into a CSV file.

In [8]:
from bs4 import BeautifulSoup
import requests

In [32]:
url = 'https://www.boxofficemojo.com/yearly/chart/?yr=2018'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

In [37]:
# I discover the data I want is in an HTML table with no class or ID 
tables = soup.find_all( 'table' )
print(len(tables))

11


In [33]:
# I had to test a few numbers before I got the correct tables[] and rows[] numbers
# I just kept changing the number and printing until I found it
rows = tables[6].find_all('tr')
# print(len(rows))
# print(rows[2])
cells = rows[2].find_all('td')
title = cells[1].text
print(title)

Black Panther


In [35]:
# get top 5 movies on this page - I know the first row is [2]
for i in range(2, 7):
    cells = rows[i].find_all('td')
    title = cells[1].text
    print(title)

Black Panther
Avengers: Infinity War
Incredibles 2
Jurassic World: Fallen Kingdom
Aquaman


In [36]:
# I would like to get the total gross number also
for i in range(2, 7):
    cells = rows[i].find_all('td')
    gross = cells[3].text
    print(gross)

$700,059,566
$678,815,482
$608,581,744
$417,719,760
$335,038,565


In [38]:
# next I want to get rank (1-5), title and gross all on one line
for i in range(2, 7):
    cells = rows[i].find_all('td')
    print(cells[0].text, cells[1].text, cells[3].text)

1 Black Panther $700,059,566
2 Avengers: Infinity War $678,815,482
3 Incredibles 2 $608,581,744
4 Jurassic World: Fallen Kingdom $417,719,760
5 Aquaman $335,038,565


In [39]:
# I want to do this for 10 years, ending with 2018
# first create a list of the years I want
years = []
start = 2018
for i in range(0, 10):
    years.append(start - i)
print(years)

[2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009]


In [41]:
# create a base url so I can open each year's page
base_url = 'https://www.boxofficemojo.com/yearly/chart/?yr='
# test it
# print(base_url + years[0]) -- ERROR
print( base_url + str(years[0]) )

https://www.boxofficemojo.com/yearly/chart/?yr=2018


In [42]:
# collect all necessary pieces from above to make a loop that gets top 5 movies 
# for each of the 10 years
for year in years:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        print(cells[0].text, cells[1].text, cells[3].text)


1 Black Panther $700,059,566
2 Avengers: Infinity War $678,815,482
3 Incredibles 2 $608,581,744
4 Jurassic World: Fallen Kingdom $417,719,760
5 Aquaman $335,038,565
1 Star Wars: The Last Jedi $620,181,382
2 Beauty and the Beast (2017) $504,014,165
3 Wonder Woman $412,563,408
4 Jumanji: Welcome to the Jungle $404,515,480
5 Guardians of the Galaxy Vol. 2 $389,813,101
1 Rogue One: A Star Wars Story $532,177,324
2 Finding Dory $486,295,561
3 Captain America: Civil War $408,084,349
4 The Secret Life of Pets $368,384,330
5 The Jungle Book (2016) $364,001,123
1 Star Wars: The Force Awakens $936,662,225
2 Jurassic World $652,270,625
3 Avengers: Age of Ultron $459,005,868
4 Inside Out $356,461,711
5 Furious 7 $353,007,020
1 American Sniper $350,126,372
2 The Hunger Games: Mockingjay - Part 1 $337,135,885
3 Guardians of the Galaxy $333,176,600
4 Captain America: The Winter Soldier $259,766,572
5 The LEGO Movie $257,760,692
1 The Hunger Games: Catching Fire $424,668,047
2 Iron Man 3 $409,013,994


In [49]:
# I realize now that each line needs to have the year also
# and maybe I should clean the gross so it's a pure integer
# so test that - using .strip() and .replace() chained together - 
num = '$293,004,164'
print(num.strip('$').replace(',', ''))

293004164


In [47]:
miniyears = [2017, 2014]
for year in miniyears:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        gross = cells[3].text.strip('$').replace(',', '')
        print(year, cells[0].text, cells[1].text, gross)


2017 1 Star Wars: The Last Jedi 620181382
2017 2 Beauty and the Beast (2017) 504014165
2017 3 Wonder Woman 412563408
2017 4 Jumanji: Welcome to the Jungle 404515480
2017 5 Guardians of the Galaxy Vol. 2 389813101
2014 1 American Sniper 350126372
2014 2 The Hunger Games: Mockingjay - Part 1 337135885
2014 3 Guardians of the Galaxy 333176600
2014 4 Captain America: The Winter Soldier 259766572
2014 5 The LEGO Movie 257760692


In [1]:
# I should really save my data into a csv

import csv

# open new file for writing -
csvfile = open("movies.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object -
c = csv.writer(csvfile)

#write header row to csv
c.writerow( ['year', 'rank', 'title', 'gross'] )

# modified code from above
for year in years:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        gross = cells[3].text.strip('$').replace(',', '')
        # print(year, cells[0].text, cells[1].text, gross)
        # instead of printing, I need to make a list and write that list to the CSV as one row
        c.writerow( [year, cells[0].text, cells[1].text, gross] )

# close the file
csvfile.close()


NameError: name 'years' is not defined

The result is a CSV file, named movies.csv, that has 51 rows: the header row plus 5 movies for each year from 2009 through 2018. It has four columns: year, rank, title, and gross.

Note that **only the final cell above** is needed to create this CSV, by scraping 10 separate web pages. Everything *above* the final cell above is just instruction, demonstration. It is intended to show the problem-solving you need to go through to get to a desired scraping result.