# Demo: How to scrape multiple things from multiple pages

This example uses the website [Box Office Mojo](https://www.boxofficemojo.com/).

The goal is to scrape info about the five top-grossing movies for each year, for 10 years. I want the title and rank of the movie, and also, how much money it grossed at the box office. In the end I will put the scraped data into a CSV file. This is a small-scale project to demonstrate how a larger-scale project using various other websites (instead of Box Office Mojo) might be accomplished.

This demo uses:

* Python 3
* Jupyter Notebook
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Requests](https://requests.kennethreitz.org/en/master/)

In [None]:
# load the Python libraries
from bs4 import BeautifulSoup
import requests

In [None]:
# begin to explore ONE page - load the page and capture all of its HTML in a variable named `soup` 
url = 'https://www.boxofficemojo.com/yearly/chart/?yr=2019'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

Before you can start trying to get the data you want, you will *explore* the HTML of the selected web page, using Chrome Developer Tools.

**Step 1:** Right-click on the exact line or element you wish to capture (scrape). A menu pops up. Select **Inspect.**

The Elements inspection pane will open. If this is on the side (instead of at bottom, as shown), you can move it by clicking the kebab menu icon in the upper right corner, beside the X.

**Step 2:** Examine the HTML elements that hold the data you want to scrape. You'll use these element names to scrape the contents. If you don't know much about HTML, [this brief guide](http://bit.ly/mm-htmltags) will help.

The **circled area** in the HTML shows us that the data we want is contained in a `table` element. HTML tables have rows &mdash; in `tr` elements. The rows contain cells &mdash; in `td` elements. The start of a row, and some cells, can be seen in the **boxed area** above.

There are various ways to scrape HTML tables other than what is demonstrated here. Our goal here is to demonstrate a scraping method that can work for more than just tables.

In [None]:
# I discover the data I want is in an HTML table with no class or ID 
# I capture all the tables on the page like this - in a Python list named tables -
tables = soup.find_all( 'table' )

# then I print the number of how many tables are in that list 
print(len(tables))

I had to test a few numbers before I got the correct `tables[]` and `rows[]` numbers you see below. I have deleted those testing cells from this notebook. It's important for beginners to understand that scraping requires a lot of trial-and-error work. It would be cumbersome to keep that work in the notebook, so it is gone. However, you will need to *do* that work before you discover how to get the data you desire from a page.

The number in square brackets &mdash; `[6]` &mdash; is the result of my trial-and-error testing. I just kept changing the number in `tables[ ]` and printing until I found the right table.

In [None]:
# I capture all the rows in that table like this - in a Python list named rows -
rows = tables[6].find_all('tr')

# Here is some of the testing needed to find the correct data row
print(len(rows))
print(rows[2])


Above: There are 106 rows in my list, named `rows`. I can see on the page that the top row contains column headings, which I do not want to scrape. Some more (deleted) testing showed me that `rows[0]` and `rows[1]` contained data I do not want. Therefore I printed `rows[2]` to have a look at its contents.

The contents are a lot of HTML elements all mashed together. Can you see the `td` that contains the first movie title?

In [None]:
# check to see whether I can get the first movie title in the first row
cells = rows[2].find_all('td')
title = cells[1].text
print(title)

When using Python lists, we access items in a list by their index, which is the number you see inside square brackets. `cells[1].text` is the text inside the *second* item in the list named `cells`. The first item in any list has the index `0`.

To access only the cells in the first row of data, I made a new list, named `cells`.

My next test is to determine whether I can cleanly get the contents I want from the first five rows in this table. I use a Python for-loop to do this. Using `range(2, 7)` will bring me the rows with the *indexes* 2 through 6 (7 is the endpoint in the range).

In [None]:
# get top 5 movies on this page - I know the first row is [2]
for i in range(2, 7):
    cells = rows[i].find_all('td')
    title = cells[1].text
    print(title)

My test was successful! 

In [None]:
# I would like to get the total gross number also
for i in range(2, 7):
    cells = rows[i].find_all('td')
    gross = cells[3].text
    print(gross)

Testing each individual thing you want is easier to debug than trying to get all the things in one big chunk of code.

In [None]:
# next I want to get rank (1-5), title and gross all on one line
for i in range(2, 7):
    cells = rows[i].find_all('td')
    print(cells[0].text, cells[1].text, cells[3].text)

In [None]:
# I want to do this for 10 years, ending with 2019
# first create a list of the years I want
years = []
start = 2019
for i in range(0, 10):
    years.append(start - i)
print(years)

Sure, I could have *typed* out that list, named `years`, by hand, but a for-loop created the list for me. It's nice to use `range()` to do things like this.

**Feel free to try other years!** Box Office Mojo charts go back to 1980, so you could start your list (previous cell) with 1989.

In [None]:
# create a base url so I can open each year's page
base_url = 'https://www.boxofficemojo.com/yearly/chart/?yr='
# test it
# print(base_url + years[0]) -- ERROR 
# years[0] is an integer, so I must convert it to a string, with str()
print( base_url + str(years[0]) )

I am working toward making a loop that will open each web page I want (one for each year) and scrape from that page the top five movies and their gross earnings. To do this, I will combine the `base_url` text with the year. That will give me 10 URLs, one for each year.

Now that my testing process is (mosty) done, I can attempt to make the complete set of instructions to scrape all 10 pages &mdash;

In [None]:
# collect all necessary pieces from above to make a loop that gets top 5 movies 
# for each of the 10 years
for year in years:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        print(cells[0].text, cells[1].text, cells[3].text)


It worked! However, looking at this, I realize that each line needs to include the year as well. 

I also realize I should clean the gross so it is a pure integer. That is a common data-cleaning task, but I will do a test before I add it to my code.

[How `strip()` works](https://docs.python.org/3/library/stdtypes.html#str.strip)

[How `replace()` works](https://docs.python.org/3/library/stdtypes.html#str.replace)

In [None]:
# test making a pure integer from the gross - using .strip() and .replace() chained together - 
num = '$293,004,164'
print(num.strip('$').replace(',', ''))

Now I make another attempt at the complete set of instructions to scrape all 10 pages. This time I am also going to write the year into the line. Instead of using all 10 years (because this is just a test), I create a tiny array with just two years in it.

In [None]:
miniyears = [2017, 2014]
for year in miniyears:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        gross = cells[3].text.strip('$').replace(',', '')
        print(year, cells[0].text, cells[1].text, gross)


Now I am confident this code will work with all 10 years. But before I run it with `years`, I want to be sure to save the data in a new CSV file. This uses some standard Python code that always works fine, so I do not bother to test it. I just add it in. This will be my final cell in the notebook. 

After it runs, I will have a new file named `movies.csv` in the same folder as this notebook, and it will contain a header row and 50 rows of movie data.

In [None]:
# to save data into a csv, we must import -

import csv

# open a new file for writing -
csvfile = open("movies.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object -
c = csv.writer(csvfile)

# write custom header row to csv
c.writerow( ['year', 'rank', 'title', 'gross'] )

# modified code that was already tested, from above 
for year in years:
    url = base_url + str(year)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')
    tables = soup.find_all( 'table' )
    rows = tables[6].find_all('tr')
    for i in range(2, 7):
        cells = rows[i].find_all('td')
        gross = cells[3].text.strip('$').replace(',', '')
        # print(year, cells[0].text, cells[1].text, gross)
        # instead of printing, I need to make a list and write that list to the CSV as one row
        c.writerow( [year, cells[0].text, cells[1].text, gross] )

# close the csv file and save it
csvfile.close()


The result is a CSV file, named `movies.csv`, that has 51 rows: the header row plus 5 movies for each year from 2010 through 2019. It has four columns: year, rank, title, and gross.

Note that **only the final cell above** is needed to create this CSV, by scraping 10 separate web pages. Everything *above* the final cell above is just instruction, demonstration. It is intended to show the problem-solving you need to go through to get to a desired scraping result.
