# Part 1: Scraping web data

In this analysis we will scrape actor biographies from IMDB, which can be used to build a recommendation engine in part 2. This notebook will guide you through the process. See [here](https://github.com/nestauk/taller_centro_cultura_digital/blob/master/data_collection.ipynb) for one possible solution.

In [None]:
from bs4 import BeautifulSoup
import re
import requests
from retrying import retry
import json

# The first page we're scraping
TOP_PAGE = "https://www.imdb.com/list/ls058011111/"

# The second page we're scraping
BIO_PAGE = "https://imdb.com/name/{}/bio?ref_=nmls_hd"

## Setting the scene

We would like to scrape each actor's biography (for example [Robert De Niro](https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1)'s), by iterating through the [list indicated here](https://www.imdb.com/list/ls058011111/).

#### a) Look at the source code (HTML) for the list. You can do that by either right-clicking on the page and clicking "show source". Can you see any patterns in the HTML which you could use to infer the URL for Robert De Niro's biography. (Hint: what is special about https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1 compared to the biography for another actor? Can you find this in the list's source code?)

    <Put your answer here>

This is the general gist of web scraping: finding patterns you can exploit. We can download the source code (HTML) of a web page using python `requests`. Get the source code by executing the following code.

In [None]:
r = requests.get(TOP_PAGE)
r.raise_for_status()

You can then inspect the HTML directly, but it looks horrible because python treats the HTML as plain text.

In [None]:
r.text

A better idea is to use `BeautifulSoup`, which makes it very easy to work with HTML (and makes it look pretty).

In [None]:
soup = BeautifulSoup(r.text, "lxml")
soup

#### b i) The following example shows how you can find every image on the page, as each image in HTML is denoted by the tag "`img`". Feel free to execute the code. Now change the code, so that instead of scraping `img` tags, you scrape the tags associated with URLs. (Hint: if you right-click and inspect any URL, your browser will indicate what tag you need).

In [None]:
for element in soup.find_all("img"):
    print(element)
    print()

#### b ii) In the following code sample, we can inspect the properties of the `img` tag. In the below example, we can extract the `width` property of each image. Feel free to execute the code. Now modify the code so that you extract the URL property from tag you identified above, and store them in a list called 'urls'.

In [None]:
widths = []
for element in soup.find_all("img"):
    if 'width' not in element.attrs:
        continue
    widths.append(element['width'])
print(widths)

# urls = []
# for element in soup.find_all(???):
#     if ??? not in element.attrs:
#         continue
#     urls.append(element[???])
# print(urls)

#### b iii) Now that we have the URLs, we actually have too many! Filter the URLs by filling in the following code.

#### b iv) Let's extract the codes directly from the URLs. First, uncomment all *lines* which start with a '#'. Next, extract the codes from the URLs using the range method indicated below (for example `url[3:10]` would give all characters from the 4th and 10th).

In [None]:
#codes = []
for url in urls:
    if not url.startswith(???):  #What do the URLs that we need start with?
        continue
    if not url.endswith(???):  # What do the URLs that we need end with?
        continue
    #code = url[???:???]  # What numeric range ([start:end]) is required to extract the actor's code?
    #codes.append(code)
#print(codes)
#print("Found", len(codes), "codes.")

#### b v) Let's wrap this up in a function, because we going to need to use it again later. Check that the function works as expected.

In [None]:
@retry(stop_max_attempt_number=7, wait_fixed=2000)  # Ask a helper why this is needed!
def get_codes_from_top_page(tag_name, property_name, url_start, url_end):
    r = requests.get(TOP_PAGE)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    
    # Now extract codes from the soup
    codes = []
    for element in soup.find_all(tag_name):
        if property_name not in element.attrs:
            continue
        url = element[property_name]
        if not url.startswith(url_start):  #What do the URLs that we need start with?
            continue
        if not url.endswith(url_end):  # What do the URLs that we need end with?
            continue
        code = url[len(url_start):-len(url_end)] 
        codes.append(code)    
    return codes

# Check the function works
#codes = get_codes_from_top_page(tag_name=???, property_name=???, url_start=???, url_end=???)
print(codes)
print("Found", len(codes), "codes.")

## Scrolling through pages

Great, so now we should have 100 codes... but wait - doesn't the list say that it has the 'Top 1000 Actors and Actresses'? [Looking back at the list](https://www.imdb.com/list/ls058011111/), click on the 'NEXT' link at the bottom of the page. We can see that the list has been updated to include the next 100 actors.

The URL for the page has also changed, slightly. The main URL (before '?') remains the same, however the parameters (everything after '?') have been added. The only one we care about is the 'page' parameter. Which can be inputted as so:

    r = requests.get(TOP_PAGE, params={"page": 2})

#### c) First, change the function `get_codes_from_top_page` to include the page number in the request. Then extract the first 3 pages by executing the following code:

In [None]:
n_pages = 3
codes = []
for page in range(1, n_pages+1):
    codes += get_codes_from_top_page(tag_name=???, property_name=???, url_start=???, url_end=???, page=page)
    
# Remove duplicates
codes = set(codes)
print(codes)
print("Found", len(codes), "codes.")

## Extracting biographies

Now we're able to extract all 1000 codes. The next step (and the point of this walkthrough!) is to extract the biographies for each actor. At the top of this notebook, you'll see that a variable `BIO_PAGE` has been defined as:

    BIO_PAGE = "https://imdb.com/name/{}/bio?ref_=nmls_hd"
    
The '{}' is python syntax for string formatting. Play with the following code until you understand how '{}' works:

In [None]:
name = "joel"
other_name = "juan"

text = "My name is {}"
other_text = "Our names are {} and {}"

print(text)
print(text.format(name))
print(other_text.format(name, other_name))

#### d i) Iterate through `codes` and generate all biography urls. Copy and paste a few into your browser to confirm that they are correct.

In [None]:
for code in codes:
    bio_url = ???
    print(bio_url)

#### d ii) Take Robert De Niro's biography url (https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1), and inspect the page in your browser. Find the 'div' which contains the biography text, and input the class names below:

In [None]:
bio_url = "https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1"
r = requests.get(bio_url)
r.raise_for_status()
soup = BeautifulSoup(r.text, "lxml")

element = soup.find("div", class_=[???,???])  # Enter the two class names here. Ask a helper why "class" ends with "_"!
bio_text = element.text
print(bio_text)

Great, so you've collected the data, but the biography isn't very useful unless we have the actor's name. 

#### d iv) Extract the Robert De Niro's name from the soup

In [None]:
element = soup.find(???, property=???)
actor_name = element[???]

## Put it all together

Let's put all of section d) into a single function.

#### e i) Finish writing the code for the following function. Check that it works!

In [None]:
def extract_bio_and_name(imdb_code):
    bio_url = BIO_URL.format(???)
    r = requests.get(bio_url)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")

    # Get the bio text
    element = soup.find("div", class_=[???,???])  # Enter the two class names here. Ask a helper why "class" ends with "_"!
    bio_text = element.text
    
    # Get the actor's name
    element = soup.find(???, property=???)
    actor_name = element[???]

    return actor_name, bio_text
    
actor_name, bio_text = extract_bio_and_name("nm0000134")
print("--->", actor_name)
print("===>", bio_text)

#### e ii) Now iterate through all of your codes and generate the output data.

In [None]:
data = {}
for code in codes:
    actor_name, bio_text = extract_bio_and_name(???)
    data[actor_name] = bio_text
    
print("Got", len(data), "biographies")

#### e iii) Finally, save the data:

In [None]:
# Fetch each biography for every code.
with open("data/out-bios.json", "w") as f:
    json.dump(data, f)

## Extras

Well done, you've finished your first web scraping. I hope you can see that you could apply this to many data sources that could be of interest to you. There are a couple of extra problems that you could still solve however:

#### f i) Most of the biographies contain some text similar to _'- IMDb Mini Biography By:  Pedro Borges'_. Add some code into your function to remove this.

#### f ii) This is only 1000 actors, but there are many more on IMDB. Think of a strategy for collecting more! (Hint: there are many ways to do this, one way might be to look for links on the biography pages).