# Part 1: Scraping web data

In this analysis we will scrape actor biographies from IMDB, which can be used to build a recommendation engine in part 2. This notebook will guide you through the process. See [here](https://github.com/nestauk/taller_centro_cultura_digital/blob/master/data_collection.ipynb) for one possible solution.

In [2]:
from bs4 import BeautifulSoup
import re
import requests
from retrying import retry
import json

# The first page we're scraping
TOP_PAGE = "https://www.imdb.com/list/ls058011111/"

# The second page we're scraping
BIO_PAGE = "https://imdb.com/name/{}/bio?ref_=nmls_hd"

## Setting the scene

We would like to scrape each actor's biography (for example [Robert De Niro](https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1)'s), by iterating through the [list indicated here](https://www.imdb.com/list/ls058011111/).

#### a) Look at the source code (HTML) for the list. You can do that by either right-clicking on the page and clicking "show source". Can you see any patterns in the HTML which you could use to infer the URL for Robert De Niro's biography. (Hint: what is special about https://www.imdb.com/name/nm0000134/bio?ref_=nm_ql_1 compared to the biography for another actor? Can you find this in the list's source code?)

    <Put your answer here>

This is the general gist of web scraping: finding patterns you can exploit. We can download the source code (HTML) of a web page using python `requests`. Get the source code by executing the following code.

In [5]:
r = requests.get(TOP_PAGE)
r.raise_for_status()

You can then inspect the HTML directly, but it looks horrible because python treats the HTML as plain text.

In [7]:
r.text

'\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///list/ls058011111?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Top 1000 Actors and Actresses - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script

A better idea is to use `BeautifulSoup`, which makes it very easy to work with HTML (and makes it look pretty).

In [10]:
soup = BeautifulSoup(r.text, "lxml")
soup

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///list/ls058011111?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 1000 Actors and Actresses - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https:

#### b i) The following example shows how you can find every image on the page, as each image in HTML is denoted by the tag "`img`". Feel free to execute the code. Now change the code, so that instead of scraping `img` tags, you scrape the tags associated with URLs. (Hint: if you right-click and inspect any URL, your browser will indicate what tag you need).

In [12]:
for element in soup.find_all("img"):
    print(element)
    print()

<img alt="IMDbPro Menu" src="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_logo_nb._CB484021162_.png"/>

<img alt="Go to IMDbPro" height="145" src="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user._CB484021156_.png" srcset="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user._CB484021156_.png 1x, https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user_2x._CB484021157_.png 2x" width="127"/>

<img alt="Robert De Niro" height="209" src="https://m.media-amazon.com/images/M/MV5BMjAwNDU3MzcyOV5BMl5BanBnXkFtZTcwMjc0MTIxMw@@._V1_UY209_CR9,0,140,209_AL_.jpg" width="140"/>

<img alt="Jack Nicholson" height="209" src="https://m.media-amazon.com/images/M/MV5BMTQ3OTY0ODk0M15BMl5BanBnXkFtZTYwNzE4Njc4._V1_UY209_CR5,0,140,209_AL_.jpg" width="140"/>

<img alt="Tom Hanks" height="209" src="https://m.media-amazon.com/images/M/MV5BMTQ2MjMwNDA3Nl5BMl5BanBnXkFtZTcwMTA2NDY3NQ@@._V1_UY209_CR2,0,140,

#### b ii) In the following code sample, we can inspect the properties of the `img` tag. In the below example, we can extract the `width` property of each image. Feel free to execute the code. Now modify the code so that you extract the URL property from tag you identified above, and store them in a list called 'urls'.

In [17]:
widths = []
for element in soup.find_all("img"):
    if 'width' not in element.attrs:
        continue
    widths.append(element['width'])
print(widths)

urls = []
for element in soup.find_all("a"):
    if 'href' not in element.attrs:
        continue
    urls.append(element['href'])
print(urls)

['127', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '140', '86', '86', '86', '86', '86']
['/?ref_=nv_home', '/?ref_=nv_home', '/search/', '/movies-in-theaters/?ref_=nv_tp_inth_1', '/chart/toptv/?ref_=nv_tp_tv250_2', '/showtimes/?ref_=nv_tp_sh_3', '/title/tt1049413/?ref_=nv_mv_dflt_1', '/title/tt1049413/?ref_=nv_mv_dflt_2', '/chart/top?ref_=nv_mv_dfl

#### b iii) Now that we have the URLs, we actually have too many! Filter the URLs by filling in the following code.

#### b iv) Let's extract the codes directly from the URLs. First, uncomment all *lines* which start with a '#'. Next, extract the codes from the URLs using the range method indicated below (for example `url[3:10]` would give all characters from the 4th and 10th).

In [33]:
#codes = []
for url in urls:
    if not url.startswith(???):  #What do the URLs that we need start with?
        continue
    if not url.endswith(???):  # What do the URLs that we need end with?
        continue
    #code = url[???:???]  # What numeric range ([start:end]) is required to extract the actor's code?
    #codes.append(code)
#print(codes)
#print("Found", len(codes), "codes.")

['nm0000134', 'nm0000197', 'nm0000158', 'nm0000008', 'nm0000138', 'nm0000007', 'nm0000136', 'nm0000199', 'nm0000243', 'nm0000059', 'nm0000093', 'nm0000358', 'nm0000129', 'nm0000026', 'nm0000163', 'nm0000022', 'nm0000576', 'nm0000288', 'nm0000060', 'nm0001627', 'nm0000148', 'nm0000075', 'nm0000123', 'nm0000032', 'nm0000151', 'nm0000031', 'nm0000658', 'nm0000006', 'nm0000054', 'nm2225369', 'nm0000701', 'nm0000072', 'nm0000949', 'nm0000030', 'nm0000545', 'nm0000012', 'nm0205626', 'nm0000173', 'nm0000113', 'nm0000204', 'nm0000149', 'nm0001132', 'nm0010736', 'nm0000210', 'nm0000473', 'nm0000038', 'nm0000511', 'nm0000702', 'nm0000234', 'nm0000023', 'nm0000078', 'nm0000056', 'nm0000164', 'nm0000354', 'nm0000128', 'nm0000380', 'nm0000015', 'nm0000018', 'nm0000020', 'nm0000245', 'nm0000080', 'nm0910607', 'nm0005132', 'nm0000125', 'nm0000228', 'nm0000432', 'nm0000553', 'nm0001570', 'nm0000246', 'nm0000011', 'nm0000450', 'nm0000602', 'nm0000146', 'nm0000226', 'nm0000537', 'nm0000046', 'nm0001401'

#### b v) Let's wrap this up in a function, because we going to need to use it again later. Check that the function works as expected.

In [38]:
@retry(stop_max_attempt_number=7, wait_fixed=2000)  # Ask a helper why this is needed!
def get_codes_from_top_page(tag_name, property_name, url_start, url_end):
    r = requests.get(TOP_PAGE)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    
    # Now extract codes from the soup
    codes = []
    for element in soup.find_all(tag_name):
        if property_name not in element.attrs:
            continue
        url = element[property_name]
        if not url.startswith(url_start):  #What do the URLs that we need start with?
            continue
        if not url.endswith(url_end):  # What do the URLs that we need end with?
            continue
        code = url[len(url_start):-len(url_end)] 
        codes.append(code)    
    return codes

# Check the function works
#codes = get_codes_from_top_page(tag_name=???, property_name=???, url_start=???, url_end=???)
print(codes)
print("Found", len(codes), "codes.")

['nm0000134', 'nm0000197', 'nm0000158', 'nm0000008', 'nm0000138', 'nm0000007', 'nm0000136', 'nm0000199', 'nm0000243', 'nm0000059', 'nm0000093', 'nm0000358', 'nm0000129', 'nm0000026', 'nm0000163', 'nm0000022', 'nm0000576', 'nm0000288', 'nm0000060', 'nm0001627', 'nm0000148', 'nm0000075', 'nm0000123', 'nm0000032', 'nm0000151', 'nm0000031', 'nm0000658', 'nm0000006', 'nm0000054', 'nm2225369', 'nm0000701', 'nm0000072', 'nm0000949', 'nm0000030', 'nm0000545', 'nm0000012', 'nm0205626', 'nm0000173', 'nm0000113', 'nm0000204', 'nm0000149', 'nm0001132', 'nm0010736', 'nm0000210', 'nm0000473', 'nm0000038', 'nm0000511', 'nm0000702', 'nm0000234', 'nm0000023', 'nm0000078', 'nm0000056', 'nm0000164', 'nm0000354', 'nm0000128', 'nm0000380', 'nm0000015', 'nm0000018', 'nm0000020', 'nm0000245', 'nm0000080', 'nm0910607', 'nm0005132', 'nm0000125', 'nm0000228', 'nm0000432', 'nm0000553', 'nm0001570', 'nm0000246', 'nm0000011', 'nm0000450', 'nm0000602', 'nm0000146', 'nm0000226', 'nm0000537', 'nm0000046', 'nm0001401'

## Scrolling through pages

Great, so now we should have 100 codes... but wait - doesn't the list say that it has the 'Top 1000 Actors and Actresses'? [Looking back at the list](https://www.imdb.com/list/ls058011111/), click on the 'NEXT' link at the bottom of the page. We can see that the list has been updated to include the next 100 actors.

The URL for the page has also changed, slightly. The main URL (before '?') remains the same, however the parameters (everything after '?') have been added. The only one we care about is the 'page' parameter. Which can be inputted as so:

    r = requests.get(TOP_PAGE, params={"page": 2})

#### c) First, change the function `get_codes_from_top_page` to include the page number in the request. Then extract the first 3 pages by executing the following code:

In [41]:
n_pages = 3
codes = []
for page in range(1, n_pages+1):
    codes += get_codes_from_top_page(tag_name=???, property_name=???, url_start=???, url_end=???, page=page)
    
# Remove duplicates
codes = set(codes)
print(codes)
print("Found", len(codes), "codes.")

{'nm0000146', 'nm0000006', 'nm0000060', 'nm0000012', 'nm0001546', 'nm0841797', 'nm0000122', 'nm0001774', 'nm0000138', 'nm0000071', 'nm0447695', 'nm0205626', 'nm1659547', 'nm0000147', 'nm0000380', 'nm0000531', 'nm0001473', 'nm0000059', 'nm0000379', 'nm0001132', 'nm0000685', 'nm0000002', 'nm0000102', 'nm0000151', 'nm0000079', 'nm0001256', 'nm0000028', 'nm1165110', 'nm0461136', 'nm0000658', 'nm0000541', 'nm0000651', 'nm0000072', 'nm0000443', 'nm0000148', 'nm0000489', 'nm0000209', 'nm0000511', 'nm0000201', 'nm0001626', 'nm0177896', 'nm0001401', 'nm0000134', 'nm0000020', 'nm0001159', 'nm0000051', 'nm0000166', 'nm0000046', 'nm0000038', 'nm0001426', 'nm0000949', 'nm0001715', 'nm0000113', 'nm0000313', 'nm0000178', 'nm0000661', 'nm0707023', 'nm0000323', 'nm0182839', 'nm0001224', 'nm0000195', 'nm0005476', 'nm0000564', 'nm0000188', 'nm0000602', 'nm0001479', 'nm0000456', 'nm0000377', 'nm0000226', 'nm0000106', 'nm0790454', 'nm0000197', 'nm0001876', 'nm0000300', 'nm0000140', 'nm0000058', 'nm0000182'

## Extracting biographies

Now we're able to extract all 1000 codes. The next step (and the point of this walkthrough!) is to extract the biographies for each actor. At the top of this notebook, you'll see that a variable `BIO_PAGE` has been defined as:

    BIO_PAGE = "https://imdb.com/name/{}/bio?ref_=nmls_hd"
    
The '{}' is python syntax for string formatting. Play with the following code until you understand how '{}' works:

In [43]:
name = "joel"
other_name = "juan"

text = "My name is {}"
other_text = "Our names are {} and {}"

print(text)
print(text.format(name))
print(other_text.format(name, other_name))

My name is {}
My name is joel
Our names are joel and juan


#### d i) 

In [2]:
@retry(stop_max_attempt_number=7, wait_fixed=2000)
def persistent_request_to_soup(*args, **kwargs):
    """Make a request and convert to a HTML soup. Requests sometimes get blocked,
    so wait 2 seconds between failed requests. All args and kwargs are passed
    directly to `requests.get`"""
    r = requests.get(*args, **kwargs)
    r.raise_for_status()
    return BeautifulSoup(r.text, "lxml")


def imdb_code_iter():
    """Yield an iterator to the next IMDB actor code, by iterating
    through pages until no more actor codes are found."""
    done = False
    ipage = 1
    while not done:
        # Innocent until proven guilty                                  
        done = True
        for code in _code_iter(ipage):
            done = False
            yield code
        # Increment page number                                         
        ipage += 1


def _code_iter(ipage):
    """Yield an iterator to the next IMDB actor code on this page, if any.
    
    Args:
        ipage (int): The page number.
    Yields:
        code (str) an IMDB actor code.
    """
    # Get the HTML soup for this page
    soup = persistent_request_to_soup(TOP_PAGE, params=dict(page=ipage))
    # Find all IMDB actor codes on this page
    for anchor in soup.find_all('a', href=PATTERN):
        href = anchor['href']
        code = REGEX.findall(href)[0]
        yield code


def fetch_bio(imdb_code):
    """Find the biography associated with the actor with IMDB code `imdb_code`.
    
    Args:
        imdb_code (str): The IMDB code of the actor.
    Returns:
        name, biography (str, str): Name and biography of the actor.
    """
    soup = persistent_request_to_soup(BIO_PAGE.format(imdb_code))
    # Fetch the actor name                                       
    name_meta = soup.find("meta", property="og:title")
    name = name_meta["content"]
    # Fetch the bio                                              
    bio = soup.find("div", class_=["soda", "odd"])
    clean_text = bio.text.strip()
    paragraphs = clean_text.split("\n")
    bio_text = paragraphs[0]
    return name, bio_text

In [5]:
# Fetch each biography for every code.
code_iter = imdb_code_iter()
bios = {name: text for name, text in map(fetch_bio, code_iter)}
with open("data/out-bios.json", "w") as f:
    json.dump(bios, f)