Web Scraping
============

Web scraping is an ugly name for a reasonable thing: some data is available via your browser and you want to get it into some kind of reasonable shape without literally copying and pasting everything into a spreadsheet somewhere.

You might think that if you work with a company that wants you to do data science that you'll always get nice data from them in the form of a database connection or just a csv file. That is often not the case. Getting database accesss can be a politically challenging thing in some situations (database admins can be quite protective of their territory). Sometimes its better and easier in that situation to just scrape the data from the web front end of whatever system you're working on. This is often good enough for a prototype which can then be leveraged for more formal access to the data.

Language Selection
==================

As we know from various lectures, your browser is just a fancy HTTP protocol client: it makes requests and renders the response (sometimes making multiple requests in the back and in response to local events or even as part of its major rendering strategy). Therefor we can use any language to do web scraping as long as it supports issueing HTTP requests and receiving the body of the response back. In fact, there are very capable command line tools (curl, wget) in Unix that you can call from almost any language at all to serve this purpose.

But once you get the data you want you don't want to have to grub though it character by character to get what you want. Its very helpful if your language of choice makes it easy to parse the resulting HTML or JSON data into a native data structure so that you can more easily and reliably extract the data. Both R and Python provide packages to this end. Today we're going to use Python but similar workflows are available in R or Javascript or Java or whatever. Its a common task.

Don't DDOS
==========

DDOS stands for "distributed denial of service" and its a sort of attack you can launch at a website or other web service. In a DDOS you flood a server with millions of requests a second which has the tendency to bring most servers down. At the very least, it tends to prevent legitimate traffic from accessing the service: they end up having to wait in line behind all the garbage traffic.

Many websites, therefore, detect unusual access patterns and block IP addresses that make too frequent requests. Therefore, even when writing the simplest scraper, it pays to 

1. Make sure you don't issue too many requests too fast
2. Cache the results of requests so that you never request the same page over and over (say during testing)

This is actually pretty easy to set up, so there is no reason not to do it.

BeautifulSoup
=============

We're going to use BeautifulSoup:

```
RUN apt update && apt-get install -y emacs openssh-server python3-pip
RUN pip3 install beautifulsoup4  pandas numpy pandasql 
```

Back in the day people used to write HTML by hand and a lot of it was pretty bad (it failed to follow all the rules). BeautifulSoup will make a decent effort to figure out what was intended, even if the HTML isn't perfect. This is handy.

Getting Started
===============

Let's begin by learning how to make HTTP requests.

In [1]:
from requests import get as get_raw


get_raw("http://tmbw.net/wiki/Lyrics:Window")

<Response [200]>

There is that "200" HTTP status code. That tells us that the request succeeded. Any other code is probably an error, but you'll likely see 404 (page not found) and 500 (internal server error). 404 almost always means what it says, but 500 is a coy way of saying "you can't access this resources and I'm not even going to tell you whether or not it exists." Its outside of the scope of this lecture to discuss how to scrape "private" pages, but it is possible: you just have to add the right stuff to your request header. This is, I believe, something of a legal grey area so tread lightly.

Let's get a little more fun information from the page.

In [5]:
get_raw("http://tmbw.net/wiki/Lyrics:Window").content[0:500]

b'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<meta charset="UTF-8" />\n<title>Lyrics:Window - TMBW: The They Might Be Giants Knowledge Base</title>\n<meta name="generator" content="MediaWiki 1.25.0" />\n<link rel="alternate" type="application/x-wiki" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="edit" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="apple-touch-icon" href="/apple-touch-icon.png" />'

I've truncated the output we got back, but you can see that the "content" we've received back from the server is a real life HTML document (represented as a string). Our next conceptual step is to parse this and search it for the data we want. But before we do that, let's take care of caching our results so we don't rudely make too many requests against this server.

In [101]:
from hashlib import md5 as hash_raw

def hash(s):
    return hash_raw(s.encode()).hexdigest()

[hash("http://tmbw.net/wiki/Lyrics:Window"),
 hash("http://tmbw.net/wiki/Lyrics:Window"),
 hash("http://tmbw.net/wiki/Lyrics:SnailShell")]


['a8fe29d5e9cb70e54efcc7ae382a2519',
 'a8fe29d5e9cb70e54efcc7ae382a2519',
 'ee7e1fd80caa65cd936de2baf6a9a96c']

We've created a function here which turns our URL into a unique string. We'll save the results of any request to a file named after this string whenever we make a request. But first we'll see if such a file exists and if it does, we'll load the response from disk instead of issuing a request. If we ever want to clear our cache we can just delete the contents of our cache directory.

Caching like this will also tend to speed up complicated scraping jobs (which often revisit the same page repeatedly as they "crawl" a website).

In [100]:
import os

cachedir = "lecture_scrape_cache"
if not os.path.isdir(cachedir):
    os.mkdir(cachedir);

def to_cache_name(url):
    return "./{}/{}".format(cachedir, hash(url));

def read_file(s):
    with open(s,'r') as fd:
        return fd.read();
    
def write_file(s, content):
    with open(s,'w') as fd:
        fd.write(content)
    return s;

verbose = True;

def get(url):
    cache_name = to_cache_name(url);
    if os.path.isfile(cache_name):
        if verbose:
            print("Cache hit: {}".format(url));
        return read_file(cache_name);
    else:
        if verbose:
            print("Cache miss: {}".format(url));
        content = get_raw(url).content.decode();
        write_file(cache_name, content);
        return content;



In [102]:
! rm -f lecture_scrape_cache/*

get("http://tmbw.net/wiki/Lyrics:Window")[0:500]

Cache miss: http://tmbw.net/wiki/Lyrics:Window


'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<meta charset="UTF-8" />\n<title>Lyrics:Window - TMBW: The They Might Be Giants Knowledge Base</title>\n<meta name="generator" content="MediaWiki 1.25.0" />\n<link rel="alternate" type="application/x-wiki" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="edit" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="apple-touch-icon" href="/apple-touch-icon.png" />'

In [103]:
get("http://tmbw.net/wiki/Lyrics:Window")[0:500]

Cache hit: http://tmbw.net/wiki/Lyrics:Window


'<!DOCTYPE html>\n<html lang="en" dir="ltr" class="client-nojs">\n<head>\n<meta charset="UTF-8" />\n<title>Lyrics:Window - TMBW: The They Might Be Giants Knowledge Base</title>\n<meta name="generator" content="MediaWiki 1.25.0" />\n<link rel="alternate" type="application/x-wiki" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="edit" title="Edit" href="/wiki/index.php?title=Lyrics:Window&amp;action=edit" />\n<link rel="apple-touch-icon" href="/apple-touch-icon.png" />'

I also like to do one more "polite" thing which web scraping, which is to limit the number of requests I make by putting a delay before each access.

In [104]:
from time import sleep

get_delay = 1; # second
def get(url):
    cache_name = to_cache_name(url);
    if os.path.isfile(cache_name):
        print("Cache hit: {}".format(url));
        return read_file(cache_name);
    else:
        print("Cache miss: {}".format(url));
        sleep(get_delay);
        content = get_raw(url).content.decode();
        write_file(cache_name, content);
        return content;

By modifying the delay we can change how polite we are. 1 Second is probably a bit conservative. 

In [105]:
get_delay = 0.1;

Now we are ready to start scraping in earnest. Let's look at a typical page as an example:
[http://tmbw.net/wiki/Lyrics:Window](http://tmbw.net/wiki/Lyrics:Window). Once we've opened the page we can inspect the HTML element there and decide what we want to get. In this example I want to extract the song title, the band, the year and the lyrics themselves. The way it works when you scrape is you try to find some unique properties of the elements you're interested in grabbing. Then you can use BeautifulSoup to search for those elements and extract the results from them.

In this case everything we want is in the "lyricstable" table element. And then the data we want is in a series of rows. This is actually a little bit tricky of a scraping job because the individual rows don't tell us all that much about what they have in them. We'll have to rely on the order in which they appear which seems brittle. We'll try to do a little better than that but let's begin with the simplest strategy.

In [106]:
from bs4 import BeautifulSoup as bs
def parse_html(html):
    return bs(html);

def get_document(url):
    return parse_html(get(url));

repr(get_document("http://tmbw.net/wiki/Lyrics:Window"))[0:100]

Cache hit: http://tmbw.net/wiki/Lyrics:Window


'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n<head>\n<meta charset="utf-8"/>\n<title'

In [37]:
def get_song_table(url):
    doc = get_document(url);
    return doc.find("table",class_="lyricstable");

In [39]:
test_url = "http://tmbw.net/wiki/Lyrics:Window"
get_song_table(test_url)

Cache hit: http://tmbw.net/wiki/Lyrics:Window


<table align="center" cellspacing="3" class="lyricstable">
<tr>
<td style="text-align:right;"><b>Window</b>
</td></tr>
<tr>
<td style="text-align:right;"><b>By: <a href="/wiki/They_Might_Be_Giants" title="They Might Be Giants">They Might Be Giants</a></b>
</td></tr>
<tr>
<td style="text-align:right;"><b>Year: <a href="/wiki/Category:Songs_Released_In_1994" title="Category:Songs Released In 1994">1994</a></b>
</td></tr>
<tr>
<td><div class="lyrics-table" style="white-space: pre-wrap;">
<p>Look at all the people in the window
I'm checking out the people in the window
I was uncomfortable
Now I'm uncomfortable
The trouble I encountered when I thought it was, it was a window
</p><p>It was a catalog
Of many women, men
The window
Window
Window
</p><p>Look at all the people in the window
I'm checking out the people in the window
I was uncomfortable
Now I'm uncomfortable
The trouble I encountered when I thought it was, it was a window
</p>
</div>
</td></tr></table>

This looks like the data we want. First the easy thing: the lyrics. These are at least in a unique div with class "lyrics-table". The other elements are in "td" tags and I think the best bet we've got is to use their "style" attribute (they are all right aligned).

In [107]:
def get_song_data(url):
    tbl = get_song_table(url);
    out = {};
    lyrics = tbl.find("div",class_="lyrics-table").text.strip().split("\n");
    tds = tbl.find_all("td", attrs={"style":"text-align:right;"})
    out["title"] = tds[0].text.strip();
    out["by"] = tds[1].text.strip().split(":")[1];
    out["year"] = tds[2].text.strip().split(":")[1];
    out["lyrics"] = lyrics;
    return out
get_song_data(test_url)

Cache hit: http://tmbw.net/wiki/Lyrics:Window


{'title': 'Window',
 'by': ' They Might Be Giants',
 'year': ' 1994',
 'lyrics': ['Look at all the people in the window',
  "I'm checking out the people in the window",
  'I was uncomfortable',
  "Now I'm uncomfortable",
  'The trouble I encountered when I thought it was, it was a window',
  'It was a catalog',
  'Of many women, men',
  'The window',
  'Window',
  'Window',
  'Look at all the people in the window',
  "I'm checking out the people in the window",
  'I was uncomfortable',
  "Now I'm uncomfortable",
  'The trouble I encountered when I thought it was, it was a window']}

While I worked on this I had to iterate on these functions over and over. Because my code transparently caches the results I only had to make a single http request near the top of the lecture. Let's try another page to see if things work the way we expect.

In [109]:
get_song_data("http://tmbw.net/wiki/Lyrics:A_Self_Called_Nowhere")

Cache hit: http://tmbw.net/wiki/Lyrics:A_Self_Called_Nowhere


{'title': 'A Self Called Nowhere',
 'by': ' They Might Be Giants',
 'year': ' 1994',
 'lyrics': ["I'm sitting on the curb",
  'Of the empty parking lot',
  'Of the store where they let me play the organ',
  "I'm waiting for my ride",
  'But I want to wait inside',
  'The store where they let me play the organ',
  "But I'm thinking of a wooden chair",
  'In the room at the top of the stair',
  "And I'm looking down the stairwell",
  'At the vanishing dot',
  'On the map of the spot',
  'Let me take you there',
  'The dotted line',
  'Surrounding the mind',
  'Of a self called nowhere',
  'It\'s a thing named "it"',
  'In a bottomless pit',
  "You can't see it there",
  'The sunken head',
  'That lies in the bed',
  'Of a self called nowhere',
  'Standing in my yard',
  'Where they tore down the garage',
  'To make room for the torn down garage',
  "I'm looking for my car",
  'But I must have sold my car',
  'When I needed to buy an electric organ',
  "But I'm thinking of a wooden chair"

Recursive Scraping
==================

Its not atypical for us to want to scrape a large number of similar pages. In this example we might be interest in all the lyrics and information for all of TMBG's songs ever. We could find these urls one by one and pass them to our scraper, but that is a lot of work. The thing to do here is to let the structure of the web page guide us. Let'se see if we can figure out how we can use the page itself to get the urls we need.

If we visit the main page of the site we can see there is a discography page. This looks like a great place to start, since it lists all of their albums of any type. If each album page we reach by this page has the song urls somewhere, then we're in business.

Sure enough, each release page contains a track listing section with links to the lyrics pages. It seems like we can extract a list of lyrics links from each page. These pages might be formatted different on each page, so let's rely on something that seems pretty solid: a link to a lyrics page is an "a" element with "http://tmbw.net/wiki/Lyrics:" as the first part of the url.

But first, let's write a scraper to get the album list from the discography page.

Discography
-----------

In [69]:
def get_album_list(url):
    doc = get_document(url);
    discog_table = doc.find("table", id = "discog");
    rows = discog_table.find_all("tr");
    def process_row(row):
        vals = row.find_all("td");
        year = vals[0].text.strip();
        album_url = vals[1].find("a")['href']
        title = vals[1].text.strip();
        return {"year":int(year), "title":title, "url":"http://tmbw.net"+album_url};
    return [ process_row(row) for row in rows if len(row.find_all("td")) >= 2 ];

get_album_list("http://tmbw.net/wiki/Discography")[0:4]

Cache hit: http://tmbw.net/wiki/Discography


[{'year': 1985,
  'title': 'Wiggle Diskette',
  'url': 'http://tmbw.net/wiki/Wiggle_Diskette'},
 {'year': 1985,
  'title': '1985 Demo Tape',
  'url': 'http://tmbw.net/wiki/1985_Demo_Tape'},
 {'year': 1986,
  'title': 'They Might Be Giants - Joshua Fried Split Single',
  'url': 'http://tmbw.net/wiki/They_Might_Be_Giants_-_Joshua_Fried_Split_Single'},
 {'year': 1986,
  'title': 'They Might Be Giants',
  'url': 'http://tmbw.net/wiki/They_Might_Be_Giants_(Album)'}]

Note that above we put the domain back on the url: web pages typically use "relative" urls to link within themselves but we need the whole thing.

Track Lists
-----------

Now that we have our album we need a scraper to scrape the track list out of each album. The easiest thing to do here is to find all the urls that have the string "http://tmbw.net/wiki/Lyrics:" in it.

In [110]:
def scrape_lyrics_pages(url):
    doc = get_document(url);
    out = [];
    for link in doc.find_all("a"):
        if link.has_key("href") and "wiki/Lyrics:" in link['href']:
            out.append("http://tmbw.net"+link['href']);
    return out;

scrape_lyrics_pages("http://tmbw.net/wiki/Apollo_18")
                       

Cache miss: http://tmbw.net/wiki/Apollo_18




['http://tmbw.net/wiki/Lyrics:Dig_My_Grave',
 'http://tmbw.net/wiki/Lyrics:I_Palindrome_I',
 'http://tmbw.net/wiki/Lyrics:She%27s_Actual_Size',
 'http://tmbw.net/wiki/Lyrics:My_Evil_Twin',
 'http://tmbw.net/wiki/Lyrics:Mammal',
 'http://tmbw.net/wiki/Lyrics:The_Statue_Got_Me_High',
 'http://tmbw.net/wiki/Lyrics:Spider',
 'http://tmbw.net/wiki/Lyrics:The_Guitar',
 'http://tmbw.net/wiki/Lyrics:Dinner_Bell',
 'http://tmbw.net/wiki/Lyrics:Narrow_Your_Eyes',
 'http://tmbw.net/wiki/Lyrics:Hall_Of_Heads',
 'http://tmbw.net/wiki/Lyrics:Which_Describes_How_You%27re_Feeling',
 'http://tmbw.net/wiki/Lyrics:See_The_Constellation',
 'http://tmbw.net/wiki/Lyrics:If_I_Wasn%27t_Shy',
 'http://tmbw.net/wiki/Lyrics:Turn_Around',
 'http://tmbw.net/wiki/Lyrics:Hypnotist_Of_Ladies',
 'http://tmbw.net/wiki/Lyrics:Everything_Is_Catching_On_Fire',
 'http://tmbw.net/wiki/Lyrics:Fingertips_(Banjo)',
 'http://tmbw.net/wiki/Lyrics:I_Hear_The_Wind_Blow',
 'http://tmbw.net/wiki/Lyrics:Hey_Now_Everybody',
 'http://t

Putting it All Together
=======================

We've not got:

1. Code to Scrape all the Album URLS
2. Code to Scrape the Tracklist from Each Album
3. Code to Scrape Lyric data from the Tracks

So we just need to write a program that puts it all together. Its worth thinking a little about how to put our data together. One possibility is just a dataframe with columns:

1. year
2. album
3. song title
4. lyrics

We could also encode the results as objects and just dump them as JSON. JSON can be more appropriate for this kind of data and we're already pretty familiar with dataframes, so let's do that. A JSON object is a collection of nested objects consisting of arrays, dictionaries and strings and numbers (the latter two unable to hold any sub objects, obviously). So our output format will be a list of objects with year, album and song title and then a lyrics sub-array with the lyrics as lines.

Enter Make
==========

With a task this simple we could write a function like this:

In [111]:
def scrape_tmbg_lyrics(discog_url):
    albums = get_album_list(discog_url);
    output = [];
    for album in albums:
        url = album['url'];
        tracks = scrape_lyrics_pages(url);
        for track in tracks:
            data = get_song_data(track);
            data['album'] = album['title'];
            output.append(data);
    return output;

In [112]:
scrape_tmbg_lyrics("http://tmbw.net/wiki/Discography")

Cache miss: http://tmbw.net/wiki/Discography
Cache miss: http://tmbw.net/wiki/Wiggle_Diskette
Cache miss: http://tmbw.net/wiki/Lyrics:Everything_Right_Is_Wrong
Cache miss: http://tmbw.net/wiki/Lyrics:You%27ll_Miss_Me_(Demo_1)
Cache miss: http://tmbw.net/wiki/1985_Demo_Tape
Cache miss: http://tmbw.net/wiki/Lyrics:Put_Your_Hand_Inside_The_Puppet_Head_(Demo_3)
Cache miss: http://tmbw.net/wiki/Lyrics:When_It_Rains_It_Snows_(Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:Number_Three_(1985_Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:Don%27t_Let%27s_Start_(1985_Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:You%27ll_Miss_Me_(Demo_2)
Cache miss: http://tmbw.net/wiki/Lyrics:I_Hope_That_I_Get_Old_Before_I_Die_(Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:The_Biggest_One_(1985_Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:32_Footsteps
Cache miss: http://tmbw.net/wiki/Lyrics:Boat_Of_Car_(1985_Demo)
Cache miss: http://tmbw.net/wiki/Lyrics:Cowtown_(1985_Demo_2)
Cache miss: http://tmbw.net/wiki/Lyr

AttributeError: 'NoneType' object has no attribute 'text'

In [113]:
def scrape_tmbg_lyrics(discog_url):
    albums = get_album_list(discog_url);
    output = [];
    for album in albums:
        url = album['url'];
        tracks = scrape_lyrics_pages(url);
        for track in tracks:
            try:
                data = get_song_data(track);
                data['album'] = album['title'];
                output.append(data);
            except Exception as e:
                print("Ran into an error on {} ({})".format( track, e))
    return output;

In [114]:
verbose = True
scrape_tmbg_lyrics("http://tmbw.net/wiki/Discography")[0:10]

Cache hit: http://tmbw.net/wiki/Discography
Cache hit: http://tmbw.net/wiki/Wiggle_Diskette
Cache hit: http://tmbw.net/wiki/Lyrics:Everything_Right_Is_Wrong
Cache hit: http://tmbw.net/wiki/Lyrics:You%27ll_Miss_Me_(Demo_1)
Cache hit: http://tmbw.net/wiki/1985_Demo_Tape
Cache hit: http://tmbw.net/wiki/Lyrics:Put_Your_Hand_Inside_The_Puppet_Head_(Demo_3)
Cache hit: http://tmbw.net/wiki/Lyrics:When_It_Rains_It_Snows_(Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:Number_Three_(1985_Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:Don%27t_Let%27s_Start_(1985_Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:You%27ll_Miss_Me_(Demo_2)
Cache hit: http://tmbw.net/wiki/Lyrics:I_Hope_That_I_Get_Old_Before_I_Die_(Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:The_Biggest_One_(1985_Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:32_Footsteps
Cache hit: http://tmbw.net/wiki/Lyrics:Boat_Of_Car_(1985_Demo)
Cache hit: http://tmbw.net/wiki/Lyrics:Cowtown_(1985_Demo_2)
Cache hit: http://tmbw.net/wiki/Lyrics:Chess_Piece_

[{'title': 'Everything Right Is Wrong',
  'by': ' They Might Be Giants',
  'year': ' 1985',
  'lyrics': ['Everything right is wrong again',
   'Just like in The Long, Long Trailer',
   'All the dishes got broken and the car kept driving',
   'And nobody would stop to save her',
   "Wake me when it's over, touch my face",
   'Tell me every word has been erased',
   "Don't you want to know the reason",
   "Why the cupboard's not appealing",
   "Don't you get the feeling that",
   "Everything that's right is wrong again",
   "You're a weasel overcome with dinge",
   'Weasel overcome but not before the damage done',
   "The healing doesn't stop the feeling",
   'Everything right is wrong again',
   'Just like in the long long trailer',
   'All the dishes got broken and the car kept driving',
   'And nobody would stop to save her',
   'And now the song is over now',
   'And now the song is over now',
   'And now the song is over now',
   'The song is over now',
   'Everything right is wrong

Using Make to Make
==================

Its a useful exercise to convert this to a build-system orchestrated set of tasks. What we want to do is isolate each step. These are:

1. grab the album list
2. grab the song list (from the album list)
3. grab the song data (from the song list)

See the makefiles.

Scraping Async Data
===================

Lots of web pages load their data _after_ the page loads. Consider almost any of the pages on this great website:

https://bl.ocks.org/armollica/3b5f83836c1de5cca7b1d35409a013e3

If we load this into our Python session:


In [115]:
page = get("https://bl.ocks.org/armollica/3b5f83836c1de5cca7b1d35409a013e3")
page

Cache miss: https://bl.ocks.org/armollica/3b5f83836c1de5cca7b1d35409a013e3


'<!DOCTYPE html>\n<meta charset="utf-8">\n<meta name="viewport" content="width=1000">\n<meta name="twitter:card" content="summary">\n<meta name="twitter:site" content="@mbostock">\n<meta property="og:url" content="https://bl.ocks.org/armollica/3b5f83836c1de5cca7b1d35409a013e3">\n<meta property="og:title" content="Joyplot">\n<meta property="og:description" content="Andrew Mollica’s Block 3b5f83836c1de5cca7b1d35409a013e3">\n<meta property="og:image" content="https://bl.ocks.org/armollica/raw/3b5f83836c1de5cca7b1d35409a013e3/783d1bbc2dd3aabbfcba83ece4a670de6f1ec371/thumbnail.png">\n<title>Joyplot - bl.ocks.org</title>\n<link rel="icon" href="/favicon.png">\n<link rel="canonical" href="https://bl.ocks.org/armollica/3b5f83836c1de5cca7b1d35409a013e3">\n<style>\n\n@import url("/style.css");\n\n</style>\n\n  <a class="announce" href="https://observablehq.com/?utm_source=blocks">\n    <div class="column">\n      Join <b>Observable</b> to explore and create live, interactive data visualizations.

Take my word for it: the data which makes these pretty plots isn't here. Thats because some Javascript (d3's "tsv" function) loads the data _after_ the page loads. When we get a page with our `get` function we just get the empty JSON. It doesn't execute any Javascript code or even have a document object. Unfortunately, even if we parse our page with beautifulsoup we just parse the HTML. No Javascript is executed.

Are we out of luck?

Visit the Network Tab
=====================

Nope! If we open our inspector, we can visit the network tab, click refresh, and see all the data the page loads. Typically we can then find the URL from which the data is loaded.

Sure enough:

https://bl.ocks.org/armollica/raw/3b5f83836c1de5cca7b1d35409a013e3/783d1bbc2dd3aabbfcba83ece4a670de6f1ec371/data.tsv

If we now invoke this:


In [116]:
content = get("https://bl.ocks.org/armollica/raw/3b5f83836c1de5cca7b1d35409a013e3/783d1bbc2dd3aabbfcba83ece4a670de6f1ec371/data.tsv")
content[0:100]

Cache miss: https://bl.ocks.org/armollica/raw/3b5f83836c1de5cca7b1d35409a013e3/783d1bbc2dd3aabbfcba83ece4a670de6f1ec371/data.tsv


'activity\ttime\tp\tp_peak\tp_smooth\nRunning\t1440.0\t6.45223543281662e-5\t0.04402262110550772\t0.04402262110'

We can now convert this to a data frame, if we want:

In [117]:
import pandas as pd
import io

pd.read_csv(io.StringIO(content), sep="\t")

Unnamed: 0,activity,time,p,p_peak,p_smooth
0,Running,1440.0,0.000065,0.044023,0.044023
1,Playing racquet sports,1440.0,0.000000,0.000000,0.000000
2,Weightlifting/strength training,1440.0,0.000080,0.050097,0.050097
3,Hiking,1440.0,0.000000,0.000000,0.000000
4,Biking,1440.0,0.000007,0.008702,0.008702
...,...,...,...,...,...
8087,Rollerblading,1440.0,0.000026,0.080469,0.080469
8088,Golfing,1440.0,0.000018,0.007607,0.007607
8089,Playing volleyball,1440.0,0.000000,0.000000,0.000000
8090,Boating,1440.0,0.000022,0.033145,0.033145


The idea here is: if you can't find the data you want in the HTML, load your browser, look in the Network Tab and and see if you can find things like:

1. JSON
2. CSV/TSV/TXT
3. HTML/XML

Copy the url and scrape or load that data directly.

Recap
=====

Web Scraping is simple and fun. The basic idea is almost always the same:

1. Find a page that has the data you want.
2. If there are many such pages, find a page which lists them all or otherwise find a way to navigate through them by finding "a" tags (links)
3. If you can't find the data in the HTML, try looking at the network activity tab to see if its load after the page is loaded
4. Make sure to cache your requests so that you don't look like a DOS attack and don't issue a million requests a minute or you might get your IP blocked. Its also just rude.
5. If you're pages are behind a login you can usually get around this by copying the header information from a browser after you've logged in, but scraping such pages may be illegal in the US.
6. Its very useful to break down complicated scraping tasks into smaller bits and put them together with a makefile.

