# Intro to Scraping

This is just a test of functionality as I explore Jupyter Notebooks. First we'll import:

In [46]:
from bs4 import BeautifulSoup
from contextlib import closing
from urllib import request

That imports the things we need. Now we will set the URL that we need. 

In [47]:
url = "https://github.com/humanitiesprogramming/scraping-corpus"

Now that we have that link saved as a variable, we can call it up again later. 

In [48]:
print(url)

https://github.com/humanitiesprogramming/scraping-corpus


We can also modify the URL if we want to use that URL as a base but we need to use a variation on it.

In [49]:
print(url + "/subdomain")

https://github.com/humanitiesprogramming/scraping-corpus/subdomain


We will use that URL to grab the basic HTML for the URL. The following code uses a Python package named "request" to go out and visit that webpage. The following two lines say, "Take the link stored at the variable 'url'. Visit it, read back to me what you find, and store that result in a new variable named HTML.

In [50]:
html = request.urlopen(url).read()
print(html[0:2000])

b'\n\n\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n\n\n\n  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/frameworks-80206cf5276e283a2a42e750a19cfc777c5bc184c6509b5db88bac96930c339f.css" media="all" rel="stylesheet" />\n  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/github-e37787f054128b693988d66147a56af54ed8c479fa4abd1a183d787453cc90a6.css" media="all" rel="stylesheet" />\n  \n  \n  <link crossorigin="anonymous" href="https://assets-cdn.github.com/assets/site-f4fa6ace91e5f0fabb47e8405e5ecf6a9815949cd3958338f6578e626cd443d7.css" media="all" rel="stylesheet" />\n  \n\n  <meta name="viewport" content="width=device-width">\n  \n  <title>GitHub - humanitiesprogramming/scraping-corpus</title>\n  <link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="GitHub">\n  <link rel="fluid-icon" href="https://github.com/fluidicon.png" title="GitHub">\n  <meta property="fb:app


Wait - why are we scraping from GitHub instead of Project Gutenberg?
Project Gutenberg does not allow automated scraping of their website. So, instead I have collected a corpus of Project Gutenberg texts and loaded them into a GitHub repository for you to practice on.

So far we just have a whole bunch of HTML. We'll need to turn that into something that Beautiful Soup can actually work with.

In [51]:
soup = BeautifulSoup(html, 'lxml')

This line says, "take the HTML that you've pulled down and get ready to do Beautiful Soup things to it." Think of it this way: you have a certain number of things that you can do in your car:
    
* Drive
* Fill it with gas
* Change the tires
    
But you can only really do those things once you actually get in your car. You couldn't change your tires if you were riding a horse. Horses don't have wheels. In programming speak, we're saying "turn that HTML into a Beautiful Soup **object**." Saying something is an object is a way of saying "I expect this data to have certain characteristics and be able to do certain things." In this case, BeautifulSoup gives us a series of ways to manipulate the HTML. We can do things like:

* Get all the links

In [52]:
soup.find_all('a')[0:10]

[<a class="accessibility-aid js-skip-to-content" href="#start-of-content" tabindex="1">Skip to content</a>,
 <a aria-label="Homepage" class="header-logo-invertocat" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z" fill-rule="evenodd"></path></svg>
 <

We can say, get all me the text

In [20]:
soup.text[0:2000]

'\n\n\n\n\n\n\nGitHub - humanitiesprogramming/scraping-corpus\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n          Features\n \n          Business\n \n          Explore\n \n          Pricing\n \n\n\n\n\nThis repository\n\n\n\n\nSign in\nor\nSign up\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n    Watch\n  \n\n    3\n  \n\n\n\n\n    Star\n  \n\n      0\n    \n\n\n\n\n        Fork\n      \n\n      0\n    \n\n\n\n\nhumanitiesprogramming/scraping-corpus\n\n\n\n\n\n\n\nCode\n\n \n\n\n\nIssues\n0\n\n \n\n\n\nPull requests\n0\n\n \n\n\n      Projects\n      0\n\n\n\n    Pulse\n\n\n\n    Graphs\n\n\n\n\n\n\n\n\n\n\n            No description, website, or topics provided.\n          \n\n\n\n\n\n\n\n\n\n\n\n              1\n            \n            commit\n        \n\n\n\n\n\n            1\n          \n          branch\n        \n\n\n\n\n\n            0\n          \n          releases\n        \n\n\n\n\n\n      1\n    \n    contributor\n\n\

It might not be very clear, but that's just the text of the webpage as one long string with all the HTML stripped out. Here is a slightly prettier version that strips out all the '\n' characters (those are a just a way for Python to note that there should be a line break at that point in the string):

In [21]:
soup.text.replace('\n', ' ')[0:1000]

'       GitHub - humanitiesprogramming/scraping-corpus                                  Skip to content                       Features             Business             Explore             Pricing       This repository     Sign in or Sign up                      Watch         3            Star           0                  Fork               0          humanitiesprogramming/scraping-corpus        Code       Issues 0       Pull requests 0            Projects       0        Pulse        Graphs                       No description, website, or topics provided.                                     1                          commit                           1                      branch                           0                      releases                     1          contributor          Clone or download                Clone with HTTPS                          Use Git or checkout with SVN using the web URL.                     Download ZIP              Find file         Branch: master 

All that white space is coming because we're grabbing a lot of whitespace from the *entire* page. We can either strip whitespace out, or we can make a bit more nuanced request. Instead of getting all the page text first, we can say, "first get me only the HTML for the links on this page. Then give me the text for just these smaller chunks.

In [22]:
soup.find_all('a').text

AttributeError: 'ResultSet' object has no attribute 'text'

Wait, what happened there? Python gave us an error. This is because we got confused about what kind of object we were looking at. The error message says, "This thing you've given me doesn't support the method or attribute '.text' Let's work backwards to see what we actually get from soup.find_all('a'):

In [26]:
soup.find_all('a')[0:10]

[<a class="accessibility-aid js-skip-to-content" href="#start-of-content" tabindex="1">Skip to content</a>,
 <a aria-label="Homepage" class="header-logo-invertocat" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z" fill-rule="evenodd"></path></svg>
 <

That looks as expected. To see what's going, let's look at it another way. The following line will tell us what kind of object we're looking at:

In [27]:
type(soup.find_all('a')).__name__

'ResultSet'

Ah! We're getting somewhere. We're looking at a ResultSet. Not a BeautifulSoup object. And ResultSets let us do different things to them. In fact, a results set gives us a list of Tag objects, but those still respond to a lot of the same things as BeautifulSoup objects. Check it:

In [28]:
type(soup.find_all('a')[0]).__name__

'Tag'

In [29]:
soup.find_all('a')[0].text

'Skip to content'

How many links are there on this page anyway? We can find out by checking out the length of this ResultSet:

In [30]:
len(soup.find_all('a'))

71

Here we go. Soup.find_all() returns us something roughly equivalent list. And you can do certain things to lists - you can find out how long they are, you can sort them, you can do things to each item. But you can't pull out the text of each list. That's something that a BeautifulSoup object can do. We were trying to change the tires of our horse. We could, though, go through element in that list and get the text for each individual item. The following lines do just that but also give a little formatting on either side to make it more readable. And we'll strip out whitespace again

In [32]:
for item in soup.find_all('a')[0:10]:
    print('=======')
    print(item.text.replace('\n', ''))


Skip to content

          Features
          Business
          Explore
          Pricing
This repository
Sign in
Sign up
    Watch  


Now we're getting somewhere. Beautiful Soup can pull down data from a link, but we'll just have to be careful that we know what kinds of objects we are working with. So let's pull down only the links that we care about by being a bit more specific.

In [33]:
for link in soup.select("td.content a"):
    print(link.text)

0.txt
1.txt
2.txt
3.txt
4.txt
5.txt
6.txt
7.txt
8.txt
9.txt


The "td.content a" bit is using css syntax to walk the structure of the HTML document to get to what we want. It says, "find the 'td' tags that have a 'class' content and then give me the 'a' tags inside. Once we have all that, print out the text of those 'a' tags. If you haven't worked with css before, you can find a good tutorial for css selectors [here](https://www.w3schools.com/cssref/css_selectors.asp). Rather than getting the text of those links, this time we will collect those links and store them in a list for us to scrape.

In [34]:
links_html = soup.select('td.content a')
urls = []
for link in links_html:
    url = link['href']
    urls.append(url)
print(urls)

['/humanitiesprogramming/scraping-corpus/blob/master/0.txt', '/humanitiesprogramming/scraping-corpus/blob/master/1.txt', '/humanitiesprogramming/scraping-corpus/blob/master/2.txt', '/humanitiesprogramming/scraping-corpus/blob/master/3.txt', '/humanitiesprogramming/scraping-corpus/blob/master/4.txt', '/humanitiesprogramming/scraping-corpus/blob/master/5.txt', '/humanitiesprogramming/scraping-corpus/blob/master/6.txt', '/humanitiesprogramming/scraping-corpus/blob/master/7.txt', '/humanitiesprogramming/scraping-corpus/blob/master/8.txt', '/humanitiesprogramming/scraping-corpus/blob/master/9.txt']


Getting closer to some usable URL's. We just need add the base of the website to it. So here is the same piece of code but reworked slightly. We'll modify the URL just slightly because of the way that GitHub formats its URL's. We want to get something like [this](https://raw.githubusercontent.com/walshbr/ohio-five-workshop/master/cli-tutorial.md) instead of [this](https://github.com/walshbr/ohio-five-workshop/blob/master/cli-tutorial.md), which is what we were getting. The former is stripped of all the GitHub formatting.

In [35]:
links_html = soup.select('td.content a')
urls = []
for link in links_html:
    url = link['href'].replace('blob/', '')
    urls.append("https://raw.githubusercontent.com" + url)
print(urls)

['https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/0.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/1.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/2.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/3.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/4.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/5.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/6.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/7.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/8.txt', 'https://raw.githubusercontent.com/humanitiesprogramming/scraping-corpus/master/9.txt']


Bingo! Since we know how to go through a list and run code on each item, we can get closer to scraping them to combine them into a dataset for us to use. Let's scrape each of them. We'll be re-using code from above. See if you can remember what each piece is doing:

In [37]:
corpus_texts = []
for url in urls:
    html = request.urlopen(url).read()
    soup = BeautifulSoup(html, "lxml")
    text = soup.text.replace('\n', '')
    corpus_texts.append(text)

The variable corpus_texts now is a list containing ten different novels. We've got a nice little collection of data, and we can do some other things with it.