Lab 1: Simple Web Scraping with Requests

Objective: Learn how to parse and retrieve data from a website in python.

Successful Outcome: Successfully access a piece of data, save it to a variable, and print it out.

# Step 1: Preliminaries

This is where we import the needed modules and "get" the web page we want to get data from.

In [4]:
##Here we are importing the requests and BeautifulSoup modules
import requests
from bs4 import BeautifulSoup

##Requests is the module that actually goes out and accesses the website.
##BeautifulSoup helps with formatting and parsing the html.

page = requests.get("https://en.wikipedia.org/wiki/Nineteen_Eighty-Four")
##Here we are accessing wikipedia using the requests module, running this block will 
##give us a snapshot of the page at the time of access.

##When you run this block there should be no output, go ahead, give it a try.

# Step 2: Reading the Content

In [5]:
## Not that we have grabbed a snapshot of the page, let's see what happens when we print it out!
print(page)


<Response [200]>


In [6]:
## As you can see, printing the page just gives us a response code, when what we want is the page content.
## This can be simply accomplished by accessing the .content of the page object
## we created earlier and printing it out.

print(page.content)
## This command will give you all of the page's content.

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Nineteen Eighty-Four - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Nineteen_Eighty-Four","wgTitle":"Nineteen Eighty-Four","wgCurRevisionId":796815867,"wgRevisionId":796815867,"wgArticleId":23454753,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Wikipedia indefinitely move-protected pages","Use British English from May 2012","Use dmy dates from August 2016","All articles with unsourced statements","Articles with unsourced statements from January 2017","Articles with unsourced statements from March 20

In [7]:
##As you can see we have the page content now, but its not that easy to read, that's why Beautiful Soup is very useful.
##Now we are going to access and parse the content as html using beautiful soup's html.parser.
soup = BeautifulSoup(page.content, 'html.parser')

##Let's see what the parser got us!
print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Nineteen Eighty-Four - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Nineteen_Eighty-Four","wgTitle":"Nineteen Eighty-Four","wgCurRevisionId":796815867,"wgRevisionId":796815867,"wgArticleId":23454753,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Wikipedia indefinitely move-protected pages","Use British English from May 2012","Use dmy dates from August 2016","All articles with unsourced statements","Articles with unsourced statements from January 2017","Articles with unsourced statements from March 2014","Arti

# Step 3: Parsing the HTML

By the way if you are not that familiar with what HTML is here is a link to an article that will explain html much better than I could.

https://www.w3schools.com/html/html_intro.asp

In [8]:
##As you can see the html is now much easier to read, and with some simple commands we can search for objects within the html very easily.
##The find_all command is very useful in finding all instances of the text or class you give it.
##Text parameters go into the first argument, class in the second.
##For this notebook we are going to find the heading of the wiki article. 
##We know the class we are looking for is 'firstheading', you can see this in the output above.
##Google chromes inspect element is very useful for finding the HTML of what you're looking for as well.
body = soup.find_all('', class_='firstHeading')

##Let's see what the soup found.
print(body)

[<h1 class="firstHeading" id="firstHeading" lang="en"><i>Nineteen Eighty-Four</i></h1>]


In [9]:
##So its easy as that. All we have is the html line, so some trimming is necessary. If you want just the title 
##you would omit the rest of the line using python's string commands. (string[startIndex:EndIndex])
## First we have to cast it as a string though.
title = str(body)
title = title[57:77]
print(title)

Nineteen Eighty-Four


And there it is! With this simple script, a forloop, and a file with valid links, you could grab as many 
titles from wikipedia as you want. By changing the parameters of the find function around, other pieces of data can be grabbed as well.

# Step 4: More Complex Scraping

Now let's try something a bit more complex. On the website onthesnow.com there are tables which hold a bunch of information regarding snowfall. Let's learn how to grab all the information from a table.

In [10]:
##These first lines are just like we saw earlier. Getting the webpage, and getting a soup object of the page content.
##For easy to read formatting and parsing.
page = requests.get("http://www.onthesnow.com/michigan/apple-mountain/historical-snowfall.html?&y=2016&v=list")
soup = BeautifulSoup(page.content, 'html.parser')


##The complex part is working with tables, which are a very common and relevant piece of data
##on a webpage. The only thing you have to know is that a table is made of 'tr' and 'td' objects.
##A 'tr' object is a row and 'td' is all of the data in that row. (tableRow) and (tableData).

##So here we are simply running a for loop that gets the first and only 'tr' object and all of it's corresponding
##'td' objects.

data = [[cell.get_text(strip=True) for cell in row.find_all('td')] for row in soup.find_all("tr")]

##Now let's print out all of the row data for the table on the page!
for entry in data:
    print(str(entry))



[]
['Jan 11, 2016', '2 in.', '2 in.', '8 in.']
['Jan 12, 2016', '3 in.', '5 in.', '8 in.']
['Feb  9, 2016', '3 in.', '8 in.', '18 in.']
['Feb 16, 2016', '1 in.', '9 in.', '18 in.']
['Feb 17, 2016', '1 in.', '10 in.', '18 in.']
['Feb 25, 2016', '9 in.', '19 in.', '18 in.']
['Mar  2, 2016', '10 in.', '29 in.', '18 in.']


As you can see, we got every single table entry in the historical snowfall for 2016. With some string manipulation we can extract each entry very easily to get the date and other variables. Now let's trim and save that data.

In [11]:
#Here we are creating a txtfile called "scraper.txt".
txtfile = open("scraper.txt","w")

#Here we are writing the header to the file.
txtfile.write("Date, snowfall, basedepth, newsnow")

#Here we are taking each entry and trimming the undesirable characters and writing it to the file
for entry in data:
    text = str(entry)
    text = text.replace(']', '')
    text = text.replace('[', '')
    text = text.replace("'", '')
    print(text)
    txtfile.write(text + "\n")
    
txtfile.close()

#Go check the file out in the r-training-notebooks/ folder, it's named scraper.txt!
    


Jan 11, 2016, 2 in., 2 in., 8 in.
Jan 12, 2016, 3 in., 5 in., 8 in.
Feb  9, 2016, 3 in., 8 in., 18 in.
Feb 16, 2016, 1 in., 9 in., 18 in.
Feb 17, 2016, 1 in., 10 in., 18 in.
Feb 25, 2016, 9 in., 19 in., 18 in.
Mar  2, 2016, 10 in., 29 in., 18 in.


Congratulations!
You have learned how to successfully scrape simple and complex structures from webpages!
If you want to train your scraping muscles a little bit more, ponder this exercise.

https://www.immobiliare.it/Roma/agenzie_immobiliari_provincia-Roma.html
Goto this website and grab/scrape 5 different fields and save them to a .txt file. This will help you identify how different objects on a webpage require different steps to grab them in their entirety.