# Day 6: Web Scraping and Data Processing with Python

Today we will be introducing Python as a multipurpose tool for fetching, parsing, aggregating, and exporting data! We will be drawing upon the BeautifulSoup library in particular and using some ideas from ["Intro to Beautiful Soup" by Jeri Wieringa](https://programminghistorian.org/en/lessons/intro-to-beautiful-soup).

Before we get going today, please complete the [Python Syntax tutorial at Codecademy](https://www.codecademy.com/courses/learn-python/lessons/python-syntax). 

Note that we will be using Python 3 in this notebook, not Python 2. As the Codecademy tutorial mentions, one critical change between 2 and 3 is that Python 3 uses the following syntax for print statements: print("Statement")

Click into the code box below for an example of printing in Python 3:

In [None]:
# To run this code, click into this box and then click the Run button at the top of the screen

researchQ = "???"
print("A key question in my current research project is: " + researchQ)

# Now try assigning a different value to researchQ. Run the code again to see a different answer

Today we will be adding a few concepts beyond the Codecademy tutorial. One important technique is the idea of **looping** or **iteration**. Sometimes we may wish to run the same line of code several times, but perhaps manipulating a different value each time.

For instance, consider the following variable assignemnt for *methods*:

In [None]:
# The methods variable stores a list of three separate strings

methods = ["text analysis", "web mapping", "data cleaning", "lots of Googling", "inventing new methods entirely"]

# Click into this box and select RUN to assign the values to the methods variable

Python uses several data types to store data in variables. We will often want to collect our data into a single collection of values. 

A **list** is an ordered collection of values. We can create a list by using brackets around a collection of items, each of which is separated by a comma.

In this case, we created a list of **strings**. Strings are sequences of alphanumeric characters. We create a string by using quotation marks (single-quotes, double-quotes, or triple-quotes all work, as long as we use the same type at the beginning and the end).

How would we print out each of these strings one by one? One approach is to tell Python to print multiple times
Selecting a differnt item from the list using the "index", or number corresponding to the location of the item:

In [None]:
print("Digital humanities involves " + methods[0])
print("Digital humanities involves " + methods[1])
print("Digital humanities involves " + methods[2])
print("Digital humanities involves " + methods[3])
print("Digital humanities involves " + methods[4])


This method works, but is repetitive to write. (In programming, we try to follow the DRY princniple: Don't Repeat Yourself!)

Instead, let's loop over the numbers 0, 1, 2, 3, 4 using a for loop and a list:

In [None]:
for index in [0, 1, 2, 3, 4]:
    print("Digital humanities involves " + methods[index])

This method uses far less code and produces the same result.

Sometimes, we would rather ignore the idea of the index entirely. Python lets us ignore indexes if we choose and loop over a list directly using the following syntax:

In [None]:
for method in methods:
    print("Digital humanities involves " + method)


The idea of looping is very powerful, especially when we don't know ahead of time what values we will use.

In web scraping, we often know a few values like the URL of the target site, and perhaps the location of important information on that site. But we will also discover information as we go, and may want to loop over these items to extract information (such as metadata, content, and links to further valuable information elsewhere). This is where looping helps us out.

## Getting Started with Web Scraping

When working with data that we haven't prepared ourselves, we often face a set of challenges in being able to use that data in a helpful fashion. In **web scraping** with Python, we first must fetch the data from our website. In most cases, that data will exist in HTML format, but it could be another format (like XML or others). 

Once our script has fetched the web data into working memory, we will likely want to transform a few important features of it into data structures Python understands (like strings, lists, and other variables). When scraping a blog, for example, we may wish to keep track of individual posts, and collect data like the title, date, post summary text, and link to the full text.

Sometimes, we will want to perform additional transformations on our data, such as counting or aggregating certain categories of data, or looping over additional pages to fetch more web content.

Finally, we'll want to output our data into a format that is helpful to us. In this example, we will output our data as a table using the **comma-separated values (CSV) format**. Another common output type is **JavaScript Object Notation (JSON)**.

In short, our steps are:
* Fetch data from a website
* Transform the data (give it structure, keep track of important data and metadata, count things)
* Output the data into a helpful format (CSV)


### Step 1: Fetch data from a website using BeautifulSoup

To start scraping web data with Python, we first have to add some extra functinoality to our code.

Python comes ready-built with several different functions, such as print(). However, we will often need to include extra functionality depending on our goal. Python collects this extra functionality into distinct collections of code called **libraries**. Some libraries are already installed alongside Python, and others need to be downloaded from the Internet (on Azure, Microsoft has already pre-downloaded most libraries you would need onto their servers.)

**BeautifulSoup** is a web scraping library for Python. Its strength is giving Python the ability to understand the structure of a website, which often includes many different types of **tags** that are organized in a hierarchical fashion. BeautifulSoup lets you find all tags that match your search conditions and extract attributes and text from them. 

BeautifulSoup tagline's: "You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects." You can learn more about BeautifulSoup on [its website](https://www.crummy.com/software/BeautifulSoup).

We need to tell Python explicitly to include this extra functionality. We can do so with the **import** command. Using import will make 100% of the library's functionality accessible to us. In this case, we only want to bring in the set of functionality called BeautifulSoup into our program from the *bs4* library.

In [None]:
from bs4 import BeautifulSoup

We also need to bring in a couple more Python libraries to support BeautifulSoup:
* **requests** allows Python to interact with a web server and fetch HTML documents
* **csv** helps Python output data into CSV tables

These libraries come pre-installed with Python 3, but we still need to add their functionality to our program using the *import* command*

In [None]:
import requests
import csv

Next, let's create a variable called *url* and set it to a website to scrape. 

For today, let's explore the blog of [Bethany Nowviskie](http://nowviskie.org/), Exectuive Director of the Digital Library Federation (DLF) and a prominent voice in Digital Humanities. Nowviskie's blog is shared under a [Creative Commons CC-BY license](http://nowviskie.org/).

In [None]:
url = "http://nowviskie.org"

Now it's time to scrape our data. This involves a three-part process:

* Query the website using the Requests library and store the result into a variable
* Extract the text data (in this case, HTML code as text) into an output variable
* Generate a BeautifulSoup object based upon the raw HTML text data

In Python, each of these steps only takes a single line:

In [None]:
# Query our URL with requests and store the output into response
response = requests.get(url)
 
# Extract all of the text data from our query (in this case, raw HTML code) and store it into data
data = response.text
 
# Generate a soup object based on our HTML data that lets us navigate the structure of the website
soup = BeautifulSoup(data, 'html5lib')


Note how the syntax varies in each step. In step three in particular, note that we have to specify 'html5lib' as a paramter of the BeautifulSoup function. html5lib is a **parser** which means its job is to read raw HTML code and help BeautifulSoup interpret its structure (for instance, what are the HTML tags used, how are they organized, etc.)

We now have an object called *soup* which includes the HTML from our site along with a set of additional functionality. We can use this functionality by calling **methods**. Methods are functions that belong to an object, and often manipulate data stored in that object. 

For instance, let's use the **.prettify() method** to output our HTML code as raw text:


In [None]:
print(soup.prettify())

Here is a "pretty" version of the code the *soup* objects contains ("pretty" might be an overstatement.) Now that we've confirmed our data has been successfully fetched and stored in *soup*, it's time to move on to Step 2.

### Step 2: Transform our web scraped data (give it structure, identify the important elements, run analyses, etc.)

What is our goal in scraping data from Nowviskie's blog?

This step requires that we have some underlying idea of what's important to us and how we intend to use this data. It may help to imagine what we want our output to look like (although we may revise this as our code gets more complicated, as we'll see). Let's imagine our goal in this case is:

* Generate a table with the most recent blog posts
* For now, we want to capture the *post title*, *post summary text*, and *link to the full blog post*

We can imagine this output as a table in .csv format, which includes heading labels, and each post represented as a single row.

Let's get started. First, we need to look at our website and understand how it's organized. To do this, we can use a number of methods, I prefer to use the "Inspect Element" feature on either Firefox or Chrome.

// **As a group, explore the HTML document and learn how blog posts are structured.** //

We should have discovered a **tag type** and a **class** for each individual blog post on the website.

The **.find_all() method** in our BeautifulSoup object (*soup*) allows us to create a subset of HTML tags and related data that match a search condition. The output of find_all() will be a list of tags that match these conditions. Let's generate a list of blog posts using this method:

In [None]:
# In the text below, replace CLASS_TYPE with values for blog posts
# Note that you must surround this value with quotations, e.g."name_of_class"

articles = soup.find_all('div', class_="CLASS_TYPE")

Let's take a look at what we've collected in the articles variable by using a print() statement

In [None]:
print(articles)

That's a lot of information! It's a little hard to tell what's going on here. Let's use the len() function to make sure that articles is a list that contains all of the blog posts on the website we just scraped.

(Note that this number won't refer to all posts in the entire website, but rather every post that displays on the front page.)

In [None]:
print(len(articles))

If you see the number "5", you have successfully stored data about the five blog posts into the articles list.

Now that we have our data in a list, we need to extract the specific data we're interested in. If you remember from earlier, we decided we want three data points per post:

* Title of the blog post
* Post text summary
* Link to the full article

To do this, we can loop over the articles in the article list. For each article, we can use the **.find(), .get(), and .get_text() BeautifulSoup methods** to extract the information we're interested in. 

Let's start with a loop that iterates over articles (storing each article information into 'current_article' in each iteration), finds the title tag, and prints the text from the tag:

In [None]:
for current_article in articles:
    article_title = current_article.find('h1')
    article_title_text = article_title.get_text(" ")
    print(article_title_text)

In every iteration, the script finds the heading tag containing the article title, extracts the text within that tag, and prints out the extracted text.

How would you also print out the text summary for each post? Return to the HTML for the blog and identify the class of the tag containing the text for each post. (Hint: the type is "div")

Now, modify the articles loop to identify the post text tag, extract the text, and print it to the screen.

In [None]:
for current_article in articles:
    article_title = current_article.find('h1')
    article_title_text = article_title.get_text(" ")
    print(article_title_text)
    
    # Fill in the class name for articles here:
    
    post_summary = current_article.find('div', class_='REPLACEWITHCLASS')
    
    # How would you extract the text from post_summary? 
    # Fill in the right part of the statement below with the appropriate method
  
    post_summary_text = #FILL IN WITH METHOD for post_summary to generate text...
    
    # The next line removes extra whitespace and also removes non-ascii characters
    
    post_summary_text = post_summary_text.encode('ascii', 'ignore')

Now, let's add the URL for the link to the full article. On the blog website, this link appears if you hover over the article title. 

Let's grab the URL from the article title tag, which we already have from the first step! However, we need to look for the link tag *within* the title tag, and then extract the URL (as opposed to the text, which in this case would still be the title of the blog post). Where is the URL stored within a link tag in HTML?

In [None]:
for current_article in articles:
    article_title = current_article.find('h1')
    article_title_text = article_title.get_text()
    print(article_title_text)
    
    ## Copy your code for post_summary_text from above here (3 lines below)
    
    

    # Now, let's use the .find() method to identify the link <a> tag within the article title tag
    full_article_link = article_title.find('a')   
    
    # We need to specifically extract the part of the <a> tag that contains the URL.
    # The .get() command lets us get information about a specific attribute in a tag.
    # Fill in the appropriate attribute below
    
    full_article_URL = full_article_link.get('FILL IN WITH ATTRIBUTE THAT CONTAINS URL IN <A> tag')
    print(full_article_URL)

We now have a functioning script that fetches HTML from our target URL, identifies the set of blog posts on the page, loops over each one and extracts three data points (title, post summary, and full article URL), and prints to the results to screen during each iteraction. 

However, printing to a screen is often not our desired end result (another programming truism: print() is for humans. That is, it's for coders to see how their code is doing, not for the output or functionality of a program).

With this in mind, let's add in the functionality of writing our output to a .csv file.

### Step 3: Output the data into a helpful format (CSV)

In Python, there are several ways to generate output files. A general pattern is using the open() function to build a connection to an output file, and storing that relationship into a variable (sometimes called the **file handle**).

Let's open up that relationship and store it in *f*:

In [None]:
f = csv.writer(open("blog_output.csv", "w"), quotechar='"', escapechar="'")   # Open the output file for writing


Now that we've opened up the file handle connection, we want to first write our headers to the file. Fortunately, with Python and the CSV library, this is a two-line process: 
* create a list of header labels, and 
* write them to the new .csv file via the *f* handle and the csv library

Here is the code:

In [None]:
headers = ["Post Title", "Summary Text", "URL"]
f.writerow(headers)

We will follow a similar pattern from here on out, except with one important exception: in addition to printing to the screen, we will be writing rows of data after completing each iteraction.

To accomplish this, copy your code from the end of step 2 below, and include one additional file writing line at the end:

In [None]:
for current_article in articles:
    # Copy code from step 2 here
    # There should be code to generate all three variables used by f.writerow() below
    
    f.writerow([article_title_text, post_summary_text, full_article_URL])

Take a look at your working directory in Azure. You should now see a new file: blog_output.csv! Try downloading this file to your computer and opening in a program like Excel. :)

### Step ?: Additional challenges

You've successfully implemented the web scraping pattern: fetch raw data from a website, transform and structure the data, and output it to a .csv file. As you become interested in different types of outputs, you may wish to complexify this basic recipe - especially in the second "transform and structure' step. 

Below are some challenge questions, along with a transcript of the functioning code for the recipe above. We'll discuss this a bit in class, but feel free to explore further on your own.

#### Challenge #1: Introduce pagination (to fetch data from all pages instead of just the home page)

You may have noticed that the resulting .csv file captures the blog posts on the first page. But what about additional pages? **Pagination** is the web crawling technique for dealing with data that is hidden behind additional pages -- in this case, behind the "Older Entry" link.

Try clicking on the Older Entires link on the bottom of the page. You may notice the address bar now reads: http://nowviskie.org/page/2/. Indeed by clicking back and back, you will find the page/ structure continues through the final page, http://nowviskie.org/page/15/. Further, you can use http://nowviskie.org/page/2 instead of http://nowviskie.org/page/2/, and also you can use http://nowviskie.org/page/1 to get to the first/home page. 

What's the significance of this? In our current code, you have used a single for loop to iterate over the articles on the homepage. To complete this challene with all 15 pages, you will introduce an even larger loop that contains, in itself, the first few steps of the code: fetching from a url, constructing a BeautifulSoup object, and fetching all of the articles. You must however also move the file opening and header line writing steps *outside* of this loop, as these things still only happen once.

Here's a template to get you started:

In [None]:
from bs4 import BeautifulSoup
import requests
import csv

f = csv.writer(open("blog_output_challenge1.csv", "w"), quotechar= '"', escapechar="'")   # Open the output file for writing
# Write headers
headers = ["Post Title", "Summary Text", "URL"]
f.writerow(headers)

for page_num in range(1,16):
    url = "http://nowviskie.org/page/" + str(page_num)
    # Query our URL with requests and store the output into response
    response = requests.get(url)
    # Extract all of the text data from our query (in this case, raw HTML code) and store it into data
    data = response.text
    # Generate a soup object based on our HTML data that lets us navigate the structure of the website
    soup = BeautifulSoup(data, 'html5lib')

    articles = soup.find_all('div', class_="article")

    for current_article in articles:
    ## Your remaining code goes here...
    ## Remember to keep all of this code indented one tab in to so it belongs to the for page_num loop
    ## (or two tabs if it's part of a second loop inside the page_num loop...)
        article_title = current_article.find('h1')
        article_title_text = article_title.get_text()
        print(article_title_text)

        post_summary = current_article.find('div', class_="content")
        post_summary_text = post_summary.get_text(" ")
        post_summary_text = post_summary_text.encode('ascii', 'ignore')
        print(post_summary_text)   


        full_article_link = article_title.find('a')   
        full_article_URL = full_article_link.get('href')
        print(full_article_URL)

        f.writerow([article_title_text, post_summary_text, full_article_URL])


#### Challenge #2: Add a fourth category: summary post character count

In [None]:
# Hint: the len() function returns the number of items in a list. 
# Another way to think of a string is a list of alphanumeric characters...

## Remmeber, you will need to modify the headers list to include a fourth value 
# (and also modify the f.writerow() step in the for loop below)

headers = ["Post Title", "Summary Text", "URL"]
f.writerow(headers)

for current_article in articles:
    article_title = current_article.find('h1')
    article_title_text = article_title.get_text(" ")
    print(article_title_text)
    
    post_summary = current_article.find('div', class_="content")
    post_summary_text = post_summary.get_text(strip=True)
    post_summary_text = post_summary_text.strip()
    print(post_summary_text)   
    

    full_article_link = article_title.find('a')   
    full_article_URL = full_article_link.get('href')
    print(full_article_URL)
    
    f.writerow([article_title_text, post_summary_text, full_article_URL])



#### Challenge #3: Add a fifth category: vocabulary complexity (or average word length)

In [None]:
# You will want to expand on your solution to Challenge #1, so we've left the code box below blank
# Hint: You may find it important to count the number of words in a string.
# To do this, you will need to (1) split a string into a list of words, and (2) count the number of items in that list

# For #1, you can call the .split() method on a string.
# For instance, "Here is a sentence".split() will return ["Here", "is", "a", "sentence"]

# For #2, you may find the len() function handy once again!!

# Remember to modify your output file name to save to a different .csv (along with the header row and f.writewrow() step)
# ...
# ...
# ...

#### Challenge #4: Add a sixth category to count the instances of a specfic word of your choice

In [None]:
# Your code here...!

## Cheat Sheet

In [None]:
## Appendix

## Here is a cheat sheet featuring the values for the blanks above, in case you get stuck!

# To generate a list of articles:
# articles = soup.find_all('div', class_="article")

# To print out the post summary text:
#     post_summary = current_article.find('div', class_="content")
#     post_summary_text = post_summary.get_text(" ")
#     post_summary_text = post_summary_text.strip().encode('ascii', 'ignore')

#     print(post_summary_text)

# To extract attribute containing URL from a link:
#     full_article_URL = full_article_link.get('href')

# Challenge 1 solution:

# from bs4 import BeautifulSoup
# import requests
# import csv

# f = csv.writer(open("blog_output_challenge1.csv", "w"), escapechar='\\')   # Open the output file for writing
# # Write headers
# headers = ["Post Title", "Summary Text", "URL"]
# f.writerow(headers)

# for page_num in range(1,16):
#     url = "http://nowviskie.org/page/" + str(page_num)
#     # Query our URL with requests and store the output into response
#     response = requests.get(url)
#     # Extract all of the text data from our query (in this case, raw HTML code) and store it into data
#     data = response.text
#     # Generate a soup object based on our HTML data that lets us navigate the structure of the website
#     soup = BeautifulSoup(data, 'html5lib')

#     articles = soup.find_all('div', class_="article")

#     for current_article in articles:
#         article_title = current_article.find('h1')
#         article_title_text = article_title.get_text(" ")
#         print(article_title_text)

#         post_summary = current_article.find('div', class_="content")
#         post_summary_text = post_summary.get_text()
#         post_summary_text = post_summary_text.strip().encode('ascii', 'ignore')
#         print(post_summary_text)   


#         full_article_link = article_title.find('a')   
#         full_article_URL = full_article_link.get('href')
#         print(full_article_URL)

#         f.writerow([article_title_text, post_summary_text, full_article_URL])


