# Web Scraping

Web scraping is a technique for harvesting data from a webpage. *Scraping* (and its slightly more complex counterpart, **crawling**) is an effective way to get text data and process it for analysis. 

There are multiple techniques, but today we're going to use the requests library to gather HTML data and then use BeautifulSoup to process that HTML into the data we want. 

in this notebook, we will **scrape** data from the White House website -- in particular, we will scrape the transcripts of press briefings related to covid-19. We will scrape from this url: https://www.whitehouse.gov/briefing-room/press-briefings/ 
We will also learn about libraries, caching, defining functions, and some basics of HTML. 

In [None]:
References

# Libraries
First, we need to **import** our Libraries. 
Libraries in python are essentially a chunk of code that we can import and use. They might contain new fuctions or object types that we can use in our code. If you think of programming with python as spellcasting, each library is like a spellbook. When we "import" a library, we're telling python to open that spellbook so we can cast its spells (call its functions) at will.

Libraries are often structured as a set of modules that you can import separately (see below where we **import** BeautifulSoup **from** bs4). You may decide to import a specific module from a library to help your code run faster, like jumping to a specific chapter in a spellbook. 

There are many, many, many python libraries available. Good ones will have documentation available, though the quality of that documentation varies.  

It is standard practice to import any libraries you will use at the very beginning of your file. You can technically imort a library any time before you need it, but it's helpful to group them all at the biginning so that you (and future re-users of your code) know what libraries need to be installed. (All libraries we'll use today should already be installed in our environment. We won't cover installation today, but we can help if oyu need it!)

# What libraries will we use today?

## requests
Requests allows us to "request" data from a data source like an API or a web page.

## json
This library allows us to read and write JSON files. JSON, which stands for JavaScript Object Notation, is a human and machine-readable data interchange format. In python, it's often used to store or exchange dictionaries and lists for later use. We will use this for caching, which we will explain later.

## bs4 & BeautifulSoup
BeautifulSoup is a commonly used webscraping tool. It allows us to parse the HTML returned by the requests library. bs4 stands for BeautifulSoup4, the larger library, and BeautifulSoup is a module. 

## datetime
datetime is a standard library that comes with Python, and it allows us to parse dates and times as objects. 


In [1]:
import requests
import json
from bs4 import BeautifulSoup
import datetime

# Caching

## What is Caching?
Caching is the  process of storing data so that future requests for that data can be served faster. In our context, it means saving the data we download, first as a python disctionary, then as a JSON file. 

## Why Cache?
Caching is useful for a lot of different reasons:
1) Caching makes your code run faster. If we **request** data from an API or website, we will have to wait until all of it is downloaded. Loading our data from a cache is much faster.
2) Caching protects you from being blocked by APIs or websites. APIs often have rate limits to prevent abuse. Caching your data means that you won't have to request every time you run your code, resulting in fewer queries, less likely
3) Caching keeps a record of your data. You can edit the caching code so that it records the date and time the data was collected, allowing for comparisons.

## How Cache?
In order to get what we need, the cache needs to 1) be readable by python and 2) remain persistent outside the python script. The technique outlined here saves the cache as a python dictionary, then saves that dictionary as a JSON file that can be read the next time the script is run. 
The block of code below is pretty much endlessly reusable--just make sure you update your file name!

In [1]:
CACHE_FNAME = 'wh_cache.json' 
#try to open the cache file
try:
    cache_file = open(CACHE_FNAME, 'r') #open the file
    cache_contents = cache_file.read() # read the file
    CACHE_DICTION = json.loads(cache_contents) #load the JSON string into a dictionary called CACHE_DICTION
    cache_file.close()
#if the file won't open, create an empty dictionary where we will store the results
except:
    CACHE_DICTION = {}


*you may notice that the variable CACHE_DICTION and CACHE_FNAME are uppercase rather than lowercase, which goes against what we've described as python convention. That is done because they are **global** variables, which we will need to access within and outside the functions we will define. Making them uppercase prompts the reader to pay attention to these variables throughout the rest of the code. 

In [None]:
#scrape one page and break it down

# Build your own Functions
You've been introduced to the pre-made functions in python, like print(), type(), and len(). But we haven't touched one of the most powerful tools in python, **function definition**, which allows us to create customized functions to suit our own needs. 

## When should I make a function? Why should I bother, when I can just write the code?
This may be confusing to new python learners who are just starting out writing a litte bit of code at a time. Once you start writing more complex scritps, however, defining functions can be a really helpful way of organizing your code and making it readable for other programmers (and for future you!). 

Let's start by making a simple function that turns a string into a list of words. *(Note that python doesn't know what we mean by 'word' here. How do we make it do what we want?)*

## Anatomy of a function
input: Also known as arguments, this is what you will give to your function to work with.  In our example, the function will take a string as input.
process: the "meat" of your function. What does it do with the input? You can include conditionals, loops, and try/except blocks. *(It might be helpful to include print statements as you're creating a function to make sure it's doing what you want!)*
return: the object that the function returns--for example, if the function is testing whether a string contains the word "catfish", it might return a boolean value (True or False); or if the function is calculating word frequency, it might return a dictionary with the word as a key and the number of times it appears as its value.


In [3]:
def split_words(text):
    text_list = text.split()
    return text_list

In [4]:
split_words("It was on a dreary night of November that I beheld my man completed.")

['It',
 'was',
 'on',
 'a',
 'dreary',
 'night',
 'of',
 'November',
 'that',
 'I',
 'beheld',
 'my',
 'man',
 'completed.']

In [9]:
frankenstein = split_words("It was on a dreary night of November that I beheld my man completed.")
if len(frankenstein) > 5:
    print("this sentence is long!")
elif len(frankenstein) < 5:
    print("this sentence is short!")
else:
    print("I have no idea what's happening!!")

this sentence is long!


Now write a function that will test whether the word "catfish" appears in a string:

In [6]:
def is_catfish(): #what is the input?
#     process: what does the function do?
    return #what does the function return?

then run this code to see if it works:

In [None]:
s1 = "canning of catfish"
s2 = "This is my cat! His name is Fish."

print(is_catfish(s1))
print(is_catfish(s2))

In the next cell you'll find my function to write new data to the cache. Read through it and try to unpack its parts (input, process, return). 


In [None]:
def write_to_cache(url, req_text):
    CACHE_DICTION[url] = req_text
    dumped_json_cache = json.dumps(CACHE_DICTION)
    fw = open(CACHE_FNAME,"w")
    fw.write(dumped_json_cache)
    fw.close()
    return None

Input: 1) the URL given to requests, and 2) a varilable called "req_text", a string containing the results of that request. 

Process: 1) create a dictionary entry in CACHE_DICTION with the url as the key and req_text as a value. 2) convert CACHE_DICTION to a json string and save it to the cache filename defined earlier. 3) close the file. 

Return: None! This function doesn't need to return anything, so we set return to None. 

# Scraping Time!

Now that we've got our caching functions set, we can get to the fun part: web scraping!

But first: let's look at our data source. 

In [None]:
#scrape one page

In [None]:
#parse with beautiful soup

# Crawling!

Crawling is when you use links found in HTML pages to **crawl** onto those pages and scrape them in turn. You can think of it as request chaining: you scrape a page with a feed for all the links, then scrape each of those pages. This allows us to get a lot of data quickly without having to copy and paste a lot of urls.  