# Scraping lecture

We have some information about pages we want to scrape in a file called `bills.json`. The ultimate goal is to download the full text of each bill and count the number of words.

## Import modules

In [1]:
# parse json file
import json

# what we need for scraping
import requests # request HTTP
from bs4 import BeautifulSoup # parse HTML

# helpful modules for cleaning up text
import re
import string

# good ole pandas to structure our data
import pandas as pd

## Bring in the data

In [2]:
with open('bills.json') as file:
    bills = json.load(file)

I've commented out the below code because a lot of text gets printed out; watch the lecture screen to view the results.

In [None]:
# # this is a way to 'pretty-print' a JSON file
# print(json.dumps(bills, indent=2))

## Start with a test page

We'll start with the first item in the `bills` list.

In [3]:
test_bill = bills[0]
test_bill

{'congress': 116,
 'chamber': 'house',
 'bill_url': 'https://www.congress.gov/bill/116th-congress/house-bill/133/text?r=1&s=3',
 'bill_number': 133}

Create a variable called `test_url` that gets the value of `bill_url` from `test_bill`:

In [4]:
test_url = test_bill['bill_url']
test_url

'https://www.congress.gov/bill/116th-congress/house-bill/133/text?r=1&s=3'

Before we download this page, let's look at the HTML and see if we can find where the bill exists in the HTML.

### Request the url

In [5]:
test_page = requests.get(test_url)

### Save the HTML so we don't have to re-download it later

If you're going to scrape tens or hundreds or thousands of URLs, it could be helpful to save the HTML so you don't have to re-download thousands of pages later. I don't want to clutter up this coding folder so I'm going to create a new directory to save all these pages.

One very cool thing about Jupyter notebooks is that you can execute some basic terminal commands by using an exclamation point. Below, I'm going to create a new directory called 'pages'. When you use the `-p` flag, you won't get an error if the directory already exists.

In [6]:
!mkdir -p pages

I included `pages` in your `.gitignore` file — that means it'll save on your hard drive but it won't be pushed to git.

In [7]:
# save the test page so i don't have to dl again
with open('pages/test_page.html', 'w') as f:
    f.write(test_page.text)

### Parse the test page with Beautiful Soup

We'll use Beautiful Soup's built-in html parser. This allows us to search for nested elements.

In [8]:
test_soup = BeautifulSoup(test_page.text, features='html.parser')

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [9]:
# test_soup

### Find and get what's inside `id='billTextContainer'`
Because we know that all of a bill's text is contained within an element with the ID of 'billTextContainer', we can use bs4's `.find(id='')` method:

In [10]:
bill_text_container = test_soup.find(id='billTextContainer')

Remember that the result is still a bs4 `type`:

In [11]:
type(bill_text_container)

bs4.element.Tag

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [12]:
# bill_text_container

If we want to extract only the text, we'll use the bs4 method `.get_text()`:

In [15]:
bill_text = bill_text_container.get_text()

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [16]:
# bill_text

What is the type of `bill_text`?

In [17]:
type(bill_text)

str

In [18]:
len(bill_text)

7692027

### Clean up `bill_text`

The text is pretty messy. We want to:
- replace punctuation with spaces
- replace newlines with spaces (`\n` means "newline")
- replace 2+ spaces with 1 space

#### Replace punctuation with space

In [None]:
# got the code from here: https://stackoverflow.com/a/37221663
punctuation_table = str.maketrans({key: ' ' for key in string.punctuation})
bill_text_cleaned = bill_text.translate(punctuation_table)  

Read more about these string methods in the Python documentation:

- [str.maketrans()](https://docs.python.org/3.3/library/stdtypes.html#str.maketrans)
- [str.translate()](https://docs.python.org/3.3/library/stdtypes.html#str.translate)

_The following code is commented out because it outputs a LOT of text. Look at the screen during lecture to see what the output looks like._

In [None]:
# bill_text_cleaned

#### Replace newlines with space

In [None]:
bill_text_cleaned = re.sub('\\n', ' ', bill_text_cleaned)

In [None]:
# bill_text_cleaned

#### Replace multiple spaces with one space

In [None]:
bill_text_cleaned = re.sub('\s{2,}', ' ', bill_text_cleaned)

In [None]:
# bill_text_cleaned

What are some problems you see in the final `bill_text_cleaned`? Do you think it's OK for the purposes of this project?

### Word count

#### Get the word count

You can get the word count of a string by splitting the string. By default `str.split()` will by split on spaces. Then, you are left with a list of words. The length of the list, or `len()` is how many words you have in the string.

In [None]:
bill_word_count = len(bill_text_cleaned.split())

In [None]:
bill_word_count


#### Create the dataframe

Let's make a pandas dataframe where we can save the word count.

The neat thing about `bills` is that it's already structured in a way that makes it very easy to create a dataframe. It's a list of dictionaries that only have one level. (If this doesn't sound familiar to you, you might want to brush up on [lists and dictionaries in the Python documentation](https://docs.python.org/3/tutorial/datastructures.html).)

In [None]:
bills_df = pd.DataFrame(bills)
bills_df

##### Create a new column, method 1

In [None]:
new_columns = list(bills_df.columns) + ['word_count']
new_columns

In [None]:
bills_df = bills_df.reindex(columns=list(bills_df.columns) + ['word_count'])

In [None]:
bills_df

##### Create a new column, method 2
You need to `import numpy as np` for this but it's easier!

In [None]:
import numpy as np
bills_df['word_count'] = np.nan

In [None]:
bills_df

#### Save the word count

How do I update Bill 133's 'word_count'? 

You'll use `df.loc`:

```python
df.loc[subset_expression, 'column_to_change'] = new_value
```
In effect, you're subsetting the dataframe and applying a value to a column.

In the below code, we subset for rows where 'bill_number' is 133: `bills_df['bill_number'] == 133`


In [None]:
bills_df.loc[bills_df['bill_number'] == 133, 'word_count'] = bill_word_count

In [None]:
bills_df[bills_df['bill_number'] == 133]

In [None]:
bills_df

## Time for a loop

We wrote all the code for ONE test page. But we have more than one item in `bills`.

### How do we loop through bills?

In [None]:
for bill in bills:
    pass
    # print(bill)

At this point, it'll be useful to check out the Table of Contents of this notebook in Lab. What are the steps we need to take?

- Request the URL
- Save the HTML of the URL
- Parse the page with bs4
- Find and get what's inside `id='billTextContainer'`
- Clean up the bill text
  - Replace punctuation with space
  - Replace newlines with space
  - Replace multiple spaces into one space
- Get the word count
- Save the word count in the dataframe

We're going to switch up a couple things though. The following steps only need to be done once, so they should be executed BEFORE we go through the loop.
- Create the folder for saving all the HTML
- Create the dataframe to save all the information

We'll write the loop in a new notebook for classwork: [`scraping_classwork.ipynb`](scraping_classwork.ipynb).

But before we do, I want to introduce you to another Python module that is really helpful when you're scraping: `tqdm`.

## tqdm

You can wrap `tqdm()` around any iterable (list, array, etc.) to create a progress bar.

In [None]:
from tqdm.notebook import tqdm
from time import sleep # this module just helps us visualize a delay

In [None]:
for n in tqdm(range(20)):
    sleep(0.2)