# Beyond the Standard Library!

Python's standard library is very powerfull, but sometimes we need more. We will scrape the Internet Archive to get data on the progress of Jill Stein's [crowdfunding campaign](http://jillstein.nationbuilder.com/recount) for recounts in the 2016 U.S. presidential election. We're also going to look at some data from
a [survey on fanfiction](https://kingsbsd.github.io/scraping_task/).

We're going to use the
[Requests](http://docs.python-requests.org/en/master/)' library to download the data, then analyse it
with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and the
[Natural Language Tool-Kit](http://www.nltk.org/book/) (NLTK), and display it using [MatPlotLib](http://matplotlib.org/).

In [None]:
# Standard library:
from datetime import datetime
import re
import urllib

# https://www.crummy.com/software/BeautifulSoup/
from bs4 import BeautifulSoup

# http://matplotlib.org/
import matplotlib.pyplot as plt
import matplotlib.dates as md

# http://www.nltk.org/book/
import nltk

# http://docs.python-requests.org/en/master/
import requests

## Warm-Up: (X-)File IO

The next cell will use the standard library to download an
[episode guide to the X-Files](http://www.textfiles.com/media/xfepgd.txt) from [www.textfiles.com](http://www.textfiles.com/). It
contains a lot of text, including short lists of episodes for each season. Your job is to extract these lists
and save them to another file. Each line in the lists starts with a code, `2X13` for example. The function
`get_x_code` will return a code if a line of text starts with a code, or `False` if it doesn't. -It uses regular
expressions, don't worry about them, (yet).

In [None]:
urllib.request.urlretrieve('http://www.textfiles.com/media/xfepgd.txt', 'xfepgd.txt')

In [None]:
def get_x_code(line):
    matches = re.findall('^[1-3]X[0-9]{1,2}',line)
    if matches:
        return matches[0]
    else:
        return False

Have a quick look at the Python documentation for
[reading and writing files](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files). The
next cell shows an example of opening the file `xfepgd.txt` for reading and iterating over each line. Modify
it so that `episodes` is popluated with each line that contains an episode starting with a code.

In [None]:
episodes = []
with open('xfepgd.txt','r') as f:
    for line in f.readlines():
        pass

All you need to do now is modify the code slightly to *write* the `episodes` list to a file
called `x-files-episodes.txt`. The first two lines of the file should be the `headers` list. Use the `write`
method of a file object. You should see your new file in the notebook server.

In [None]:
header = ['Code Episode                        Air Date  Rebroadcasts',
    '---- -------                        --------  ------------']
# You're own your own now.

## The Jill Stein Recount

Look up [jillstein.nationbuilder.com/recount](http://jillstein.nationbuilder.com/recount) on the
[Internet Archive](https://archive.org/web/). There are lots of snapshots from November 2016, and we can track how
the ammount of money raised and requested evolved with time. It would be nice to be able to automate this.
The [Internet Archive](https://archive.org/web/) provides us with an
[API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) for getting data on the snapshots
the archive holds for a given web-page. We'll access
it using the Requests library. We'll create a `request` object, and use it's `.json()` method to return all the JSON data
(JavaScript Object Notation) from the API call as standard Python collections.

Here's what it should look like:

```
[['urlkey',
  'timestamp',
  'original',
  'mimetype',
  'statuscode',
  'digest',
  'length'],
 ['com,nationbuilder,jillstein)/recount',
  '20161123205129',
  'https://jillstein.nationbuilder.com/recount',
  'text/html',
  '200',
  'W6HBNX5QXRX4X3OGHN6VSUTSMIKVOLSY',
  '13059'],
 ['com,nationbuilder,jillstein)/recount',
  '20161123211323',
  'https://jillstein.nationbuilder.com/recount',
  'text/html',
  '200',
  '32B2DRBBQ525UHKR3KFA5G56YRTUZV2L',
  '15287']]
```

In [None]:
req = requests.get("https://web.archive.org/cdx/search/cdx?url=jillstein.nationbuilder.com/recount&output=json")
req.json()[0:3]

## Getting the Timestamps

We get a list of lists, the first one is a list of headings for all the other lists, which contain the data. We
need the timestamp of each snapshot. Build a list called `timestamps`, from the second list onwards, the timestamp 
will be the second item. Use a `for` loop if you like. Here's what the first 5 should look like:

```
['20161123205129', '20161123211323', '20161123234049', '20161123234501', '20161124001441']
```

In [None]:
timestamps = []

In [None]:
print(timestamps[0:5])

## Getting the Snapshots

We need to download the snapshots at each timestamp. The URL of the first snapshot is:

```
http://web.archive.org/web/20161123205129/https://jillstein.nationbuilder.com/recount
```

Complete the `web_archive_request` function so that it returns a `request` object for a given timestamp using
the Request library.

In [None]:
def web_archive_request(timestamp, url='https://jillstein.nationbuilder.com/recount'):
    return None

The `content` of the `request` for the first timestamp should be a big mess of HTML.

In [None]:
recount_1 = web_archive_request('20161123205129').content
print(recount_1)

Let's have a look at one of the
[snapshots](https://web.archive.org/web/20161129012122/https://jillstein.nationbuilder.com/recount). The raised
and goal amounts are contained in pairs of div tags (`<div></div>`) with the CSS classes of `bar-text` and
`bar-goal`. (Try "view source" in your browser) We can use the BeautifulSoup library to extract this text. Here it is for the first snapshot:
    
```
$87,122.13 raised GOAL: $2,500,000.00
```

In [None]:
recount_soup_1 = BeautifulSoup(recount_1, 'html.parser')
raised_1 = recount_soup_1.find(class_='bar-text')
goal_1 = recount_soup_1.find(class_='bar-goal')
print(raised_1.text, goal_1.text)

## Extracting Values from Text

We need to turn the texts into floating point values. Complete the function `get_dollar_ammount`.
`float('87122.13')` works but the commas in `float('87,122.13')` will make it fail. We need to get rid of
everything but the numbers and the decimal point. Would the `split()` and `join()` functions help?

In [None]:
def get_dollar_ammount(text):
    return float(0)

In [None]:
get_dollar_ammount(raised_1.text)

## Scraping the Data

We've almost got everything we need to scrape the data and display it. Complete the function `scrape_ammounts`
which takes the list of timestamps and returns the lists of raised and goal ammounts in a tuple. Create a
BeautifulSoup object from the content of request objects for each timestamp. Get the contents of the div tags
with the classes `bar-text` and `bar-goal`. You've already written all the functions you need, and the examples
of using the `find` method of a BeautifulSoup have already been given. Not all of the snapshots will contain the
right data, so you need to use try-except blocks to avoid runtime errors. Here are the first 5 data-points:
    
```
[87122.13, 131526.2, 626916.47, 646386.47, 780759.0]
[2500000.0, 2500000.0, 2500000.0, 2500000.0, 2500000.0]
```

In [None]:
def scrape_ammounts(times):
    raised = []
    goal = []
    for t in times:
        soup = None
        try:
            raised.append()
        except:
            #print("Can't get raised ammount for timestamp: "+t)
            pass
        try:             
            goal.append()
        except:
            #print("Can't get goal ammount for timestamp: "+t)
            pass
    return raised,goal

In [None]:
raised, goal = scrape_ammounts(timestamps)

In [None]:
print(raised[0:5])
print(goal[0:5])

## Plot the Data

You might not have written the most concise, elegant code, but that would have taken longer. Besides, we didn't
want code, we wanted data. Matplotlib is hard to use, so just run the cell below to plot a graph and see how the
amount raised responded to the ammount requested. The Internet Archive, Requests library and  BeautifulSoup are
a powerfull combination. What would *you* do with them?

In [None]:
%matplotlib inline
n = len(raised)
times = [datetime.strptime(t, '%Y%m%d%H%M%S') for t in timestamps[0:n]]
raised_millions = [i/1E6 for i in raised]
goal_millions = [i/1E6 for i in goal]
fig, ax = plt.subplots()
fmt = md.DateFormatter('%m/%d %H:%M')
ax.xaxis.set_major_formatter(fmt)
plt.xticks(rotation=25)
raised_line, = plt.plot(times, raised_millions, marker='*')
ax.set_xlabel('Time Retrieved')
ax.set_ylabel('Amount in Millions of $')
goal_line, = plt.plot(times, goal_millions)
plt.legend([raised_line, goal_line], ['raised','goal'], loc='lower right')
plt.show()

In [None]:
survey_url = 'https://kingsbsd.github.io/scraping_task/'

## Scraping a Fan Fiction Survey

Take a look at this [survey](https://kingsbsd.github.io/scraping_task/) of read and written Fan Fiction. Some
people responded with neat, comma-separated lists of titles. Others have embedded the titles in free-form text.
There will be spelling and punctuation errors. The idea is to extract titles from the survey, without picking up
too much junk. We will try a very simple Natural Language processing (NLP) approach. It won't be perfect, but it'll
do.

## Getting the contents of a web page
Store the contents of the web page in the string "fan_doc". Call requests' "get" method to obtain a response object, and use its "text" attribute.

In [None]:
survey_url = 'https://kingsbsd.github.io/scraping_task/'
fan_doc = ''

"fan_doc" should look like this:

In [None]:
print(fan_doc)

## Parsing HTML with BeautifulSoup
We now create a BeautifulSoup object, so we can extract the text we want from the HTML mark-up:

In [None]:
fan_soup = BeautifulSoup(fan_doc)

The BeautifulSoup object should already be a lot more readable:

In [None]:
print(fan_soup)

## Extracting the desired content
Most of the the methods of a BeautifulSoup object return another BeautifulSoup object. We can chain these method calls
together to drill down into HTML document's structure until we get what we're looking for. Here, we get all the table row
(`<tr>`) elements in the first table in the page with the "find_all method":

In [None]:
fan_rows = fan_soup.find('table').find_all('tr')

In [None]:
print(fan_rows[0])

Complete the function "get_row_text", so that when passed a  BeautifulSoup object for a row element it returns the
text contents of its *first* `<td>` element.

In [None]:
def get_row_text(row):
    #FIXME:
    return None

The first `<td>` element of the second `<tr>` element should look like:
    
```Buffy/Angel, Harry Potter, Forever Knight, Andromeda, Highlander (currently in a Horsemen-only phase),
 some 'older' anime every now and then (slayers, weiss kreuz, fushigi yuugi, video game)```    

In [None]:
print(get_row_text(fan_rows[1]))

Now we'll combine the content of all the rows into a big list. This is called a ["list comprehension"](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). Sometimes
they can be a lot more readable and concise than appending to lists in a for loop. Don't worry about them for now.

In [None]:
fandoms = [get_row_text(row) for row in fan_rows]

## Extracting fiction titles with NLTK
Named Entity Recognition (NER) can be hard, but sometimes we can do quite well with a simple approach. Many of the
titles in the "fandoms" list will be badly typed, but most will follow the
[accepted rules](http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html)
for capitalization, and we can make use of this. Some of the texts will still contain HTML, so we'll strip it out
with BeautifulSoup again. Then we'll use NLTK to chop them into lists of words or punctuation.

In [None]:
def get_clean_words(txt):
    soup = BeautifulSoup(txt)
    return [str(word) for word in nltk.word_tokenize(soup.text)]

See how we get rid of a rogue `<br />` tag in the 11th text:
    
```

Don't read original.<br />Eroica, Garrison's Gorillas, x-files, Harry Potter, Pirates of the CAribbean, Lord of the Rings, Master & commander, Smallville, Buffy, Wild Wild West, Wiseguy, Quantum Leap and a multitude of others.

['Do', "n't", 'read', 'original.Eroica', ',', 'Garrison', "'s", 'Gorillas', ',', 'x-files', ',', 'Harry', 'Potter', ',', 'Pirates', 'of', 'the', 'CAribbean', ',', 'Lord', 'of', 'the', 'Rings', ',', 'Master', '&', 'commander', ',', 'Smallville', ',', 'Buffy', ',', 'Wild', 'Wild', 'West', ',', 'Wiseguy', ',', 'Quantum', 'Leap', 'and', 'a', 'multitude', 'of', 'others', '.']
```

In [None]:
print(fandoms[10])
print()
print(get_clean_words(fandoms[10]))

We'll have another list comprhension to strip all the HTML out of "fandoms" and put the result in "clean_fandoms":

In [None]:
clean_fandoms = [get_clean_words(f) for f in fandoms]

Complete the function "chunker" that takes lists of words and punctuation, and returns lists of lists of
consecutive words with all the punctuation removed. Here's the algorithm:

* Start with a list of lists that contains one empty list.
* Iterate over the input list of words and punctuation.
  + If the word is alpha-numeric (`word.isaplha()`) then append it to the last list in the list of lists. Remember
    about [list slicing](https://docs.python.org/3/tutorial/introduction.html#lists).
  + Otherwise, append a new empty list to the list of lists.
                         

Here's what you should get for the 14th text. It looks pretty good already, but we've broken
"Xena, the worrior princes" (sic) into two. You really can't win them ALL in this game!

```
[['Lord', 'of', 'the', 'Rings'], ['Queer', 'as', 'Folk'], ['Buffy', 'The', 'Vampire', 'Slayer'], ['Harry', 'Potter'], ['Babylon', 'Five'], ['Xena'], ['The', 'worrior', 'princes'], ['Some', 'different', 'original', 'slash'], ['I', 'do', 'read', 'Het'], ['but', 'I', 'prefer', 'slash'], []]
```

In [None]:
print(chunker(clean_fandoms[13]))

Here's a list of short words we often find in fiction titles that don't need to be capitalized:

In [None]:
shortwords = ['a', 'an', 'as', 'the', 'and', 'but', 'or', 'for', 'on', 'at', 'to', 'from', 'by', 'of']

Complete the "find_titles" function. We'll use a very similar algorithm to extract titles from our lists of words. A title must start and end on a capitalized word, and may contain only capitalized words, or words from the "stopwords" list we pass in.

* "title_list" is a list of lists, that contains one empty list.
* For each word:
  + Append the word to the last list in "title_list" IF:
    - The 1st character is [uppercase](https://docs.python.org/3/library/stdtypes.html#str.isupper).
    - OR the word is in "stopwords" AND the last list has at least one word in it.
  + ELSE append a new empty list to "title_list".
* Join all the words in the lists together with spaces. (This is done for you.)

In [None]:
def find_titles(words, stopwords=shortwords):
    title_list = [[]]
    for w in words:
        if True or (True and True):   
            pass
        else:
            if True:
                pass

    titles = []
    
    for l in title_list:
        if l:
            titles.append(' '.join(l))
                                    
    return titles

In [None]:
def extract_titles(fandom, stopwords=shortwords):
    titles = []
    for word_list in chunker(fandom):
        titles.append(find_titles(word_list, stopwords=stopwords))        
    return titles

If we run the previous function on the 3rd and 14th texts it looks pretty reasonable, we've got rid of a lot of the junk. It's still
not perfect:
 
```[['I', 'GEN', 'I'], ['I', 'GEN', 'Lord of the Rings'], ['Kung Fu'], ['The Ledgend Continues'], ['I', 'SLASH'], ['GEN'], [], ['Gundam Wing'], ['Yu Yu Hakusho'], ['Highlander'], ['Smallville'], ['Dogma'], ['Pirates of the Carribbean'], ['Star Trek TOS'], ['Star Trek TNG'], ['Stargate'], ['The Matrix'], ['The Vampire Chronicels'], ['Xena and Hercules'], [], [], ['HET'], ['GEN'], [], [], ['I'], [], ['For'], ['Lord of the Rings and Gundam Wing'], ['I DO NOT'], [], [], [], [], [], [], [], ['I'], ['OOC'], []]```

```[['Lord of the Rings'], ['Queer as Folk'], ['Buffy The Vampire Slayer'], ['Harry Potter'], ['Babylon Five'], ['Xena'], ['The'], ['Some'], ['I', 'Het'], ['I'], []]```

In [None]:
print(extract_titles(clean_fandoms[2]))
print()
print(extract_titles(clean_fandoms[13]))

What are the advantages and disadvantages of including "and" in the stopwords? Look at texts 3 and 6:

```[['I', 'GEN', 'I'], ['I', 'GEN', 'Lord of the Rings'], ['Kung Fu'], ['The Ledgend Continues'], ['I', 'SLASH'], ['GEN'], [], ['Gundam Wing'], ['Yu Yu Hakusho'], ['Highlander'], ['Smallville'], ['Dogma'], ['Pirates of the Carribbean'], ['Star Trek TOS'], ['Star Trek TNG'], ['Stargate'], ['The Matrix'], ['The Vampire Chronicels'], ['Xena', 'Hercules'], [], [], ['HET'], ['GEN'], [], [], ['I'], [], ['For'], ['Lord of the Rings', 'Gundam Wing'], ['I DO NOT'], [], [], [], [], [], [], [], ['I'], ['OOC'], []]
```

```
[['Mostly Harry Potter', 'Lord of the Rings'], [], []]
```

```
[['I', 'GEN', 'I'], ['I', 'GEN', 'Lord of the Rings'], ['Kung Fu'], ['The Ledgend Continues'], ['I', 'SLASH'], ['GEN'], [], ['Gundam Wing'], ['Yu Yu Hakusho'], ['Highlander'], ['Smallville'], ['Dogma'], ['Pirates of the Carribbean'], ['Star Trek TOS'], ['Star Trek TNG'], ['Stargate'], ['The Matrix'], ['The Vampire Chronicels'], ['Xena and Hercules'], [], [], ['HET'], ['GEN'], [], [], ['I'], [], ['For'], ['Lord of the Rings and Gundam Wing'], ['I DO NOT'], [], [], [], [], [], [], [], ['I'], ['OOC'], []]
```

```
[['Mostly Harry Potter and Lord of the Rings'], [], []]
```

In [None]:
no_and = ['a', 'an', 'as', 'the', 'but', 'or', 'for', 'on', 'at', 'to', 'from', 'by', 'of']
print(extract_titles(clean_fandoms[2], stopwords=no_and))
print()
print(extract_titles(clean_fandoms[5], stopwords=no_and))
print()
print(extract_titles(clean_fandoms[2]))
print()
print(extract_titles(clean_fandoms[5]))

Doh! "Quantum Leap" from the 11th text is messed up! We forgot to drop words from the *end* of the title until we
find a capitalized word. For most of the other cases, it didn't matter.
What other improvements could you make? What about very short titles or ones that are all
in caps? We should filter out the empty lists too. Try and filter out as much of the junk, without losing any real
fiction titles.

```
[['Do'], [], ['Garrison'], ['Gorillas'], ['Harry Potter'], ['Pirates of the CAribbean'], ['Lord of the Rings'], ['Master'], [], ['Smallville'], ['Buffy'], ['Wild Wild West'], ['Wiseguy'], ['Quantum Leap and a'], []]
```

In [None]:
def improved_find_titles(words, stopwords=shortwords):
    # You're on your own here!
    return None

Can you rate ALL the titles in all the texts by their popularity? Can you use BeautifulSoup to repeat the analysis for the *second* table in the page, about written fiction? Have an open-ended play with the data.