# Basic Web Scraping and Word Frequency

For this module, we will use Natural Language ToolKit along with several other popular Python packages to build a data science pipeline to plot frequency histograms of words in html novels.

To get started, you will need a Python installation (3.6.3 or later is recommended).
```
$ python --version
3.6.3
```


If needed, run these command inside the terminal to install the packages: 

```
$ pip install beautifulsoup4
$ pip install matplotlib
$ pip install nltk
$ pip install requests
$ pip install seaborn
```

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

## Get some data
Where do we get data?  That's easy...data is everywhere.  We can import files (csv, xlsx, txt), pull from APIs (usually as JSON), or obtain raw HTML.  For this example, we will use the freely available online at Project Gutenberg.

Here are several links to well known HTML books:
- 'https://www.gutenberg.org/files/514/514-h/514-h.htm' # Little Women
- 'https://www.gutenberg.org/files/42671/42671-h/42671-h.htm' # Pride & Prejudice
- 'https://www.gutenberg.org/files/203/203-h/203-h.htm' # Uncle Tom's Cabin
- 'https://www.gutenberg.org/files/205/205-h/205-h.htm' # Walden

In [None]:
# Store url
url = 'https://www.gutenberg.org/files/42671/42671-h/42671-h.htm'


Next, we need to fetch the HTML file.  To do this, we will use a popular package known as ```requests```.  If you are familiar with http requests, we will be submitting a ```GET``` request.

In [None]:
# Make the request and check object type


The ```type``` command outputs the datatype.  Here we are getting a ```Response`` object.

The following commands extract and outputs the raw HTML.

In [None]:
# Extract HTML from Response object and print


## Wrangle the data

**Tag soup** refers to unstructured (or malformed) HTML code.  The package ```BeautifulSoup``` allows you to easily interact with this code.

Because we are in Lousiana, let's refer to our HTML soup as 'gumbo'.

In [None]:
# Create a BeautifulSoup object from the HTML
gumbo = BeautifulSoup(html, 'html5lib')
type(gumbo)


From our ```gumbo``` object, we can extract some information such as title.

In [None]:
# Get title as string


We can also find the hyperlinks within a page (< a > tags):

In [None]:
# Get hyperlinks from gumbo and check out first several
gumbo.findAll('a')[:8]

    For this project, we want the text from the ```gumbo``` object.  Luckily, there is a ```.get_text()``` method for doing this.

In [None]:
# Get the text out of the gumbo and print it


Almost there!  While we have the text of the novel, it still contains some metadata.  Since the metadata is minimal and will not influence our findings, let's move forward witht he project.

## Extract Words
Next, we will use ```nltk``` tokenize text and remove stopwords.

Regex in use.

In [None]:
# Define sentence
sentence = 'peter piper pick a peck of pickled peppers'

# Define regex
ps = 'p\w+'

# Find all words in sentence that match the regex and print them
re.findall(ps, sentence)

In [None]:
# Find all words and print them
re.findall('\w+', sentence)

Let's do something similar with the ```text``` object.

In [None]:
# Find all words in Moby Dick and print several

# tokens = nltk.word_tokenize(text)


Almost there!  At this point, words that start with a capital letter will be counted a separate instance.  To handle this issue, make all of the words lowercase.

In [None]:
# Initialize new list
words = []

# Loop through list tokens and make lower case


# Print several items from list as sanity check
words[:8]

Stop words provide no real insights so let's remove them. 

In [None]:
# Get English stopwords and print some of them
sw = nltk.corpus.stopwords.words('english')
sw[:5]

# If you encounter an error, run the command below.
# nltk.download('stopwords')

In [None]:
# Initialize new list
words_ns = []

# Add to words_ns all words that are in words but not in sw


# Print several list items as sanity check
words_ns[:5]


## Answering the question
We started this project wanting to know the most frequently used words in a novel.  An easy manner to answer this question is to create a graph.

In [None]:
# Figures inline and set visualization style
%matplotlib inline
sns.set()

# Create freq dist and plot
freqdist1 = nltk.FreqDist(words_ns)
freqdist1.plot(40)

## Bonus: Create a reusable function

There are hundreds of novels on Project Gutenbergso it makes sense to write a function that does utilizes our code from above.

In [None]:
def plot_word_freq(url, num = 25):
    """Takes a url & frequency and plots the word distribution"""
    # Make the request and check object type
    r = requests.get(url)
    # Extract HTML from Response object and print
    html = r.text
    # Create a BeautifulSoup object from the HTML
    gumbo = BeautifulSoup(html, "html5lib")
    # Get the text out of the soup and print it
    text = gumbo.get_text()
    # Create tokens
    tokens = re.findall('\w+', text)
    # Initialize new list
    words = []
    # Loop through list tokens and make lower case
    for word in tokens:
        words.append(word.lower())
    # Get English stopwords and print some of them
    sw = nltk.corpus.stopwords.words('english')
    # Initialize new list
    words_ns = []
    # Add to words_ns all words that are in words but not in sw
    for word in words:
        if word not in sw:
            words_ns.append(word)
    # Create freq dist and plot
    freqdist1 = nltk.FreqDist(words_ns)
    freqdist1.plot(num)
   

In [None]:
plot_word_freq('https://www.gutenberg.org/files/514/514-h/514-h.htm', 10)

## Conclusion

What have we learned?  You now have the foundation for 'scraping' HTML data from a website, extracting data, manipulating text, and plotting output.