# DS106 Machine Learning : Lesson Ten Companion Notebook

### Table of Contents <a class="anchor" id="DS106L10_toc"></a>

* [Table of Contents](#DS106L10_toc)
    * [Page 1 - Introduction](#DS106L10_page_1)
    * [Page 2 - Natural Language Processing ](#DS106L10_page_2)
    * [Page 3 - Read in Text](#DS106L10_page_3)
    * [Page 4 - Convert Text to Soup](#DS106L10_page_4)
    * [Page 5 - Tokenize Data](#DS106L10_page_5)
    * [Page 6 - Remove Capitalization](#DS106L10_page_6)
    * [Page 7 - Remove Stopwords](#DS106L10_page_7)
    * [Page 8 - Count and Plot Words](#DS106L10_page_8)
    * [Page 9 - Key Terms](#DS106L10_page_9)
    * [Page 10 - ](#DS106L10_page_10)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L10_page_1"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Natural Language Processing
VimeoVideo('388630868', width=720, height=480)


The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO106-ML-L05overview.zip)**.

# Introduction

So much information is presented only as the written or spoken word, yet most of the tools you have learned so far for dealing with data won't handle text! In this lesson, you'll learn a technique called *Natural Language Processing* to handle raw text data and turn it into something usable.  By the end of this lesson, you should be able to:

* Read in data from webpages
* Convert text to a usable format
* Utilize HTML tags to pull data parts
* Tokenize your data
* Use for loops to remove capitalization and stopwords
* Count words used in novels and chart the most frequent ones

This lesson will culminate in a hands-ons in which you find the most frequently occurring words in Lewis Carroll's *Alice's Adventures in Wonderland*.  Ready to start down the rabbit hole?

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/441207100"> recorded live workshop </a> that goes over the material in this lesson. </p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Natural Language Processing<a class="anchor" id="DS106L10_page_2"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Natural Language Processing

While some information is nicely and neatly laid out in columns and rows, the majority of information is hidden in language! This would not be a problem if language were not such a confusing thing that no two people use in exactly the same way.  

Imagine you are trying to process movie reviews to find out if something were popular.  The word "bad" might lead you to believe a review was negative, but in the context of "this movie was bad arse!" it might be a positive review.

*Natural Language Processing (NLP)* is the process by which data scientists try to glean useful information out of the chaos that is the written word.

## NLP in Python

Now that you have a general idea of what natural language processing is, you'll learn how to do it in Python.  

---

## Install Required Packages

You may need to install some packages before you begin. You'll need to open up your terminal and navigate to the folder where you are going to store your data for this exercise.  Then, run this code: 

```bash
pip install bs4
```

Then, open up your Anaconda Command Prompt, and you will run the following: 

```bash
conda install nltk
```

---

## Import those Packages


Now that you've installed a few packages, you will need to import them, along with a lot of other stuff! 

```python
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

You'll use `requests` to read in your data from the webpage, `BeautifulSoup` to help process your raw data, `nltk` as the definitive natural language processing package, and `RegexpTokenizer` to break down your data into words. You should already be familiar with `matplotlib` and `seaborn`, and you'll make use of them to visualize the frequency counts of words at the end.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Read in Text<a class="anchor" id="DS106L10_page_3"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Read in Text

Then, you will read in the text you'll analyze. You'll start by taking a URL to a webpage and assigning it a variable name, in this case, `url`:

```python
url = 'http://www.gutenberg.org/files/1184/1184-h/1184-h.htm'
```

This particular URL above goes to an e-book, *The Count of Monte Cristo*, by Alexander Dumas.  If you haven't read it, you should - great read! Maybe when you're done learning all of data science and have room in your brain for fun again.  Once you have your URL saved to a Python variable, you can get make a request to get data from that webpage.  This will use the function `requests.get()`, and you are making it to the `url` webpage you just saved:

```python
r = requests.get(url)
```

Then, you can find out the type of that request if you like by using the function `type()`:

```python
type(r)
```

The response back you should receive is: 

```text
requests.models.Response
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Convert Text to Soup<a class="anchor" id="DS106L10_page_4"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Convert Text to Soup

Yes, you read that right! Your next step is to make some soup. Chicken noodle is the recommendation, but feel free to go wild. Chili? Lemon orzo? A good beef stew? The possibilities are endless. The next few lines take the data off the webpage and extract the text, then use the `html5lib` to convert it into something you'll be able to process and better understand, called *soup*.

```python
html = r.text
soup = BeautifulSoup(html,"html.parser")
type(soup)
```

You know the first two lines worked, because when you determine the type with that `type()` function, it will give you back:

```text
bs4.BeautifulSoup
```

Feel free to burst into song here. Need some Animal Crackers in your Soup? Bet that gets stuck in your head all day. You're welcome!

---

## Use HTML Tags to Extract Useful Info

If you know your HTML, and the website is designed well, you can then call out certain pieces of this text.  For instance, the title:

```python
soup.title.string
```

`soup` is the name of the webpage broken into HTML, then you are calling the `title` tag from the HTML and asking for Python to give it back as `.string`.  The result should be this:

```text
'The Count of Monte Cristo, by Alexandre Dumas, pÃ¨re'
```

Which happens to be both the title of the book and the title of the webpage.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>There are many other HTML tags, and you make use of them to get all sorts of information out of a webpage.  The problem is, each website is typically set up a little differently, and you'll actually need to peek at the structure of the webpage to make the right HTML tag calls. If you happen to already know HTML, this will be a great place to play around.  If not, don't worry - there are other ways to play with soup that aren't so messy.</p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Tokenize Data<a class="anchor" id="DS106L10_page_5"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Tokenize Data

No, all of you Lord of the Rings fans, stand down! J.R.R. Tolkein will not be making an appearance here.  Instead, you're learning about *tokens*, or your text, broken down into words. The following code uses the function `get_text()` to retrieve your text, and then you will use the function `RegexpTokenizer()` to break it down into words, separated by spaces.  Spaces, you might ask? Well, in the language RegEx, `\w+` is the symbol for space.

Then the function `tokenize()` will actually perform the operation, and you will get the first five words with `[:5]`:

```python
text = soup.get_text()
tokenizer = RegexpTokenizer('\w+')
tokens = tokenizer.tokenize(text)
tokens[:5]
```

Here is the result from this code:

```text
['The', 'Count', 'of', 'Monte', 'Cristo']
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Remove Capitalization<a class="anchor" id="DS106L10_page_6"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Remove Capitalization

Notice that the above has capitalization.  That's great for when you are reading, but when you're trying to break things down for analysis, it is not as useful.  Should `The` be different from `the`? What about typos, like `THe` or `thE`? Probably not important.  You can remove capitalization like this:

```python
words = []
for word in tokens:
    words.append(word.lower())
```

The above code uses a for loop, and makes use of the function `lower()` to strip caps.  It all goes into a dictionary named `words`. You can take a look at the first five entries in the dictionary like this:

```python
words[:5]
```

And the result should be:

```text
['the', 'count', 'of', 'monte', 'cristo']
```

There.  You're now capitalization blind.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>Removing capitalization is part of text pre-processing, and if you'd like to learn about other text pre-processing, <a href="https://www.kdnuggets.com/2019/04/text-preprocessing-nlp-machine-learning.html"> check this article.com</a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Remove Stopwords<a class="anchor" id="DS106L10_page_7"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Remove Stopwords

There are some words that mean nothing out of context.  Writing a paper on how many times a book uses the word "the" is not very interesting.  Your future supervisor would not be impressed, unless you're going to go into the field of linguistics. These "boring" and "useless" words are considered *stopwords*, and, luckily, `nltk` already has a list of them! You can pull them out and label them like this:

```python
stopwords = nltk.corpus.stopwords.words('english')
```

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>If you get an error that says something like "no stopwords," then you'll need to download them.  You can go into your Anaconda Command Prompt (if you are using Jupyter) and get the stopword list by using this line: python -m nltk.downloader stopwords . Then re-run the line above and you should be good to go!</p>
    </div>
</div>

Check 'em out - here are the first ten stopwords in the list:

```python
stopwords[:10]
```

```text
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
```

These words don't have any emotional context, descriptive information, or tell you anything about the subject. So you'll want to filter these out, and others like them, so they don't clutter up your analysis.

```python
wordsWithoutStops = []
for word in words:
    if word not in stopwords:
        wordsWithoutStops.append(word)
```

Feel familiar? Yeah, it's another for loop. This makes a dictionary named `wordsWithoutStops` that filters out all the stopwords, so only the good stuff is left.  Kind of like straining your jam to remove the seeds, so you're only left with yummy jelly.

Want to see the first five words, without stops? Easy - just call:

```python
wordsWithoutStops[:5]
```

And you receive back:

```text
['count', 'monte', 'cristo', 'alexandre', 'dumas']
```

Now that gets to the more important stuff, doesn't it?

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Count and Plot Words<a class="anchor" id="DS106L10_page_8"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Count and Plot Words

You have successfully brought in data from a website with a get request, made it usable with BeautifulSoup, tokenized it, and cleaned it.  Now you can plot it.

```python
sns.set()
frequencyDis = nltk.FreqDist(wordsWithoutStops)
frequencyDis.plot(25)
```

Line two will get a frequency count of all the words in the `wordsWithoutStops` dictionary, using the function `nltk.FreqDist()`.  Then you can easily plot it with the `.plot()` command from `matplotlib`.  The `25` in parentheses says that you're only going to plot the top 25 words, though of course you could change that to whatever you like.

Here is the resulting plot:

![A graph showing the frequency count of the top twenty five words in the words without stops dictionary. Each word is listed on the x axis, which is labeled sample. The y axis is labeled counts and runs from five hundred to three thousand five hundred in increments of five hundred. A line begins at the top left of the graph at three thousand five hundred for the word said, then sharply decreases to one thousand five hundred for the word one, and then decreases more slowly across the remaining twenty three words.](Media/NLP1.png)

When looking at this graph, it is altogether possible that a few more stopwords could have been removed, like "would" or "could." However, these can sometimes be helpful.  For instance, when looking at product reviews, wouldn't it be helpful to see if they "would" recommend the product to a friend or "wouldn't?" You can see how these words in a novel don't help, but how they might help in other contexts.  You do get a sense for the main characters in this novel, though - so this analysis has really gotten to the heart of the issue.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Key Terms<a class="anchor" id="DS106L10_page_9"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Natural Language Processing</td>
        <td>The process of making raw text data useful.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Soup</td>
        <td>Webpage data converted to a usable format.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Token</td>
        <td>A part of language, such as a word or a sentence.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Stopwords</td>
        <td>Parts of speech that do not convey meaning.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>requests</td>
        <td>Allows you to pull data from a webpage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>BeautifulSoup</td>
        <td>Turns raw web data into something you can actually use.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>RegexpTokenizer</td>
        <td>Allows you to split your data into meaningful chunks.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>nltk</td>
        <td>Natural language processing package.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>requests.get()</td>
        <td>Function to pull data from a webpage.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>type()</td>
        <td>Function to determine the format of your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>BeautifulSoup()</td>
        <td>Tags the raw data in HTML to make soup.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>RegexpTokenizer()</td>
        <td>Specifies the break points in your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>tokenize()</td>
        <td>Breaks your data down into tokens.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.lower()</td>
        <td>Makes all the words lowercase.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>nltk.corpus.stopwords.words()</td>
        <td>Retrieves a list of stopwords in the language specified in the argument.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>nltk.FreqDist()</td>
        <td>Counts all the words in the data and creates a frequency distribution with them.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Lesson 5 Hands-On<a class="anchor" id="DS106L10_page_10"></a>

[Back to Top](#DS106L10_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


This Hands-­On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Natural Language Processing Hands On

For this hands-on, you will be using *Alice's Adventures in Wonderland* by Lewis Carroll to practice your newfound NLP skills. The book can be found **[here](https://www.gutenberg.org/files/11/11-h/11-h.htm)**. Follow the process you used on *The Count of Monte Cristo* to create a graphic of the most frequently used words in *Alice's Adventures in Wonderland*.

Please attach a Jupyter Notebook with your code, your graphic, and your conclusions.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>