## 1. Tools for text processing

What are the most frequent words in Herman Melville's novel, Moby Dick, and how often do they occur?

In this notebook, we'll scrape the novel <em>Moby Dick</em> from the website <a href="https://www.gutenberg.org/">Project Gutenberg</a> (which contains a large corpus of books) using the Python package <code>requests</code>. Then we'll extract words from this web data using <code>BeautifulSoup</code>. Finally, we'll analyze the distribution of words using the Natural Language ToolKit (<code>nltk</code>) and <code>Counter</code>.

The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the world's data as it is unstructured data and includes a great amount of text.

We will start by loading in the main Python packages we are going to use.

In [1]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
import nltk
from bs4 import BeautifulSoup
from collections import Counter

## 2. Request Moby Dick
<p>To start analyzing Moby Dick, we need to get the contents of Moby Dick from <em>somewhere</em>. We are lucky that the text is freely available online at Project Gutenberg as an HTML file: <a href="https://www.gutenberg.org/files/2701/2701-h/2701-h.htm">https://www.gutenberg.org/files/2701/2701-h/2701-h.htm</a> .</p>

<p>To fetch the HTML file with Moby Dick we're going to use the <code>requests</code> package to make a <code>GET</code> request for the website, which means we're <em>getting</em> data from it. </p>

In [2]:
# Getting the Moby Dick HTML 
r = requests.get("https://www.gutenberg.org/files/2701/2701-h/2701-h.htm")

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 1000 characters in html
print(html[0:1000])

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#ffffff; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       {

## 3. Get the text from the HTML
<p>Currently this HTML is not quite what we want. However, it does <em>contain</em> what we want: the text of <em>Moby Dick</em>. What we need to do now is <em>wrangle</em> this HTML to extract the text of the novel. For this we'll use the package <code>BeautifulSoup</code>.</p>

In [3]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html)

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 3200 and 3400
print(text[3200:3400])

R 29. Enter Ahab; to Him, Stubb. 


 CHAPTER 30. The Pipe. 


 CHAPTER 31. Queen Mab. 


 CHAPTER 32. Cetology. 


 CHAPTER 33. The Specksnyder. 


 CHAPTER 34. The Cabin-Table. 


 CHAPTER 35. The Ma


## 4. Extract the words
<p>Now we have the text of the novel. There is some unwanted parts at the start and some unwanted parts at the end. It is possible to remove it, but for the sake of simplicity right now we will leave it as is.</p>
<p>As we possess the text of interest, it's time to count how many times each word appears, and for this we'll use <code>nltk</code> â€“ the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.</p>

In [4]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+|\$[\d\.]+|\S+,;:')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
print(tokens[:8])

['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']


## 5. Make the words lowercase
<p>We're proceeding with success. Note that in the above 'Or' has a capital 'O' and that in other places it may not, but both 'Or' and 'or' should be counted as the same word. For this reason, we should build a list of all words in <em>Moby Dick</em> in which all capital letters have been made lower case.</p>

In [5]:
# Create a list called words containing all tokens transformed to lower-case
words = []
for x in tokens:
    words.append(x.lower())

# Printing out the first 8 words / tokens 
print(words[:8])

['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']


## 6. Load in stop words
<p>It is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as <em>stop words</em>. The package <code>nltk</code> includes a good list of stop words in English that we can use.</p>

In [6]:
# Getting the English stop words from nltk
nltk.download('stopwords')
sw = nltk.corpus.stopwords.words('english')

# Printing out the first eight stop words
print(sw[:6])

['i', 'me', 'my', 'myself', 'we', 'our']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ylmza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 7. Remove stop words in Moby Dick
<p>We will now create a new list with all <code>words</code> in Moby Dick, except those that are stop words (that is, those words listed in <code>sw</code>).</p>

In [7]:
# Create a list words_ns containing all words that are in words but not in sw
words_ns = [w for w in words if w not in sw]

# Printing the first 5 words_ns to check that stop words are gone
print(words_ns[:5])

['moby', 'dick', 'whale', 'herman', 'melville']


## 8. Original question
<p>Let's have a look at our original question:</p>
<blockquote>
  <p>What are the most frequent words in Moby Dick and how often do they occur?</p>
</blockquote>
<p>We are now ready to answer this question. Let's answer it using the <code>Counter</code> class we imported earlier.</p>

In [8]:
# Initialize a Counter object from our processed list of words
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('whale', 1245), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]


## 9. The most common word
<p>Using our variable <code>top_ten</code>, now we have an answer to our original question.</p>

<p><em>Not surprisingly</em> the most common word in Moby Dick is <em>'whale'</em>.</p>