## 1. Tools for text processing

What are the most frequent words in Herman Melville's novel, Moby Dick, and how often do they occur?

In this notebook, we'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests. Then we'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.

Let's start by loading in the three main Python packages we are going to use.


In [48]:
# import
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
from nltk.corpus import stopwords



### Request Moby Dick

In [49]:
# Getting the Moby Dick HTML 
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
# ... YOUR CODE FOR TASK 3 ...
print(html[:2000])

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; m

### Get the text from the HTML

In [50]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, 'html.parser')
# Getting the text out of the soup
text = soup.text

# Printing out text between characters 32000 and 34000
# ... YOUR CODE FOR TASK 3 ...
print(text[32000:34000])

ent me
      from deliberately stepping into the street, and methodically knocking
      people’s hats off—then, I account it high time to get to sea as soon
      as I can. This is my substitute for pistol and ball. With a philosophical
      flourish Cato throws himself upon his sword; I quietly take to the ship.
      There is nothing surprising in this. If they but knew it, almost all men
      in their degree, some time or other, cherish very nearly the same feelings
      towards the ocean with me.
    

      There now is your insular city of the Manhattoes, belted round by wharves
      as Indian isles by coral reefs—commerce surrounds it with her surf.
      Right and left, the streets take you waterward. Its extreme downtown is
      the battery, where that noble mole is washed by waves, and cooled by
      breezes, which a few hours previous were out of sight of land. Look at the
      crowds of water-gazers there.
    

      Circumambulate the city of a dreamy Sabbath afte

### Extract the words

In [51]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer(r'\b\w+\b')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 8 words / tokens 
# ... YOUR CODE FOR TASK 4 ...

### Make the words lowercase¶


In [52]:
# Create a list called words containing all tokens transformed to lower-case
# ... YOUR CODE FOR TASK 5 ...
words = []
for word in tokens:
    words.append(word.lower())
# Printing out the first 8 words / tokens 
# ... YOUR CODE FOR TASK 5 ...
words[:8]


['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']

### Load in stop words


In [58]:
# Getting the English stop words from nltk
nltk.download('stopwords')
sw = stopwords.words('english')

# Printing out the first eight stop words
# ... YOUR CODE FOR TASK 6 ...
print(sw[:6])

['i', 'me', 'my', 'myself', 'we', 'our']


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/janderson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Remove stop words in Moby Dick


In [61]:
# Create a list words_ns containing all words that are in words but not in sw
# ... YOUR CODE FOR TASK 7 ...
words_ns = [word for word in words if word not in sw]

# Printing the first 5 words_ns to check that stop words are gone
# ... YOUR CODE FOR TASK 7 ...
print(words_ns[:5])

['moby', 'dick', 'whale', 'herman', 'melville']


### We have the answer

Our original question was:

What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?

In [63]:
# Initialize a Counter object from our processed list of words
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)


[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]
