## **1. Tools for text processing**

So we are here to answer a simple question.
*What is the most common word in Moby Dick by Herman Melville?* 

Obviusly we will first get the data, it will be scrapped from **Project Gutenberg** using the *requests* package and *BeautifulSoup*.

Then, naturally, we will check de dataset and look for the most frequent words and their occurencias after carefuly prepare the data if necessary.


In [2]:
# Importing requests, BeautifulSoup, nltk, and Counter
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

Now that we have imported the pertinent packages to build the data pipeline, let's get the data.

## **2. Request Moby Dick**

As mentioneed before, we can scrap the novel freely from the **Project Gutenberg** website.
Let's fetch the HTML file using the *requests* package to make a *GET* request.

In [3]:
# Getting the Moby Dick HTML 
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')

# Setting the correct text encoding of the HTML page
r.encoding = 'utf-8'

# Extracting the HTML from the request object
html = r.text

# Printing the first 2000 characters in html
print(html[0:2000])

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >

<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
  <head>
    <title>
      Moby Dick; Or the Whale, by Herman Melville
    </title>
    <style type="text/css" xml:space="preserve">

    body { background:#faebd0; color:black; margin-left:15%; margin-right:15%; text-align:justify }
    P { text-indent: 1em; margin-top: .25em; margin-bottom: .25em; }
    H1,H2,H3,H4,H5,H6 { text-align: center; margin-left: 15%; margin-right: 15%; }
    hr  { width: 50%; text-align: center;}
    .foot { margin-left: 20%; margin-right: 20%; text-align: justify; text-indent: -3em; font-size: 90%; }
    blockquote {font-size: 100%; margin-left: 0%; margin-right: 0%;}
    .mynote    {background-color: #DDE; color: #000; padding: .5em; margin-left: 10%; margin-right: 10%; font-family: sans-serif; font-size: 95%;}
    .toc       { margin-left: 10%; m

Clearly now we have to start reading the book and scraping the data.

## **3. Get the text from the HTML**

Note that we got an HTML file, but we actually need the text inside of it, to get the text we will use BeautifulSoup to parse the HTML and extract the text.

In [4]:
# Creating a BeautifulSoup object from the HTML
soup = BeautifulSoup(html)

# Getting the text out of the soup
text = soup.get_text()

# Printing out text between characters 28000 and 32000
print(text[28000:32000])



 William Comstock. Another Version of the whale-ship Globe
        narrative.
      

        “The voyages of the Dutch and English to the Northern Ocean, in order,
        if possible, to discover a passage through it to India, though they
        failed of their main object, laid-open the haunts of the whale.” —McCulloch’s
        Commercial Dictionary.
      

        “These things are reciprocal; the ball rebounds, only to bound forward
        again; for now in laying open the haunts of the whale, the whalemen seem
        to have indirectly hit upon new clews to that same mystic North-West
        Passage.” —From “Something” unpublished.
      

        “It is impossible to meet a whale-ship on the ocean without being struck
        by her near appearance. The vessel under short sail, with look-outs at
        the mast-heads, eagerly scanning the wide expanse around them, has a
        totally different air from those engaged in regular voyage.” —Currents
        and Whaling. U.S

## **4. Extracting the words**

Observe that still we have unwanted characters, like the punctuation marks, credits and other things that are not actually our scoped words, but to be fair its not probably a big deal looking at the size of the book. So, lets count the words.

To do this is pertinent to use *NTLK* to start tokenizing the words and remove unwanted characters.

In [5]:
# Creating a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenizing the text
tokens = tokenizer.tokenize(text)

# Printing out the first 25 words / tokens 
tokens[0:25]

['Moby',
 'Dick',
 'Or',
 'the',
 'Whale',
 'by',
 'Herman',
 'Melville',
 'The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Moby',
 'Dick',
 'or',
 'The',
 'Whale',
 'by',
 'Herman',
 'Melville',
 'This',
 'eBook',
 'is',
 'for']

## **5. Make the words lowercase**

Note to avoid counting "Or" and "or" as different words, we will make all the words lowercase.
So we are going to save the lower tkenized words in a list.

In [7]:
# Create a list called words containing all tokens transformed to lower-case
words = []
for token in tokens:
    words.append(token.lower())

# Printing out the first 25 words / tokens 
print(words[0:25])

['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville', 'the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville', 'this', 'ebook', 'is', 'for']


## **6. Load in stop words**

There are some words that are not relevant now, like "The", "of", "a", etc...
These words are known like stop words and again we will use *ntlk* to remove this words from our list.

In [10]:
# Getting the English stop words from nltk
nltk.download('stopwords')
sw = nltk.corpus.stopwords.words('english')

# Printing out the first eight stop words
print(sw[:8])
print(sw)

[nltk_data] Downloading package stopwords to /home/rzlaqk/nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'suc

[nltk_data]   Unzipping corpora/stopwords.zip.


## **7. Remove stop words in Moby Dick**

Now that we have the stop words, we can make our final list without them.

In [11]:
# A new list to hold Moby Dick with No Stop words
words_ns = []

# Appending to words_ns all words that are in words but not in sw
for word in words:
    if word not in sw:
        words_ns.append(word)

# Printing the first 15 words_ns to check that stop words are gone
print(words_ns[:15])

['moby', 'dick', 'whale', 'herman', 'melville', 'project', 'gutenberg', 'ebook', 'moby', 'dick', 'whale', 'herman', 'melville', 'ebook', 'use']


## **8. Answering the question**

Now that we have the list, there is multiple ways to find the ocurrences of each word, but we will use the *Counter* function from *collections* package to count the words this time.

In [12]:
# Initialize a Counter object from our processed list of words
count = Counter(words_ns)

# Store 10 most common words and their counts as top_ten
top_ten = count.most_common(10)

# Print the top ten words and their counts
print(top_ten)


[('whale', 1246), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]


## **9. The most common word**

Finally, we can see that the most common word is "whale" with 1246 ocurrences followed by "one" with 925 ocurrences and so on...

#### *This was collected and solved by jdpalmad. Suggestions are found at Datacamp and the Book is scrapped from Project Gutenberg repository.*