
# This notebook will help you find the most common words in any English-language book on Project Gutenberg. 

Inspired by the DataCamp Project "Word frequency in Moby Dick" (https://www.datacamp.com/projects/38), I made this notebook to help a user find the most common words in any book on Project Gutenberg. All you need to do is supply the title of a book and a URL for an HTML file of that book.

First we'll read in an HTML file from Project Gutenberg. Then we'll extract just the text and get rid of any uninteresting words (articles, conjunctions, etc.). Finally, we'll count the words to determine the most common one and display the top 25 words on a plot. Ready? Let's get started.

In [None]:
# Import requests, BeautifulSoup, and nltk
import requests
from bs4 import BeautifulSoup
import nltk

# Now run this cell.

# Now let's find a book.

Navigate to www.gutenberg.org and choose a book. Be sure to choose the HTML version of the book. Copy the url, paste it below, and fill in the book's title where indicated.

In [None]:
# Type the title of your chosen book between the single quotes.
title = ''

# Use requests.get to fetch your HTML file. Insert the URL of your chosen book between the single quotes in the line below.
request = requests.get('')

# Set the encoding of the HTML file to UTF-8.
request.encoding = 'utf-8'

# Use .text to extract just the text from the HTML file.
html = request.text

# Check that it is working by printing the first 2000 characters of the text.
print(html[0:2000])

# Now run this cell.

# The next step is to extract the text from this HTML file.

We'll use BeautifulSoup to read the HTML file and identify the part that is text. Then .get_text() will help us extract the text of the book from the HTML file.

In [None]:
# Use BeautifulSoup on the HTML file.
soup = BeautifulSoup(html, 'html.parser')

# Use .get_text() to extract just the text.
text = soup.get_text()

# Check the results by printing a passage from the middle of the text.
print(text[20000:22000])

# Now run this cell.

# Next, we'll cull the individual words from the text.

First, we'll "tokenize" the text; this means to throw out things that aren't words, like spaces and punctuation. To do this, we'll use a function from the Natural Language Toolkit (nltk).

In [None]:
# Make a tokenizer function.
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Now apply the tokenizer function to the text to extract tokens (words).
tokens = tokenizer.tokenize(text)

# Check the results by printing the first 10 tokens.
print(tokens[0:10])

# Now run this cell.

# Now we'll make all the tokens lowercase.

Why? Well, if a word appears sometimes at the beginning sentence (i.e., capitalized) and sometimes in the middle of a sentence (i.e., uncapitalized), it would get treated like two distinct tokens, when from a human point of view, those two tokens represent the same word. 

First we'll create an empty list to store our tokens. Then we'll use a "for loop" to check each token, make it lowercase as needed, and then store it in the list.

In [None]:
# Make an empty list to store the tokens.
words = []

# Use a "for loop" to make each token lowercase and store it in the list.
for word in tokens:
    word = word.lower()
    words.append(word)

# Check the results by printing the first 10 words.
print(words[0:10])

# Now run this cell.

# Let's remove words that are very common but not very interesting (like pronouns).

To do this, we'll download a list of "stopwords"--very common words like "the" and "and" that we don't want to consider in our analysis. We'll be using a default list of stopwords from nltk, but it is also possible to create a custom list of stopwords as well.

First, we'll download a list of stopwords to use. Then we'll loop over our wordlist and throw out any that appear on the stoplist.

In [None]:
# Download a list of stopwords from nltk.
nltk.download('stopwords')

# Extract a list of English stopwords to use
sw = nltk.corpus.stopwords.words('english')

# Make a new empty list for our cleaned-up collection of words.
words_clean = []

# Get rid of words on the stop list.
for word in words:
    if word not in sw:
        words_clean.append(word)
        
# Check results by printing the first 10 words on the new list.
print(words_clean[0:10])

# Now run this cell.

# At last we can see which words are used most frequently in the chosen book.

We'll count how often each word occurs and plot the top 25 most common words in a convenient chart.

In [None]:
# Use matplotlib to set up an inline plot.
%matplotlib inline

# Calculate word frequencies
word_freq = nltk.FreqDist(words_clean)

# Plot the 25 most common words
word_freq.plot(25)

# Now run this cell.

# To summarize, let's print the most common word in the chosen book.

In [None]:
# Extract the most common word and print it with the book title.
print('The most common word in ' + title + ' is "' + word_freq.most_common(1)[0][0] + '."')
print('It occurs ' + str(word_freq.most_common(1)[0][1]) + ' times.')

# Now run this cell.