What are the most frequent words in F. Scott Fitzgerald's novel, The Great Gatsby, and how often do they occur?

In this notebook, we'll scrape the novel from the website Project Gutenberg.

In [9]:
#import main packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

A freely available version of the Great Gatsby can be found at Project Gutenberg as an HTML file:
https://www.gutenberg.org/files/219/219-h/219-h.htm


In [10]:
#get the Heart of Darkness HTML, create a request object
# link = 'https://www.gutenberg.org/files/219/219-h/219-h.htm'
link = 'https://www.gutenberg.org/files/64317/64317-h/64317-h.htm'
r = requests.get(link)

#set the correct text encoding (UTF-8, according to info in the link)
r.encoding = 'utf-8'

#extract the HTML from the request object
html = r.text

#print first 500 characters in html
print(html[:500])


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of The Great Gatsby</title>
<link rel="coverpage" href="images/cover.jpg" />
<style type="text/css">

body{
margin-left: 20%;
margin-right:


Package BeautifulSoup will be used to extract text from html

In [11]:
#create a BeautifulSoup object from html
soup = BeautifulSoup(html)

#get text out of soup
text = soup.get_text()

#print characters 32500-33000
print(text[10000:10500])

n a fashion that rather took your breath away: for instance, he’d brought down a string of polo ponies from Lake Forest. It was hard to realize that a man in my own generation was wealthy enough to do that.


Why they came East I don’t know. They had spent a year in France for no particular reason, and then drifted here and there unrestfully wherever people played polo and were rich together. This was a permanent move, said Daisy over the telephone, but I didn’t believe it—I had no sight into 


Natural Language Toolkit will be used to tokenize the text and split it into list of words.

In [12]:
#creating a tokenizer with regex 'w+' (words)
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

#tokenize the text
tokens = tokenizer.tokenize(text)

#print the first 10 tokens
print(tokens[:10])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Great', 'Gatsby', 'The', 'Project']


Make the words lowercase

In [13]:
#create empty list words
words = []

#loop through tokens and append their lowercase version to 'words' list
for token in tokens:
    words.append(token.lower())

#print the first 10 words
print(words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'great', 'gatsby', 'the', 'project']


English stop words have to be removed.

In [14]:
#dowloading and importing stopwords from nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#assign English stopwords to 'sw'
sw = stopwords.words('english')

#loop through 'words' list and append to 'words_ns' only those
#that do not appear in 'sw' list

words_ns = []
for word in words:
    if word not in sw:
        words_ns.append(word)
        
#print the first 10 words
print(words_ns[:10])

['project', 'gutenberg', 'ebook', 'great', 'gatsby', 'project', 'gutenberg', 'ebook', 'great', 'gatsby']


[nltk_data] Downloading package stopwords to /home/lucjan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Find the most common words

In [15]:
#Initialize Counter object from list of words_ns
counter = Counter(words_ns)

#Find top 10 most common words
top_10 = counter.most_common(10)

#print top 10
print(top_10)

[('gatsby', 268), ('said', 235), ('tom', 191), ('daisy', 186), ('one', 154), ('like', 122), ('man', 114), ('back', 109), ('came', 108), ('little', 103)]
