#### **What are the most frequent words in F. Scott Fitzgerald's novel, The Great Gatsby, and how often do they occur?**

In this notebook, we'll scrape the novel from the website Project Gutenberg.

In [1]:
# Import main packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter

A freely available version of the Great Gatsby can be found at Project Gutenberg as an HTML file:
https://www.gutenberg.org/files/219/219-h/219-h.htm


In [2]:
# Get The Great Gatsby HTML
link = 'https://www.gutenberg.org/files/64317/64317-h/64317-h.htm'
# Create a request object
r = requests.get(link)

# Set the correct text encoding (UTF-8, according to info in the link)
r.encoding = 'utf-8'

# Extract the HTML from the request object
html = r.text

# Print first 500 characters in html
print(html[:500])


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<title>The Project Gutenberg eBook of The Great Gatsby</title>
<link rel="coverpage" href="images/cover.jpg" />
<style type="text/css">

body{
margin-left: 20%;
margin-right:


Package BeautifulSoup will be used to extract text from html

In [3]:
# Create a BeautifulSoup object from html
soup = BeautifulSoup(html)

# Get text out of soup
text = soup.get_text()

# Print some characters
print(text[10000:10500])

n a fashion that rather took your breath away: for instance, he’d brought down a string of polo ponies from Lake Forest. It was hard to realize that a man in my own generation was wealthy enough to do that.


Why they came East I don’t know. They had spent a year in France for no particular reason, and then drifted here and there unrestfully wherever people played polo and were rich together. This was a permanent move, said Daisy over the telephone, but I didn’t believe it—I had no sight into 


Natural Language Toolkit will be used to tokenize the text and split it into list of words.

In [4]:
# Create a tokenizer with regex '\w+' (words)
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenize the text
tokens = tokenizer.tokenize(text)

# Print the first 10 tokens
print(tokens[:10])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Great', 'Gatsby', 'The', 'Project']


We can notice that some words are in lowercase and some are Capitalized. We have to make all of them lowercase.

In [5]:
# Create an empty list words
words = []

# Loop through tokens and append their lowercase version to 'words' list
for token in tokens:
    words.append(token.lower())

# Print the first 10 words
print(words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'great', 'gatsby', 'the', 'project']


Some stop words, like 'of' and 'the' can be noticed. English stop words have to be removed.

In [6]:
# Download and import stop words from NLTK
nltk.download('stopwords')
from nltk.corpus import stopwords

# Assign English stop words to 'sw'
sw = stopwords.words('english')

# Loop through 'words' list
# Append to 'words_ns' only those that do not appear in 'sw' list
words_ns = []
for word in words:
    if word not in sw:
        words_ns.append(word)
        
# Print the first 10 words
print(words_ns[:10])

[nltk_data] Downloading package stopwords to /home/lucjan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['project', 'gutenberg', 'ebook', 'great', 'gatsby', 'project', 'gutenberg', 'ebook', 'great', 'gatsby']


Find the most common words

In [7]:
# Initialize Counter object from list of words_ns
counter = Counter(words_ns)

# Find top 10 most common words
top_10 = counter.most_common(10)

# Print top 10
print(top_10)

[('gatsby', 268), ('said', 235), ('tom', 191), ('daisy', 186), ('one', 154), ('like', 122), ('man', 114), ('back', 109), ('came', 108), ('little', 103)]


### We can say now that the most frequently used word in The Great Gatsby is **'Gatsby'** :D