## html_scraping_bookwordcount

Script to extract words from a html book, find the most commonly occuring words and save them to a csv file.

Note: "stop words" are excluded, currently set to English.

Recommended source of books is from the Guttenburg Project, where the html versions should be chosen.
<br>Most popular books: https://gutenberg.org/browse/scores/top
<br>Books in Spanish: https://gutenberg.org/browse/languages/es

Code modified from DataCamp.com 'Word Frequency' project.

Some packages will likely require installing for the script to run.

Merlin Fair
<br>Created: 2024/02/27
<br>Last Modified: 2024/02/27 (MF)

In [1]:
# Import and download packages
import requests
import csv
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Get the Moby Dick HTML
r = requests.get('https://gutenberg.org/cache/epub/64317/pg64317-images.html')

# Set the correct text encoding of the HTML page
r.encoding = 'utf-8'

In [3]:
# Extract the HTML from the request object
html = r.text

# Print 1000 characters in html
print(html[15000:16000])

ight was a colossal affair by any standard—it was a factual imitation of some Hôtel de Ville in Normandy, with a tower on one side, spanking new under a thin beard of raw ivy, and a marble swimming pool, and more than forty acres of lawn and garden. It was Gatsby’s mansion. Or, rather, as I didn’t know Mr. Gatsby, it was a mansion inhabited by a gentleman of that name. My own house was an eyesore, but it was a small eyesore, and it had been overlooked, so I had a view of the water, a partial view of my neighbour’s lawn, and the consoling proximity of millionaires—all for eighty dollars a month.
</p>
<p>
Across the courtesy bay the white palaces of fashionable East Egg glittered along the water, and the history of the summer really begins on the evening I drove over there to have dinner with the Tom Buchanans. Daisy was my second cousin once removed, and I’d known Tom in college. And just after the war I spent two days with them in Chicago.
</p>
<p>
Her husband, among various phys

In [4]:
# Create a BeautifulSoup object from the HTML
html_soup = BeautifulSoup(html, "html.parser")

# Get the text out of the soup
moby_text = html_soup.get_text()

In [5]:
# Create a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenize the text
tokens = tokenizer.tokenize(moby_text)

In [6]:
# Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]

# Print out the first eight words
words[:8]

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'great', 'gatsby']

In [7]:
# Get the English stop words from nltk
stop_words = nltk.corpus.stopwords.words('english')

# Print out the first eight stop words
stop_words[:8]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']

In [8]:
# Create a list words_ns containing all words that are in words but not in stop_words
words_no_stop = [word for word in words if word not in stop_words]

# Print the first five words_no_stop to check that stop words are gone
words_no_stop[:5]

['project', 'gutenberg', 'ebook', 'great', 'gatsby']

In [9]:
# Initialize a Counter object from our processed list of words
count_total = Counter(words_no_stop)

# Store ten most common words and their counts as top_ten
top_ten = count_total.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('gatsby', 268), ('said', 235), ('tom', 191), ('daisy', 186), ('one', 154), ('like', 122), ('man', 114), ('back', 109), ('came', 108), ('little', 103)]


In [10]:
# Saving top 100 to CSV file
# N.b. change filename to reflect chosen book
filename = 'words_mobydick.csv'
with open(filename, 'w', newline='') as csvfile:

    writer = csv.writer(csvfile, delimiter=',',  quotechar='"',
                                     quoting=csv.QUOTE_MINIMAL)
    writer.writerow(["word","count"])
    for key, count in count_total.most_common(100):
        word = key
        writer.writerow([word, count])