## html_scraping_bookwordcount

Script to extract words from a html book, find the most commonly occuring words and save them to a csv file.

Note: "stop words" are excluded, currently set to English.

Recommended source of books is from the Guttenburg Project, where the html versions should be chosen.
<br>Most popular books: https://gutenberg.org/browse/scores/top
<br>Books in Spanish: https://gutenberg.org/browse/languages/es

Code modified from DataCamp.com 'Word Frequency' project.

Some packages will likely require installing for the script to run.

Merlin Fair
<br>Created: 2024/02/27
<br>Last Modified: 2024/02/27 (MF)

In [144]:
# Import and download packages
import requests
import csv
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/mjf81/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [145]:
# Get the Moby Dick HTML  
r = requests.get('https://www.gutenberg.org/cache/epub/2701/pg2701-images.html')

# Set the correct text encoding of the HTML page
r.encoding = 'utf-8'

In [146]:
# Extract the HTML from the request object
html = r.text

# Print 1000 characters in html
print(html[15000:16000])

in. </a>
</p>
<p class="toc">
<a href="#link2HCH0086" class="pginternal"> CHAPTER 86. The Tail. </a>
</p>
<p class="toc">
<a href="#link2HCH0087" class="pginternal"> CHAPTER 87. The Grand Armada. </a>
</p>
<p class="toc">
<a href="#link2HCH0088" class="pginternal"> CHAPTER 88. Schools and Schoolmasters. </a>
</p>
<p class="toc">
<a href="#link2HCH0089" class="pginternal"> CHAPTER 89. Fast-Fish and Loose-Fish. </a>
</p>
<p class="toc">
<a href="#link2HCH0090" class="pginternal"> CHAPTER 90. Heads or Tails. </a>
</p>
<p class="toc">
<a href="#link2HCH0091" class="pginternal"> CHAPTER 91. The Pequod Meets The Rose-Bud. </a>
</p>
<p class="toc">
<a href="#link2HCH0092" class="pginternal"> CHAPTER 92. Ambergris. </a>
</p>
<p class="toc">
<a href="#link2HCH0093" class="pginternal"> CHAPTER 93. The Castaway. </a>
</p>
<p class="toc">
<a href="#link2HCH0094" class="pginternal"> CHAPTER 94. A Squeeze of the Hand. </a>
</p>
<p class="toc">
<a href="#link2HCH0095" cl

In [147]:
# Create a BeautifulSoup object from the HTML
html_soup = BeautifulSoup(html, "html.parser")

# Get the text out of the soup
moby_text = html_soup.get_text()

In [148]:
# Create a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

# Tokenize the text
tokens = tokenizer.tokenize(moby_text)

In [149]:
# Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]

# Print out the first eight words
words[:8]

['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or']

In [150]:
# Get the English stop words from nltk
stop_words = nltk.corpus.stopwords.words('english')

# Print out the first eight stop words
stop_words[:8]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']

In [151]:
# Create a list words_ns containing all words that are in words but not in stop_words
words_no_stop = [word for word in words if word not in stop_words]

# Print the first five words_no_stop to check that stop words are gone
words_no_stop[:5]

['project', 'gutenberg', 'ebook', 'moby', 'dick']

In [152]:
# Initialize a Counter object from our processed list of words
count_total = Counter(words_no_stop)

# Store ten most common words and their counts as top_ten
top_ten = count_total.most_common(10)

# Print the top ten words and their counts
print(top_ten)

[('whale', 1244), ('one', 925), ('like', 647), ('upon', 568), ('man', 527), ('ship', 519), ('ahab', 517), ('ye', 473), ('sea', 455), ('old', 452)]


In [154]:
# Saving top 100 to CSV file
# N.b. change filename to reflect chosen book
filename = 'words_mobydick.csv'
with open(filename, 'w', newline='') as csvfile:

    writer = csv.writer(csvfile, delimiter=',',  quotechar='"', 
                                     quoting=csv.QUOTE_MINIMAL)
    writer.writerow(["word","count"])
    for key, count in count_total.most_common(100):
        word = key
        writer.writerow([word, count])