![mobydick](mobydick.jpg)

In this workspace, you'll scrape the novel Moby Dick from the website [Project Gutenberg](https://www.gutenberg.org/) (which contains a large corpus of books) using the Python `requests` package. You'll extract words from this web data using `BeautifulSoup` before analyzing the distribution of words using the Natural Language ToolKit (`nltk`) and `Counter`.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

What are the most frequent words in Herman Melville's novel Moby Dick, and how often do they occur?

Note that the HTML file you are asked to request is a cashed version of this file from Project Gutenberg.

Your project will follow these steps:

The first step will be to request the Moby Dick HTML file using requests and encoding it to utf-8. Here is the URL to scrape from: https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm
Next, you'll extract the HTML and create a BeautifulSoup object using an HTML parser to get the text.
Following that, you'll initialize a regex tokenizer object tokenizer using nltk.tokenize.RegexpTokenizer to keep only alphanumeric text, assigning the results to tokens.
You'll transform the tokens into lowercase, removing English stop words, and saving the results to words_no_stop.
Finally, you'll initialize a Counter object and find the ten most common words, saving the result to top_ten and printing to see what they are.


In [None]:
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# Start coding here... 

In [None]:
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm'

reponse = requests.get(url)

# Make the GET request
if response.status_code == 200:
    response.encoding = 'utf-8'
    # Get the HTML content
    html_content = response.text
    print(html_content[0:500])
else:
    print(f"Failed to retrieve content. Status code: {response.status_code}")


In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

body_dick = soup.body.get_text()

print(len(body_dick))

In [None]:
import re

cleaned_body_dick = re.sub(r'[\n\r]+', ' ', body_dick)
cleaned_body_dick = re.sub(r'[^a-zA-Z0-9\s]', '', cleaned_body_dick)

print(len(cleaned_body_dick))

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(cleaned_body_dick)

In [None]:
len(tokens)

In [None]:
lower_tokens = [token.lower() for token in tokens]

In [None]:
from nltk.corpus import stopwords
 
nltk.download('stopwords')
english_stopwords = stopwords.words('english')

In [None]:
words_no_stop = [w for w in lower_tokens if not w in english_stopwords]

In [None]:
len(words_no_stop)

In [None]:
counter = Counter(words_no_stop)

In [None]:
top_ten = dict(sorted(counter.items(), key=lambda item: item[1], reverse=True)[:10])

In [None]:
top_ten