[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

### Mark J Serena - mjserena@wm.edu

In [1]:
import string
import requests
import re
import pandas as pd
import numpy as np
import nltk
from nltk import download
download('stopwords')
from nltk.corpus import stopwords
stopwords.words('english')
from nltk.tokenize import word_tokenize
# For Lemmatization
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# For Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

from bs4 import BeautifulSoup

#import CountVectorizer if we just want a Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
#import TfidfVectorizer if we want TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 150
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mjserena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [2]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    ## YOUR CODE HERE
    html_source: str = requests.get(url).text
    assert isinstance(html_source, str), "The HTML should be a string."
    return html_source

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [3]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"


response = requests.get(url)
if response.status_code != 200:
    print('url request failed, best of luck')
html_source = response.text
print('html source returned is type',type(html_source))

any_html = re.findall(r"<html>", html_source)
if any_html == []:
    print('but no html found, hmmm. may be time to brew a beautiful soup.')
else:
    print(any_html)

html source returned is type <class 'str'>
but no html found, hmmm. may be time to brew a beautiful soup.


### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [4]:
from bs4 import BeautifulSoup as bs4

soup = BeautifulSoup(html_source, "lxml")
# lets look at our soup
soup_html_source = soup.prettify()
#lets check soup for html
any_html = re.findall(r"<.*>", soup_html_source)
if any_html == []:
    print('but no html found, hmmm. may be time to brew a beautiful soup.')
else:
    print('we found html!',any_html)

we found html! ['<html>', '<body>', '<p>', '<hart>', '</hart>', '</p>', '</body>', '</html>']


### 1.3 Extract the HTML body text and examine the contents.

In [5]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")

# The above .find() function of soup will as named find any tags in the html source and their contents. 
# This particular example is looking for the body tag and its contents.
# specifically anything between <body> and </body>

### The above .find() function of soup will find any tags in the html source and their contents. This particular example is looking for the body tag and its contents. Specifically anything between 'body' and '/body'

### 1.4 Use regular expressions to extract the documents within the page

In [6]:
# lets see what we have in the body of the html source
# look for text that starts with "[Etext #]"

data_check = re.findall(r"\[Etext #\d+\]", body.text)
print('We found ', len(data_check), 'documents named ', data_check)

We found  9 documents named  ['[Etext #1]', '[Etext #2]', '[Etext #3]', '[Etext #4]', '[Etext #5]', '[Etext #6]', '[Etext #7]', '[Etext #8]', '[Etext #9]']


In [7]:
import re

# Your regex here to capture the documents
doc_extractor = r"(?<=\[Etext #\d])([^\f]+?)(?=\[Etext #\d]|\*\*\*End of)" 

# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, flags=re.MULTILINE)
    

assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


In [8]:
print(len(found_documents))

9


Explain: `documents = re.findall(doc_extractor, body.text, re.MULTILINE)`

This statement its looking for text that start with '[Etext #' and then ends with either 'Etext # ]' or '***End of'. 
The regular expression has a couple of interesting aspects. One, it's made up of 3 different matching groups. Two, it is matching on the 1st group and the 3rd group but is not selecting them. Third, the multiline flag was needed because the groups could span a single line.

## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [9]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^ ?The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

In [10]:
# clean up corpus

corpus = []

for i, doc in enumerate(found_documents):
    cleaned_doc = clean_gutenberg(doc)
    corpus.append(cleaned_doc)

In [11]:
# Well what's the corpus look like

for i in range(len(corpus)):
    print('doc ID',i, 'is', len(corpus[i]), 'characters in length')


doc ID 0 is 8113 characters in length
doc ID 1 is 2851 characters in length
doc ID 2 is 7594 characters in length
doc ID 3 is 1514 characters in length
doc ID 4 is 26730 characters in length
doc ID 5 is 6615 characters in length
doc ID 6 is 2027 characters in length
doc ID 7 is 3975 characters in length
doc ID 8 is 21111 characters in length


# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


## Tokenize the documents

In [12]:
# lets convert to lower case and remove punctuation
for i in range(len(corpus)):
    corpus[i] = corpus[i].lower()
    corpus[i] = re.sub('[^a-zA-Z0-9 ]','',corpus[i])
    
# lets save the text from the corpus at this point before we tokenize it

text = []
text = corpus.copy()


In [13]:
# tokenize the corpus
for i in range(len(corpus)):
    corpus[i] = nltk.word_tokenize(corpus[i]) 

In [14]:
# create a dataframe just because
import pandas as pd
corpus_df = pd.DataFrame({"docID": range(len(corpus)), "text": text, "tokens": corpus})
corpus_df.insert(3,'lemmas', ' ', True)

corpus_df.info()

corpus_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   docID   9 non-null      int64 
 1   text    9 non-null      object
 2   tokens  9 non-null      object
 3   lemmas  9 non-null      object
dtypes: int64(1), object(3)
memory usage: 416.0+ bytes


Unnamed: 0,docID,text,tokens,lemmas
0,0,the declaration of independence of the united ...,"[the, declaration, of, independence, of, the, ...",
1,1,the united states bill of rights the ten origi...,"[the, united, states, bill, of, rights, the, t...",
2,2,we observe today not a victory of party but a ...,"[we, observe, today, not, a, victory, of, part...",
3,3,four score and seven years ago our fathers bro...,"[four, score, and, seven, years, ago, our, fat...",
4,4,the constitution of the united states of ameri...,"[the, constitution, of, the, united, states, o...",
5,5,no man thinks more highly than i do of the pat...,"[no, man, thinks, more, highly, than, i, do, o...",
6,6,in the name of god amen we whose names are un...,"[in, the, name, of, god, amen, we, whose, name...",
7,7,fellow countrymen at this second appearing to...,"[fellow, countrymen, at, this, second, appeari...",
8,8,fellow citizens of the united states in compl...,"[fellow, citizens, of, the, united, states, in...",


## Lemmatize the tokens

In [15]:
for i in corpus_df.index:
    tokens = corpus_df['tokens'][i]
    w = []
    for token in tokens:
        lemmetized_word = wordnet_lemmatizer.lemmatize(token)
        w.append(lemmetized_word)
    corpus_df['lemmas'][i] = w

## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `SpaCy` library to remove stopwords.

In [16]:
stop_words = set(stopwords.words('english'))

# index thru each document in the corpus
for i in corpus_df.index:
    # take the lemmas for this document
    unfiltered_lemmas = corpus_df['lemmas'][i]
    filtered_tokens = []
    # index thru each lemma in the document
    for w in unfiltered_lemmas:
        if w not in stop_words:
            filtered_tokens.append(w)
    # place the filtered lemmas into the token column
    corpus_df['tokens'][i] = filtered_tokens


### At this point the the tokens column in the corpus dataframe in should contain text that is in lower case, has had the punctuation removed, has been tokenized, converted to lemmas, and finally the stop words removed

## Remove punctuation

In [17]:
# already removed punctuation in the tokenization step above seemed to make more sense to do it there

## Analyze the documents and corpus using TF-IDF

### After this point a lot of the code was copied from lecture notebooks

In [18]:
# all of this code was copied from the Lecture_06_2023_02_14 notebook


# Unwind the data on the tokens
corpus_tokens = (corpus_df
                  .explode('tokens'))

# First calculate the term frequency for each document
# create a word frequency dataframe
term_freq = (corpus_tokens
                  .groupby(by=['docID', 'tokens'])
                  .agg({'tokens': 'count'})
                  .rename(columns={'tokens': 'term_freq'})
                  .reset_index()
                  .rename(columns={'tokens': 'term'})
                 )
term_freq

Unnamed: 0,docID,term,term_freq
0,0,1972,1
1,0,abdicated,1
2,0,abolish,1
3,0,abolishing,3
4,0,absolute,3
...,...,...,...
3580,8,would,10
3581,8,written,4
3582,8,wrong,1
3583,8,year,4


In [19]:
# let's sort by freq and see the top 10
sorted_term_freq_df = term_freq.sort_values(by='term_freq',ascending = False)
sorted_term_freq_df.head(10)

Unnamed: 0,docID,term,term_freq
1825,4,shall,191
1843,4,state,128
1907,4,united,55
1579,4,law,34
1610,4,may,33
1708,4,president,32
1310,4,congress,29
1518,4,house,28
3470,8,state,26
2882,8,constitution,23


In [20]:
# Document frequency

document_freq = (term_freq
                 .groupby(['docID', 'term'])
                 .size()
                 .unstack()
                 .sum()
                 .reset_index()
                 .rename(columns={0: 'document_freq'})
)
document_freq

Unnamed: 0,term,document_freq
0,1,1.0
1,10,1.0
2,15,1.0
3,1620,1.0
4,1774,1.0
...,...,...
2365,yea,1.0
2366,year,6.0
2367,yet,3.0
2368,york,1.0


In [21]:
# let's sort by freq and see the top 10
sorted_document_freq_df = document_freq.sort_values(by='document_freq',ascending = False)
sorted_document_freq_df.head(10)

Unnamed: 0,term,document_freq
1950,shall,9.0
2294,war,8.0
2158,time,8.0
558,december,8.0
2052,subject,7.0
1003,government,7.0
1554,people,7.0
1419,nation,7.0
1491,one,7.0
1582,place,7.0


In [22]:
# Create an instance of the Tfidf vectorizer with stopwords

tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [23]:
corpus_df

Unnamed: 0,docID,text,tokens,lemmas
0,0,the declaration of independence of the united ...,"[declaration, independence, united, state, ame...","[the, declaration, of, independence, of, the, ..."
1,1,the united states bill of rights the ten origi...,"[united, state, bill, right, ten, original, am...","[the, united, state, bill, of, right, the, ten..."
2,2,we observe today not a victory of party but a ...,"[observe, today, victory, party, celebration, ...","[we, observe, today, not, a, victory, of, part..."
3,3,four score and seven years ago our fathers bro...,"[four, score, seven, year, ago, father, brough...","[four, score, and, seven, year, ago, our, fath..."
4,4,the constitution of the united states of ameri...,"[constitution, united, state, america, 1787, p...","[the, constitution, of, the, united, state, of..."
5,5,no man thinks more highly than i do of the pat...,"[man, think, highly, patriotism, well, ability...","[no, man, think, more, highly, than, i, do, of..."
6,6,in the name of god amen we whose names are un...,"[name, god, amen, whose, name, underwritten, l...","[in, the, name, of, god, amen, we, whose, name..."
7,7,fellow countrymen at this second appearing to...,"[fellow, countryman, second, appearing, take, ...","[fellow, countryman, at, this, second, appeari..."
8,8,fellow citizens of the united states in compl...,"[fellow, citizen, united, state, compliance, c...","[fellow, citizen, of, the, united, state, in, ..."


In [38]:
# Fit on our model

vectors = tfidf_vectorizer.fit_transform(corpus_df['text'])

In [47]:
# Create dataframe with terms and tfidf
tfidf_df = pd.DataFrame(vectors.toarray(), index=corpus_df.index.values, columns=tfidf_vectorizer.get_feature_names_out())


In [48]:
#tfidf_vectorizer.get_feature_names_out().tolist()

In [49]:
# Explore some selected terms

tfidf_df[['senate', 'house', 'congress', '1776', 'shall']]

Unnamed: 0,senate,house,congress,1776,shall
0,0.0,0.0,0.024365,0.0,0.01439
1,0.0,0.040452,0.080905,0.0,0.40616
2,0.0,0.02204,0.0,0.0,0.065085
3,0.0,0.0,0.0,0.0,0.10772
4,0.136429,0.119766,0.151009,0.0,0.587413
5,0.0,0.092634,0.0,0.0,0.150455
6,0.0,0.0,0.0,0.0,0.019549
7,0.0,0.0,0.0,0.0,0.114372
8,0.0,0.0,0.05869,0.01809,0.117855


In [50]:
# Let's create a heatmap of higest terms in some of the documents

tfidf_df = (tfidf_df
            .stack()
            .reset_index()
            .rename(columns={0: 'tf_idf', 'level_0': 'docID', 'level_1': 'term'})
           )

In [51]:
#tfidf_df

In [52]:
# Group the documents by their n highest performing terms

n = 5
top_tfidf = (tfidf_df
             .sort_values(by=['docID','tf_idf'], ascending=[True,False])
             .groupby(['docID'])
             .head(n)
            )

In [53]:
top_tfidf

Unnamed: 0,docID,term,tf_idf
1308,0,laws,0.219287
1597,0,people,0.176016
2090,0,states,0.173932
191,0,assent,0.150204
402,0,colonies,0.150204
4441,1,shall,0.40616
3732,1,law,0.242714
4519,1,states,0.216578
3705,1,jury,0.183134
4352,1,right,0.162067


In [55]:
# this looks way more complicated than seaborn heatmaps

import altair as alt

# adding a little randomness to break ties in term ranking
top_tfidf_rand = top_tfidf.copy()
top_tfidf_rand['tf_idf'] = top_tfidf_rand['tf_idf'] + np.random.rand(top_tfidf.shape[0])*0.0001


base = alt.Chart(top_tfidf_rand, title="Heatmap of TF_IDF for Terms").encode(
    x = 'rank:O',
    y = 'docID:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tf_idf", order="descending")],
    groupby = ["docID"],
)

# heatmap specification
heatmap = base.mark_rect().encode(color = 'tf_idf:Q')

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + text).properties(width=500,height=500)

# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.

DATA 340-03 NLP - WebScraping Assignment - Mark J. Serena