[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [5]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    r = requests.get(url = url)
    
    return r.text

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [6]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"

html_source = get_html(url)

print(html_source)




     NOTE:  This file combines the first two Project Gutenberg
     files, both of which were given the filenumber #1. There are
     several duplicate files here. There were many updates over
     the years.  All of the original files are included in the
     "old" subdirectory which may be accessed under the "More
     Files" listing in the PG Catalog of this file. No changes
     have been made in these original etexts.



**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*

Below you will find the first nine Project Gutenberg Etexts, in
one file, with one header for the entire file.  This is to keep
the overhead down, and in response to requests from Gopher site
keeper to eliminate as much of the headers as possible.

However, for legal and financial reasons, we must request these
headers be left at the beginning o

Every document begins with a "[Etext #_]". This repeated pattern at the start of documents could allow for the documents to be divided with a regular expression that captures this repeated format.




### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [7]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_source, 'lxml')

### 1.3 Extract the HTML body text and examine the contents.

In [8]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")
body


     NOTE:  This file combines the first two Project Gutenberg
     files, both of which were given the filenumber #1. There are
     several duplicate files here. There were many updates over
     the years.  All of the original files are included in the
     "old" subdirectory which may be accessed under the "More
     Files" listing in the PG Catalog of this file. No changes
     have been made in these original etexts.



**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*

Below you will find the first nine Project Gutenberg Etexts, in
one file, with one header for the entire file.  This is to keep
the overhead down, and in response to requests from Gopher site
keeper to eliminate as much of the headers as possible.

However, for legal and financial reasons, we must request these
headers be left at the beginning of each file that is posted 

The above code searches through the HTML string to find the first instance of the "body" tag in the webpage content. Then, it returns the part of the HTML string that is contained by this tag. Therefore, in this case, the object returned is the text included in the body of the webpage, exclusing any extraneous tags that could be found before/after. 

### 1.4 Use regular expressions to extract the documents within the page

In [9]:
import re

# Your regex here to capture the documents

# Success option 1
doc_extractor = r"#\d\].*?\[Etext|#\d\].*?angels of our nature."

# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, flags= re.DOTALL)

assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


Based on my regular expression, I utilized the re.DOTALL flag rather than the re.MULTILINE FLAG. 

While the start of each document varied slightly, a commonality was the use of "[Etext #_]". Therefore, I used this pattern to create my regular expression for splitting documents. The regular expression that I wrote includes strings that meet one of two sets of stipulations: 1) begins with "#", followed by any numerical digit, followed by "]" and ends with "[Etext" OR 2) begins with "#", followed by any numerical digit, followed by "]" and ends with "\*\*\*\*\*\*This file" (text found at the conclusion of the final document). The captured text between the start/end arguments is indicated as ".*?" meaning that it can be any characters and any length. (The DOTALL flag allows this to include the new line character.) The find_all function is searching the body HTML text for the patterns specified by the regular expression and returning a list of all the subsequential strings that match this pattern.

## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [10]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"#\d+\]", "", text) #added
    text = re.sub(r"\[Etext", "", text) #added
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^ ?The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

^^ I added two text patterns in order to account for how I split the documents.

In [11]:
corpus = []

for i, doc in enumerate(found_documents):
  clean_text = clean_gutenberg(doc)
  corpus += [clean_text]

# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


In [None]:
### TIP ###
## if you want to work with pandas create a dataframe with documents as rows and columns for the document number and the text
#import pandas as pd
#corpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})


## Tokenize the documents

## Lemmatize the tokens

## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `SpaCy` library to remove stopwords.

## Remove punctuation

In [15]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

All pre-processing steps done here...

In [16]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string

stemmer = PorterStemmer()

stop_words = stopwords.words('english') + list(string.punctuation) #stopword list containing punctuation
# did not remove numbers because dates could be important given historical context

def tokenize_lemmatize_text(text):
    tokens = word_tokenize(text.lower())
    for token in tokens:
        if re.match(r"\..+", token): #added this if statement because analysis showed words with dot before
          token = token.replace('.', '')
        if token in stop_words:
            continue
        else:
            final_doc.append(stemmer.stem(token))
    return final_doc

final_corpus = []
for doc in corpus:
  final_doc = []
  tokenize_lemmatize_text(doc)
  final_corpus += [final_doc]

## Analyze the documents and corpus using TF-IDF

#### Method 1

In [17]:
import pandas as pd
df_corpus = pd.DataFrame({"docID": range(len(final_corpus)), "text": final_corpus})

In [18]:
df_corpus

Unnamed: 0,docID,text
0,0,"[declar, independ, unit, state, america, cours..."
1,1,"[unit, state, bill, right, ten, origin, amend,..."
2,2,"[observ, today, victori, parti, celebr, freedo..."
3,3,"[four, score, seven, year, ago, father, brough..."
4,4,"[constitut, unit, state, america, 1787, peopl,..."
5,5,"[man, think, highli, patriot, well, abil, wort..."
6,6,"[name, god, amen, whose, name, underwritten, l..."
7,7,"[fellow, countrymen, second, appear, take, oat..."
8,8,"[fellow, citizen, unit, state, complianc, cust..."


In [19]:
df_tokens = df_corpus.explode('text')
df_tokens = df_tokens.rename(columns={'text': 'tokens'})

In [20]:
df_termfreq = (df_tokens
               .groupby(by=['docID', 'tokens'])
               .agg({'tokens': 'count'})
               .rename(columns={'tokens': 'term_frequency'})
               .reset_index()
               .rename(columns={'tokens': 'term'}))

In [21]:
df_docfreq = (df_termfreq
              .groupby(['docID', 'term'])
              .size()
              .unstack()
              .sum()
              .reset_index()
              .rename(columns={0: 'document_frequency'}))
df_docfreq = df_docfreq.drop(labels=[0,1,2,3], axis=0) #first 4 rows were terms that did not provide meaning

In [22]:
df_termfreq = df_termfreq.merge(df_docfreq)

In [23]:
documents_in_corpus = df_termfreq['docID'].nunique()

In [24]:
import numpy as np

df_termfreq['idf'] = np.log((1 + documents_in_corpus) / (1 + df_termfreq['document_frequency'])) + 1

In [25]:
df_termfreq['tfidf'] = df_termfreq['term_frequency'] * df_termfreq['idf']
df_termfreq.sort_values(by=['term_frequency'], ascending=False)

Unnamed: 0,docID,term,term_frequency,document_frequency,idf,tfidf
936,4,shall,191,9.0,1.000000,191.000000
955,4,state,132,5.0,1.510826,199.428982
1062,4,unit,55,5.0,1.510826,83.095409
206,8,constitut,34,6.0,1.356675,46.126948
2403,4,presid,34,3.0,1.916291,65.153885
...,...,...,...,...,...,...
1424,2,belabor,1,1.0,2.609438,2.609438
1426,2,believ,1,3.0,1.916291,1.916291
1427,7,believ,1,3.0,1.916291,1.916291
1430,4,best,1,3.0,1.916291,1.916291


In [26]:
from sklearn import preprocessing
df_termfreq['tfidf_norm'] = preprocessing.normalize(df_termfreq[['tfidf']], axis=0, norm='l2')

In [27]:
top_n_terms = df_termfreq.sort_values(by=['docID', 'tfidf'], ascending=[True, False]).groupby(['docID']).head(2)

In [28]:
top_n_terms.head(10)

Unnamed: 0,docID,term,term_frequency,document_frequency,idf,tfidf,tfidf_norm
1072,0,us,11,5.0,1.510826,16.619082,0.037234
952,0,state,10,5.0,1.510826,15.108256,0.033849
933,1,shall,17,9.0,1.0,17.0,0.038088
953,1,state,8,5.0,1.510826,12.086605,0.027079
590,2,let,16,4.0,1.693147,27.090355,0.060695
1073,2,us,12,5.0,1.510826,18.129907,0.040619
1927,3,dedic,5,1.0,2.609438,13.04719,0.029232
1926,3,dead,3,1.0,2.609438,7.828314,0.017539
955,4,state,132,5.0,1.510826,199.428982,0.446811
936,4,shall,191,9.0,1.0,191.0,0.427926


#### Method 2 (with heatmap)

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

In [31]:
df_rawcorpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})

In [32]:
vectors = tfidf_vectorizer.fit_transform(df_rawcorpus['text'])

In [33]:
tfidf_df = pd.DataFrame(vectors.toarray(), index=df_rawcorpus.index.values, columns=tfidf_vectorizer.get_feature_names_out())

In [34]:
tfidf_df = (tfidf_df
            .stack()
            .reset_index()
            .rename(columns={0: 'tfidf', 'level_0': 'docID', 'level_1': 'term'})
           )

In [35]:
n = 2 #group documents by 2 highest performing terms
top_tfidf = (tfidf_df
             .sort_values(by=['docID','tfidf'], ascending=[True,False])
             .groupby(['docID'])
             .head(n)
            )

In [36]:
# Import altair for graphing the n highest terms in a heatmap

import altair as alt

# adding a little randomness to break ties in term ranking
top_tfidf_rand = top_tfidf.copy()
top_tfidf_rand['tfidf'] = top_tfidf_rand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

base = alt.Chart(top_tfidf_rand).encode(
    x = 'rank:O',
    y = 'docID:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["docID"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + text).properties(width = 800)

# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.