# OPAN 6609 Text Analytics

**Assignment 1: Web scraping and named entity recognition**

https://en.wikipedia.org/wiki/Georgetown_University

Mike Johnson
****

## Set up

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import spacy

## 1. Review the source code of the webpage using the inspector tool of your browser. List three types of HTML tags you found on the page and explain what elements of the page are created with those tags.

| HTML Tag | Purpose | Elements Created |
| --- | --- | --- |
| `<head>` | Metadata Tag | Contains the metadata of the page such as title (Georgetown University), character set (UTF-8), styles, scripts and other meta information. |
| `<p>` | Paragraph Tag | Defines the paragraph. Contains the main body of text of the article. |
| `<a>` | Anchor/Link Tag | Defines the hyperlink. Includes links to other Wikipedia pages, external references, and citations. |


****

## 2. Issue a request for the webpage using the requests library and display the HTTP response code for the request.

In [2]:
# Make request to Georgetown University Wikipedia page
url = "https://en.wikipedia.org/wiki/Georgetown_University"


# Original request was a 403 error. Research suggested adding this header so that Wikipedia doesn't assume we are a bot.
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}


response = requests.get(url, headers=headers)

### 2.1. What is the response status code and what does it indicate?

In [3]:
# Display the response code
print(f"HTTP Response Code: {response.status_code}")
print(f"Response Status: {response.reason}")

HTTP Response Code: 200
Response Status: OK


A response code of 200 means that the request was successful.

### 2.2. What do status codes in the 400s mean?

Status codes in the 400's means that the error is caused by the client. Some example codes include:
* `400 Bad Request`: The server would not process the request due to something on the server considered to be a client error.
* `401 Unauthorized`: The request lacks valid authentication credentials.
* `403 Forbidden`: The server understood the request but refused to process it. This was the initial error received before adding headers to the request.

### 2.3. What do status codes in the 500s mean?

Status codes in the 500's mean the server failed to fulfill the request. Some example codes include:
* `500 Internal Server Error`: The server encountered an unexpected condition that prevented it from fulfilling the request.
* `502 Bad Gateway`: A server was acting as a gateway or proxy that it received an invalid response from the upstream server.
* `503 Service Unavailable`: The server is not ready to handle the request.

***

## 3. Extract all raw text contained in paragraph tags on the web page. Store the results in single cell in a pandas DataFrame

In [4]:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Extract all paragraph text
paragraphs = soup.find_all('p')
paragraph_text = ' '.join([p.get_text() for p in paragraphs])

# Create DataFrame with the extracted text
text_df = pd.DataFrame({'raw_text': [paragraph_text]})

****

## 4. Load and instantiate a spaCy NLP pipeline. Apply the pipeline to the scraped text to extract the named entities via named entity recognition (NER). Store the results in a new dataframe called ner_df with two columns: the extracted entities/text and its corresponding entity label.

You might need to download the spaCy model. In either your terminal or anaconda prompt, run the following code: `python -m spacy download en_core_web_sm`

In [5]:
# Load spaCy model
nlpspacy = spacy.load("en_core_web_sm")

In [6]:
# Process the text through the NLP pipeline
doc = nlpspacy(paragraph_text)

In [7]:
# Extract the named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

# Create a dataframe with named entities
ner_df = pd.DataFrame(entities, columns = ['entity_text', 'entity_label'])

ner_df

Unnamed: 0,entity_text,entity_label
0,Georgetown University,ORG
1,Washington,GPE
2,D.C.,GPE
3,United States,GPE
4,Bishop John Carroll,PERSON
...,...,...
1240,Andrew Morrison,PERSON
1241,RaMell,PERSON
1242,Tony Award-winning,PERSON
1243,John Guare,PERSON


****

## 5. Print in descending order the top 10 most frequently mentioned people and the number of times each is mentioned. Print in descending order the top 10 most frequently mentioned organizations and the number of times each is mentioned. 

In [8]:
# Print the top 10 most frequently mentioned people and the number of times each is mentioned
names_top10 = ner_df[ner_df['entity_label'] == 'PERSON'].groupby('entity_text').size().sort_values(ascending = False).head(10)
names_top10

entity_text
Healy Hall             3
ROTC                   3
Laura Chinchilla       2
Fulbright Scholars     2
Antonin Scalia         2
Dahlgren Quadrangle    2
George Tenet           2
Henry Kissinger        2
Hilltoss               2
Barack Obama           2
dtype: int64

In [9]:
# Print the top 10 most frequently mentioned organizations and the number of times each is mentioned
orgs_top10 = ner_df[ner_df['entity_label'] == 'ORG'].groupby('entity_text').size().sort_values(ascending = False).head(10)
orgs_top10

entity_text
Georgetown                          58
Georgetown University               13
SFS                                  6
the School of Foreign Service        6
NCAA                                 5
the McDonough School of Business     4
CIA                                  4
the Law Center                       3
the School of Continuing Studies     3
State                                3
dtype: int64

## 6. Do #3, #4, and #5 again but instead extract all text contained within anchor `<a>` tags on the web page. Do the results differ from the results in #5?

In [10]:
# Extract all hyperlink text
hyperlinks = soup.find_all('a')
hyperlinks_text = ' '.join([a.get_text() for a in hyperlinks])

# Create DataFrame with the extracted text
hyperlinks_df = pd.DataFrame({'raw_text': [hyperlinks_text]})

In [11]:
# Process the text through the NLP pipeline
doc_a = nlpspacy(hyperlinks_text)

In [12]:
# Extract the named entities
entities_a = [(ent.text, ent.label_) for ent in doc_a.ents]

# Create a dataframe with named entities
ner_df_a = pd.DataFrame(entities_a, columns = ['entity_text', 'entity_label'])

ner_df_a

Unnamed: 0,entity_text,entity_label
0,Contents Current,ORG
1,Community,ORG
2,Upload,ORG
3,1,CARDINAL
4,1.1,CARDINAL
...,...,...
1688,June 2020 Commons,DATE
1689,Wikidata Articles,ORG
1690,Kartographer,PERSON
1691,4.0,CARDINAL


In [13]:
# Print the top 10 most frequently mentioned people and the number of times each is mentioned
names_top10_a = ner_df_a[ner_df_a['entity_label'] == 'PERSON'].groupby('entity_text').size().sort_values(ascending = False).head(10)

# Create a DataFrame with both series
names_df = pd.DataFrame({
    'Using <a>': names_top10_a,
    'Using <p>': names_top10
})

print(names_df)

                       Using <a>  Using <p>
entity_text                                
Antonin Scalia               3.0        2.0
Articles                     3.0        NaN
Barack Obama                 NaN        2.0
Bill Clinton                 2.0        NaN
Chris Sacca                  2.0        NaN
Dahlgren Quadrangle          NaN        2.0
David Malpass                2.0        NaN
Edward Douglass White        2.0        NaN
Fulbright Scholars           NaN        2.0
George Tenet                 NaN        2.0
George Washington            2.0        NaN
Healy Hall                   NaN        3.0
Henry Kissinger              2.0        2.0
Hilltoss                     NaN        2.0
James Madison                2.0        NaN
Laura Chinchilla             NaN        2.0
Martha                       2.0        NaN
ROTC                         NaN        3.0


Using `<a>` resulted in better named entity recognition with the top 10 resulting lower instances of misclassified entities. On the other hand, `<p>` had a higher presence of misclassified entities.

In [14]:
# Print the top 10 most frequently mentioned organizations and the number of times each is mentioned
orgs_top10_a = ner_df_a[ner_df_a['entity_label'] == 'ORG'].groupby('entity_text').size().sort_values(ascending = False).head(10)

# Create a DataFrame with both series
orgs_df = pd.DataFrame({
    'Using <a>': orgs_top10_a,
    'Using <p>': orgs_top10
})

print(orgs_df)

                                  Using <a>  Using <p>
entity_text                                           
CA                                      3.0        NaN
CIA                                     NaN        4.0
Georgetown                             16.0       58.0
Georgetown University                  26.0       13.0
Georgetown University Law Center        3.0        NaN
Hoya                                    6.0        NaN
ISSN                                    5.0        NaN
ISSN 0362-4331                          4.0        NaN
NCAA                                    3.0        5.0
O'Neill & Williams                      3.0        NaN
SFS                                     NaN        6.0
State                                   NaN        3.0
U.S. News & World Report                7.0        NaN
the Law Center                          NaN        3.0
the McDonough School of Business        NaN        4.0
the School of Continuing Studies        NaN        3.0
the School

The top 10 using `<a>` resulted in lower counts and higher instance of misclassified entities. The `<p>` top 10 resulted in a higher volume with better named entity recognition.

****

## 7. Briefly explain web scraping and named entity recognition/extraction in a way that a non-technical co-worker would understand. 

**Web scraping** is used for extracting data from websites. It's like having a digital assistance that reads and collects information from websites for you. Think of it like reviewing and highlighting all the importants of a document, but a computer does it for you and saves it in an organized format.

**Named entity recognition (NER)** adds an extra layer to the highlighter mentioned before and categorizes important names and terms. Whan you read a document, you can recognize that 'George Washington' is person versus 'Georgetown University' is an institution. NER gives a computer the same capability. It scans through the text and identifies different categories of important information.