# Text Analytics | BAIS:6100
# Module 6: Web Scraping

Instructor: Kang-Pyo Lee 

Topics to be covered:
- Fetching content from a website using <b>requests</b>
- Parsing HTML code using <b>BeautifulSoup</b>

In [None]:
# ! pip install --user --upgrade bs4 requests

## Fetch HTML content from a webpage using Requests

https://fivethirtyeight.com/features/why-was-the-national-polling-environment-so-off-in-2020/

In [None]:
url = "https://fivethirtyeight.com/features/why-was-the-national-polling-environment-so-off-in-2020/"

### *** Please run the cells for HTTP requests only when needed. 

In [None]:
import requests
r = requests.get(url)

https://requests.kennethreitz.org/

In [None]:
r.content

Note that the HTML content you have retrieved does not always corresopond to what you are actually seeing on a web browser. Web sites are able to distinguish program access from normal human access using a web browser. Some web sites do not care about the program access, whereas other web sites do care about it and block the undesirable access. In this case, take a close look at the HTML content and you will see it does not have the information you expected.  

## Load the fetched content as a BeautifulSoup object

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, "html.parser")

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Extract the title of the webpage

Do not confuse the title of a webpage with the title of an article.

In [None]:
soup.title

In [None]:
soup.title.text

## Extract the title of the article

If you want to extract information from a webpage, always start by identifying the corresponding HTML element in the HTML code using the Inspect feature of the Chrome Browser. 

When searching for an HTML element using <b>BeautifulSoup</b>, you can use either the <b>find</b> method or <b>find_all</b> method.
- The <b>find</b> method returns the first found element. 
- The <b>find_all</b> method returns a list of all found elements in order. 

You can simply choose to use the <b>find</b> mehtod if you are confident that there is only one element matching or the element you are searching for is the first element matching.  

In [None]:
soup.find(name="h1", attrs={"class": "article-title article-title-single entry-title"})

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

In addition to specifying the tag of the element you are searching for in the first parameter `name`, you can use any clues that help identify the element by specifying them in the second parameter `attrs`. Note the `attrs` parameter takes a dictionary. 

In the above example, if the class name is unique, you will always be able to find the element. If not, the <b>find</b> method will return the first found element, which could not be the element you are trying to find. In this case, you would better use different clues in the `attrs` parameter or consider using the <b>find_all</b> method.  

In [None]:
soup.find("h1", {"class": "article-title article-title-single entry-title"}).text

In [None]:
soup.find("h1", {"class": "article-title article-title-single entry-title"}).text.strip()

## Extract the author name of the article

In [None]:
soup.find("a", {"class": "author url fn"})

In [None]:
soup.find("a", {"class": "author url fn"}).text

In [None]:
soup.find("a", {"class": "author url fn"})["href"]

## Fetch an image from the webpage

In [None]:
soup.find("picture", {"class": "featured-picture"})

When the element you are searching for has no unique clues, you should try finding the parent/ancestor element, by which you can narrow down the scope of search, and then you can start another search from there. This is the beauty of hierarchical search. 

In [None]:
soup.find("picture", {"class": "featured-picture"}).find("img")

Because there is only one img element in the picture element found above, you do not have to add any additional attributes, or clues, in the find method. 

In [None]:
soup.find("picture", {"class": "featured-picture"}).find("img")["src"]

In [None]:
img_url = soup.find("picture", {"class": "featured-picture"}).find("img")["src"]
img_url

In [None]:
from IPython.display import Image

Image(url=img_url)

Note that this is not saving the image. It is just displaying the image fetched from the website. 

In [None]:
r = requests.get(img_url)

with open("outcome/photo.jpg", "w+b") as fw:
    fw.write(r.content)

In [None]:
Image("outcome/photo.jpg")

## Extract the body text of the article

In [None]:
soup.find("div", attrs={"class": "entry-content single-post-content"})

In [None]:
soup.find("div", attrs={"class": "entry-content single-post-content"}).text

In [None]:
soup.find("div", attrs={"class": "entry-content single-post-content"}).text.replace("\n", " ").strip()

## Extract a list of article titles

https://fivethirtyeight.com/features/

In [None]:
url = "https://fivethirtyeight.com/features/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

In [None]:
h2_list = soup.find_all(name="h2", attrs={"class": "article-title entry-title"})
h2_list

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

The <b>find_all</b> method looks through a tagâ€™s descendants and retrieves all descendants that match your filters.

In [None]:
len(h2_list)

In [None]:
for h2 in h2_list:
    print(h2.text.strip())

In [None]:
for h2 in h2_list:
    url = h2.find("a")["href"]     # Starting from each h2 element, go deeper by one level to find an a element 
    print(url)

The hierarchical search of BeautifulSoup is very useful in navigating the nested HTML elements. 

## Handle pagination

In [None]:
urls = ["https://fivethirtyeight.com/features/"]

for i in range(2, 101):     # The range(2, 101) generates a list of integers from 2 to 100.
    url = "https://fivethirtyeight.com/features/page/{}/".format(i)
    urls.append(url)
    
urls

Try to get all of the URLs of the target webpages first before trying to get the contents from those webpages. At this point, it is important to find a rule for creating the URLs. 

In [None]:
for url in urls:
    print(url)              # Do whatever you want with each web page.

## Write & read an HTML file

In [None]:
url = "https://fivethirtyeight.com/features/why-was-the-national-polling-environment-so-off-in-2020/"
r = requests.get(url)

In [None]:
url[len("https://fivethirtyeight.com/features/"):-1]

In [None]:
file_name = url[len("https://fivethirtyeight.com/features/"):-1] + ".html"
file_name

In [None]:
with open("outcome/" + file_name, "w+b") as fw:
    fw.write(r.content)

In [None]:
with open("outcome/" + file_name, "r+b") as fr:
    soup = BeautifulSoup(fr.read(), "html.parser")
    
    print(soup.title.text)          # Do whatever you want with the saved web page.

## Automate the process of saving all articles on the Features list

In [None]:
urls = ["https://fivethirtyeight.com/features/"]

for i in range(2, 10):     # The range(2, 10) generates a list of integers from 2 to 9.
    url = "https://fivethirtyeight.com/features/page/{}/".format(i)
    urls.append(url)

urls

In [None]:
import os

if not os.path.isdir("outcome/HTMLs"):
    os.mkdir("outcome/HTMLs")

In [None]:
import time     # Necessary for the sleep function.

In [None]:
for url in urls:
    print(url)
    
    ####################################################
    # Get the content of a page
    ####################################################
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    
    ####################################################
    # Get the list of articles
    ####################################################
    h2_list = soup.find_all(name="h2", attrs={"class": "article-title entry-title"})
    
    for h2 in h2_list:
        ####################################################
        # Find the anchor tag
        ####################################################
        a = h2.find("a")
        
        ####################################################
        # Extract the title & URL of an article
        ####################################################
        title = a.text
        article_url = a["href"]
        
        ####################################################
        # Fetch the content and save it as an HTML file
        ####################################################
        r2 = requests.get(article_url)
                
        file_name = article_url[len("https://fivethirtyeight.com/features/"):-1] + ".html"
        with open("outcome/HTMLs/" + file_name, "w+b") as fw:
            fw.write(r2.content)
        
        print("- " + file_name + " saved.")
        
        ####################################################
        # Sleep for a second to not overload the web site
        ####################################################
        time.sleep(1)
    
    print()

## Extract information from all HTML files & save it in a CSV file

In [None]:
os.listdir("outcome/HTMLs")

In [None]:
with open("outcome/html_metadata.csv", "w", encoding="utf8") as fw:
    ####################################################
    # Column names on the first row
    ####################################################
    fw.write("file_name\tarticle_title\tarticle_author\n")   # A tab between columns and a new line between rows  

    for file_name in os.listdir("outcome/HTMLs"):
        if not file_name.endswith(".html"):
            continue
        
        ####################################################
        # Column values starting from the second row
        ####################################################
        with open("outcome/HTMLs/" + file_name, "r+b") as fr:
            print(file_name)
            soup = BeautifulSoup(fr.read(), "html.parser")
            article_title = soup.find("h1", {"class": "article-title article-title-single entry-title"}).text.strip()
            article_author = soup.find("a", {"class": "author url fn"}).text
            
            #####################################################################
            # Remove all possible tabs, as tab is being used as column delimiter
            #####################################################################
            article_title = article_title.replace("\t", "")
            article_aurthor = article_author.replace("\t", "")
            
            fw.write("{}\t{}\t{}\n".format(file_name, article_title, article_author))

In [None]:
with open("outcome/html_metadata.csv", "w", encoding="utf8") as fw:
    ####################################################
    # Column names on the first row
    ####################################################
    fw.write("file_name\tarticle_title\tarticle_author\n")

    for file_name in os.listdir("outcome/HTMLs"):
        if not file_name.endswith(".html"):
            continue
        
        ####################################################
        # Column values starting from the second row
        ####################################################
        with open("outcome/HTMLs/" + file_name, "r+b") as fr:
            print(file_name)
            soup = BeautifulSoup(fr.read(), "html.parser")
            article_title = soup.find("h1", {"class": "article-title article-title-single entry-title"}).text.strip()
            
            ####################################################
            # No author exception handling
            ####################################################
            if soup.find("a", {"class": "author url fn"}) == None:
                article_author = ""
            else:
                article_author = soup.find("a", {"class": "author url fn"}).text
            
            ####################################################
            # Remove all possible tabs
            ####################################################
            article_title = article_title.replace("\t", "")
            article_aurthor = article_author.replace("\t", "")
                        
            fw.write("{}\t{}\t{}\n".format(file_name, article_title, article_author))

## Exercises - HTML Parsing Using BeautifulSoup