# Tutorial: Get the top news stories from UCF using web scraping

## Step 1. Verify the website is working

In [None]:
# Verify that the news page is up and running by opening it in a web browser

import webbrowser

ucf_news_site = "https://www.ucf.edu/news/"

webbrowser.open(ucf_news_site)

- Did the website open?
- Can you see the headlines of the first new news stories?
- Hover over a few of the headlines... are they URLs?

## Step 2. Retrieve the HTML of the website and inspect it

In [None]:
# Get the UCF news page without opening a browser window

import re
from curl_cffi import requests

news_response = requests.get(ucf_news_site, impersonate="chrome124")
news_html = news_response.text
news_html = re.sub(r"\s+", " ", news_html)
print("The UCF server responded with status code: ", news_response.status_code)

- What was the status code?
- What does this mean? If you don't remember, go back to the video on web scraping.

In [None]:
# Print out the first 500 characters of the HTML content

import textwrap

print("Here is what the first 500 characters of the response look like:\n")
print(textwrap.fill(news_html[:500], width=80))

Probably pretty difficult to read, right? That's because it's HTML.
We need to parse it, but we'll get to that in a minute.

For now, let's just make sure that the information we need is in there somewhere.

In the cell below, change "title here" to the title of the first news story 
from the UCF website. *Make sure you type it exactly as it appears on the website.*

In [None]:
news_title = "11 Lesser-known Facts about the Mayflower and Thanksgiving"

if news_title.lower() in news_html.lower():
    print("Celebrate! The news article title was found in the UCF news page!")
    title_index = news_html.lower().index(news_title.lower())
    print(
        "\nHere is the first instance of the news article title in the HTML "
        "and the surrounding HTML:\n"
    )
    window = 250
    wrapped_text = textwrap.fill(
        news_html[title_index - window : title_index + len(news_title) + window],
        width=80,
    )
    print(wrapped_text)
else:
    print("Uh oh... The news article title was not found in the UCF news page.")

- Was the title of the first news story in the HTML?
- Try replacing the title with the other titles from the website. Are they in the HTML?
- What about the URLs? Are they in the HTML?
- Try replacing the title with some random text. Is that in the HTML?

When you're done, make sure you change the title back to the correct one and run this cell
one more time so that the title is correct for the next step.

## Step 3. Explore the HTML

Now let's parse the HTML so we can extract the information we need.

In [None]:
from bs4 import BeautifulSoup

news_soup = BeautifulSoup(news_html, "html5lib")
print("Let's see the pretty version of the HTML content:\n")
print(news_soup.prettify())

OK, much easier to read, right? However, still a bit difficult to find the information we need.

Let's use the `BeautifulSoup` library to see what all we can pull from the webpage in general.

In [None]:
print("'a' tags are used for hyperlinks in HTML.")
print("'href' is the attribute that has the URL the hyperlink points to.")
print("So, if we find all 'a' tags and for each one, we get the 'href' "
      "attribute, we can get all the links on the page.")
print("\nHere are all the links on the UCF news page:\n")
for link in news_soup.find_all("a"):
    print(link.get("href"))

Cool, if we wanted to index an entire website, we could do that. 
But we don't want to do that. We just want the news stories.

What if we just wanted all of the images that gets displayed to the page?

In [None]:
print("'img' tags are used for images in HTML.")
print("'src' is the attribute that has the URL of the image.")
print("So, if we find all 'img' tags and for each one, we get the 'src' "
      "attribute, we can get all the images on the page.")

print("\nHere are all the images on the UCF news page:\n")
for image in news_soup.find_all("img"):
    print(image.get("src"))


Again cool, if we wanted to grab all of the images from an entire website, 
we could do that. But we don't want to do that. 

What if we just wanted all of the text that gets displayed to the page without 
all of the HTML mark-up?

In [None]:
print("In beautifulsoup, the .text attribute gives us the text inside a tag.")
print("If we do this for the 'main' tag, we get all text displayed in"
      "the main tag within of this page.")

print("\nHere is all the text on the UCF news page:\n")
page_text = textwrap.fill(news_soup.find("main").text, width=80)
print(page_text)

Again cool, if we wanted to grab all of the text from an entire website, 
but, we don't care about this... We just want the news stories.

...but do you see a trend in what we're doing to get all of the `XYZ` we want 
from the website?
- We're finding the tag that contains the information we want
  - If this tag is unique, we're using `find` to get the tag
  - If this tag is not unique, we're using `find_all` to get a list of
  all such tags
- We're then extracting the information we want from the tag or its contents

Let's combine this all together to get the titles of the news stories from the 
UCF website.

## Step 4. Extract the titles of the news stories

Before we start coding, let's figure out what we need and from where. Let's
go back to the website and inspect the HTML to see if there's any 'pattern'
in the way the titles are stored.

The next cell will open up your browser again and show you the UCF website.

In [None]:
webbrowser.open(ucf_news_site)

- Right click on the title of the first news story and click 'Inspect'.
- Look at the HTML that gets highlighted. 
- Right click on the title of the second news story and click 'Inspect'.
- Look at the HTML that gets highlighted.
- Repeat for a few more news stories.
- Do you see a pattern in the way the titles are stored?

It looks like the titles are stored in "span" tags with the class "h3 feature-title".

'h3' is a header tag, and so it is likely to be used for both news titles and other 
headings on the page.

On the other hand, 'feature-title' is a class that is likely to be unique to the news 
titles.

Let's use this information to extract the titles of the news stories from the UCF website.

In [None]:
print("\nHere are the titles of the featured articles on the UCF news page:\n")
titles = news_soup.find_all("span", class_="feature-title")
for title in titles:
    print(title.text)

**Bingo!** We have the titles of the news stories from the UCF website.

But what if we want the URLs of the news stories as well?

Go back to the webpage and inspect the HTML again... where are the URLs associated with these titles?

In [None]:
webbrowser.open(ucf_news_site)

**Uh-oh!** The URLs are not inside of the 'span' tags we were using to get 
the titles.

But, if you look at the HTML, you'll see that the 'span' tags are nested inside 
of 'a' tags. Can we use this?

Think back to when we got all of the 'a' tags from the webpage. Do we want 
*all* of those?

No, we only want the 'a' tags that contain the news stories. This is the art
of web scraping. We need to find the right tags to get the information we want. 
Sometimes this is easy and we can find another tag or criterion to filter the 
tags we want. Other times, it's not so easy and we have to get creative. 
Fortunately, this is one of the easier cases.

Look up a level from the 'a' tags... 'article' tags! These contain the 'a' tags
and titles, and they appear to be unique to the news stories.

Let's use this information to extract the URLs of the news stories from the UCF website.

In [None]:
print("\nHere are the titles of all the articles on the UCF news page:\n")
articles = news_soup.find_all("article")
for article in articles:
    title = article.find("span", class_="feature-title")
    if title:
        print(title.text)
    else:
        print("No title found for this article.")

    url = article.find("a")
    if url:
        print("URL: ", url.get("href"))
    else:
        print("No URL found for this article.")
    print("-" * 30)

**Woo hoo!** We have the titles and URLs of the news stories from the UCF website.

## Step 5. Getting the articles' full text

Printing the titles and URLs is great, but it's not very useful.

We can add the titles and URLs to a dataframe. This will make it 
easier to work with the data further and save it to a file.

In [None]:
import pandas as pd

print("The data in a dataframe:\n")

data = []
articles = news_soup.find_all("article")
for article in articles:
    title = article.find("span", class_="feature-title")
    url = article.find("a")
    data.append(
        {
            "title": title.text,
            "url": url.get("href"),
            "author": "",
            "subtitle": "",
            "date": "",
            "text": "",
        }
    )

article_df = pd.DataFrame(data)
print(article_df.head())

Notice that the "author", "subtitle", "date", and "text" columns are empty. 
We need to get this information, but it's on the news stories' individual 
pages.

Let's look at the full text of the first one on our list in the browser...

In [None]:
article_1_url = article_df["url"][0]

# Open the web page in a browser
webbrowser.open(article_1_url)

# Load the article into a BeautifulSoup object for you to play around with
article_1_response = requests.get(article_1_url, impersonate="chrome124")
article_1_soup = BeautifulSoup(article_1_response.text, "html5lib")

Use the "inspect" tool to find where the full text is stored in the HTML.

Feel free to play around in the next cell in a similar way to how we did.
- The parsed html is in the variable `article_1_soup`
- play around with the `find` and `find_all` functions 
- try to find the full text of the article without any ancillary text

In [None]:
# Your playground

===============================================================================

OK, here's what I see:

- All of the article meta-data is in a `header` tag with the class `site-header`
  - Within this, the title is in a unique `h1` tag
  - Within this, the subtitle is in a `div` tag with the class `lead mb-3`, 
  but `mb-3` is not unique to the subtitle
  - Within this, the author is in a `span` tag with the word "By" in it
  - Within this, the date is in a `span` tag with the class 
  `d-block d-sm-inline`, but `d-block` is not unique to the date
- The main text is in a `div` tag with the class `post-content`
  - Within this, each paragraph's text is in `p` tags (if I want to work with 
  the paragraphs separately)

Let's use this information to extract the detailed information of the news 
stories from the UCF website.

In [None]:
# Header information
header = article_1_soup.find("header", class_="site-header")
title = header.find("h1").text.strip()
date = header.find("span", class_="d-sm-inline").text.strip()
subtitle = header.find("div", class_="lead").text.strip()
span_tags = header.find_all("span")
for span in span_tags:
    if span.text.strip().startswith("By"):
        author = span.text.strip()[3:].strip()

# Fulltext information
fulltext = article_1_soup.find("div", class_="post-content").text.strip()

# Print the information
print("Title: ", title)
print("\nSubtitle: ", textwrap.fill(subtitle, width=80))
print("\nAuthor: ", author)
print("\nDate: ", date)
print("\nFull text: ", textwrap.fill(fulltext[:500], width=80), "...")


That's it! We have the full text of the news story from the UCF website...

...for **one** news story.

We need to do this for all of the news stories. We don't want to run this 
code individually for each news story. Let's turn it into a function that
we can apply to all of the news stories!

In [None]:
def get_article_info(article_soup):
    # Header information
    header = article_soup.find("header", class_="site-header")
    title = header.find("h1").text.strip()
    date = header.find("span", class_="d-sm-inline").text.strip()
    subtitle = header.find("div", class_="lead").text.strip()
    span_tags = header.find_all("span")
    for span in span_tags:
        if span.text.strip().startswith("By"):
            author = span.text.strip()[3:].strip()

    # Fulltext information
    fulltext = article_soup.find("div", class_="post-content").text.strip()

    return title, subtitle, author, date, fulltext

OK, we have a function that takes a "soup" object and returns the title,
subtitle, author, date, and text of the news story.

Let's try it out on the first news story.

In [None]:
from pprint import pprint

pprint(get_article_info(article_1_soup))

OK, but we knew that was going to work... let's try this on the 2nd 
news story that we haven't looked at yet.

In [None]:
article_2_url = article_df["url"][1]
article_2_response = requests.get(article_2_url, impersonate="chrome124")
article_2_soup = BeautifulSoup(article_2_response.text, "html5lib")
pprint(get_article_info(article_2_soup))

**Great!** We have the full text of the 2nd news story from the UCF website.

We now have a function that can get the full text of the news stories from
the UCF website.

Let's pull it all together to create a dataframe with the titles, URLs,
authors, subtitles, dates, and full text of the news stories from the UCF 
website.

## Step 6. Pull it all together

In [None]:
## Get the data for each article individually

for index, row in article_df.iterrows():
    article_url = row["url"]
    article_response = requests.get(article_url, impersonate="chrome124")
    article_soup = BeautifulSoup(article_response.text, "html5lib")
    try:
        title, subtitle, author, date, fulltext = get_article_info(article_soup)
        article_df.loc[index, "author"] = author
        article_df.loc[index, "subtitle"] = subtitle
        article_df.loc[index, "date"] = date
        article_df.loc[index, "text"] = fulltext
    except AttributeError:
        print(f"Error processing article at index {index} with URL {article_url}")
        print(
            "This article is probably not hosted on the UCF news site. "
            "And so 'scraping' it would have to be handled differently "
            "than the other articles... We will skip it for now.\n"
        )

print("Articles collected!")

You may find that some of the articles weren't successfully collected/scraped.
This is likely because some of the URLs for the articles link to websites
that are not formatted in the same way as the UCF website.

This is a common problem with web scraping. You need to be able to handle
these situations. For the purposes of this tutorial, we will ignore these
articles. But know that the 'job isn't done' until you've handled these
one way or another in your own projects.

For now, let's look at the dataframe we've collected.

In [None]:
# Print out the first 5 articles with their information
print(article_df.head(5))

Let's look at some summary statistics of the dataframe.

In [None]:
n_rows = article_df.shape[0]
n_columns = article_df.shape[1]
print(f"The dataframe has {n_rows} articles with {n_columns} columns of data.")

fulltext_filter = article_df["text"].str.len() != 0
articles_with_text = article_df[fulltext_filter].shape[0]
print(
    f"Of these, {articles_with_text} articles have fulltext "
    f"associated with them that could be used for analysis."
)

Chances are, you aren't going to want to work with the data in this notebook.

Let's save the dataframe to a CSV file so that you can work with it in
another notebook, Excel, or any other program that can read CSV files.

In [None]:
from pathlib import Path
file = Path.cwd() / "ucf_news_articles.csv"
article_df.to_csv(file, encoding="utf-8-sig", index=False, header=True)
print(f"Data saved to {file}")

Now you have the news stories from the UCF website in a CSV file!

Try opening it - does it look like what you expected?

Done...