# Data retrieval I

In this notebook, we will work with the following:

- Web scraping process.
- Read one page.
- Find the content we want.
- Automate many pages.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
pd.set_option("mode.copy_on_write", True)

# Web scraping

One helpful way of gathering text data is web scraping.
We usually do this in three steps:

1. Retrieve the pages with information we want.
1. Extract the data from the pages.
1. Clean and save the resulting data.

Let's walk through an example of getting press releases from the [Alphabet website](https://abc.xyz/investor/news/2024/).

I often prefer to work out of order as follows:

1. Figure out how to extract data from one page that has the data.
1. Then, figure out how to automate getting the pages of interest.
1. Run those pages through the procedure in step 1.
1. Clean and save.

This has the benefit of solving what is usually the hardest problem first.

## Important note

As you'll see, the difficulty ramps up a lot here.
Web scraping is easily a full day topic on its own.
Hence, I have two main goals for you:

1. Get a sense of the logic and the process in solving the problem. This is a good start if you want to learn it yourself.
1. Understand what is feasible and achievable. This helps whether you do it yourself or farm it out (and there's a ready talent pool for this).

## Read one page

This is the hardest part.

Note that we add a user agent header that is sent as part of the request.
The reason is that a lot of web servers block user agents that are web scraping tools.

In [None]:
AGENT = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    " (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.3"
)

pr_url_1 = "https://abc.xyz/2024-1010/"

pr_req_1 = requests.get(pr_url_1, headers={"User-Agent": AGENT})

In [None]:
# We want this to be 200, which is the code for OK.
pr_req_1.status_code

### Encoding

This is a very deep topic that we only need to barely touch.
In short, there are many standards for representing text as mappings of bytes (eight 0 or 1 values).
Many of them have significant overlap (based on underlying standards that they are a superset of), such that they at least mostly work, but it's better if we're sure we're using the right encoding.

In our example here, the server sends data in such a way that we would infer that the text is in the `ISO-8859-1` encoding, though it is actually in the `UTF-8` encoding.
Fortunately, `requests` can tell us both what the encoding is and what it thinks it actually is, so we can build upon that.

In [None]:
pr_req_1.encoding

In [None]:
pr_req_1.apparent_encoding

In [None]:
pr_req_1.encoding = pr_req_1.apparent_encoding

### Extracting content

In [None]:
# The .text attribute of the request object is the HTML of the page.
pr_soup_1 = BeautifulSoup(pr_req_1.text)

In [None]:
# The meta tags have some data we'd like to get.
# For example, this is the published time.
pr_soup_1.find("meta", property="article:published_time")

In [None]:
# We can get the property attribute of this meta tag,
# which has the name of the data item.
pr_soup_1.find("meta", property="article:published_time")["property"]

In [None]:
# The content attribute has the data item itself.
pr_soup_1.find("meta", property="article:published_time")["content"]

In [None]:
# List of meta tags to get.
# Note: when in doubt, get everything you might possibly use.
#       It's easier to drop stuff than to re-scrape everything.

METAS = [
    "article:published_time",
    "article:modified_time",
    "og:title",
    "og:description",
    "og:updated_time",
    "og:url",
    "article:section",
]

In [None]:
# This loop populates a dict with each of the meta attributes above and its content.
# Discussion: why is this try/except necessary? What happens if we remove it?
pr_data_1 = {}
for meta in METAS:
    try:
        prop = pr_soup_1.find("meta", property=meta)["property"]
        content = pr_soup_1.find("meta", property=meta)["content"]
    except TypeError:
        prop = meta
        content = ""
    pr_data_1.update({prop: content})

In [None]:
pr_data_1

In [None]:
pr_soup_1.find("div", {"class": "RichTextArticleBody RichTextBody"}).find_all("p")

In [None]:
# This is a little gnarly.
pr_data_1["body"] = "\n\n".join(
    [
        i.text
        for i in pr_soup_1.find(
            "div", {"class": "RichTextArticleBody RichTextBody"}
        ).find_all("p")
    ]
)

In [None]:
pr_data_1

# Automate our one page work.

This is fairly easy. We have the code for it already.
We just need to wrap it in a function.

**Note:** I'm using an `if` statement to check whether these properties exist, and guarding against the case where they don't.
I did this iteratively while building this content, because I noticed (from errors) that many press releases do not have modification dates or article sections.

In [None]:
def get_data_from_soup(soup):
    data = {}
    for meta in METAS:
        try:
            prop = soup.find("meta", property=meta)["property"]
            content = soup.find("meta", property=meta)["content"]
        except TypeError:
            prop = meta
            content = ""
        data.update({prop: content})

    data["body"] = "\n\n".join(
        [
            i.text
            for i in soup.find(
                "div", {"class": "RichTextArticleBody RichTextBody"}
            ).find_all("p")
        ]
    )

    return data

In [None]:
# Notice how easy this is once we make a function.
get_data_from_soup(pr_soup_1)

## Read many pages

Now we need to get the URLs for all of the pages we want.

In [None]:
many_pr_url_1 = "https://abc.xyz/investor/news/2024/"
many_pr_page_1 = requests.get(many_pr_url_1, headers={"User-Agent": AGENT}).text
many_pr_soup_1 = BeautifulSoup(many_pr_page_1)

In [None]:
# Here, we find the div containing the listings and then find the links within.
many_pr_soup_1.find("div", {"class": "PageListW-items"}).find_all("a")

In [None]:
# Then, for each of the anchor tags, we can extract the links themselves.
articles = many_pr_soup_1.find("div", {"class": "PageListW-items"}).find_all("a")
links = [i["href"] for i in articles]
links

In [None]:
many_pr_links_1 = links.copy()

## Automate getting links and data from each

In [None]:
# We need to turn links into soup objects a lot, so let's make a function.
def link_to_soup(link):
    page_request = requests.get(link, headers={"User-Agent": AGENT})
    page_request.encoding = page_request.apparent_encoding
    page = page_request.text
    soup = BeautifulSoup(page)
    return soup


def get_links_from_link_page(link_page):
    soup = link_to_soup(link_page)
    articles = soup.find("div", {"class": "PageListW-items"})
    links = [i["href"] for i in articles]
    return links


def get_data_from_links(links):
    data_list = []
    for link in links:
        soup = link_to_soup(link)
        data_list.append(get_data_from_soup(soup))

    return data_list

In [None]:
alphabet_prs = pd.DataFrame(get_data_from_links(many_pr_links_1))
alphabet_prs.head()

# Further automation

**Note**: for running time reasons, we're not going to make a multi-links-page version, but note that there are year links on the left of the listing pages that can be extracted.

However, we could also notice that the link pages have a year in the URL.
We would have to look at a page to get the earliest year, but we could otherwise simply use a loop to construct a URL for each of those years.

`https://abc.xyz/investor/news/2023/`