# Data retrieval I

In this notebook, we will work with the following:

- Web scraping process.
- Read one page.
- Find the content we want.
- Automate many pages.

In [None]:
# REQUESTS GETS URL
# beatutiful soup parses web pages
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [None]:
pd.set_option("mode.copy_on_write", True)

# Web scraping

One helpful way of gathering text data is web scraping.
We usually do this in three steps:

1. Retrieve the pages with information we want.
1. Extract the data from the pages.
1. Clean and save the resulting data.

Let's walk through an example of getting press releases from the [Microsoft website](https://news.microsoft.com/category/press-releases/).

I often prefer to work out of order as follows:

1. Figure out how to extract data from one page that has the data.
1. Then, figure out how to automate getting the pages of interest.
1. Run those pages through the procedure in step 1.
1. Clean and save.

This has the benefit of solving what is usually the hardest problem first.

## Important note

As you'll see, the difficulty ramps up a lot here.
Web scraping is easily a full day topic on its own.
Hence, I have two main goals for you:

1. Get a sense of the logic and the process in solving the problem. This is a good start if you want to learn it yourself.
1. Understand what is feasible and achievable. This helps whether you do it yourself or farm it out (and there's a ready talent pool for this).

## Read one page

This is the hardest part.

Note that we add a user agent header that is sent as part of the request.
The reason is that a lot of web servers block user agents that are web scraping tools.

In [None]:
# do no change agent string, thats why its all uppercase
# python will concat the strings in parenthese automatically
# change agent string to correct web browser
#
#
_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"

pr_url_1 = (
    "https://news.microsoft.com/2018/10/04/"
    "redline-communications-and-microsoft-announce-"
    "partnership-to-lower-the-cost-of-tv-white-space-solutions/"
)
pr_req_1 = requests.get(pr_url_1, headers={"User-Agent": _AGENT})

In [None]:
pr_url_1

In [None]:
# We want this to be 200, which is the code for OK.
pr_req_1.status_code

In [None]:
# The .text attribute of the request object
# is the HTML of the page.
pr_soup_1 = BeautifulSoup(pr_req_1.text)

In [None]:
# The meta tags have some data we'd like to get.
# For example, this is the published time.
# turn on developer tools
# tags are between lees than and greather than signs
# standadiezed on utc time stamp
# "what type of tag am i trying to find"
# "what are the properties im looking for"
pr_soup_1.find("meta", property="article:published_time")

In [None]:
# We can get the property attribute of this meta tag,
# which has the name of the data item.
pr_soup_1.find("meta", property="article:published_time")["property"]

In [None]:
# The content attribute has the data item itself.
pr_soup_1.find("meta", property="article:published_time")["content"]

In [None]:
# List of meta tags to get.
# Note: when in doubt, get everything you might possibly use.
#       It's easier to drop stuff than to re-scrape everything.

_METAS = [
    "article:published_time",
    "article:modified_time",
    "og:title",
    "og:description",
    "og:updated_time",
    "og:url",
    "article:section",
]

In [None]:
# This loop populates a dict with each of the
# meta attributes above and its content.
# Discussion: why is this TRY/EXCEPT necessary? What happens if we remove it?
# try runs code in the block, but if theres an error it does something
# type error will return "type none doesnt have a property"
# if that happens, dont stop running, just make the property
#   the same as the tag and content is empty
#  building dictionary with the stuff we want
pr_data_1 = {}
for meta in _METAS:
    try:
        prop = pr_soup_1.find("meta", property=meta)["property"]
        content = pr_soup_1.find("meta", property=meta)["content"]
    except TypeError:
        prop = meta
        content = ""
    pr_data_1.update({prop: content})

In [None]:
# VERY EASY TO GENERALIZE ABOVE CODES
pr_data_1

In [None]:
# CAN USE INSPECT ELEMENT ON WEB PAGES
# ".text" strips html
pr_soup_1.find("div", {"class": "entry-content m-blog-content"}).find("h3").text

In [None]:
pr_soup_1.find("div", {"class": "entry-content m-blog-content"}).find("h3")

In [None]:
pr_soup_1.find("div", {"class": "entry-content m-blog-content"})

In [None]:
# indented bc code style tool thinks the line is too long
pr_data_1["h3"] = (
    pr_soup_1.find("div", {"class": "entry-content m-blog-content"}).find("h3").text
)

In [None]:
pr_data_1

In [None]:
# .find_all rather than just .find to get all paragraph tags
pr_soup_1.find("div", {"class": "entry-content m-blog-content"}).find_all("p")

In [None]:
# This is a little gnarly.
# brackets=list
# makes a long string with all paragraphs in it
# "new line is \n\n"
# .join method will join everything in a list
# displayed on a dashboard would give page breaks
pr_data_1["body"] = "\n\n".join(
    [
        i.text
        for i in pr_soup_1.find(
            "div", {"class": "entry-content m-blog-content"}
        ).find_all("p")
    ]
)

In [None]:
pr_data_1

# Automate our one page work.

This is fairly easy. We have the code for it already.
We just need to wrap it in a function.

**Note:** I'm using an `if` statement to check whether these properties exist, and guarding against the case where they don't.
I did this iteratively while building this content, because I noticed (from errors) that many press releases do not have modification dates or article sections.

In [None]:
# atypically long function for him. normally in pieces
# always specify EXCEPT statement otherwise
#    keyboard interrupts wont stop long running errors
def get_data_from_soup(soup):
    data = {}
    for meta in _METAS:
        if soup.find("meta", property=meta) is not None:
            prop = soup.find("meta", property=meta)["property"]
        if soup.find("meta", property=meta) is not None:
            content = soup.find("meta", property=meta)["content"]
        if prop is not None and content is not None:
            data.update({prop: content})
    try:
        data["h3"] = (
            soup.find("div", {"class": "entry-content m-blog-content"})
            .find("h3")
            .string
        )
    except AttributeError:
        data["h3"] = ""

    data["body"] = "\n\n".join(
        [
            i.text
            for i in soup.find(
                "div", {"class": "entry-content m-blog-content"}
            ).find_all("p")
        ]
    )

    return data

In [None]:
# Notice how easy this is once we make a function.
# make descriptive function names
get_data_from_soup(pr_soup_1)

## Read many pages

Now we need to get the URLs for all of the pages we want.

In [None]:
many_pr_url_1 = "https://news.microsoft.com/category/press-releases/"
many_pr_page_1 = requests.get(many_pr_url_1, headers={"User-Agent": _AGENT}).text
many_pr_soup_1 = BeautifulSoup(many_pr_page_1)

In [None]:
# Almost, but note the ones at the bottom.
# a is an anchor tag to link in html
many_pr_soup_1.find("section", id="primary").find_all("a")

In [None]:
# Here, we further filter down to articles and then get their hrefs to
#    eliminate the navigation links at the bottom.
# href is hypertext reference = link
articles = many_pr_soup_1.find("section", id="primary").find_all("article")
links = [i.find("a")["href"] for i in articles]
links

In [None]:
many_pr_links_1 = links.copy()

In [None]:
many_pr_links_1 = links.copy()

## Automate getting links and data from each

In [None]:
# We need to turn links into soup objects a lot, so let's make a function.
# short functions that do one thing
# create a list oflinks and pull data from each link
def link_to_soup(link):
    page = requests.get(link, headers={"User-Agent": _AGENT}).text
    soup = BeautifulSoup(page)
    return soup


def get_links_from_link_page(link_page):
    soup = link_to_soup(link_page)
    articles = soup.find("section", id="primary").find_all("article")
    links = [i.find("a")["href"] for i in articles]
    return links


def get_data_from_links(links):
    data_list = []
    for link in links:
        soup = link_to_soup(link)
        data_list.append(get_data_from_soup(soup))

    return data_list

In [None]:
msft_prs = pd.DataFrame(get_data_from_links(many_pr_links_1))
msft_prs.head()

# Further automation

**Note**: for running time reasons, we're not going to make a multi-links-page version, but note that there's a next page link at the bottom of those pages that can be extracted to build that:

```html
<a href="/category/press-releases/page/2/?paged=3" 
   class="c-glyph x-hidden-focus" 
   aria-label="Go to next page" ms.title="Next Page">
```

However, we could also notice that the link pages have a number in the URL that is incremented by one for each page.
We would have to look at a page to get the end number, but we could also simply use a loop to construct a URL for each of those numbers.

`https://news.microsoft.com/category/press-releases/page/2/`

use time .sleep to wait between each link pull