# Data retrieval I

In this notebook, we will work with the following:

- Web scraping process.
- Read one page.
- Find the content we want.
- Automate many pages.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

# Web scraping

One helpful way of gathering text data is web scraping.
We usually do this in three steps:

1. Retrieve the pages with information we want.
1. Extract the data from the pages.
1. Clean and save the resulting data.

Let's walk through an example of getting press releases from the [Microsoft website](https://news.microsoft.com/category/press-releases/).

I often prefer to work out of order as follows:

1. Figure out how to extract data from one page that has the data.
1. Then, figure out how to automate getting the pages of interest.
1. Run those pages through the procedure in step 1.
1. Clean and save.

This has the benefit of solving what is usually the hardest problem first.

## Important note

As you'll see, the difficulty ramps up a lot here.
Web scraping is easily a full day topic on its own.
Hence, I have two main goals for you:

1. Get a sense of the logic and the process in solving the problem. This is a good start if you want to learn it yourself.
1. Understand what is feasible and achievable. This helps whether you do it yourself or farm it out (and there's a ready talent pool for this).

## Read one page

This is the hardest part.

Note that we add a user agent header that is sent as part of the request.
The reason is that a lot of web servers block user agents that are web scraping tools.

In [2]:
_AGENT= 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'

pr_url_1 = 'https://news.microsoft.com/2018/10/04/redline-communications-and-microsoft-announce-partnership-to-lower-the-cost-of-tv-white-space-solutions/'
pr_req_1 = requests.get(pr_url_1, headers={'User-Agent': _AGENT})

In [3]:
# We want this to be 200, which is the code for OK.
pr_req_1.status_code

200

In [4]:
# The .text attribute of the request object is the HTML of the page.
pr_soup_1 = BeautifulSoup(pr_req_1.text)

In [5]:
# The meta tags have some data we'd like to get.
# For example, this is the published time.
pr_soup_1.find('meta', property='article:published_time')

<meta content="2018-10-04T13:00:35+00:00" property="article:published_time"/>

In [6]:
# We can get the property attribute of this meta tag, which has the name of the data item.
pr_soup_1.find('meta', property='article:published_time')['property']

'article:published_time'

In [7]:
# The content attribute has the data item itself.
pr_soup_1.find('meta', property='article:published_time')['content']

'2018-10-04T13:00:35+00:00'

In [8]:
# List of meta tags to get.
# Note: when in doubt, get everything you might possibly use.
#       It's easier to drop stuff than to re-scrape everything.

_METAS = [
    'article:published_time',
    'article:modified_time',
    'og:title',
    'og:description',
    'og:updated_time',
    'og:url',
    'article:section'
]

In [9]:
# This loop populates a dict with each of the meta attributes above and its content.
# Discussion: why is this try/except necessary? What happens if we remove it?
pr_data_1 = {}
for meta in _METAS:
    try:
        prop = pr_soup_1.find('meta', property=meta)['property']
        content = pr_soup_1.find('meta', property=meta)['content']
    except TypeError:
        prop = meta
        content = ''
    pr_data_1.update({prop: content})

In [10]:
pr_data_1

{'article:published_time': '2018-10-04T13:00:35+00:00',
 'article:modified_time': '2018-10-04T14:43:59+00:00',
 'og:title': 'Redline Communications and Microsoft announce partnership to lower the cost of TV White Space solutions - Stories',
 'og:description': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally REDMOND, Wash. — Oct. 4, 2018 — On Thursday, Redline Communications (TSX:RDL) and Microsoft Corp. announced a new partnership that will help address the rural broadband gap using TV White Space technology. Redline, a […]',
 'og:updated_time': '',
 'og:url': 'https://news.microsoft.com/2018/10/04/redline-communications-and-microsoft-announce-partnership-to-lower-the-cost-of-tv-white-space-solutions/',
 'article:section': ''}

In [11]:
pr_soup_1.find('div', {'class': 'entry-content m-blog-content'}).find('h3').text

'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally'

In [12]:
pr_data_1['h3'] = pr_soup_1.find('div', 
                                 {'class': 'entry-content m-blog-content'}
                                ).find('h3').text

In [13]:
pr_data_1

{'article:published_time': '2018-10-04T13:00:35+00:00',
 'article:modified_time': '2018-10-04T14:43:59+00:00',
 'og:title': 'Redline Communications and Microsoft announce partnership to lower the cost of TV White Space solutions - Stories',
 'og:description': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally REDMOND, Wash. — Oct. 4, 2018 — On Thursday, Redline Communications (TSX:RDL) and Microsoft Corp. announced a new partnership that will help address the rural broadband gap using TV White Space technology. Redline, a […]',
 'og:updated_time': '',
 'og:url': 'https://news.microsoft.com/2018/10/04/redline-communications-and-microsoft-announce-partnership-to-lower-the-cost-of-tv-white-space-solutions/',
 'article:section': '',
 'h3': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally'}

In [14]:
pr_soup_1.find('div', 
               {'class': 'entry-content m-blog-content'}
              ).find_all('p')

[<p><strong>REDMOND, Wash. </strong><strong>—</strong> <strong>Oct. 4, 2018</strong> <strong>—</strong> On Thursday, <a href="https://rdlcom.com/">Redline Communications</a> (TSX:RDL) and <a href="https://www.microsoft.com/en-us/">Microsoft Corp.</a> announced a new partnership that will help address the rural broadband gap using TV White Space technology. Redline, a leader in private wireless networks, will provide its Virtual Fiber™ radio technology in the TV White Space band to Microsoft Airband Initiative partners. Together, Redline and Microsoft’s partnership will help make broadband internet more affordable and accessible to unserved and underserved customers in rural areas in the United States and globally.</p>,
 <p>New cloud services and other technologies make broadband connectivity a necessity to start and grow a small business and to take advantage of advances in agriculture, telemedicine and education. It is a vital part of 21st century infrastructure. Yet, more than 19.4 m

In [15]:
# This is a little gnarly.
pr_data_1['body'] = '\n\n'.join(
                        [i.text for i in pr_soup_1.find(
                            'div', 
                            {'class': 'entry-content m-blog-content'}
                            ).find_all('p')])

In [16]:
pr_data_1

{'article:published_time': '2018-10-04T13:00:35+00:00',
 'article:modified_time': '2018-10-04T14:43:59+00:00',
 'og:title': 'Redline Communications and Microsoft announce partnership to lower the cost of TV White Space solutions - Stories',
 'og:description': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally REDMOND, Wash. — Oct. 4, 2018 — On Thursday, Redline Communications (TSX:RDL) and Microsoft Corp. announced a new partnership that will help address the rural broadband gap using TV White Space technology. Redline, a […]',
 'og:updated_time': '',
 'og:url': 'https://news.microsoft.com/2018/10/04/redline-communications-and-microsoft-announce-partnership-to-lower-the-cost-of-tv-white-space-solutions/',
 'article:section': '',
 'h3': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally',
 'body': 'REDMOND, Wash. — 

# Automate our one page work.

This is fairly easy. We have the code for it already.
We just need to wrap it in a function.

**Note:** I'm using an `if` statement to check whether these properties exist, and guarding against the case where they don't.
I did this iteratively while building this content, because I noticed (from errors) that many press releases do not have modification dates or article sections.

In [17]:
def get_data_from_soup(soup):
    data = {}
    for meta in _METAS:
        if soup.find('meta', property=meta) is not None:
            prop = soup.find('meta', property=meta)['property']
        if soup.find('meta', property=meta) is not None:
            content = soup.find('meta', property=meta)['content']
        if prop is not None and content is not None:
            data.update({prop: content})
    try:
        data['h3'] = soup.find('div', 
                               {'class': 'entry-content m-blog-content'}
                              ).find('h3').string
    except AttributeError:
        data['h3'] = ''
    
    data['body'] = '\n\n'.join(
                        [i.text for i in soup.find(
                            'div', 
                            {'class': 'entry-content m-blog-content'}
                            ).find_all('p')])
    
    return data

In [18]:
# Notice how easy this is once we make a function.
get_data_from_soup(pr_soup_1)

{'article:published_time': '2018-10-04T13:00:35+00:00',
 'article:modified_time': '2018-10-04T14:43:59+00:00',
 'og:title': 'Redline Communications and Microsoft announce partnership to lower the cost of TV White Space solutions - Stories',
 'og:description': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally REDMOND, Wash. — Oct. 4, 2018 — On Thursday, Redline Communications (TSX:RDL) and Microsoft Corp. announced a new partnership that will help address the rural broadband gap using TV White Space technology. Redline, a […]',
 'og:url': 'https://news.microsoft.com/2018/10/04/redline-communications-and-microsoft-announce-partnership-to-lower-the-cost-of-tv-white-space-solutions/',
 'h3': 'The partnership will help make broadband more affordable and accessible for unserved communities in rural areas of the U.S. and globally',
 'body': 'REDMOND, Wash. — Oct. 4, 2018 — On Thursday, Redline Communicatio

## Read many pages

Now we need to get the URLs for all of the pages we want.

In [19]:
many_pr_url_1 = 'https://news.microsoft.com/category/press-releases/'
many_pr_page_1 = requests.get(many_pr_url_1, headers={'User-Agent': _AGENT}).text
many_pr_soup_1 = BeautifulSoup(many_pr_page_1)

In [20]:
# Almost, but note the ones at the bottom.
many_pr_soup_1.find('section', id='primary').find_all('a')

[<a class="f-post-link c-heading-6 m-chevron" href="https://news.microsoft.com/2021/01/12/microsoft-announces-quarterly-earnings-release-date-46/" ms.title="Microsoft announces quarterly earnings release date" rel="bookmark">
 		Microsoft announces quarterly earnings release date	</a>,
 <a class="f-post-link c-heading-6 m-chevron" href="https://news.microsoft.com/2021/01/11/broad-institute-and-verily-partner-with-microsoft-to-accelerate-the-next-generation-of-the-terra-platform-for-health-and-life-science-research/" ms.title="Broad Institute and Verily partner with Microsoft to accelerate the next generation of the Terra platform for health and life science research" rel="bookmark">
 		Broad Institute and Verily partner with Microsoft to accelerate the next generation of the Terra platform for health and life science research	</a>,
 <a class="f-post-link c-heading-6 m-chevron" href="https://news.microsoft.com/2020/12/14/microsoft-and-warner-bros-pictures-assemble-all-star-team-in-lebro

In [21]:
# Here, we further filter down to articles and then get their hrefs to
#    eliminate the navigation links at the bottom.
articles = many_pr_soup_1.find('section', id='primary').find_all('article')
links = [i.find('a')['href'] for i in articles]
links

['https://news.microsoft.com/2021/01/12/microsoft-announces-quarterly-earnings-release-date-46/',
 'https://news.microsoft.com/2021/01/11/broad-institute-and-verily-partner-with-microsoft-to-accelerate-the-next-generation-of-the-terra-platform-for-health-and-life-science-research/',
 'https://news.microsoft.com/2020/12/14/microsoft-and-warner-bros-pictures-assemble-all-star-team-in-lebron-james-bugs-bunny-and-xbox-to-celebrate-gaming-and-coding-education-inspired-by-the-upcoming-animated-live-action-adventure/',
 'https://news.microsoft.com/2020/12/11/industry-leaders-in-tech-education-and-financial-services-join-together-in-new-national-council-to-activate-ai-for-the-greater-good/',
 'https://news.microsoft.com/2020/12/09/deutsche-telekom-and-microsoft-redefine-partnership-to-deliver-high-performance-cloud-computing-experiences/',
 'https://news.microsoft.com/2020/12/08/johnson-controls-and-microsoft-announce-global-collaboration-launch-integration-between-openblue-digital-twin-and-az

In [22]:
many_pr_links_1 = links.copy()

## Automate getting links and data from each

In [23]:
# We need to turn links into soup objects a lot, so let's make a function.
def link_to_soup(link):
    page = requests.get(link, headers={'User-Agent': _AGENT}).text
    soup = BeautifulSoup(page)
    return soup
    
def get_links_from_link_page(link_page):
    soup = link_to_soup(link_page)
    articles = soup.find('section', id='primary').find_all('article')
    links = [i.find('a')['href'] for i in articles]
    return links

def get_data_from_links(links):
    data_list = []
    for link in links:
        soup = link_to_soup(link)
        data_list.append(get_data_from_soup(soup))
        
    return data_list


In [24]:
msft_prs = pd.DataFrame(get_data_from_links(many_pr_links_1))
msft_prs.head()

Unnamed: 0,article:published_time,article:modified_time,og:title,og:description,og:url,h3,body
0,2021-01-12T21:05:18+00:00,2021-01-12T21:05:39+00:00,Microsoft announces quarterly earnings release...,"REDMOND, Wash. — Jan. 12, 2021 — Microsoft Cor...",https://news.microsoft.com/2021/01/12/microsof...,,"REDMOND, Wash. — Jan. 12, 2021 — Microsoft Cor..."
1,2021-01-11T14:00:33+00:00,,Broad Institute and Verily partner with Micros...,Multiyear partnership brings together advanced...,https://news.microsoft.com/2021/01/11/broad-in...,Multiyear partnership brings together advanced...,"\n\nCAMBRIDGE, Mass., SOUTH SAN FRANCISCO, Cal..."
2,2020-12-14T17:11:01+00:00,2020-12-15T16:59:18+00:00,Microsoft and Warner Bros. Pictures assemble a...,Fans can submit their best video game ideas an...,https://news.microsoft.com/2020/12/14/microsof...,Fans can submit their best video game ideas an...,"\n\nREDMOND, Wash. — Dec. 14, 2020 — Microsoft..."
3,2020-12-11T17:00:11+00:00,,"Industry leaders in tech, education and financ...",Coalition established to identify and solve si...,https://news.microsoft.com/2020/12/11/industry...,Coalition established to identify and solve si...,"REDMOND, Wash. — Dec. 11, 2020 — On Friday, le..."
4,2020-12-10T06:00:14+00:00,2020-12-10T17:10:41+00:00,Deutsche Telekom and Microsoft redefine partne...,Seven-year strategic agreement to help enterpr...,https://news.microsoft.com/2020/12/09/deutsche...,,"BONN, Germany, REDMOND, Wash. — Dec. 9, 2020 —..."


# Further automation

**Note**: for running time reasons, we're not going to make a multi-links-page version, but note that there's a next page link at the bottom of those pages that can be extracted to build that:

```html
<a href="/category/press-releases/page/2/?paged=3" 
   class="c-glyph x-hidden-focus" 
   aria-label="Go to next page" ms.title="Next Page">
```

However, we could also notice that the link pages have a number in the URL that is incremented by one for each page.
We would have to look at a page to get the end number, but we could also simply use a loop to construct a URL for each of those numbers.

`https://news.microsoft.com/category/press-releases/page/2/`