Scraping Data from the Web
=====================================

For today we are going to look at creating a few scripts that we can use to scrape data from the web in a couple
of different fields.  

1. Medical Data
2. Sports Data
3. General News Articles

We are going to use `requests` and `beautifulsoup4` to accomplish this task.  The first step is to determine the sources
of the data.  

In [1]:
import bs4
import requests
import itertools
import json
import os

from bs4 import BeautifulSoup

Before we start going through the different sources, the first thing that we want to create is a helper function
that we can use to easily return the `status_code` and the `content`.  We also want it to retry at least 3 times
in the case of a failed url (likely introducing some sort of wait between requests).  

In [2]:
def retry_get(url, retry_count=3):
    attempt_count = 0
    while True:
        try:
            attempt_count += 1
            result = requests.get(url)
            code = result.status_code
            content = result.content
            
            if code >= 200 and code < 300:
                return code, content

            if attempt_count > retry_count:
                return code, content
            
        except Exception as e:
            print(f'Url {url} failed {attempt_count} times - last error {e}')
            if attempt_count > retry_count:
                return -1, None

## Medical Sources

Today we are groing to scrap another medical news site, in this case we are going to grab data 
from [News Medical](https://www.news-medical.net/medical/news).  This page contains a number of
articles that date back as far as __2009__.  

First lets look at the page that lists the articles (in a paged fashion).  That link is
[News Medical Page 1 - https://www.news-medical.net/medical/news?page=1](https://www.news-medical.net/medical/news?page=1).  

If we look at the structure of the list of articles on the page (there are 20 per page) we see something like the below. 

        <div class="posts">
            <div class="row">
                <div>
                </div>
                <div>
                    <h3>
                        <a href="--article-link--">--Article title--</a>
                    </h3>
                    <p class="item-desc">
                        --Article Description--
                    </p>
                </div>
            </div>
            ...
        </div>

So looking at this format, I believe we are going to grab all the `item-desc` elements and the `h3` siblings
from the `posts` parent.  

In [7]:
def get_articles_from_medical_news_page(page_number=None):
    base_url = 'https://www.news-medical.net'
    news_url = f'{base_url}/medical/news'
    if page_number is not None:
        page_url = f'{news_url}?page={page_number}'
    else:
        page_url = news_url
        
    try:
        status_code, content = retry_get(page_url)

        if status_code < 200 or status_code >= 300:
            return status_code, []

        soup = BeautifulSoup(content, 'html5lib')

        posts = soup.find('div', class_='posts')
        articles = [p.parent for p in posts.find_all('p', class_='item-desc')]
        
        return_articles = []
        for article in articles:
            a_tag = article.find('a')
            article_link = f'{base_url}{a_tag["href"]}'
            article_desc = article.find('p').get_text().strip()
            article_header = a_tag.get_text()
            return_articles.append({
                'link': article_link,
                'desc': article_desc,
                'title': article_header
            })
            
    except Exception as e:
        print(f'Failed to get details from {archive_url} - {e}')
        return -1, []
    
    return status_code, return_articles

So we now have a way to grab all the article links on a listing page.  Lets look at an article to determine
what the structure is there and how to extract data from the page.  

To start here is the structure we are looking at.  

    <div class='item-body content-item-body'>
        <h1 itemprop='headline'>--Article title--</h1>
        ...
        <div class='content'>
            <div itemprop='articleBody'>
                <div class='article-meta'>
                     <span class='article-meta-contents'>
                         <span class='article-meta-date'>Month Day, Year</span>
                     </span>
                </div>
                ...
                <p>--Article Paragraph--</p>
                <p>--Article Paragraph--</p>
                ...
                <p>--Article Paragraph--</p>
                <p>--Article Paragraph--</p>
                ...    
            </div>        
        </div>
    </div>

In [22]:
def read_news_medical_article(url):
    try:
        status_code, content = retry_get(url)
        
        if status_code < 200 or status_code >= 300:
            return status_code, str(content)
        
        soup = BeautifulSoup(content, 'html5lib')
        article = soup.find('div', class_='content-item-body')
        article_content = article.find('div', itemprop='articleBody')
        
        article_headline = article.find('h1', itemprop='headline').get_text().strip()
        article_date = article_content.find('span', class_='article-meta-date').get_text().strip()
        article_paragraphs = [p.get_text().strip() for p in article_content.find_all('p')]
        
        return status_code, {
            'headline': article_headline,
            'date': article_date,
            'text': '\n--\n'.join(article_paragraphs),
            'paragraphs': article_paragraphs
        }
    except Exception as e:
        print(f'Failed to read article {url} - {e}')
        return -1, None
        

At this point we have the two helper methods needed to read in a list of articles and get the content of said articles.  Lets
now create a helper function that we can use to retrieve all the articles for a list of `page_numbers`.  

In [27]:
def read_from_news_medical(page_numbers, save_filename):
    results = {}
    for page_number in page_numbers:
        print(f'Reading page_number {page_number} of {page_numbers}')
        code, articles = get_articles_from_medical_news_page(page_number)
        if code < 200 or code >= 300:
            results[page_number] = (code, articles)
            continue
            
        for article in articles:
            code, result = read_news_medical_article(article['link'])            
            article['status_code'] = code
            article['result'] = result

        results[page_number] = articles
    
    with open(save_filename, 'w') as f:
        json.dump(results, f, indent=4)

Alright we have the method, lets test it out

In [28]:
#read_from_news_medical([1], '/tmp/news_medical.json')

Reading page_number 1 of [1]


Alright we have a working page, so now lets create a crawler that will go through groups and write them out to file.  

In [30]:
page_size = 10
max_groups = 20
page_groups = [list(range(1 + i * page_size, (1 + page_size) + (i * page_size))) for i in range(max_groups)]
display(page_groups)

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
 [41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
 [61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
 [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 [81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
 [91, 92, 93, 94, 95, 96, 97, 98, 99, 100],
 [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
 [111, 112, 113, 114, 115, 116, 117, 118, 119, 120],
 [121, 122, 123, 124, 125, 126, 127, 128, 129, 130],
 [131, 132, 133, 134, 135, 136, 137, 138, 139, 140],
 [141, 142, 143, 144, 145, 146, 147, 148, 149, 150],
 [151, 152, 153, 154, 155, 156, 157, 158, 159, 160],
 [161, 162, 163, 164, 165, 166, 167, 168, 169, 170],
 [171, 172, 173, 174, 175, 176, 177, 178, 179, 180],
 [181, 182, 183, 184, 185, 186, 187, 188, 189, 190],
 [191, 192, 193, 194, 195, 196, 197, 198, 199, 200]]

In [32]:
for page_nums in page_groups:
    print(f'Handling page nums {page_nums}')
    file_name = f'news-medical-{page_nums[0]}-{page_nums[-1]}.json'
    read_from_news_medical(page_nums, os.path.join('..', 'raw_data', 'docs', 'medical', file_name))