Scraping Data from the Web
=====================================

For today we are going to look at creating a few scripts that we can use to scrape data from the web in a couple
of different fields.  

1. Medical Data
2. Sports Data
3. General News Articles

We are going to use `requests` and `beautifulsoup4` to accomplish this task.  The first step is to determine the sources
of the data.  

In [115]:
import bs4
import requests
import itertools
import json
import os

from bs4 import BeautifulSoup

Before we start going through the different sources, the first thing that we want to create is a helper function
that we can use to easily return the `status_code` and the `content`.  We also want it to retry at least 3 times
in the case of a failed url (likely introducing some sort of wait between requests).  

In [123]:
def retry_get(url, retry_count=3):
    attempt_count = 0
    while True:
        try:
            attempt_count += 1
            result = requests.get(url)
            code = result.status_code
            content = result.content
            
            if code >= 200 and code < 300:
                return code, content

            if attempt_count > retry_count:
                return code, content
            
        except Exception as e:
            print(f'Url {url} failed {attempt_count} times - last error {e}')
            if attempt_count > retry_count:
                return -1, None

## Medical Sources

So there are a couple of sources that I am going to look into for data that should be "medical" in nature.  

First is [Medical News Today](https://www.medicalnewstoday.com) which has a list of all its archieved articles
at the link `https://www.medicalnewstoday.com/archive/<page_number>` with the exclusion of the `page_number`
defaulting to the most recent articles.  

When I first looked at the url there were 1-37 different pages, however when expliciting putting in `page_number`
of 37 it showed that it goes up to **93** different pages with articles.  To start let me give a break down
of the html in the "archive" page.  

In the body of the page the list of different articles can be found un the following tag.  

        <ul class='listing'>
            <li class='article'>
                <a href='/artiles/<number>.php' title='Some article title'>
                    ...
                    <span class='story_metadata'>
                         <span class='story_date'>7 Apr 2016</span>
                    </span>
                </a>
            </li>
        </ul>
        
So our first little function is going to get the html for the "archive" page and return a list of refs and titles
as well as the class name of the link (there are currently 3 different ones I have seen: `article`, `knowledge`,
`featured`).  



In [124]:
def get_articles_from_archive_page(page_number=None):
    base_url = 'https://www.medicalnewstoday.com'
    archive_url = f'{base_url}/archive/'
    if page_number is not None:
        archive_url = archive_url + str(page_number)

    try:
        status_code, content = retry_get(archive_url)

        if status_code < 200 or status_code >= 300:
            return status_code, []

        soup = BeautifulSoup(content, 'html5lib')

        article_list = soup.find('ul', class_='listing')
        articles = article_list.find_all('li')
    except Exception as e:
        print(f'Failed to get details from {archive_url} - {e}')
        return -1, []
    
    return status_code, [
        { 
            'link': f'{base_url}{listing.a["href"]}',
            'title': listing.a['title'],
            'type': listing['class'][0],
            'span_class': listing.span['class'][0],
            'span_text': listing.span.text.strip(),
            'time': listing.find('span', class_='story_metadata').span.text
        } for listing in articles
    ]        
    

In [125]:
status_code, articles = get_articles_from_archive_page(93)
display(status_code, articles[:2])

200

[{'link': 'https://www.medicalnewstoday.com/articles/308044.php',
  'title': 'Hemp: Health Benefits, Nutritional Information',
  'type': 'knowledge',
  'span_class': 'headline',
  'span_text': 'Hemp: Health Benefits, Nutritional Information',
  'time': '7 Apr 2016'},
 {'link': 'https://www.medicalnewstoday.com/articles/308273.php',
  'title': 'Social media use and depression linked in large study',
  'type': 'written',
  'span_class': 'headline',
  'span_text': 'Social media use and depression linked in large study',
  'time': '23 Mar 2016'}]

So now that we have a function that will return all the links.  We need to create code that will iterate through
all the articles and pull the data that we care about on the page.  

We are going to look at a couple of different pages to see if the schemas are the same.  So here is the first
one, which is checked from [https://www.medicalnewstoday.com/articles/308044.php](https://www.medicalnewstoday.com/articles/308044.php).  

#### Format

The article is found inside of the:

    <div class='article_body'>
        <div itemprop='articleBody'>
            ...
        </div>
        <p id='advertiser_disclosure'>
           ...
        </p>
    </div>
    
So there are a few other tags in the page that we want to ignore or remove the text from, namely.  

    <div class='article_toc ...'> ... </div>
    <span class='imageWidgetWrapper'> ... </span>
    <div class='... leaderboard'>
    <script> ... </script>
    <img> ... </img>
    <div class='... related_inline ...'> ... </div>
    
So lets create a function that will read a page from a link and parse out just the text that we care about
given that page.  (We could maybe verify by making sure the body has these two classes `article` and `v2`).  

In [126]:
def read_article_text(url):
    try:
        status_code, content = retry_get(url)
        
        if status_code < 200 or status_code >= 300:
            return status_code, str(content)

        soup = BeautifulSoup(content, 'html5lib')
        article_tag = soup.find('div', class_='article_body')
        article_body = article_tag.find('div', itemprop='articleBody')

        tags_to_remove = itertools.chain(*[
            article_body.find_all('script'),
            article_body.find_all('div', class_='article_toc'),
            article_body.find_all('span', class_='imageWidgetWrapper'),
            article_body.find_all('div', class_='leaderboard'),
            article_body.find_all('img'),
            article_body.find_all('div', class_='related_inline'),
            article_body.find_all('div', class_='photobox_right'),
            article_body.find_all('div', class_='photobox_left')
        ])

        for tag in tags_to_remove:
            tag.extract()

        return status_code, article_body.get_text()
    except Exception as e:
        print(f'Failed to read article {url} - {e}')
        return -1, None

In [127]:
status_code, page_details = read_article_text('https://www.medicalnewstoday.com/articles/308044.php')
print(status_code)
print(page_details[:500].strip() + '...')

200
Hemp is a plant grown in the northern hemisphere that takes about 3-4 months to mature. Hemp seeds can be consumed or used to produce a variety of food products including hemp milk, hemp oil, hemp cheese substitutes and hemp-based protein powder.
Hemp seeds have a mild, nutty flavor. Hemp milk is made from hulled hemp seeds, water, and sweetener. Hemp oil has a strong "grassy" flavor.



Hemp is commonly confused with marijuana. It belongs to the same family, but the two plants are very dif...


### Gathering for the entire site

We have the two different steps we need to combine into one.  This step will take the list of `page_numbers` to gather
data for and location to save the results to (as a json file).  

In [128]:
def read_from_medicalnewstoday(page_numbers, save_filename):
    results = {}
    for page_number in page_numbers:
        code, articles = get_articles_from_archive_page(page_number)
        if code < 200 or code >= 300:
            results[page_number] = (code, articles)
            continue
            
        for article in articles:
            code, text = read_article_text(article['link'])
            article['status_code'] = code
            article['text'] = text

        results[page_number] = articles
    
    with open(save_filename, 'w') as f:
        json.dump(results, f, indent=4)

In [122]:
page_nums = list(range(1, 6))
file_name = f'medicalnewstoday-{page_nums[0]}-{page_nums[-1]}.json'
read_from_medicalnewstoday(page_nums, os.path.join('..', 'raw_data', 'docs', 'medical', file_name))

In [129]:
page_nums = list(range(6, 11))
file_name = f'medicalnewstoday-{page_nums[0]}-{page_nums[-1]}.json'
read_from_medicalnewstoday(page_nums, os.path.join('..', 'raw_data', 'docs', 'medical', file_name))

Ok so we have the first two groups (1-10) so lets create the different page groups

In [135]:
page_groups = [list(range(11 + (i * 5), 11 + 5 + (i * 5))) for i in range(16) ]
page_groups.append(list(range(91, 94)))
display(page_groups)

[[11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20],
 [21, 22, 23, 24, 25],
 [26, 27, 28, 29, 30],
 [31, 32, 33, 34, 35],
 [36, 37, 38, 39, 40],
 [41, 42, 43, 44, 45],
 [46, 47, 48, 49, 50],
 [51, 52, 53, 54, 55],
 [56, 57, 58, 59, 60],
 [61, 62, 63, 64, 65],
 [66, 67, 68, 69, 70],
 [71, 72, 73, 74, 75],
 [76, 77, 78, 79, 80],
 [81, 82, 83, 84, 85],
 [86, 87, 88, 89, 90],
 [91, 92, 93]]

In [136]:
for page_nums in page_groups:
    print(f'Handling page nums {page_nums}')
    file_name = f'medicalnewstoday-{page_nums[0]}-{page_nums[-1]}.json'
    read_from_medicalnewstoday(page_nums, os.path.join('..', 'raw_data', 'docs', 'medical', file_name))

Handling page nums [11, 12, 13, 14, 15]
Handling page nums [16, 17, 18, 19, 20]
Handling page nums [21, 22, 23, 24, 25]
Handling page nums [26, 27, 28, 29, 30]
Handling page nums [31, 32, 33, 34, 35]
Url https://www.medicalnewstoday.com/articles/321148.php failed 1 times - last error HTTPSConnectionPool(host='www.medicalnewstoday.com', port=443): Max retries exceeded with url: /articles/321148.php (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000002174EB16630>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
Handling page nums [36, 37, 38, 39, 40]
Failed to read article https://www.medicalnewstoday.com/articles/221369.php - 'NoneType' object has no attribute 'find'
Failed to read article https://www.medicalnewstoday.com/articles/226119.php - 'NoneType' ob

Alright all files have been downloaded, these are the raw files, the final step I have is to take the raw files
and remove any links that errored out.  

In [145]:
#os.chdir(os.path.join('..', 'raw_data', 'docs', 'medical'))

files = [f for f in os.listdir() if f.endswith('.json') and f != 'medicalnewstoday-all-ok.json']
data_values = []
for f in files:
    with open(f, 'r') as data_file:
        data_values.append(list(json.load(data_file).values()))
        
values = [v for v in itertools.chain(*[d for d in itertools.chain(*data_values)])]
ok_values = [d for d in values if d['status_code'] >= 200 and d['status_code'] < 300]

with open('medicalnewstoday-all-ok.json', 'w') as out:
    json.dump(ok_values, out, indent=2)