# Cleaned Parsing Functions

In [10]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import re
import dateutil.parser
import datetime

## Overview

Our goal is to extract relevant information on every article tagged with the `Codingbootcamp` tag.  
We accomplish this in three steps:
1. Iterate through each page of the archive for this tag.
2. On each date page (such as 2018/01/02), extract the urls for all links that correspond to articles published on that date tagged with `Codingbootcamp` (don't extract "Home", "signup" links etc).
3. On each url, call the single article parsing function to extract author, author_bio, title, publish date, publisher, and article text.

## Get HTML Function
**Base Source**: https://realpython.com/python-web-scraping-practical-introduction/  
**Update10/1/18**: This code originated from this tutorial but was later adapted in order complete step 3 - to get all articles tagged with `Codingbootcamp`. 
I added a second get html function that checked whether the web browser redirected you at some point when attempting to access the html at the provided link. This was necessary because when you attempted to access a page of the archive that didn't exist - such as attempting to access `archive/2013/03/02`, you are instead redirected back to the year or month page. In those situations, I didn't want to re-parse the articles on the main page, which would have been redundant parsing (and extra time in an already long function, and instead didn't parse a page that had been redirected.  
I had to keep this in a seperate function, because for certain single article urls, like "https://medium.freecodecamp.org/5-key-learnings-from-the-post-bootcamp-job-search-9a07468d2331" were redirected in the retreival process (I am not sure why). I attempted to come up with a elegant solution using response object attributes that would differentiate between article redirects and archive redirects, but was unable to do so. 
So the function that retrieves the html for articles doesn't check for redirects, but the function that retrieves archive page html does check for redirects. 

In [11]:
# Define get function to get raw HTML
def simple_get_article(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        #closing ensures any network resources are freed when out of scope - good practice
        with closing(get(url, stream=True)) as resp: 
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

# Define get function to get raw HTML from the archive pages, with a twist.
def simple_get_archive(url):
    """
    Retrieves the raw html of a page in the `Codingbootcamp` tag archive (https://medium.com/tag/codingbootcamp/archive).
    But some pages of the archive don't exist (like 2013/01/02) because no stories were published on that date. 
    If you attempt to access these nonexistent urls, you are redirected to the main year or month page.
    We don't want to re-parse those pages (redundant code), so our hack-y solution is to check to see if the HTML response
    object has a redirect (302 status code) in its history. 
    This is defined as a separate function, because some of the article urls are redirected (I am not sure why), and I couldn't 
    come up with a clean solution that separates article redirection vs archive redirection. 
    """
    try:
        #closing ensures any network resources are freed when out of scope - good practice
        with closing(get(url, stream=True)) as resp: 
            if is_redirect(resp):
                return "redirect"
            elif is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
# I added this helper function to check if the HTML response returned by the browser had been redirected at some point
# See http://docs.python-requests.org/en/v0.10.6/api/ for docs on history attribute
def is_redirect(resp):
    """
    Returns True if the resp had been redirected at some point in the retrieval process due to nonexistent url, 
    False otherwise. 
    Arguments: a HTML response object
    """
    resp_history = resp.history
    if resp_history: #not empty - some things happend before response was returned
        # I specifically want to check for redirects (status code 302)
        statuses = [h.status_code==302 for h in resp_history]
        # Are there any true in the above list comp? Then something was redirected. 
        return np.any(statuses)
    else:
        return False
    
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

## Single Article Info Parsing

The following functions return information parsed from a single article on Medium. 

In [12]:
def get_author(parsed_html):
    """Parses the author name from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the author name as a string or None if author tag not present
    """
    author = parsed_html.find('meta', property="author")
    return author['content'] if author else None

def get_author_bio(parsed_html):
    """Parses the author's bio/description from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the author bio/description as a string if it exists, and None otherwise
    """
    bios = parsed_html.find_all('div', class_="ui-caption ui-xs-clamp2 postMetaInline")
    # If bios is empty, that means there is no author bio for article, and the [0]  will error so we need to explicityly
    # check and return None if no author bio
    return bios[0].text if bios else None

def get_title(parsed_html):
    """Parses the title of a Medium article.
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        the title of the article as a string or None if title tag not present
    """
    title = parsed_html.find('meta', property='og:title')
    return title['content'] if title else None

def get_raw_publish_date(parsed_html):
    """Parses the date a Medium article was published. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns: 
        a raw/uncleaned publish date, which looks like '2016-11-19T16:48:30.365Z', or None if tag not present
    """
    date = parsed_html.find('meta', property='article:published_time')
    return date['content'] if date else None

def get_article_publisher(parsed_html):
    """Parses the article's publisher from a Medium article. 
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Notes:
        The publisher is encoded as "https://facebook.com/publisher" so I extract just the publisher name.
        Not all articles have a verified publisher - like if it's just the author's personal blog - so publisher is
        just "medium" in that case
    Returns:
        If article is hosted on verified publisher, returns publisher name as string
        If article is on personal blog, returns "medium" as a string. 
        If publisher tag doesn't exist, returns None
        
    """
    long_publisher = parsed_html.find("meta", property='article:publisher')
    return long_publisher['content'].split("/")[3] if long_publisher else None

def get_raw_article_text(parsed_html):
    """Extracts out the text/content of the Medium article.
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Returns:
        a raw/uncleaned string of text or None if tag is not present.
        The string is considered "raw" because there are some weird characters that are remants
        of header formatting and the like.
    """
    text = parsed_html.find_all('div', class_='postArticle-content')
    return text[0].text if text else None

def clean_text(text):
    """takes a string of text
    removes \xa0 and \u200a that are randomly splattered throughout text
    adds spaces after punctuation that appeared to be missing
    splits all words where a capital letter is in the middle of a word and puts a space in front of it and removes double spaces
    """
    if text:
        cleaned_text = text.replace("\xa0", " ").replace("\u200a", " ")
        cleaned_text = re.sub(r'(?<=[.,])(?=[^\s])', r' ', cleaned_text)
        cleaned_text = re.sub(r'([A-Z])', r' \1', cleaned_text).lstrip().replace("  ", " ")
        return cleaned_text
    else:
        # text is a None object which you can't use regex on.
        return text

def clean_date(date):
    """takes a string in RFC 339 format ('Y-M-D"T"H:M:S.MS"Z"')
    returns a string of format ('Y-M-D H:M:S')
    """
    if date:
        date = dateutil.parser.parse(date)
        date = date.strftime("%Y-%m-%d %H:%M:%S")
        return date
    else:
        # Date is a None object and you can't parse it
        return date

def get_all_article_info(article_url):
    """Parses a Medium article to get all needed information about author and story.
    Arguments:
        article_url: String of url for article to parse
    Returns:
        list of [author, author_bio, title, date, publisher article_text], where each component is a string or None
    """
    raw_html = simple_get_article(article_url)
    parsed_html = BeautifulSoup(raw_html, 'html.parser')
    author = get_author(parsed_html)
    author_bio = get_author_bio(parsed_html)
    title = get_title(parsed_html)
    raw_date = get_raw_publish_date(parsed_html)
    cleaned_date = clean_date(raw_date)
    publisher = get_article_publisher(parsed_html)
    raw_text = get_raw_article_text(parsed_html)
    cleaned_text = clean_text(raw_text)
    return [author, author_bio, title, cleaned_date, publisher, cleaned_text]

### Single Article Example Usage

In [13]:
test_url_1 = "https://medium.com/launch-school/were-not-a-bootcamp-c33901412c38"
get_all_article_info(test_url_1)

['Chris Lee',
 None,
 "We're Not a Bootcamp – Launch School – Medium",
 '2018-08-01 01:48:19',
 'medium',
 'We’re Not a Bootcamp We’re Something Unique, and Uniquely Effective Photo by Kyle Johnson on Unsplash One of the things about operating in a crowded marketplace is that you tend to get lumped in with the biggest names and most common stereotypes. In the programming education space, that means the now familiar label of “bootcamp. ” I often see people refer to Launch School as a “coding bootcamp, ” which may seem like a reasonable shortcut for helping people understand what we do; however, it fails to capture what makes us special. When I talk about Launch School and what we are trying to achieve, I don’t use the word “bootcamp. ” We are an online school for developers, but more than that, we are a school with an opinionated pedagogy that focuses on fundamentals first with the goal of building skills that last a career. So, why do I steer clear of the “bootcamp” label? To put it si

### Get all open post links from a page

In [14]:
def get_all_open_post_links(parsed_html):
    """Retrieves all links on a page that open to an article (does not return "home", "sign up" links etc).
    Arguments:
        parsed_html: object returned by calling `BeautifulSoup(raw_html, 'html.parser')`
    Notes:
        All links on a page that have the data-attribute 'open-post' are the types of links we want.
        find_all returns a special BeautifulSoup object so I need to extract the string url.
        I think there are muliple links for each post (like the title and "read more"), so returning array 
        has duplicates
    Returns:
        List of link strings
    """
    def href_open_post_data_action(tag):
        """Helper parsing function that checks to see if an "a" tag is an href with an 'open-post' data action.
        Arguments:
            tag: All html tags like <a>, <div> that BeautifulSoup has extracted
        Returns:
            true if tag has href and 'open-post' data-action attribute, false otherwise
        """
        return tag.has_attr('href') and tag.get('data-action') == 'open-post'
    
    link_objects =  parsed_html.find_all(href_open_post_data_action)
    return [link.get('href') for link in link_objects] #retrieves the string url from link object

The above function is the "middle level" function in the grand scheme of our parsing.  
First, we defined the function that parses information for a single article.  
The next step is to retrieve all the links to articles on a web page (like the web page returned by searching "coding bootcamp" for example), and then call our single article parsing function on each link we extracted from the page.  
Note - the name is "open post" links because there are a ton of links on a webpage - to "home", to "sign-up", to "search" etc. We are only interested in extracting the links that correspond to articles about coding bootcamps.  
After inspecting the source html, I discovered that all links that open to coding bootcamp articles have the attribute "data-action" set to "open-post" within the html `<a href>` tag. So this function ensures that it only extracts the links we want. 

## Iterate Through All Dates in Archive

The next step is to define a function that will iterate through each "date" page in the archive, gather all the open post links for each "date" page (i.e. all the story links), and then call the single article parser on each link.  
**NOTE**: The following function is currently defined such that it will *ONLY* work on the archive for the `Codingbootcamp` (https://medium.com/tag/codingbootcamp/archive) because currently it operates on the assumption that the years 2013 and 2014 have so few articles that they are not seperated by month or day, only year. 

In [15]:
def format_dates_for_url(integer_date):
    """
    Formats dates to enter into the archive url, which requires numbers < 10 to be encoded with a "0" in front,
    but Python doesn't allow 01 as an integer. 
    """
    if integer_date < 10:
        return str(0) + str(integer_date)
    else:
        return str(integer_date)
def get_specific_Codingbootcamp_links():
    # For current range of years available
    years = range(2013, 2019)
    months_w_o_zero_in_front = range(1, 13)
    days_w_o_zero = range(1, 32)
    base_url = "https://medium.com/tag/codingbootcamp/archive"
    # Collector variable to story all story urls
    story_links = []
    for year in years:
        # These years don't have month, day subdivisions so to prevent redundant parsing, 
        # just parse the base year and its links.
        if year == 2013 or year == 2014:
            # 
            url = base_url + "/" + format_dates_for_url(year)
            raw_year_html = simple_get_archive(url)
            # If raw_year_html is None, can't parse it so skip
            if not raw_year_html:
                break
            parsed_year_html = BeautifulSoup(raw_year_html, 'html.parser')
            links = get_all_open_post_links(parsed_year_html)
            #use extend instead of append because just want to add elements, not create nested lists.
            story_links.extend(links) 
        else:
            for month in months_w_o_zero_in_front:
                for day in days_w_o_zero:
                    url = base_url + "/" + format_dates_for_url(year) + format_dates_for_url(month) + "/" + format_dates_for_url(day)
                    raw_day_html = simple_get_archive(url)
                    # 2015, 2016, 2017, 2018 have some dates with no stories, so GET requests are redirected
                    # and we don't want to parse stuff we already did.
                    if raw_day_html == "redirect":
                        #skip the redirected day - advance in for loop
                        break
                    # If raw_day_html is None, can't parse it so skip
                    elif not raw_day_html:
                        break
                    parsed_day_html = BeautifulSoup(raw_day_html, 'html.parser')
                    links = get_all_open_post_links(parsed_day_html)
                    #use extend instead of append because just want to add elements, not create nested lists.
                    story_links.extend(links)
    # Create a set of links to remove duplicates and then turn back into a list (better data structure)
    return list(set(story_links))

### Collect Text Data

In [16]:
# List of urls for all stories tagged with `Codingbootcamp`
story_links = get_specific_Codingbootcamp_links()

In [17]:
# Get all info like article, publisher, etc for each url gathered above
# THIS CELL TAKES FOREVER BE WARNED - DO NOT CLOSE COMPUTER WHILE RUNNING ELSE THE CONNECTION BREAKS
story_info_list = []
for url in story_links:
    story_info_list.append(get_all_article_info(url))

In [18]:
column_names = ["author", "author_bio", "title", "date", "publisher", "text"]
Codingbootcamp_info = pd.DataFrame(story_info_list, columns = column_names)

In [19]:
Codingbootcamp_info

Unnamed: 0,author,author_bio,title,date,publisher,text
0,Chris Barfod,"Creative programmer, Dvorak keyboard user, alw...","Hello, World! – Chris Barfod – Medium",2018-03-07 05:47:01,medium,"Hello, World! Come with me on a journey throug..."
1,Rithm School,18 week program on #JavaScript #ReactJS and #P...,What’s New in Python 3.7: Data Classes – Rithm...,2018-07-11 22:57:17,medium,What’s New in Python 3. 7: Data Classes In Jun...
2,Dex Mills,dexmills.com,React – Dex Mills – Medium,2017-05-09 20:31:28,medium,"React No matter what anyone tells you, faceboo..."
3,KeepCoding,We create the best learning experience for Ful...,Which Programming Language Should A Beginner L...,2017-08-10 14:42:12,medium,Which Programming Language Should A Beginner L...
4,Holly Valenty,Full-Stack Developer and Tech Education Enthus...,Failing forward. – HollsMarie – Medium,2018-03-05 04:59:50,medium,Failing forward. First things first. I earned ...
5,Etienne Mustow,Spirit and Soul.,Makers Academy PreCourse — Week 1 – Etienne Mu...,2017-08-07 11:49:31,medium,Makers Academy Pre Course — Week 1 After hours...
6,Emily Deans,"Washington, D.C.-based web developer. Former c...",Playing Around with the iTunes API – Emily Dea...,2017-07-22 19:28:02,medium,Playing Around with the i Tunes A P I Happy Sa...
7,Azharie Muhammad,I'm Fullstack Web Developer,Setelah Setahun Belajar Coding – Azharie Muham...,2018-08-05 08:50:03,medium,Setelah Setahun Belajar Coding1 Agustus kemari...
8,Yasin Shuman,Blockchain Engineer with a passion for learnin...,"Again and again, I am blown away by Austen and...",2018-08-02 19:51:24,medium,"Again and again, I am blown away by Austen and..."
9,Chance Taken,Facilitator @ Chingu. Apply here: chingu.io,Overheard in Chingu – Chingu – Medium,2018-04-02 13:56:59,medium,Overheard in Chingu Accomplishments from teams...


In [20]:
#Convert to csv so I don't have to run the collection function again - it takes FOREVER
# Index false so it doesn't write index (which we dont' use but sometimes you want to) to csv file
Codingbootcamp_info.to_csv("codingbootcamp_articles_info.csv", index=False)