# Blog Scraping

## Part 1: Setup

First, we load modules we need.

In [1]:
from bs4 import BeautifulSoup
import urllib
import nltk
import pandas as pd
from scipy.stats.mstats import zscore
from collections import Counter
import pandas as pd
from scipy.stats.mstats import zscore
import matplotlib
import matplotlib.pyplot as plt
import sys
%matplotlib inline  
import numpy as np

Web scraping requires us to be respectful of the intellectual property of and server impact to the owners of the site we're scraping.

We will crawl sites based on known sitemap architectures. We do not attempt a full-fledged spidering of sites and do not follow links.

## Part 2: Define Functions

The getHTML function simply gets an entire webpage.  This is platform independent (will work for Blogspot, Wordpress, etc.)

In [3]:
def getHTML(url):
    try:
        f = urllib.urlopen(url)
    except: 
        print ("ERROR in " + url + " with exception " + str(sys.exc_info()[0]))
        return(404)
    if f.getcode() != 200:
        return(404)
    htmltext = f.read()
    return(htmltext)

Many bloggers use the Wordpress platform for blogging.  Wordpress has a certain way it structures its html as well as blog entries, so we have a pair of functions that handle both the parsing of an individual post as well as the spidering of the blog (looking for all posts in a given time frame).

In [4]:
def parseWordpressSite(htmltext):
    soup = BeautifulSoup(htmltext, "lxml")

    # remove any .feedback (they're within our .post divs so we want them out so that "Comments", e.g., won't be included)
    for div in soup.find_all("p", {'class':'feedback'}): 
        div.decompose()

    # remove any .storydate (they're within our .post divs so we want them out, so that "March", e.g., won't be included)
    for div in soup.find_all(class_ = "storydate"): 
        div.decompose()

    # get only .post divs and .entry-content divs.
    posthtml = soup.find_all("div", class_="post") + soup.find_all("div", class_="entry-content")
    posttext = ""
    for post in posthtml:
        posttext += post.getText()
    return(posttext)

In [6]:
def scrapeWordpressBlog (blog):
        year=2018
        month=2
        blogstring = ""
        # recursively construct a single text that combines all blog text from this blogger from 2011-present
        while year > 2010:
            page = site.strip() + str("%04d" % (year)) + "/" + str("%02d" % (month))
            if month==1 :
                month = 12
                year = year - 1
            else:
                month = month - 1
            htmltext = getHTML(page)
            if htmltext==404:  
                continue     # go to the next month.
            poststring = parseWordpressSite(htmltext).encode("utf8")
            blogstring = blogstring + " " + poststring
        # remove and replace smart quotes, unreadable characters, new line chars, etc.
        blogstring = blogstring.replace("\xe2\x80\x9c", "'").replace("\xe2\x80\x9d", "'") 
        blogstring = blogstring.replace('\xe2\x80\x92', " ").replace('\xe2\x80\x93', " ").replace('\xe2\x80\x94', " ")
        blogstring = blogstring.replace("\xe2\x80\x98", "'").replace('\xe2\x80\x99', "'")                                                                                                  
        blogstring = blogstring.replace('\n', " ").replace('\t', " ").replace('\xc2\xa0'," ")
        blogstring = blogstring.replace("\'", "'")
        return blogstring

## Part 3: Obtain list of blogs

Note that the actual blogs used are confidential, to preserve the privacy of the bloggers.  We pull here from a .csv that has the main url of the blog.

In [None]:
autismBlogList = pd.read_table("ASD_wordpress_bloggers.txt", squeeze=True).tolist()