# Blog Scraping

The goal of this script is to perform a number of tasks for each blog:

- Spider the site to find all in-site url references to the site's blog posts (using regex to define what a blog text -- as opposed to a listing of posts or index page, etc. --  looks like)
- Get all unique blog text urls
- Go to each url and extract the text, adding it to a corpus of the blog texts for that blog
- Save corpora to disk, so that we only have to crawl once and can analyze over and over without hitting the server

After we set things up, we will demonstrate on a sample blog (so that you can see how the blog list should be set up) and also run the blog scraper on our (confidential) list of subject blogs.


## Part 1: Setup

First, we load modules we need.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import pandas as pd
import re
import sys
import pathlib

It's also important to indicate what our assumptions are as far as data that this script needs.  

This script depends on a blog list, which will be a data frame with two columns:
* blog identifier (something like "1", "TEST-1", "Autism-1", etc.)
* https-based blog url, ending in `/` (e.g. "https://my.fake.blog.org/")
    
This script will also write files to disk, so if you run it, you will need to run it in a directory that has a subdirectory called "confidential" and a directory under that called "corpora".

Web scraping requires us to be respectful of the intellectual property of and server impact to the owners of the site we're scraping.

We will crawl sites based on known sitemap architectures. We do not attempt a full-fledged spidering of sites and do not follow links.

## Part 2: Define Functions

The `getHTML` function simply gets an entire webpage.  This is platform independent (will work for Blogspot, Wordpress, etc.)

In [2]:
def getHTML(url):
    try:
        r = requests.get(url)
    except: 
        print ("ERROR in " + url + " with exception " + str(sys.exc_info()[0]))
        return(404)
    if r.status_code != 200:
        return(404)
    htmltext = r.text
    return(htmltext)

`getWordpressLinks` takes a base url (like "http://my.fake.blog.org/") as a parameter, and uses what we know about Wordpress sites (blog posts are located in the `[site]/year/month/day` path, and links to blog posts for a given month can be located at `[site]/year/month`) to find all relevant links to blog posts for a given time point.  Links are then vetted according to the following rules:  

* They must have the form `[site]/year/month/day/[post]`
* They must not be double-counted -- we will strip duplicates using `set()`

In [3]:
def getWordpressLinks (site):
        pattern = re.compile(re.escape(site) + "\d{4}/\d{2}/\d{2}/.+")
        avoid = "#comment|#respond|\?share="
        links = set() # makes entries unique
        year=2018 # stop point
        month=3 # stop point
        # recursively construct a single text that combines all blog text from this blogger from 2010-present
        while year >= 2014:
            page = site.strip() + str("%04d" % (year)) + "/" + str("%02d" % (month))
            if month==1 :
                month = 12
                year = year - 1
            else:
                month = month - 1
            html_content = getHTML(page)
            if html_content!=404:
                soup = BeautifulSoup(html_content, 'lxml')
                for a in soup.find_all('a', href=True):
                    possible_link = a.get('href')
                    m = re.match(pattern, possible_link)
                    unwanted = re.search(avoid, possible_link)
                    if (m and unwanted is None):
                        links.add(possible_link)
        return(links)

The `getLinks` function will take a blog list (see above for the expected format) and return a list of every link that looks like a blog post for each blog in the blog list. It uses the `getWordpressLinks` function.

In [4]:
def getLinks(blogList):
    links = pd.DataFrame(columns=["blog_identifier","links"])
    for index, site in blogList.iterrows(): 
        new = pd.DataFrame({"blog_identifier": site[0], "links": list(getWordpressLinks(site[1]))})
        links = pd.concat([links, new] )
    return(links)

The `groupLinks` function takes a data frame that has many rows, each of which has a single blog post and the blog identifier it came from, and returns a shorter data frame, where each blog identifier has a single row in which there is a list of all blog posts from that blog.

In [5]:
def groupLinks(blogList):
    grouped = getLinks(blogList).groupby("blog_identifier")["links"].apply(lambda x: list(x))
    return(grouped)

Many bloggers use the Wordpress platform for blogging.  Wordpress has a certain way it structures its html as well as blog entries, so we have a pair of functions that handle both the parsing of an individual post as well as the spidering of the blog (looking for all posts in a given time frame).

In [6]:
def parseWordpressSite(htmltext):
    soup = BeautifulSoup(htmltext, "lxml")
    
    # remove any .feedback (they're within our .post divs so we want them out so that "Comments", e.g., won't be included)
    for div in soup.find_all("p", {'class':'feedback'}): 
        div.decompose()

    for script in soup.find_all ("script"):
        script.decompose()
        
    # remove ads
    for div in soup.find_all("div", {'class': 'wpa'}):
        div.decompose()
        
    # remove sharing links
    for div in soup.find_all("div", {'class': 'sharedaddy'}):
        div.decompose()    
    
    # remove any .storydate (they're within our .post divs so we want them out, so that "March", e.g., won't be included)
    for div in soup.find_all(class_ = "storydate"): 
        div.decompose()

    # get only .post divs and .entry-content divs.
    posthtml = soup.find_all("div", class_="post") + soup.find_all("div", class_="entry-content")
    posttext = ""
    for post in posthtml:
        posttext += post.getText()
    return(posttext)

`writeCorporaToDisk` takes a grouped list of blog posts, scrapes the entire list of blog posts, and saves the scraped text and saves it as a file with the name including the blog identifier.

In [10]:
def writeCorporaToDisk(blogList, parent_dir):
    for blog_identifier, post_list in groupLinks(blogList).iteritems():
        print("analyzing blog " + str(blog_identifier))
        directory = "blog_" + str(blog_identifier)
        pathlib.Path('../confidential/corpora/' + parent_dir + '/' + directory).mkdir(parents=True, exist_ok=True) 
        for blog_url in post_list:
            filename = re.sub(r"[:./]+", '_', str(blog_url))
            htmltext = getHTML(blog_url)
            poststring = parseWordpressSite(htmltext)
            # remove and replace smart quotes, unreadable characters, new line chars, etc.
            poststring = poststring.replace("\xe2\x80\x9c", "'").replace("\xe2\x80\x9d", "'") 
            poststring = poststring.replace('\xe2\x80\x92', " ").replace('\xe2\x80\x93', " ").replace('\xe2\x80\x94', " ")
            poststring = poststring.replace("\xe2\x80\x98", "'").replace('\xe2\x80\x99', "'")                                                                                                  
            poststring = poststring.replace('\n', " ").replace('\t', " ").replace('\xc2\xa0'," ")
            poststring = poststring.replace("\'", "'")
            # Write this blog's total corpus to file
            file_name = "../confidential/corpora/" + parent_dir + '/' + directory + "/" + filename + ".txt"
            text_file = open(file_name, "w")
            text_file.write(poststring)
            text_file.close()

## Part 3: Obtain list of blogs

Note that the actual blogs used are confidential, to preserve the privacy of the bloggers.  We pull here from a text file that has the main url of each blog on a separate line.  Note that to preserve the privacy of subjects, the blog lists are not included in the GitHub repository for this project.  This will *not* work for you, unless you create your own `ASD_wordpress_bloggers.csv` file.

In [11]:
autismWordpressBlogList = pd.read_csv("../confidential/ASD_wordpress_bloggers.csv", header=0)
controlWordpressBlogList = pd.read_csv("../confidential/control_wordpress_bloggers.csv", header=0)

## Part 4: Obtain blog texts

In [12]:
writeCorporaToDisk(autismWordpressBlogList, "ASD")
writeCorporaToDisk(controlWordpressBlogList, "controls")

analyzing blog 7
analyzing blog 2
analyzing blog 3
analyzing blog 4
analyzing blog 5
analyzing blog 6


## Part 5: Demonstrate How This Works

We are keeping the blogs themselves secret, so how can you know that this works?

You will want to make sure you run this notebook in a directory that has directory at the same level (a sibling directory) called "confidential" which in turn contains "corpora".  Then, you can do the following.  NOTE that the blog I selected as examples is *not* a blog that is actually used in this research.  It is just a sample wordpress blog, intended to demonstrate how this notebook works and that this script works effectively.

In [13]:
sample_blogs = pd.DataFrame(data = {"blog_identifier":['TEST1'], 
                            "blog_url":["https://en.blog.wordpress.com/"]})

In [14]:
sample_blogs

Unnamed: 0,blog_identifier,blog_url
0,TEST1,https://en.blog.wordpress.com/


In [15]:
sampleLinks = getLinks(sample_blogs)

In [16]:
sampleLinks

Unnamed: 0,blog_identifier,links
0,TEST1,https://en.blog.wordpress.com/2014/03/10/monda...
1,TEST1,https://en.blog.wordpress.com/2014/01/07/profi...
2,TEST1,https://en.blog.wordpress.com/2016/02/17/persp...
3,TEST1,https://en.blog.wordpress.com/2014/02/26/twelv...
4,TEST1,https://en.blog.wordpress.com/2014/02/11/the-d...
5,TEST1,https://en.blog.wordpress.com/2014/10/27/2014-...
6,TEST1,https://en.blog.wordpress.com/2014/09/26/early...
7,TEST1,https://en.blog.wordpress.com/2014/11/04/us-mi...
8,TEST1,https://en.blog.wordpress.com/2016/02/24/amp-f...
9,TEST1,https://en.blog.wordpress.com/2015/03/25/press...


In [17]:
sampleGrouped = groupLinks(sample_blogs)

In [18]:
sampleGrouped

blog_identifier
TEST1    [https://en.blog.wordpress.com/2014/03/10/mond...
Name: links, dtype: object

In [19]:
writeCorporaToDisk(sample_blogs, "sample")

analyzing blog TEST1


Now, check in your confidential/corpora directory and you should see a new directory that contains all the scraped text in a text file for each post!