# Blog Text Analysis

The goals of this ipython notebook include:

- Open each corpus (saved to file from a previous script)
- Analyzes corpus for parts of speech
- Analyzes corpus for topics (important for matching)
- Saves data about blog in deidentified way (using blog id #)

## Part 1: Setup

First, we load modules we need.

In [7]:
import nltk
import pandas as pd
import os
import re
from collections import Counter

## Part 2: Define Functions

`analyzeTextForPOS` allows us to analyze the text that we've scraped and count parts of speech, using the Penn Treebank tagger.

In [8]:
def analyzeTextForPOS (blogstring):
    # Tokenize the text
    tokens = nltk.word_tokenize(blogstring.lower())
    text = nltk.Text(tokens)
    # get parts of speech for each token
    tags = nltk.pos_tag(text)
    # count how many times each pos is used
    counts = Counter(tag for word,tag in tags)
    # note that the POS abbreviations can be understood here:
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    # return the counts as a dictionary
    return(dict(counts))

## Part 3: Get Corpora

We'll iterate through the saved corpora on disk (these are not included, for the sake of subject privacy, in the GitHub repo for this project).  Each blog has its own directory, in which .txt files for each individual blog post are stored.


In [10]:
asd_directory = "../confidential/corpora/ASD/"
controls_directory = "../confidential/corpora/controls/"

asd_blog_dirs = []
controls_blog_dirs = []
for root, dirs, files in os.walk(asd_directory):
    asd_blog_dirs += dirs
for root, dirs, files in os.walk(controls_directory):
    controls_blog_dirs += dirs

In [11]:
# todo:
# get rid of the ones that end in ?share= ....txt or _#comments.txt
# or keep them from being collected to begin with!
asdPartOfSpeechDicts = []
controlsPartOfSpeechDicts = []
for blog_dir in asd_blog_dirs:
    for filename in os.listdir(asd_directory + blog_dir):
        m = re.match(re.compile(".+\.txt"), filename) 
        if m:
            file = open((asd_directory + blog_dir + "/" + filename), "r") 
            text = file.read()
            asdPartOfSpeechDicts.append(analyzeTextForPOS(text))
for blog_dir in controls_blog_dirs:
    for filename in os.listdir(controls_directory + blog_dir):
        m = re.match(re.compile(".+\.txt"), filename) 
        if m:
            file = open((controls_directory + blog_dir + "/" + filename), "r") 
            text = file.read()
            controlsPartOfSpeechDicts.append(analyzeTextForPOS(text))

In [12]:
print(pd.DataFrame(controlsPartOfSpeechDicts))

       #    $   ''     (     )     ,   .    :  CC    CD ...    VBD   VBG  \
0    NaN  NaN  NaN   NaN   NaN   NaN   5  NaN   1   NaN ...    NaN   2.0   
1    NaN  NaN  NaN   3.0   3.0   6.0  16  NaN   7   1.0 ...   13.0   2.0   
2    NaN  NaN  NaN   3.0   3.0   9.0  16  NaN  18   5.0 ...    9.0   8.0   
3    NaN  1.0  NaN  10.0  10.0  10.0  26  NaN  19  17.0 ...   21.0  19.0   
4    NaN  NaN  NaN   NaN   NaN   6.0  14  NaN   8   2.0 ...    4.0   8.0   
5    NaN  NaN  NaN   7.0   7.0  23.0  26  NaN  32   8.0 ...   72.0  16.0   
6    NaN  NaN  NaN   3.0   3.0   6.0  20  NaN  12   3.0 ...   28.0  10.0   
7    NaN  NaN  NaN   3.0   3.0   5.0  10  NaN   8   2.0 ...   11.0   5.0   
8    NaN  NaN  NaN   4.0   4.0   5.0  15  NaN  13   5.0 ...   23.0   6.0   
9    NaN  NaN  NaN   5.0   5.0   4.0  15  NaN  21   NaN ...   23.0  10.0   
10   NaN  NaN  NaN   3.0   3.0  10.0  17  NaN  28   6.0 ...   13.0   8.0   
11   NaN  NaN  NaN   NaN   NaN   3.0  11  NaN  11   NaN ...    8.0   4.0   
12   1.0  Na