# Blog Text Analysis

The goals of this ipython notebook include:

- Open each corpus (saved to file from a previous script)
- Analyzes corpus for parts of speech
- Analyzes corpus for topics (important for matching)
- Saves data about blog in deidentified way (using blog id #)

## Part 1: Setup

First, we load modules we need.

In [1]:
import nltk
import pandas as pd
import os
import re
from collections import Counter

## Part 2: Define Functions

`analyzeTextForPOS` allows us to analyze the text that we've scraped and count parts of speech, using the Penn Treebank tagger.

In [2]:
def analyzeTextForPOS (blogstring):
    # Tokenize the text
    tokens = nltk.word_tokenize(blogstring.lower())
    text = nltk.Text(tokens)
    # get parts of speech for each token
    tags = nltk.pos_tag(text)
    # count how many times each pos is used
    counts = Counter(tag for word,tag in tags)
    # note that the POS abbreviations can be understood here:
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    # return the counts as a dictionary
    return(dict(counts))

We also want to do things like add word count, and other stats (for matching etc.?)

## Part 3: Access Corpora

We'll iterate through the saved corpora on disk (these are not included, for the sake of subject privacy, in the GitHub repo for this project).  Each blog has its own directory, in which .txt files for each individual blog post are stored.


In [3]:
asd_directory = "../confidential/corpora/ASD/"
controls_directory = "../confidential/corpora/controls/"

asd_blog_dirs = []
controls_blog_dirs = []
for root, dirs, files in os.walk(asd_directory):
    asd_blog_dirs += dirs
for root, dirs, files in os.walk(controls_directory):
    controls_blog_dirs += dirs

## Part 4: Analyze Gross Text Characteristics

We want to get the word count of each blog post, stored in a way that allows us to aggregate over the blogs, identifying which blog any given blog post comes from.

In [4]:
def getWC(fname):
    num_words = 0
    with open(fname, 'r') as f:
        for line in f:
            words = line.split()
            num_words += len(words)
    return(num_words)

In [5]:
asdPartOfSpeechDicts = []
controlsPartOfSpeechDicts = []
asd_blognames = []
controls_blognames = []
asd_wc = []
controls_wc = []

In [6]:
# get word count, column bind it to the POS data frame
# Restrict to 

In [7]:
for blog_dir in asd_blog_dirs:
    asd_blognames.append(blog_dir)
    textToAnalyze = ""
    for filename in os.listdir(asd_directory + blog_dir):
        m = re.match(re.compile(".+\.txt"), filename) 
        if m:
            file = open((asd_directory + blog_dir + "/" + filename), "r") 
            text = file.read()
            textToAnalyze = textToAnalyze + " " + text
    file_name = "../confidential/corpora/consolidated_texts/" + "ASD" + '/' + blog_dir + ".txt"
    text_file = open(file_name, "w")
    text_file.write(textToAnalyze)
    text_file.close()
    asd_wc.append(getWC(file_name))    
    asdPartOfSpeechDicts.append(analyzeTextForPOS(textToAnalyze))

In [8]:
for blog_dir in controls_blog_dirs:
    controls_blognames.append(blog_dir)
    textToAnalyze = ""
    for filename in os.listdir(controls_directory + blog_dir):
        m = re.match(re.compile(".+\.txt"), filename) 
        if m:
            file = open((controls_directory + blog_dir + "/" + filename), "r") 
            text = file.read()
            textToAnalyze = textToAnalyze + " " + text
    file_name = "../confidential/corpora/consolidated_texts/" + "controls" + '/' + blog_dir + ".txt"
    text_file = open(file_name, "w")
    text_file.write(textToAnalyze)
    text_file.close()
    controls_wc.append(getWC(file_name))  
    controlsPartOfSpeechDicts.append(analyzeTextForPOS(textToAnalyze))

Take a quick peek at our stats!

In [16]:
basic_asd_data = pd.DataFrame({"blog_name" : asd_blognames, "word_count" : asd_wc})
combined_asd_data = pd.concat([basic_asd_data, pd.DataFrame(asdPartOfSpeechDicts)], 1)
combined_asd_data.head()

Unnamed: 0,blog_name,word_count,#,$,'',(,),",",.,:,...,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WP$,WRB,``
0,blog_ASD_1,317961,92.0,86.0,23.0,3356.0,3414.0,21996.0,18656.0,2451.0,...,9140.0,7707.0,6365.0,16055.0,8942.0,1833.0,2336.0,36.0,2127.0,
1,blog_ASD_10,14740,1.0,11.0,3.0,19.0,19.0,677.0,932.0,52.0,...,426.0,329.0,273.0,704.0,390.0,70.0,113.0,1.0,121.0,
2,blog_ASD_11,17571,13.0,2.0,2.0,61.0,63.0,467.0,871.0,9.0,...,501.0,468.0,309.0,925.0,513.0,70.0,77.0,,149.0,
3,blog_ASD_12,1285,,,,2.0,2.0,85.0,68.0,10.0,...,29.0,33.0,27.0,66.0,60.0,11.0,1.0,,7.0,
4,blog_ASD_13,95982,35.0,19.0,7.0,485.0,485.0,4013.0,4647.0,365.0,...,2788.0,2566.0,1906.0,4588.0,3037.0,635.0,654.0,5.0,845.0,


In [17]:
basic_controls_data = pd.DataFrame({"blog_name" : controls_blognames, "word_count" : controls_wc})
combined_controls_data = pd.concat([basic_controls_data, pd.DataFrame(controlsPartOfSpeechDicts)], 1)
combined_controls_data.head()

Unnamed: 0,blog_name,word_count,#,$,'',(,),",",.,:,...,VBD,VBG,VBN,VBP,VBZ,WDT,WP,WP$,WRB,``
0,blog_control_10,2464,,1.0,,15.0,15.0,188.0,148.0,79.0,...,31.0,30.0,60.0,41.0,43.0,4.0,1.0,1.0,1.0,
1,blog_control_100,211682,3.0,16.0,13.0,919.0,934.0,6620.0,12027.0,1113.0,...,6194.0,4700.0,4122.0,5789.0,4753.0,878.0,632.0,12.0,1342.0,
2,blog_control_101,192525,18.0,18.0,5.0,786.0,794.0,4167.0,9532.0,581.0,...,6007.0,5034.0,4217.0,5594.0,4834.0,841.0,838.0,11.0,1235.0,
3,blog_control_102,94407,16.0,9.0,6.0,300.0,308.0,2130.0,5657.0,281.0,...,3157.0,2758.0,2322.0,4174.0,2533.0,466.0,770.0,6.0,826.0,
4,blog_control_103,24640,1.0,3.0,2.0,21.0,21.0,955.0,1793.0,153.0,...,1523.0,527.0,589.0,783.0,426.0,103.0,91.0,1.0,143.0,


Eliminate blogs that have fewer than 5000 total words.

In [18]:
final_asd_data = combined_asd_data[(combined_asd_data['word_count'] >= 5000)]
final_controls_data = combined_controls_data[(combined_controls_data['word_count'] >= 5000)]

We're going to write these to disk so we have them to work with later on in our analysis!

In [20]:
final_asd_data.to_csv("../confidential/ASD_POS_stats.csv", index=False)
final_controls_data.to_csv("../confidential/controls_POS_stats.csv", index=False)