# Topic / Keyword Analysis

We want to get an idea what our blog is about, in order to match blogs from autistic bloggers with similar blogs from our "control" blogs.  We will conduct Term Frequency / Inverse Document Frequency analysis as well as use IBM Watson to detect concepts.  We work at the blog level, with all blog posts from the same blog concatenated into a single text.  This is because post-level matching would be prohibitively difficult, and it's sufficient to grasp the overall themes of a blog for matching purposes.

In [1]:
import os, gensim
from gensim.models import TfidfModel
import json
import sys
import string
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, ConceptsOptions, EntitiesOptions, KeywordsOptions

## Add a Word Count Tool

We don't want to work with blogs that have fewer than 5000 words total, so we can skip analyzing these!

In [2]:
def getWC(fname):
    num_words = 0
    with open(fname, 'r') as f:
        for line in f:
            words = line.split()
            num_words += len(words)
    return(num_words)

## Set up for TF/IDF

Much of the TF/IDF work in this section has been shamelessly borrowed from http://carrefax.com/new-blog/2017/11/25/create-a-gensim-corpus-for-text-files-in-a-local-directory .

In [3]:
def list_files(top_directory):
    texts = []
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: (file.endswith('.txt') and \
                                         getWC(os.path.join(root, file)) >= 5000), files):
            texts.append(file)
    return(texts)
        
def iter_documents(top_directory):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for root, dirs, files in os.walk(top_directory):
        for file in filter(lambda file: (file.endswith('.txt') and \
                                         getWC(os.path.join(root, file)) >= 5000), files):
                document = open(os.path.join(root, file)).read() # read the entire document, as one big string
                yield gensim.utils.tokenize(document, lower=True) # or whatever tokenization suits you

class MyCorpus(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir
        self.dictionary = gensim.corpora.Dictionary(iter_documents(top_dir))
        self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params

    def __iter__(self):
        for tokens in iter_documents(self.top_dir):
            yield self.dictionary.doc2bow(tokens)

In [4]:
def getKeyWords(path) :
    blogs_keyword_list = []
    for blog in tfidf[list(MyCorpus(path))]:
        sorted_tfidf_weights = sorted(blog, key=lambda w: w[1], reverse=True)
        keywords = ""
        for term_id, weight in sorted_tfidf_weights[:40]:
            keywords = keywords + corpus.dictionary.get(term_id) + ", "
        blogs_keyword_list.append(keywords)
    return(blogs_keyword_list)

## Perform TF/IDF Analysis

We want to analyze TF/IDF by group -- ASD in one section, controls in another.  This is so we don't detect very common themes in the blogs written by autistic bloggers -- terms like "autism" itself, for example.

In [5]:
asd_docs = [x.replace(".txt", "") for x in list_files("../confidential/corpora/consolidated_texts/ASD")]
controls_docs = [x.replace(".txt", "") for x in list_files("../confidential/corpora/consolidated_texts/controls")]

In [6]:
asd_kw = getKeyWords('../confidential/corpora/consolidated_texts/ASD')
controls_kw = getKeyWords('../confidential/corpora/consolidated_texts/controls')

## For Selected Blogs, Get Word Count

In [7]:
asd_wc = [getWC(os.path.join("../confidential/corpora/consolidated_texts/ASD", (x+".txt"))) \
                for x in asd_docs]

controls_wc = [getWC(os.path.join("../confidential/corpora/consolidated_texts/controls", (x+".txt"))) \
               for x in controls_docs]

## Using IBM Watson to detect concepts

We can only send text in 5000 character chunks, so we chunk each text into 5k characters, observing word boundaries, and send it to Watson.  When we get the data back on each chunk we can consolidate common topics.  It's important to save the Watson response, even if we don't want to use all of it, since there are limits to the number of free Watson requests we can make.

It's easy and free to get credentials, see more at https://www.ibm.com/watson/services/natural-language-understanding/.

In [8]:
credentials = json.load(open('../confidential/watson_credentials.json'))

Here we request a given version of the Watson NLP 

In [9]:
natural_language_understanding = NaturalLanguageUnderstandingV1(
  username=credentials["username"],
  password=credentials["password"],
  version='2018-03-16')

In case Watson is bogged down, can't analyze a text, etc., fail gracefully!  We will ask Watson to do hundreds of analyses at one go, so we'd hate to break just because of a handful of errors.  That's why we use try/except here.

In [10]:
def watsonAnalysis(blog_text):
    try:
        response = natural_language_understanding.analyze(
          text=blog_text,
          features=Features(
            entities=EntitiesOptions(
              limit=15),
            keywords=KeywordsOptions(
              limit=15),
            concepts=ConceptsOptions(
              limit=15)))
    except: 
        print ("ERROR in Watson analysis with exception " + str(sys.exc_info()[0]))
        return("")
    return(response)

In [11]:
def segmentBlog(path):
    blog_text = open(path,'r').read().replace('\n', ' ')
    import textwrap
    bite_sized = textwrap.wrap(blog_text, 5000)
    return(bite_sized)

def blogSegmentsWatson(path):
    bite_sized = segmentBlog(path)
    watson = []
    for bite in bite_sized[:15]:  
        # if we have 300k words don't do all of them, just limit to first 75000 characters.
        # that should do a good job of giving us analysis, no matter what
        watson.append(watsonAnalysis(bite))
    return(watson)

Run the analysis!

In [12]:
asd_blog_root = "../confidential/corpora/consolidated_texts/ASD/"
controls_blog_root = "../confidential/corpora/consolidated_texts/controls/"
asd_watson_list = []
controls_watson_list = []

It's unsurprising to get some errors here.  We'll have enough 5000 character chunks on our blog posts to allow for some requests to fail...

In [13]:
for blog in asd_docs:
    path = asd_blog_root + blog + ".txt"
    features = blogSegmentsWatson(path)
    asd_watson_list.append(features)    

ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>


In [14]:
for blog in controls_docs:
    path = controls_blog_root + blog + ".txt"
    features = blogSegmentsWatson(path)
    controls_watson_list.append(features)   

ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>
ERROR in Watson analysis with exception <class 'watson_developer_cloud.watson_service.WatsonApiException'>


In [15]:
import pandas as pd
asd_watson_data = pd.DataFrame(asd_watson_list)
controls_watson_data = pd.DataFrame(controls_watson_list)

In [16]:
asd_watson_data.to_csv("../confidential/watson_asd.csv")
controls_watson_data.to_csv("../confidential/watson_controls.csv")

In [17]:
asd_entities = []
asd_keywords = []
asd_concepts = []

In [18]:
for blog in asd_watson_list:
    entities = []
    keywords = []
    concepts = []
    for chunk in blog:
        if len(chunk) > 0 :
            for keyword in chunk.get("keywords"):
                if keyword.get("relevance") > 0.85:
                    keywords.append(keyword.get("text"))
            for entity in chunk.get("entities"):
                if entity.get("relevance") > 0.85:
                    entities.append(entity.get("text"))
            for concept in chunk.get("concepts"):
                if concept.get("relevance") > 0.85:
                    concepts.append(concept.get("text"))
    entities = set(entities)
    keywords = set(keywords)
    concepts = set(concepts)
    asd_entities.append(entities)
    asd_keywords.append(keywords)
    asd_concepts.append(concepts)

In [19]:
controls_entities = []
controls_keywords = []
controls_concepts = []

In [20]:
for blog in controls_watson_list:
    entities = []
    keywords = []
    concepts = []
    for chunk in blog:
        if len(chunk) > 0 :
            for keyword in chunk.get("keywords"):
                if keyword.get("relevance") > 0.85:
                    keywords.append(keyword.get("text"))
            for entity in chunk.get("entities"):
                if entity.get("relevance") > 0.85:
                    entities.append(entity.get("text"))
            if chunk.get("concepts"): # some don't have concepts
                for concept in chunk.get("concepts"):
                    if concept.get("relevance") > 0.85:
                        concepts.append(concept.get("text"))
            else:
                concepts.append("")
    entities = set(entities)
    keywords = set(keywords)
    concepts = set(concepts)
    controls_entities.append(entities)
    controls_keywords.append(keywords)
    controls_concepts.append(concepts)

## Combine What We Know

In [21]:
asd_blog_features = pd.DataFrame({'blog' : asd_docs,
                                  'word_count' : asd_wc,
                                  'tf_idf_keywords' : asd_kw,
                                  'keywords' : asd_keywords, 
                                  'entities' : asd_entities,
                                  'concepts': asd_concepts})

In [22]:
controls_blog_features = pd.DataFrame({'blog' : controls_docs,
                                       'word_count' : controls_wc,
                                       'tf_idf_keywords' : controls_kw,
                                       'keywords' : controls_keywords, 
                                       'entities' : controls_entities,
                                       'concepts': controls_concepts})

## Write Matching Data to File

We want to review the matching data qualitatively -- it takes too much human knowledge to match similar blogs!

In [23]:
asd_blog_features.to_csv("../confidential/asd_matching.csv")
controls_blog_features.to_csv("../confidential/controls_matching.csv")