In [1]:
import pandas as pd
from collections import defaultdict
from IPython.display import display, HTML

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import textstat

In [2]:
DATASET_PATH = '../datasets/RC_2018-02-28'

In [3]:
df = pd.read_json(DATASET_PATH, lines=True, chunksize=1e4).read()
df = df[(df.body != '[deleted]') & (df.body != '[removed]')]

Possible sentiment features: SentiStrength, Vader-Sentiment, LIWC. Both SentiStrength and LIWC are proprietary. LIWC also can be used to give many different psychological and language dimensions (if only it weren't proprietary).

TODO: learned Naive Bayes classifier

In [4]:
def vader_features(analyzer, body):
    vs = analyzer.polarity_scores(body)
    return {'vad_'+k: v for k, v in vs.items()}

Readability features: word count, avg sentence length, avg word length, Gunning Fog, SMOG, Flesch-Kincaid. LIWC can give language features, but again is proprietary.

TODO: COCA fluency

In [5]:
def read_features(body):
    d = defaultdict(float)
    d['WC'] = textstat.lexicon_count(body)
    d['WPS'] = textstat.avg_sentence_length(body)
    d['WL'] = textstat.avg_letter_per_word(body)
    d['GI'] = textstat.gunning_fog(body)
    d['SMOG'] = textstat.smog_index(body)
    d['FK'] = textstat.flesch_kincaid_grade(body)
    return d

In [6]:
import string
from collections import Counter

exclude = list(string.punctuation)

def ttr_feature(body):
    words = ''.join(ch for ch in body if ch not in exclude).split()
    type_count = Counter(words)
    return {'ttr': len(type_count) / float(sum(type_count.values()))}

In [7]:
def all_features(comment):
    analyzer = SentimentIntensityAnalyzer()
    body = comment.body
    
    features = {
        'body': body,
        'subreddit': comment.subreddit,
    }
    
    features.update(vader_features(analyzer, body))
    features.update(ttr_feature(body))
    
    #     features.update(read_features(comment.body))   # This one seems sketchy
    
    return pd.Series(features)

In [8]:
featurized = df.iloc[:20].apply(all_features, axis=1)

with pd.option_context('display.max_colwidth', 500, 'display.max_columns', 100):
    display(featurized)

Unnamed: 0,body,subreddit,vad_neg,vad_neu,vad_pos,vad_compound,ttr
0,"People loooooove quoting Warren Buffett here. Little do people realize they aren't billionaires with billions in cash waiting on the side lines for the next big market crash. Does Buffett add to his positions when he can, oh yes he does, but he always, always keeps lots of cash as well. Now...I don't have billions in cash, so I'd rather simply buy at a good time and hold until a great time.\n\nNot holding through this downturn has been the difference between me having no cash, and being down...",weedstocks,0.127,0.733,0.139,0.6769,0.690763
1,"My comment was a joke, if Spurs can't hold on to him there's no chance the Quakes would get him.",MLS,0.098,0.714,0.188,0.25,0.95
2,Has anyone played this multiplayer with friends and npcs as well? Anything to say about it? Who tested this? Just skeptical to install and bork my game...I want to believe but a patch that has 53 pages of notes HAS to have some issues somewhere,dawnofwar,0.035,0.819,0.147,0.5171,0.888889
3,"Don’t be a turkey, knock it off with the puns",funny,0.0,1.0,0.0,0.0,1.0
4,"It looks like there's text in this image. I've tried to transcribe it automatically, but I'm still learning -- this may be inaccurate. At the very least, hopefully it will serve as a decent starting point for your work!\n\nPlease note that any formatting instructions above override whatever I provide, so please format my content accordingly if you choose to use it. \n\nProcess time: 0.775s\n\n---\n\nv0.4.2 | This message was posted by a bot. | [FAQ](https://www.reddit.com/r/TranscribersOfRed...",TranscribersOfReddit,0.0,0.843,0.157,0.9274,0.923077
5,"Definetly gambling problem... like all of us here...\n\nEdit: Before you get me wrong (what always happen on the internet) I never spent a dime for pulling units, I just love to crack open this crystals every day.",FFBraveExvius,0.074,0.766,0.16,0.5574,0.973684
6,&gt;Premature death is premature death\n\nThen 200% increase on alcoholic taxes when? It causes far more premature deaths. It has less societal uses than guns.\n\nSuicide is not illegal in my country. What the fuck are you going on about?\n,worldnews,0.292,0.614,0.094,-0.9175,0.875
7,"Oh man, that's great. I have loved Rammstein for a long time. Unfortunately, the last time they were around me stateside, I was underage and couldn't get into the show.",AskReddit,0.07,0.698,0.233,0.765,0.9
8,"Bezos, Brandson and Musk.\n\n""They have changed everything. The price of spaceflight has gone down because of these billionairs""\n\nDespite the fact bezos hasn't launched apayload. (Have sold launches to be fair which could count as lowering cost depending o what that price is)\n\nOh and brandson? The guy who spent $400M on a 4 min joyride that hasn't carried a single paying customer in its whole ten years? (And now bezos might fly customers first, with more safely and cheaper). Yeah he rea...",SpaceXMasterrace,0.037,0.848,0.115,0.7954,0.873563
9,"I believe that it will be a reality, but not anytime soon. Eventually one of the states will adopt universal healthcare, and cause successive states to adopt it as well until it goes nationwide.",AskAnAmerican,0.0,0.749,0.251,0.8126,0.852941


TODO: featurize each comment, make list of dicts, use sklearn.feature_extraction.DictVectorizer