Preparing Data
==============

In [2]:
import json
import cPickle as pickle
import numpy as np
import pandas as pd
from pandas import factorize

In [3]:
!ls bossa/*json

bossa/tasks_export.json  bossa/tasks_runs_export.json


BOSSA Results
-------------

Processing `results_bossa.json` to get a *dictionary* with keys the task ids, and values in as the average value of the scores. To do that, we first convert scores from categorical (`neg`, `neu`, `pos`) to a numeric scale.

In [4]:
bossa_results = pd.read_json("bossa/tasks_runs_export.json")
bossa_results.rename(columns={"created": "start_time", "id": "result_id", "info": "score"}, inplace=True)
bossa_results[['start_time']]= bossa_results[['start_time']].apply(pd.to_datetime, dayfirst=True)
bossa_results[['finish_time']]= bossa_results[['finish_time']].apply(pd.to_datetime, dayfirst=True)
bossa_results['score'] = pd.Categorical(bossa_results['score'], categories=['vneg', 'neg', 'neu', 'pos', 'vpos'])
bossa_results['score'].cat.rename_categories([-2, -1, 0, 1, 2], inplace=True)
# Normalize everything to -1, 0, 1
# bossa_results['score'] = bossa_results['score'].astype(float).apply(lambda x: -1 if x < 0 else 1 if x > 0 else 0)
bossa_results["seconds"] = (bossa_results["finish_time"] - bossa_results["start_time"]).astype('timedelta64[us]') / 1e6
bossa_results = bossa_results[["result_id", "seconds", "task_id", "score"]]
bossa_results.ix[[50]]

Unnamed: 0,result_id,seconds,task_id,score
50,11203,2.5e-05,52775,1


The information about the sentence comes in a dictionary inside the cells of the serie `info`, so we expand it.

In [5]:
bossa_tasks = pd.read_json("bossa/tasks_export.json")
bossa_tasks[['created']]= bossa_tasks[['created']].apply(pd.to_datetime, dayfirst=True)
bossa_tasks.rename(columns={'id': 'task_id'}, inplace=True)
bossa_tasks = bossa_tasks[['task_id', 'info']]
bossa_tasks.ix[[50]]

Unnamed: 0,task_id,info
50,52851,"{u'search_words': u'founder', u'appears_in_sen..."


And finally we merge the `DataFrame` with the scores with the one containing the sentences.

In [6]:
bossa_tasks_scores = pd.merge(bossa_results, bossa_tasks, on='task_id')
bossa_tasks_scores.ix[[50]]

Unnamed: 0,result_id,seconds,task_id,score,info
50,11195,2.1e-05,52776,2,"{u'search_words': u'executive', u'appears_in_s..."


Let's now expand the column `info` into as many new columns as keys has the dictionary `info`.

In [7]:
bossa_tasks_scores.ix[50].info.keys()

[u'search_words',
 u'appears_in_sentence',
 u'url',
 u'media',
 u'appears_in_noun_phrases',
 u'noun_phrases',
 u'sentence_id',
 u'text',
 u'sentence',
 u'pub_date',
 u'is_company']

In [8]:
def json_to_series(info):
    keys, values = zip(*info.iteritems())
    return pd.Series(values, index=keys)

bossa_info = bossa_tasks_scores["info"].apply(json_to_series)
bossa_info.reset_index()
bossa = pd.concat([bossa_tasks_scores, bossa_info], axis=1)
bossa.pop("info")
# bossa['id'] = bossa['id'].astype(float)
bossa.ix[50:53]

Unnamed: 0,result_id,seconds,task_id,score,search_words,appears_in_sentence,url,media,appears_in_noun_phrases,noun_phrases,sentence_id,text,sentence,pub_date,is_company
50,11195,2.1e-05,52776,2,executive,0,http://dealbook.nytimes.com/2013/05/17/a-toeho...,nyt,0,"[chinese investors, overseas companies, politi...",14,Chinese investors are increasingly opting to b...,Chinese investors are increasingly opting to b...,2013-05-17T11:47:51Z,0
51,11205,1.8e-05,52776,-1,executive,0,http://dealbook.nytimes.com/2013/05/17/a-toeho...,nyt,0,"[chinese investors, overseas companies, politi...",14,Chinese investors are increasingly opting to b...,Chinese investors are increasingly opting to b...,2013-05-17T11:47:51Z,0
52,11207,1.7e-05,52776,1,executive,0,http://dealbook.nytimes.com/2013/05/17/a-toeho...,nyt,0,"[chinese investors, overseas companies, politi...",14,Chinese investors are increasingly opting to b...,Chinese investors are increasingly opting to b...,2013-05-17T11:47:51Z,0
53,11209,1.7e-05,52776,-2,executive,0,http://dealbook.nytimes.com/2013/05/17/a-toeho...,nyt,0,"[chinese investors, overseas companies, politi...",14,Chinese investors are increasingly opting to b...,Chinese investors are increasingly opting to b...,2013-05-17T11:47:51Z,0


Aggregate
---------

We now aggregate calculating the average per `sentence_id` using a group by. In the process, we lose the source of the data, that's why we first have to save it.

In [9]:
bossa.to_csv("sentiment/scores_ungrouped.csv", encoding="utf8")

Finally, we aggregate and create a new `DataFrame` for the different sentences and their score.

In [10]:
sentences = bossa.groupby(['sentence'])[['score']].aggregate(np.average)
sentences.to_csv("sentiment/scores.csv", encoding="utf8")
print sentences.count()
sentences[1001:1004]

score    8996
dtype: int64


Unnamed: 0_level_0,score
sentence,Unnamed: 1_level_1
"'We must hope after so much prevarication that this time Google's proposals represent a genuine attempt to address the concerns identified,' said David Wood, the legal counsel for Icomp, an industry group backed by Microsoft and a number of other companies.",-0.333333
"'We must push our leaders to step up and commit to action,' said Hugh Evans, the founder and chief executive of the charity.",-0.285714
"'We need them to tell the story of how we are making decisions and putting the organization together,' said George Postolos, the Astros' president and chief executive, who added that the team would not want a broadcaster who was uncomfortable explaining the front office's strategy.",-0.666667


Sentence Classifier
-------------------

In [11]:
from nltk.corpus import stopwords
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

Create the tranining and testing sets (data and labels) from a randomized version of the set of assessed sentences.

In [12]:
sentences.reset_index().count()

sentence    8996
score       8996
dtype: int64

We could consider 3 classes, but it toruns out that using binary classficication seems to produce better results. Still, try multi-classs classifiers is something worth trying.

In [13]:
raw_scores = sentences.reset_index()
scores = raw_scores
scores = scores[scores.score!=0]  # We ignore the neutral sentences
scores['sentiment'] = scores['score'].apply(lambda s: 'pos' if s > 0 else 'neg')
percentage = 0.85  #  percentage for training, rest for for testing
# We split to have enough representativenesss for both positive and negative sentiments
sent_min = min(
    scores[scores.sentiment=='pos'].sentiment.count(),
    scores[scores.sentiment=='neg'].sentiment.count(),
)
scores = scores[["sentence", "sentiment"]]
train_data = np.array([])
train_labels = np.array([])
test_data = np.array([])
test_labels = np.array([])
for sent in ('pos', 'neg'):
    sent_scores = scores[scores['sentiment']==sent]
    sent_scores = sent_scores.reindex(np.random.permutation(sent_scores.index))
    sent_sentences_count = int(sent_scores['sentence'].count())
    sent_train = sent_scores[["sentence", "sentiment"]][:int(sent_sentences_count * percentage)]
    sent_test = sent_scores[["sentence", "sentiment"]][int(sent_sentences_count * percentage) + 1:]
    print sent, sent_min, sent_train.sentiment.count(), sent_test.sentiment.count()
    train_data = np.append(train_data, sent_train["sentence"])
    train_labels = np.append(train_labels, sent_train["sentiment"])
    test_data = np.append(test_data, sent_test["sentence"])
    test_labels = np.append(test_labels, sent_test["sentiment"])

pos 2939 4281 755
neg 2939 2498 440


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [21]:
import nltk

In [27]:
from nltk.corpus import stopwords
 
def preprocess(text):
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    filtered_words = filter(lambda t: t not in stopwords.words('english'), tokens)
    return filtered_words

In [28]:
sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good. French-Fries"
print preprocess(sentence)

['eight', "o'clock", 'thursday', 'morning', 'arthur', "n't", 'feel', 'good', '.', 'french-fries']


In [100]:
document_df = scores[['sentence', 'sentiment']]
document_df = document_df.reindex(np.random.permutation(document_df.index))

In [101]:
document_df.head()

Unnamed: 0,sentence,sentiment
2179,As giants like Amazon.com move into the online...,pos
41,"""One reason why Lloyds shares have appreciated...",pos
4605,"It was founded by Vernon Hill, an entrepreneur...",neg
950,"'We are not in a rush in taking decisions,' sa...",pos
6450,"Samsung, which makes smartphones as well as th...",pos


In [102]:
documents = [(r[1]['sentence'], r[1]['sentiment']) for r in document_df.iterrows()]

In [103]:
documents[:5]

[(u'As giants like Amazon.com move into the online grocery business, an investment firm is betting that a technology company can help traditional retailers fight back.',
  'pos'),
 (u'"One reason why Lloyds shares have appreciated so much is that there is no other real pure-play UK bank to buy at a time when the British economy is bouncing back," says one leading asset manager.',
  'pos'),
 (u'It was founded by Vernon Hill, an entrepreneur who pioneered his customer-friendly approach to branches at Commerce Bancorp in the US.',
  'neg'),
 (u"'We are not in a rush in taking decisions,' said Hanan Ashrawi, a member of the executive committee.",
  'pos'),
 (u"Samsung, which makes smartphones as well as the chips that go into many other manufacturers' devices, rose 760 percent.",
  'pos')]

In [104]:
all_words = [w.lower() for d in documents for w in nltk.word_tokenize(d[0])]
freq_dist = nltk.FreqDist(all_words)

In [105]:
most_common_words = [word for word, freq in freq_dist.most_common()]
most_common_words = filter(lambda t: t not in stopwords.words('english'), most_common_words)
word_features = most_common_words[:1000]
print word_features[:10]

[u',', u'.', u"'s", u"'", u'executive', u'said', u'chief', u'apple', u'manager', u'company']


In [106]:
def document_features(document):      
    document_words = set(document)
    features = {}
    for word in word_features:
        features['%s' % word] = word in document_words
    return features

In [107]:
featureset = [(document_features(d[0]), d[1]) for d in documents]

In [None]:
feature

In [75]:
train_set = featureset[100:]
test_set = featureset[:100]

In [76]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [77]:
nltk.classify.accuracy(classifier, test_set)

1.0

In [78]:
classifier.show_most_informative_features(5)

Most Informative Features
