# Preparing training sentences

In [1]:
import json
import numpy as np
import pandas as pd
from pandas import factorize

In [2]:
!ls bossa/*json

bossa/control_tasks.json       bossa/tasks_export.json
bossa/control_tasks_runs.json  bossa/tasks_runs_export.json


## BOSSA Results

Processing `results_bossa.json` to get a *dictionary* with keys the task ids, and values in as the average value of the scores. To do that, we first convert scores from categorical (`neg`, `neu`, `pos`) to a numeric scale.

In [3]:
bossa_results = pd.read_json("bossa/tasks_runs_export.json")
bossa_results.rename(columns={"created": "start_time", "id": "result_id", "info": "score"}, inplace=True)
bossa_results[['start_time']]= bossa_results[['start_time']].apply(pd.to_datetime, dayfirst=True)
bossa_results[['finish_time']]= bossa_results[['finish_time']].apply(pd.to_datetime, dayfirst=True)
bossa_results['score'] = pd.Categorical(bossa_results['score'], categories=['vneg', 'neg', 'neu', 'pos', 'vpos'])
bossa_results['score'].cat.rename_categories([-2, -1, 0, 1, 2], inplace=True)
# Normalize everything to -1, 0, 1
# bossa_results['score'] = bossa_results['score'].astype(float).apply(lambda x: -1 if x < 0 else 1 if x > 0 else 0)
bossa_results["seconds"] = (bossa_results["finish_time"] - bossa_results["start_time"]).astype('timedelta64[us]') / 1e6
bossa_results = bossa_results[["result_id", "seconds", "task_id", "score"]]
bossa_results.ix[[50]]

Unnamed: 0,result_id,seconds,task_id,score
50,11203,2.5e-05,52775,1


The information about the sentence comes in a dictionary inside the cells of the serie `info`, so we expand it.

In [4]:
bossa_tasks = pd.read_json("bossa/tasks_export.json")
bossa_tasks[['created']]= bossa_tasks[['created']].apply(pd.to_datetime, dayfirst=True)
bossa_tasks.rename(columns={'id': 'task_id'}, inplace=True)
bossa_tasks = bossa_tasks[['task_id', 'info']]
bossa_tasks.ix[[50]]

Unnamed: 0,task_id,info
50,52851,"{'pub_date': '2013-02-22T00:00:00Z', 'appears_..."


And finally we merge the `DataFrame` with the scores with the one containing the sentences.

In [5]:
bossa_tasks_scores = pd.merge(bossa_results, bossa_tasks, on='task_id')
bossa_tasks_scores.ix[[50]]

Unnamed: 0,result_id,seconds,task_id,score,info
50,11195,2.1e-05,52776,2,"{'pub_date': '2013-05-17T11:47:51Z', 'appears_..."


Let's now expand the column `info` into as many new columns as keys has the dictionary `info`.

In [6]:
bossa_tasks_scores.ix[50].info.keys()

dict_keys(['pub_date', 'appears_in_noun_phrases', 'sentence', 'is_company', 'search_words', 'sentence_id', 'url', 'appears_in_sentence', 'noun_phrases', 'text', 'media'])

In [7]:
def json_to_series(info):
    keys, values = zip(*info.items())
    return pd.Series(values, index=keys)

bossa_info = bossa_tasks_scores["info"].apply(json_to_series)
bossa_info.reset_index()
bossa = pd.concat([bossa_tasks_scores, bossa_info], axis=1)
bossa.pop("info")
# bossa['id'] = bossa['id'].astype(float)
bossa.ix[50:53]

Unnamed: 0,result_id,seconds,task_id,score,pub_date,appears_in_noun_phrases,sentence,is_company,search_words,sentence_id,url,appears_in_sentence,noun_phrases,text,media
50,11195,2.1e-05,52776,2,2013-05-17T11:47:51Z,0,Chinese investors are increasingly opting to b...,0,executive,14,http://dealbook.nytimes.com/2013/05/17/a-toeho...,0,"[chinese investors, overseas companies, politi...",Chinese investors are increasingly opting to b...,nyt
51,11205,1.8e-05,52776,-1,2013-05-17T11:47:51Z,0,Chinese investors are increasingly opting to b...,0,executive,14,http://dealbook.nytimes.com/2013/05/17/a-toeho...,0,"[chinese investors, overseas companies, politi...",Chinese investors are increasingly opting to b...,nyt
52,11207,1.7e-05,52776,1,2013-05-17T11:47:51Z,0,Chinese investors are increasingly opting to b...,0,executive,14,http://dealbook.nytimes.com/2013/05/17/a-toeho...,0,"[chinese investors, overseas companies, politi...",Chinese investors are increasingly opting to b...,nyt
53,11209,1.7e-05,52776,-2,2013-05-17T11:47:51Z,0,Chinese investors are increasingly opting to b...,0,executive,14,http://dealbook.nytimes.com/2013/05/17/a-toeho...,0,"[chinese investors, overseas companies, politi...",Chinese investors are increasingly opting to b...,nyt


## Aggregate

We now aggregate calculating the average per `sentence_id` using a group by. In the process, we lose the source of the data, that's why we first have to save it.

In [8]:
bossa.to_csv("sentiment/last_tasks_scores_ungrouped.csv", encoding="utf8")

Finally, we aggregate and create a new `DataFrame` for the different sentences and their score.

In [9]:
sentences = bossa.groupby(['sentence'])[['score']].aggregate(np.average)
sentences.to_csv("sentiment/last_tasks_scores.csv", encoding="utf8")
print(sentences.count())
sentences[1001:1004]

score    8996
dtype: int64


Unnamed: 0_level_0,score
sentence,Unnamed: 1_level_1
"'We must hope after so much prevarication that this time Google's proposals represent a genuine attempt to address the concerns identified,' said David Wood, the legal counsel for Icomp, an industry group backed by Microsoft and a number of other companies.",-0.333333
"'We must push our leaders to step up and commit to action,' said Hugh Evans, the founder and chief executive of the charity.",-0.285714
"'We need them to tell the story of how we are making decisions and putting the organization together,' said George Postolos, the Astros' president and chief executive, who added that the team would not want a broadcaster who was uncomfortable explaining the front office's strategy.",-0.666667
