# Process Datasets

To build our sentiment analyser(s) we will be using two core datasets:
* Financial phrase bank: https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10/link/0c96051eee4fb1d56e000000/download
* Domain specific dictionary: provided on request by Srikumar Krishnamoorthy the author of Sentiment Analysis of Financial News Articles using Performance Indicators, 2017.

The purpose of this notebook is to:
1. download required datasets to the notebook instance.
2. restructure them to a tabular format. 
3. save resulting format to csv so they can be used throughout the project.

<b>NOTE:</b> No further analysis takes place in this notebook.

## Import Libraries

In [1]:
import json
import nltk

# set sys path to access scripts
import sys
sys.path.append('../')
import os

# imports
import scripts.config as config
import numpy as np
import pandas as pd

## Import Data & Process Financial Phrase Bank

The dataset contains four text files. Each file is distinguised by a confidence percentage, the % of annotators that agreed on the sentiment. For example in the 'Sentences_66Agree.txt'.txt at least 66% of the annotators had to agree on the underlying sentiment. The word at least here is important. The dataset 'Sentences_50Agree.txt' contains all of the examples where 66%, 75% and 100% or more agree (the examples in the other three text files) as well as all the examples where 55-66% of the annotators agreed.

For this reason, we will only use the 50% or more agree dataset. 

That being said, we can import all 4 datasets to check there are no discrepancies in the overlaps. That is to say the examples where 75% of the annotators agree in the 50% or more dataset should match exactly with 75% or more dataset. 

In [12]:
# dataset directory
dataset_dir = config.DATASETS_LOC

# financial phrase bank dataset directory
phrase_dir = dataset_dir + '/FinancialPhraseBank'

In [13]:
# text files and corresponding confidences
conf_mapping = {'Sentences_AllAgree.txt':1, 'Sentences_75Agree.txt':0.75, 
                'Sentences_66Agree.txt':0.66, 'Sentences_50Agree.txt':0.5}
phrase_files = list(conf_mapping.keys())

# stop words
class_mapping = {'.@positive': 'positive', '.@neutral': 'neutral', '.@negative': 'negative'}
stop_words = list(class_mapping.keys())

# define empty dict to store data
dict_phrasebank = {'sentiment':[], 'text':[], 'confidence':[]}

# initiate empty text
text = ''

# iterate over each text file
for phrase_file in phrase_files:
    
    # get confidence score relating to text file
    confidence = conf_mapping[phrase_file]
    
    # read text file word by word of text file
    with open(phrase_dir + '/' + phrase_file,'r',encoding='"ISO-8859-1"') as f:
        
        # iterate over lines in file
        for line in f:
            
            # iterate over words in line
            for word in line.split():
                
                # when stop word is reached
                if word in stop_words:
                    
                    text = text.lower()
                    
                    # update dictionary to store sentiment, text and confidence
                    sentiment = class_mapping[word]
                    dict_phrasebank['sentiment'].append(sentiment)
                    dict_phrasebank['text'].append(text)
                    dict_phrasebank['confidence'].append(confidence)
                    
                    # reset text for next phrase
                    text = ''
                
                # otherwise add word to body of text 
                else:
                    text = text + ' ' + word

# create dataframe 
df_phrasebook = pd.DataFrame(dict_phrasebank)

# remove duplicate data to save space
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_50Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_66Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_75Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_AllAgree.txt

In [14]:
def create_df(df_phrasebook):
    # create sets
    set_100pct = set(df_phrasebook[df_phrasebook['confidence']==1.0]['text'])
    set_75pct = set(df_phrasebook[df_phrasebook['confidence']==0.75]['text'])
    set_66pct = set(df_phrasebook[df_phrasebook['confidence']==0.66]['text'])
    set_50pct = set(df_phrasebook[df_phrasebook['confidence']==0.50]['text'])

    # between 50% and 66% confidence
    conf_set_1 = set_50pct - set_66pct - set_75pct - set_100pct
    conf_1 = round((0.5+0.66)/2,2)

    # between 66% and 75% confidence
    conf_set_2 = set_66pct - set_75pct - set_100pct
    conf_2 = round((0.66+0.75)/2,2)

    # between 75% and 100% confidence
    conf_set_3 = set_75pct - set_100pct
    conf_3 = round((0.75+1)/2,2)

    # 100% confidence
    conf_set_4 = set_100pct
    conf_4 = 1.0

    # create dataframe of unique values
    df_phrasebook_new = df_phrasebook[['sentiment','text']]
    df_phrasebook_new = df_phrasebook_new.drop_duplicates(['sentiment','text'])
    row_count = df_phrasebook_new.shape[0]
    
    def set_conf_levels(text):
        if text in conf_set_1:
            return conf_1
        elif text in conf_set_2:
            return conf_2
        elif text in conf_set_3:
            return conf_3
        elif text in conf_set_4:
            return conf_4
   
    df_phrasebook_new['confidence'] = df_phrasebook_new['text'].apply(set_conf_levels)
    
    print('Expected row count condition met: ' + str(len(conf_set_4) + len(conf_set_3) + 
                                                     len(conf_set_2) + len(conf_set_1) == row_count))
    
    return df_phrasebook_new   
    

In [15]:
df_phrasebook_new = create_df(df_phrasebook)

Expected row count condition met: False


In [16]:
# search for duplicated text -- there should be none unless there are duplicated texts with different sentiments
df_phrasebook_new[df_phrasebook_new['text'].duplicated()]['text']

9743      telecomworldwire-7 april 2006-tj group plc se...
10433     the group 's business is balanced by its broa...
Name: text, dtype: object

In [17]:
# duplicated texts
duplicated_1 = df_phrasebook_new[df_phrasebook_new['text'].duplicated()]['text'][9743]
duplicated_2 = df_phrasebook_new[df_phrasebook_new['text'].duplicated()]['text'][10433]

# see duplicated texts in original dataframes
df_phrasebook[(df_phrasebook['text']==duplicated_1) | (df_phrasebook['text']==duplicated_2)]

Unnamed: 0,sentiment,text,confidence
5626,positive,telecomworldwire-7 april 2006-tj group plc se...,0.66
6234,positive,the group 's business is balanced by its broa...,0.66
9743,neutral,telecomworldwire-7 april 2006-tj group plc se...,0.5
9744,positive,telecomworldwire-7 april 2006-tj group plc se...,0.5
10432,positive,the group 's business is balanced by its broa...,0.5
10433,neutral,the group 's business is balanced by its broa...,0.5


In [18]:
# drop ambigious examples
drop_indicies = [9743,9744,10432,10433]
df_phrasebook = df_phrasebook.drop(index=drop_indicies)
df_phrasebook_new = create_df(df_phrasebook)

Expected row count condition met: True


In [20]:
print(df_phrasebook_new.shape)
df_phrasebook_new.head()

(4778, 3)


Unnamed: 0,sentiment,text,confidence
0,neutral,"according to gran , the company has no plans ...",1.0
1,positive,"for the last quarter of 2010 , componenta 's ...",1.0
2,positive,"in the third quarter of 2010 , net sales incr...",1.0
3,positive,operating profit rose to eur 13.1 mn from eur...,1.0
4,positive,"operating profit totalled eur 21.1 mn , up fr...",1.0


In [21]:
# save dataframe to csv
df_phrasebook_new.to_csv(config.FINANCIAL_PHRASE_BANK, index=False)

## Process Dictionaries

Dictionaries 


In [22]:
DICT_DIRECTIONALITY = '/home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/directionality.json'
DICT_LAGGING = '/home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/lagging.json'
DICT_LEADING= '/home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/leading.json'
DICT_NEGATIVE = '/home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/negative.json'
DICT_POSITIVE = '/home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/positive.json'

In [23]:
def load_dictionary(file_path):
    with open(file_path, 'rb') as f:
        return json.load(f)

In [24]:
# load json dictionaries to file
leading = load_dictionary(DICT_LEADING)
lagging = load_dictionary(DICT_LAGGING)
directionality = load_dictionary(DICT_DIRECTIONALITY)
positive = load_dictionary(DICT_POSITIVE)
negative = load_dictionary(DICT_NEGATIVE)

dictionaries = [leading, lagging, directionality, positive, negative]

# check for duplicates
total_duplicates = 0
domain_dict = dict()
for dictinoary in dictionaries:
    domain_dict.update(dictinoary)
    total_duplicates += pd.DataFrame(list(dictinoary.keys())).duplicated().sum()
total_duplicates

0

In [25]:
# create dataframe from json objects
df_domain_dict = pd.DataFrame(np.array([list(domain_dict.keys()), list(domain_dict.values())]).transpose(), columns=['word','type'])

# rename positive and negative sentiment words - to not confuse with positive and negative sentiment classes
df_domain_dict.replace('positive', 'pos', inplace=True)
df_domain_dict.replace('negative', 'neg', inplace=True)

# check for duplicates
df_domain_dict.duplicated().sum()

0

In [26]:
# visual check
df_domain_dict.head()

Unnamed: 0,word,type
0,operations,leading
1,new service,leading
2,stores,leading
3,deal,leading
4,passenger,leading


In [27]:
# save dictionary
df_domain_dict.to_csv(config.DOMAIN_DICTIONARY, index=False)