# Process Datasets

To build our sentiment analyser(s) we will be using two core datasets:
* Financial phrase book: https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10/link/0c96051eee4fb1d56e000000/download
* Domain specific dictionary: https://github.com/jperla/sentiment-data/tree/master/finance

The purpose of this notebook is to:
1. download required datasets to the notebook instance.
2. restructure them to a tabular format. 
3. save resulting format to csv so they can be used throughout the project.

<b>NOTE:</b> No further analysis takes place in this notebook.

## Import Libraries

In [1]:
import pandas as pd
import os

## Setup Project Folder Structure

In [4]:
# get root directory
root = ! pwd
root = root[0]

# dataset directory
dataset_dir = root + '/datasets'

# financial phrase bank dataset directory
phrase_dir = dataset_dir + '/FinancialPhraseBank'

# financial sentiment dictionary dataset directory
dictionary_dir = dataset_dir + '/FinancialDictionary'

# create project folder structure
if not os.path.exists(dataset_dir):
    os.mkdir(dataset_dir)
if not os.path.exists(phrase_dir):
    os.mkdir(phrase_dir)
if not os.path.exists(dictionary_dir):
    os.mkdir(dictionary_dir)

## Get Data

In [9]:
# get financial phrase bank dataset
# ! wget -O /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank-v1.0.zip https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10/link/0c96051eee4fb1d56e000000/download

# unzip
# ! unzip /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank-v1.0.zip -d "/home/ec2-user/SageMaker/mle-capstone/datasets/financialphrasebank"

# get financial dictionary from github
!svn checkout https://github.com/jperla/sentiment-data/trunk/finance /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary

A    datasets/FinancialDictionary/LoughranMcDonald_Litigious.csv
A    datasets/FinancialDictionary/LoughranMcDonald_ModalStrong.csv
A    datasets/FinancialDictionary/LoughranMcDonald_ModalWeak.csv
A    datasets/FinancialDictionary/LoughranMcDonald_Negative.csv
A    datasets/FinancialDictionary/LoughranMcDonald_Positive.csv
A    datasets/FinancialDictionary/LoughranMcDonald_Uncertainty.csv
Checked out revision 3.


## Import Data

### Financial Phrase Bank

In [6]:
# pseudocode
conf_mapping = {'Sentences_AllAgree.txt':1, 'Sentences_75Agree.txt':0.75, 
                'Sentences_66Agree.txt':0.66, 'Sentences_50Agree.txt':0.5}
phrase_files = list(conf_mapping.keys())

# stop words
class_mapping = {'.@positive': 'positive', '.@neutral': 'neutral', '.@negative': 'negative'}
stop_words = list(class_mapping.keys())

# define empty dict to store data
dict_phrasebank = {'sentiment':[], 'text':[], 'confidence':[]}

# initiate empty text
text = ''

# iterate over each text file
for phrase_file in phrase_files:
    
    # get confidence score relating to text file
    confidence = conf_mapping[phrase_file]
    
    # read text file word by word of text file
    with open(phrase_dir + '/' + phrase_file,'r',encoding='"ISO-8859-1"') as f:
        
        # iterate over lines in file
        for line in f:
            
            # iterate over words in line
            for word in line.split():
                
                # when stop word is reached
                if word in stop_words:
                    
                    # update dictionary to store sentiment, text and confidence
                    sentiment = class_mapping[word]
                    dict_phrasebank['sentiment'].append(sentiment)
                    dict_phrasebank['text'].append(text)
                    dict_phrasebank['confidence'].append(confidence)
                    
                    # reset text for next phrase
                    text = ''
                
                # otherwise add word to body of text 
                else:
                    text = text + ' ' + word

# create dataframe 
df_phrasebook = pd.DataFrame(dict_phrasebank)

# save dataframe to csv
df_phrasebook.to_csv(phrase_dir +'/FinancialPhraseBook.csv')

# remove duplicate data to save space
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_50Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_66Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_75Agree.txt
#!rm /home/ec2-user/SageMaker/mle-capstone/datasets/FinancialPhraseBank/Sentences_AllAgree.txt

In [7]:
df_phrasebook.head()

Unnamed: 0,sentiment,text,confidence
0,neutral,"According to Gran , the company has no plans ...",1.0
1,positive,"For the last quarter of 2010 , Componenta 's ...",1.0
2,positive,"In the third quarter of 2010 , net sales incr...",1.0
3,positive,Operating profit rose to EUR 13.1 mn from EUR...,1.0
4,positive,"Operating profit totalled EUR 21.1 mn , up fr...",1.0


### Financial Dictinoaries

In [17]:
df_Litigious = pd.read_csv(dictionary_dir + '/LoughranMcDonald_Litigious.csv', names=['word','year'])
df_ModalStrong = pd.read_csv(dictionary_dir + '/LoughranMcDonald_ModalStrong.csv', names=['word','year'])
df_ModalWeak = pd.read_csv(dictionary_dir + '/LoughranMcDonald_ModalWeak.csv', names=['word','year'])
df_Negative = pd.read_csv(dictionary_dir + '/LoughranMcDonald_Negative.csv', names=['word','year'])
df_Positive = pd.read_csv(dictionary_dir + '/LoughranMcDonald_Positive.csv', names=['word','year'])
df_Uncertainty = pd.read_csv(dictionary_dir + '/LoughranMcDonald_Uncertainty.csv', names=['word','year'])

* Litigious:
* ModalStrong:
* ModalWeak:
* Negative:
* Positive:
* Uncertainty:

Easier to work with one labelled dataFrame rather than many dataFrames representing each word class.

In [21]:
# add classes
df_Litigious['type'] = 'litigious'
df_ModalStrong['type'] = 'modalstrong'
df_ModalWeak['type'] = 'modalweak'
df_Negative['type'] = 'negative'
df_Positive['type'] = 'positive'
df_Uncertainty['type'] = 'uncertainty'

# create single dataframe
df_financialDict = pd.concat([df_Litigious, df_ModalStrong, df_ModalWeak, 
                              df_Negative, df_Positive, df_Uncertainty])

# drop year column
df_financialDict.drop('year', axis=1, inplace=True)

# convert words to lowercase
df_financialDict['word'] = df_financialDict['word'].apply(lambda x: x.lower())

# save dictionary
df_financialDict.to_csv(dictionary_dir+'/LoughranMcDonald_FinancialDictionary.csv', index=False)

# remove duplicate data to save space
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_Litigious.csv
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_ModalStrong.csv
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_ModalWeak.csv
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_Negative.csv
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_Positive.csv
!rm /home/ec2-user/SageMaker/financial_headline_sentiment/datasets/FinancialDictionary/LoughranMcDonald_Uncertainty.csv

In [22]:
df_financialDict.head()

Unnamed: 0,word,type
0,abovementioned,litigious
1,abrogate,litigious
2,abrogated,litigious
3,abrogates,litigious
4,abrogating,litigious
