In [1]:
import os
import pandas as pd
import numpy as np

# This notebook is meant to familiarize the reader with the sentiment lexicon employed for the tasks of polarity words extraction and key phrase sentiment tagging.
# The notebook is organized as follows:
- Lexicons employed
    - Lexicon 1 generation and preproccessing
    - Lexicon 2 generation and preproccessing
- Final lexicon generation

## Lexicons employed
Two sentiment lexicons were used to create a combined lexicon for the purposes of this project. Each contains polarity words and their corresponding sentiment on a binary scale of positive or negative. The first one was built by Loughran and McDonald (2013) and the second one by Hu and Liu (2004). The choice of these particular lexicons was determined by Shapiro et al.(2020). In their research, they demonstrate that the combination of these two lexicons seemed to outperform other combinations when classifying financial news sentiment. Despite the fact that the lexicons are employed for different tasks in this project, the same characteristics that render their combination appropriate would remain viable.

Below the reader is familiarized with both.

### Lexicon 1 - Loughran and McDonald (2011) (updated 2020)
- The lexicon consists of 2707 polarity words extracted from companies' 10-K reports and was originally used in the paper 'When is a Liability Not a Liability? Textual Analysis,Dictionaries, and 10-Ks' (Loughran & McDonald,2011). Since then it has received regular updates and its latest version is the one employed in this project. The lexicon is freely available on [Bill McDonald's page](https://sraf.nd.edu/textual-analysis/resources/). Despit its rather small size, it is particularly suitable for the purposes of this project as word sentiment was specifically considered within the context of the finance/economics domain. Impliedly, the problem of certain words having a different polarity in nonbusiness domains is well addressed and erronous interpretation is significantly reduced. 

In [2]:
PATH= os.path.join('..','external_datasets')
SRC = os.path.join(PATH,'LoughranMcDonald_MasterDictionary_2020.csv')
master_dictionary= pd.read_csv(SRC)

The master dictionary file includes various other information that will not be considered for the purposes of this project. Out of the six sentiment categories (negative, positive, uncertainty, litigous, strong_modal, weak_modal), only the negative and positive ones are of interest. Below the reader can see a sample from the original file before preprocessing is applied.

In [3]:
master_dictionary.head(10)

Unnamed: 0,Word,Seq_num,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Strong_Modal,Weak_Modal,Constraining,Complexity,Syllables,Source
0,AARDVARK,1,312,1.42205e-08,1.335201e-08,3.700747e-06,96,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.367356e-10,8.882163e-12,9.362849e-09,1,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,9,4.102067e-10,1.200533e-10,5.359747e-08,7,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,15,6.836779e-10,4.080549e-10,1.406914e-07,14,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,8009,3.650384e-07,3.798698e-07,3.523914e-05,1058,0,0,0,0,0,0,0,0,3,12of12inf
5,ABACUSES,6,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,4,12of12inf
6,ABAFT,7,4,1.823141e-10,2.363561e-11,2.491473e-08,1,0,0,0,0,0,0,0,0,2,12of12inf
7,ABALONE,8,140,6.380994e-09,5.055956e-09,1.080602e-06,47,0,0,0,0,0,0,0,0,4,12of12inf
8,ABALONES,9,1,4.557853e-11,8.502178e-11,8.9623e-08,1,0,0,0,0,0,0,0,0,4,12of12inf
9,ABANDON,10,118075,5.381685e-06,4.661541e-06,3.338305e-05,62949,2009,0,0,0,0,0,0,0,3,12of12inf


The polarity of words is determined by the year tag (the year when the word was added to the lexicon) in the respective column. For example, above, one can see that 'abandon' was added in 2009 and it is classified as negative. Below, the master dictionary file is processed to include only polarity words which have been tagged as either negative or positive. 

The daframe must be converted to contain two columns - word and sentiment. In order to do so, two separate dataframes for the positive and negative words are created before merging them in one. Below, the steps taken to do so are illustrated.

In [4]:
""" Create a pandas DataFrame consisting only of columns 'Word' and 'Positive' and replace all 0s with NaN to
ease removing these observations. Following the described operation, only the positive words are left and column 'Positive' is removed
and replaced by column sentiment which is subsequently populated with the sentiment class - positive. The same procedure is 
repeated for the negative words in the master dictionary dataframe. Lastly, the two dataframes are merged to generate 
the final Loughran & McDonald lexicon dataframe."""

# Generate the dataframe consisting of columns 'Word' and 'Positive'
positive_df=master_dictionary[['Word','Positive']]
# Replace all 0s with NaN and drop all observations having that value in the 'Positive' column. The column is subsequently removed.
positive_df=positive_df.replace(0,np.nan).dropna(subset=['Positive']).iloc[::,:1]
# Column sentiment is added and populated with 'positive'
positive_df['sentiment']='positive'

# Generate the dataframe consisting of columns 'Word' and 'Negative'
negative_df=master_dictionary[['Word','Negative']]
# Replace all 0s with NaN and drop all observations having that value in the 'Positive' column. The column is subsequently removed.
negative_df=negative_df.replace(0,np.nan).dropna(subset=['Negative']).iloc[::,:1]
# Column sentiment is added and populated with 'negative'
negative_df['sentiment']='negative'

# Merge the positive and negative dataframes to generate the final Loughran & McDonald lexicon.
LM_lexicon=positive_df.append(negative_df,ignore_index=True)
# Lowercase the 'Word' column name
LM_lexicon.columns=['word','sentiment']
LM_lexicon

Unnamed: 0,word,sentiment
0,ABLE,positive
1,ABUNDANCE,positive
2,ABUNDANT,positive
3,ACCLAIMED,positive
4,ACCOMPLISH,positive
...,...,...
2704,WRONGDOING,negative
2705,WRONGDOINGS,negative
2706,WRONGFUL,negative
2707,WRONGFULLY,negative


Below the reader can observe the label distribution. The distribution is heavily skewed towards negative words with them accouting for close to 87% of all words in the lexicon.


In [5]:
LM_lexicon.sentiment.value_counts()

negative    2355
positive     354
Name: sentiment, dtype: int64

### Lexicon 2 - Hu and Liu (2004)
- Hu and Liu developed their lexicon using movie reviews. The lexicon contains close to 6,800 sentiment words classified on a binary scale of negative or positive. Its size makes it a useful contribution to Lougran and Mcdonald's lexicon when generating the sentiment lexicon used in this project. The lexicon is freely distributed on [Bing Liu's page](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon).

In [6]:
SRC_POS = os.path.join(PATH,'Hu_Liu_positive.txt')
SRC_NEG=os.path.join(PATH,'Hu_Liu_negative.txt')

"""The two separate txt files containing the positive and negative words are processed as separate dataframes. 
Column headers are added to both and their sentiment is populated before merging them into one dataframe."""

# Create the positive words dataframe
positive_words=pd.read_csv(SRC_POS,header=None,names=['word','sentiment'])
positive_words['sentiment']='positive'
# Create the negative words dataframe
negative_words=pd.read_csv(SRC_NEG,header=None,names=['word','sentiment'])
negative_words['sentiment']='negative'

# Merge both to create the final Hu and Liu lexicon dataframe
HL_lexicon=positive_words.append(negative_words,ignore_index=True)
HL_lexicon

Unnamed: 0,word,sentiment
0,a+,positive
1,abound,positive
2,abounds,positive
3,abundance,positive
4,abundant,positive
...,...,...
6784,zaps,negative
6785,zealot,negative
6786,zealous,negative
6787,zealously,negative


Below the reader can observe the label distribution. Again, the distribution is skewed towards negative words that account for 70% of the polarity words present in the lexicon.


In [7]:
HL_lexicon.sentiment.value_counts()

negative    4783
positive    2006
Name: sentiment, dtype: int64

## Final lexicon generation
- The final lexicon will consist of both Loughran and McDonald's and Hu and Liu's lexicons merged. As mentioned, the reasoning for the choice of that combination was derived from Shapiro et al.(2020). Since Loughran and McDonald's lexicon is specifically targeted for the finance/economics domain, should there be any duplicated values, the observation from Loughran and McDonald's lexicon will be kept and the one from Hu and Liu's removed.

In [8]:
# Merge both lexicons to create the final sentiment lexicon
sentiment_lexicon=LM_lexicon.append(HL_lexicon)
# Lowercase all words
sentiment_lexicon["word"]=sentiment_lexicon["word"].apply(lambda x: x.lower())
# Drop any duplicates, keeping the first observation which will be the word from Lougran and McDonald's lexicon.
sentiment_lexicon=sentiment_lexicon.drop_duplicates(subset="word",keep="first")  


sentiment_lexicon

Unnamed: 0,word,sentiment
0,able,positive
1,abundance,positive
2,abundant,positive
3,acclaimed,positive
4,accomplish,positive
...,...,...
6784,zaps,negative
6785,zealot,negative
6786,zealous,negative
6787,zealously,negative


In [9]:
sentiment_lexicon.sentiment.value_counts()

negative    6294
positive    2159
Name: sentiment, dtype: int64

In [10]:
SRC = os.path.join('..','project_datasets')
path = os.path.join(SRC,'sentiment_lexicon.csv')
sentiment_lexicon.to_csv(path, index=False)

### Bibliography

Shapiro, A.H., Sudhof, M. and Wilson, D.J., 2020. Measuring news sentiment. Journal of Econometrics.

[Loughran, T. and McDonald, B., 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of finance, 66(1), pp.35-65.](https://sraf.nd.edu/textual-analysis/resources/)

[Hu, M., and B. Liu (2004): “Mining and summarizing customer reviews,” in SIGKDDKDM-04.](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon)