# Automatic Domain Adaptation of Sentiment Dictionary

## Implementation using scikit-learn 

Finance research evaluates the “tone” of a text to predict financial outcomes. Sentiment dictionaries are widely used: Harvard Psychosociological Dictionary, Harvard- IV-4 TagNeg (H4N), Loughran and McDonald (L&M) dictionary
- Download dictionaries from here: https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary
 

In [1]:
H4N_PATH = '/mounts/work/sedinkina/LoughranMcdonald2011_Dictionary/lm_data/LoughranMcDonald_MasterDictionary_2016.csv' 

In [2]:
import pandas as pd
from collections import defaultdict
LMcD = pd.read_csv(H4N_PATH)
LMcD.head(5)

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,275,1.603442e-08,1.306189e-08,3.665256e-06,82,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.749209e-10,1.028197e-11,1.014208e-08,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,4.664558e-10,1.465871e-10,6.401309e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,6,3.498419e-10,1.758203e-10,7.213526e-08,6,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,6729,3.923477e-07,3.752169e-07,3.452425e-05,845,0,0,0,0,0,0,0,0,0,0,3,12of12inf


Reclassify L&M negative category:
- first, convert data into a dictionary of labels and its words -> {label1: [word1, word2, ...], ...}

In [3]:
#get L&M dictionary 
LMcD["Word"] = LMcD["Word"].str.lower()
unc = LMcD.loc[lambda x: x["Uncertainty"] > 0, "Word"].tolist()
lit = LMcD.loc[lambda x: x["Litigious"] > 0, "Word"].tolist()
pos = LMcD.loc[lambda x: x["Positive"] > 0, "Word"].tolist()
neg =  LMcD.loc[lambda x: x["Negative"] > 0, "Word"].tolist()
LM_dict = {'Uncertainty':unc, 'Litigious': lit, 'Positive':pos, 'Negative': neg}

#In L&M, word can have be a part of several categories
LM_word_cl = defaultdict(list)
for cl,words in LM_dict.items():
    for w in words:
        LM_word_cl[w].append(cl)

#convert into a dictionary with two labels 
LABEL1 = 'Negative'
LABEL2 = 'Common'
target_dic = defaultdict(set)
target_dic[LABEL1] = set(neg) #neg category
target_dic[LABEL2] = set(unc+lit+pos) #other words

In [4]:
#instead, you can also reclassify Harvard (H4N) dictionary:
#LMcD["Word"] = LMcD["Word"].str.lower()
#target_dic = defaultdict(set)
#target_dic[LABEL1] = set(LMcD.loc[lambda x: x["Harvard_IV"] > 0, "Word"].tolist())
#target_dic[LABEL2] =set(LMcD.loc[lambda x: x["Harvard_IV"] == 0, "Word"].tolist()) #["Harvard_IV"] == 0

In [5]:
print(f'Number of all words in dictionary is {LMcD.shape[0]}')
print(f'Number of {LABEL1} words in dictionary is {len(target_dic[LABEL1])}')
print(f'Number of {LABEL2} words in dictionary is {len(target_dic[LABEL2])}')

Number of all words in dictionary is 85221
Number of Negative words in dictionary is 2355
Number of Common words in dictionary is 1554


- Load **domain specific** word embeddings of interest, e.g. learned from financial 10-K corpus and inspect them a little bit

In [6]:
from gensim.models import KeyedVectors
EMB_PATH = '/mounts/work/sedinkina/models/preprocessed_doppelpunkt.vec'
emb_model_fin = KeyedVectors.load_word2vec_format(EMB_PATH)




In [7]:
emb_model_fin.most_similar('rebound', topn=9)#rebound is positive word in L&M dictionary

[('rebounding', 0.5815426707267761),
 ('rebounded', 0.5602948665618896),
 ('soften', 0.5499194860458374),
 ('softening', 0.5360822677612305),
 ('slowdown', 0.533230185508728),
 ('sluggish', 0.5260505080223083),
 ('weak', 0.5184440612792969),
 ('slump', 0.5068018436431885),
 ('downturn', 0.5012232065200806)]

### Train SVM classifier in a 5-fold cross validation setup:
- using each trained classifier on the 5th fold to predict 20% of the dictionary
- binary classification: using dictionary labels: LABEL1 and LABEL2

In [8]:
import random
import numpy as np
from sklearn.svm import SVC, LinearSVC
from typing import List, Dict, Tuple

    
class SentimentAnalyser:

    def __init__(self, dict_in: Dict[str, List[str]], emb_model: KeyedVectors): 
        
        self.NUM_FOLDS = 5
        self.dict_in = dict_in
        self.emb_model = emb_model
        self.word_to_cl = {w:cl for cl,words in dict_in.items() for w in words}
        self.label1, self.label2 = dict_in.keys() 
        self.cl_to_int = {self.label1: 0, self.label2: 1}
        self.reclassified = []
    
    def _split_data(self,data: List[str],i: int) -> List[List[str]] :
        #train_data_pos = w[:int((len(data)+1)*.80)] #Remaining 80% to training set
        #test_data_pos = w[int(len(data)*.80+1):] #Splits 20% data to test set

        subset_size = round(len(data)/self.NUM_FOLDS)    
        if i != self.NUM_FOLDS-1:
            test_data_pos = data[i*subset_size:][:subset_size] 
            train_data_pos = data[:i*subset_size] + data[(i+1)*subset_size:]
        else:
            test_data_pos = data[i*subset_size:]
            train_data_pos = data[:i*subset_size]
        return train_data_pos, test_data_pos
    
    def _getProbs(self,predicted: np.ndarray, clf: SVC, test_data: List[str]) -> Dict[str, List[Tuple[str, float]]]:
        
        int_to_cl = {num:cl for cl, num in self.cl_to_int.items()}
        probs = {}
        classes = [self.word_to_cl[word] for word in test_data]
        for i, el in enumerate(predicted):
            word = test_data[i]
            gold_class = classes[i]
            probs[word] = [(int_to_cl[list(clf.classes_)[i]],x) for i,x in enumerate(el)]
        return probs
    
    def _train(self,train_data: List[str], test_data: List[str], i:int) -> Dict[str, List[Tuple[str, float]]]:
        
        vectors = np.array([self.emb_model[word] for word in train_data])
        classes = np.array([self.cl_to_int[self.word_to_cl[word]] for word in train_data])
        clf = SVC(probability=True)
        clf.fit(vectors, classes)  
        vectors_test = np.array([self.emb_model[word] for word in test_data])
        predicted = clf.predict_proba(vectors_test)
        probs = self._getProbs(predicted,clf,test_data)
        return probs    
            
    def get_data(self, thresh: int = 0.5) -> Dict[str, str]:
        return {word:pred for prob, pred, word in self.reclassified if prob >=thresh}
    
    def get_df_data(self, thres: int = 0.5):
        word_to_cl = self.get_data(thres)
        words_old = sorted([w for cat, word_list in self.dict_in.items() for w in word_list])
        cat = self.label1
        csv_dict = defaultdict(list)
        csv_dict['Word'] =  words_old
        for w in words_old:
            csv_dict[cat].append(int(word_to_cl.get(w, 0) == cat))
        return pd.DataFrame(data=csv_dict)
                
    def reclassify(self):
        
        #words from dictionary for which we have embeddings
        wordsInModel = list(filter(lambda x: x in self.emb_model, list(self.word_to_cl)))
        print(f"{len(wordsInModel)} of dictionary words are in embedding model (out of {len(self.word_to_cl)})")
        
        wordsOfLabel1 = list(filter(lambda x: x in self.emb_model, self.dict_in[self.label1]))
        print(f"Number of {self.label1} words that are in embedding model is {len(wordsOfLabel1)}")
        
        wordsOfLabel2 = list(filter(lambda x: x in self.emb_model, self.dict_in[self.label2]))
        print(f"Number of {self.label2} words that are in embedding model is {len(wordsOfLabel2)}")
        
        for i in range(self.NUM_FOLDS):
            print(f"iteration {i+1}")
            
            train_data_label1,test_data_label1 = self._split_data(wordsOfLabel1,i) #predict 20% of the dictionary
            train_data_label2,test_data_label2 = self._split_data(wordsOfLabel2,i)

            train_data_re = train_data_label1+train_data_label2
            test_data_re = test_data_label1+test_data_label2
            random.shuffle(train_data_re)
            random.shuffle(test_data_re)    
            
            #print(f"len of training data is {len(train_data_re)}")
            #print(f"len of test is {len(test_data_re)}")
            probs = self._train(train_data_re,test_data_re,i)
            for word,y in probs.items():     
                pred,prob = sorted(y,key= lambda x:x[1],reverse=True)[0]
                data = [prob,pred,word]
                self.reclassified.append(data)
            

In [9]:
#SentimentAnalyser recieves as an input old sentiment dictionary and domain specific embedding model
reclassifier = SentimentAnalyser(dict_in = target_dic,
                                 emb_model = emb_model_fin)

In [10]:
#reclassify initial data
reclassifier.reclassify()

2965 of dictionary words are in embedding model (out of 3715)
Number of Negative words that are in embedding model is 1893
Number of Common words that are in embedding model is 1223
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5


In [11]:
#get reclassified words with probability >= 0.8 (turned to be the best for the prediction of finanical outcomes)
dictionary_re = reclassifier.get_data(0.8) 

In [12]:
#get reclassified words
print("\n{:20}| {:15} {:15}".format("Reclassified Word", 'New Class', 'Initial L&M Class\n'))

for w, cl in dictionary_re.items():
    if cl == LABEL1 and w not in target_dic[LABEL1]:
        print("{:20}| {:15} {:15}".format(w, cl, " ".join(LM_word_cl[w])) )



Reclassified Word   | New Class       Initial L&M Class

uncontracted        | Negative        Litigious      
uncertainly         | Negative        Uncertainty    
possibly            | Negative        Uncertainty    
probation           | Negative        Litigious      
undetectable        | Negative        Uncertainty    
warrantor           | Negative        Litigious      
conceivably         | Negative        Uncertainty    
rescissions         | Negative        Litigious      
resolve             | Negative        Positive       
speculation         | Negative        Uncertainty    
rescinded           | Negative        Litigious      
risking             | Negative        Uncertainty    
recoupments         | Negative        Litigious      
unconstitutional    | Negative        Litigious      
regaining           | Negative        Positive       
unusually           | Negative        Uncertainty    
rebutting           | Negative        Litigious      
contractile         | Ne

In [13]:
dictionary_re_df = reclassifier.get_df_data(0.8) 
dictionary_re_df.head(10)

Unnamed: 0,Word,Negative
0,abandon,1
1,abandoned,1
2,abandoning,1
3,abandonment,1
4,abandonments,1
5,abandons,1
6,abdicated,0
7,abdicates,0
8,abdicating,0
9,abdication,0


In [15]:
dictionary_re_df.to_csv('test.csv')

### Reclassify all L&M words and save results in .csv

In [16]:
class SentimentAnalyserFactory:
    def create_analyzer(self,dict_in: Dict[str, List[str]], emb_model: KeyedVectors) -> SentimentAnalyser:
        analyzer = SentimentAnalyser(dict_in, emb_model)
        return analyzer

In [17]:
factory = SentimentAnalyserFactory()
new_dict = {}
for current_label, words in LM_dict.items():
    print(f'Reclassifying {current_label} label...')
    target_d = defaultdict(set)
    target_d[current_label] = set(words) 
    rest = [w for l, w_list in LM_dict.items() if l != current_label for w in w_list]
    target_d[LABEL2] = set(rest) #other words
    
    reclassifier = factory.create_analyzer(target_d,emb_model_fin)
    reclassifier.reclassify()
    
    new_word_cl = reclassifier.get_data(0.8)  
    new_cl_word = defaultdict(set)
    for new_w, new_cl in new_word_cl.items():
        new_cl_word[new_cl].add(new_w)

    new_dict[current_label] = new_cl_word[current_label]

Reclassifying Uncertainty label...
2965 of dictionary words are in embedding model (out of 3715)
Number of Uncertainty words that are in embedding model is 275
Number of Common words that are in embedding model is 2727
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
Reclassifying Litigious label...
2965 of dictionary words are in embedding model (out of 3715)
Number of Litigious words that are in embedding model is 622
Number of Common words that are in embedding model is 2457
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
Reclassifying Positive label...
2965 of dictionary words are in embedding model (out of 3715)
Number of Positive words that are in embedding model is 326
Number of Common words that are in embedding model is 2639
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
Reclassifying Negative label...
2965 of dictionary words are in embedding model (out of 3715)
Number of Negative words that are in embedding model is 1893
Number of Common w

In [18]:
all_word_cl = {w:cat for cat, d_words in new_dict.items() for w in d_words}
words_LM = sorted(LM_word_cl.keys())

In [19]:
cat1, cat2, cat3, cat4 = new_dict.keys()

In [20]:
csv_dict = defaultdict(list)
csv_dict['Word'] =  words_LM
for w in words_LM:
    csv_dict[cat1].append(int(all_word_cl.get(w, 0) == cat1))
    csv_dict[cat2].append(int(all_word_cl.get(w, 0) == cat2))
    csv_dict[cat3].append(int(all_word_cl.get(w, 0) == cat3))
    csv_dict[cat4].append(int(all_word_cl.get(w, 0) == cat4))


In [21]:
reclassified_LM = pd.DataFrame(data=csv_dict)

In [22]:
reclassified_LM.tail(20)

Unnamed: 0,Word,Uncertainty,Litigious,Positive,Negative
3695,worse,0,0,0,1
3696,worsen,0,0,0,1
3697,worsened,0,0,0,1
3698,worsening,0,0,0,1
3699,worsens,0,0,0,1
3700,worst,0,0,0,0
3701,worthless,0,0,0,1
3702,worthy,0,0,1,0
3703,writ,0,0,0,0
3704,writedown,0,0,0,1


In [23]:
reclassified_LM.to_csv('reclassified_LM.csv')