# Step 1: Preprocessing

The method requires a corpus to train a custom model of word embeddings. In this example, the corpus is a collection of political speeches from the Canadian House of Commons. I use data from 1980 to 2015, which should be large enough to fit a reliable word embeddings model. (You may find the full Canadian corpus here: www.lipad.ca)

You may download the data here: https://drive.google.com/uc?id=1u-lnejm4bzDm7t3YulSowzLaS26IHLpz

To replicate the approach used in the study, I will lemmatize the corpus and detect the parts of speech for each word. 

Not so many researchers get into the trouble of lemmatization and POS tagging before fitting embeddings. The rationale was that lemmatization reduces the size of the vocabulary, and thus should improve the accuracy of word vectors. As for parts of speech, the rationale is that they should help to reduce ambiguity, by distinguishing between usages of words with multiple meanings. Since the categories for parts of speech are a bit too specific for this task, I convert the Penn Treebank POS Tags into a simplified format that identifies nouns, adjectives, verbs and adverbs. This will allow to filter the vocabulary for words of substantive interest later on. Alternatively, one could choose to skip these preprocessing steps. 

The following script relies on a Python wrapper to call the Stanford CoreNLP library. It will output a copy of our corpus, but containing lemmas and their associated part of speech.  

In [None]:
#==========================================================================#
# Downloading Stanford CoreNLP.
#==========================================================================#
# The exclamation mark indicates that we call the command from our operating system.
! wget https://nlp.stanford.edu/software/stanford-corenlp-latest.zip

In [None]:
#==========================================================================#
# Unzipping the file.
#==========================================================================#
! unzip stanford-corenlp-latest.zip

In [None]:
from stanfordcorenlp import StanfordCoreNLP # Will require installing.
import json

#==========================================================================#
# Loading CoreNLP wrapper.
#==========================================================================#

nlp = StanfordCoreNLP('stanford-corenlp-4.4.0/') # Make sure the numbers correspond to your version.
props={'annotators': 'tokenize,lemma,pos', 'pipelineLanguage':'en'}

#==========================================================================#
# A dictionary to zero in on parts of speech of interest:
#==========================================================================#
posmap = {'JJ':'a', 'JJR':'a', 'JJS':'a', # Adjectives
        'NN':'n', 'NNS':'n', # Nouns, excluding proper nouns
        'RB':'r', 'RBR':'r', 'RBS':'r', # Adverbs
        'UH':'u', # Interjections
        'VB':'v', 'VBD':'v', 'VBG':'v', 'VBN':'v', 'VBP':'v', 'VBZ':'v'} # Verbs

#==========================================================================#
# A function to call CoreNLP and retrieve words converted to 'lemmas_POS' format.
#==========================================================================#
def lemmatize(text):
    res = json.loads(nlp.annotate(text, properties=props))
    lemmas = [token['lemma'] + '_' + posmap.get(token['pos'],'') for s in res['sentences'] for token in s['tokens']]
    lemmas = [l.lower() for l in lemmas if ' ' not in l and l.count('_')==1] # This will exclude malformed tokens.
    return ' '.join(lemmas)

#==========================================================================#
# Our main loop, to read the corpus and transform it.
#==========================================================================#
progress = 0
with open('lipad8015_lemmas.csv', 'w') as fout: # Saving a copy of the output.
    with open('lipad8015.csv', 'r') as fin: # The original corpus.
        for line in fin: # Stream lines to save memory.
            newline = lemmatize(line) # Process the line.
            fout.write(newline + '\n') # Save processed line to file.
            progress += 1
            if progress%10000==0:
                print(f"Completed {progress} lines.")           

#==========================================================================#
# Closing connection to CoreNLP
#==========================================================================#
nlp.close()

You may also consider alternatives like SpaCy, stanza, and other NLP libraries.

# Step 2: Fitting custom word embeddings model

Next, I will use the GloVe program to fit word embeddings. Many alternatives are available, like the word2vec model or embeddings from large scale language models. 

GloVe is written in C, a "compiled language." It will require compiling on your local machine.

Once the program is compiled, we can run it on our custom corpus and generate embeddings. 

The file fit.sh is a shell script to call GloVe. Parameters of the models can be modified at will. For this example, let us fit a model with a dimension of 300 and a window of 15 words, similar to that used in the original study. 

In [None]:
#==========================================================================#
# Downloading GloVe.
#==========================================================================#
# The study used the original code (Glove 1.0), but since a new one is available, let us use the latest one.
! wget https://nlp.stanford.edu/software/GloVe-1.2.zip
! unzip GloVe-1.2.zip

In [None]:
#==========================================================================#
# Compiling GloVe.
#==========================================================================#
# We need to compile the C program first.
! cd GloVe-1.2 && make

In [None]:
#==========================================================================#
# Computing word embeddings
#==========================================================================#
# Let us call a shell script that will run the GloVe program as desired, using our new corpus. 
# We can call this script from here, after making it executable:
! chmod +x fit.sh
! ./fit.sh

The commands above assume a Mac or Linux operating system. You may need to edit slightly for use on a Windows OS. The parameters for GloVe are left at default values, but you could fit multiple models, as desired, by editing the fit.sh file.

# Step 3: Generating an anxiety lexicon

The final step is to generate a lexicon for anxiety. We start from seed words that I had curated manually, and identifying the poles of an axis ranging from confidence to anxiety.  

The objective is to rank the remaining lemmas from our vocabulary based on where they fall on a scale from confidence to anxiety. For that, we can calculate the similarity of each lemma's embedding to those of the lemmas in each group of seeds. The method is inspired from Peter Turney's work for lexicon creation. We introduced an adaptation to word embeddings in more details in a separate study (https://github.com/lrheault/emotion). 

The commands below will create our final lexicon. Optionally, we can filter the lexicon to retain only words in our main categories for parts of speech. The logic is that this will exclude words like proper nouns, which we do not expect to represent indicators of anxiety in text. (Although depending on use cases, you may have a justification to consider proper nouns.) Another option is to rescale the scores between -1 and 1 (from confidence to anxiety), so that it is simpler to interpret, which is done here.

The resulting dictionary will not be identical to the one used in the original study, but should be similar and just as useful. 

In [None]:
from gensim.models import KeyedVectors
import pandas as pd

#==========================================================================#
# GloVe files have no header, so we must use that option to load our model:
#==========================================================================#
model = KeyedVectors.load_word2vec_format('lipad-vectors-300d.txt', no_header=True) 

#==========================================================================#
# Compute cosine similarity of a word against the two groups of seed words:
#==========================================================================#
def create_score(word, positive_seeds, negative_seeds):
    positive = sum([model.similarity(word, x) for x in positive_seeds])/len(positive_seeds) 
    negative = sum([model.similarity(word, x) for x in negative_seeds])/len(negative_seeds) 
    return (positive - negative)

# A function to rescale/normalize a pandas series.
def rescale(X, newmin=-1, newmax=1):
    return [(newmax-newmin)*((x-X.min())/(X.max()-X.min()))+newmin for x in X]

#==========================================================================#
# Loading manually curated seed lemmas for anxiety:
#==========================================================================#
seeds = pd.read_csv('seeds.csv')
positive_seeds = [lemma + '_' + pos for lemma,pos,score in zip(seeds.lemma, seeds.pos, seeds.score) if score==1] # Anxiety words 
negative_seeds = [lemma + '_' + pos for lemma,pos,score in zip(seeds.lemma, seeds.pos, seeds.score) if score==-1] # Confidence words

#==========================================================================#
# Loading vocabulary.
#==========================================================================#
vocab = []
with open('lipad-vocab.txt','r') as fin:
    for line in fin:
        word_freq = line.split()
        if len(word_freq)==2: # Eliminates malformed tokens.
            word,freq = word_freq
            if int(freq)>=200: # Filter on frequency as desired.
                vocab.append((word,freq))
vocab = [(word,freq) for word,freq in vocab if word not in positive_seeds and word not in negative_seeds]
print(f'Processing a total of {len(vocab)} words.')
    
#==========================================================================#
# Calculating scores for each word:
#==========================================================================#
progress = 0
anxiety_lexicon = []
for word, freq in vocab:
    lemma_pos = word.split('_')
    if len(lemma_pos)==2 and lemma_pos[0].isalpha():
        lemma, pos = lemma_pos
        score = create_score(word, positive_seeds, negative_seeds)
        anxiety_lexicon.append((lemma, pos, score))
    progress += 1
    if progress%1000==0:
        print(f'Completed {progress} words.')
        
#==========================================================================#
# Storing results in a pandas data frame.
#==========================================================================#
anxiety_lexicon = pd.DataFrame(anxiety_lexicon, columns=['lemma', 'pos', 'score'])
# We can rescale the scores between -1 and 1 
anxiety_lexicon['score'] = rescale(anxiety_lexicon.score)

# We can now implement additional filters on the lexicon. 
# For instance, we can choose to keep only the main POS categories identified earlier.
anxiety_lexicon = anxiety_lexicon[pd.notnull(anxiety_lexicon.pos) & (anxiety_lexicon.pos!='')]
anxiety_lexicon.sort_values(by=['score'], ascending=False, inplace=True)

# We can attach the original seeds to the lexicon, with the respective bounding values.
anxiety_lexicon = pd.concat([anxiety_lexicon,seeds])

#==========================================================================#
# Saving our lexicon:
#==========================================================================#
print(f'The created lexicon contains {anxiety_lexicon.shape[0]} words.')
anxiety_lexicon.to_csv('new_anxiety_lexicon.csv',index=False)