# MultiLexScaled: Process lexica (2021-12-10)

_by A. Maurits van der Veen_  

_Modification history:_  
_2021-11-14 - initial extraction from longer, older notebook_   
_2021-12-05 - clean-up and update for Github_   
_2021-12-10 - double-check availability of all lexica at URLs indicated_  

The sentiment analysis method MultiLexScaled uses 8 widely used, publicly available, lexica. Not all of these are in the same format to start with. This notebook takes the format in which these are available online, and converts each to a csv file.

The lexica are processed in the following (alphabetical) order:
- HuLiu  
- labMT 
- LexicoderSD  
- MPQA  
- NRC  
- SentiWordNet (filtered0.1)
- SO-CAL (including the intensifier dictionary used in valence calculation)
- WordStat

Most of the lexica receive modifications, which are marked by modifying the filename (either adding `_filtered` or simply a capital `X`:
- HuLiu_lexiconX
- labMT_lexicon_filtered
- LSD_lexiconX
- SWN_lexicon_filtered0.1
- SO-CAL_lexiconX & SO-CAL_modifiersX
- WordStat_lexicon2X

### 0. Set-up

Import code modules and specify project folder path. Also define some useful lexicon-editing functions.

In [114]:
STAIRfolder = r"../"

In [156]:
# Code files to import
import sys
import csv
import os

# Print summary version info (for fuller info, simply print sys.version)
print("You are using python version {}.".format(sys.version.split()[0]))

You are using python version 3.11.7.


In [157]:
# CREATE dirs to contain all the individual lexicon files
os.makedirs(f"{STAIRfolder}Corpora/Lexica/English/MultiLexScaled/", exist_ok=True)
# Pathname to contain all the individual lexicon files
SAfolder = STAIRfolder + "Corpora/Lexica/English/MultiLexScaled/"

#### 0.1. Auxiliary functions

In [None]:
def fix_lex(lex, fixdict):
    """Fix a lexicon, by replacing each key in fixdict with the corrected key.

    Used to fix apparent unintentional spelling errors in some of the sentiment lexicon keys.
    """
    for oldlexkey, newlexkey in fixdict.items():
        if oldlexkey in lex:
            lexval = lex[oldlexkey]
            del lex[oldlexkey]
            if newlexkey != "":
                lex[newlexkey] = lexval
                print("Replaced {} by {}".format(oldlexkey, newlexkey))
            else:
                print("Deleted {}".format(oldlexkey))

    return lex

In [None]:
def subsumed(origx, words, wild="*", report=True):
    """See whether origx is subsumed by a wildcarded entry in words."""
    if origx[-1] == wild:
        x = origx[:-2] + wild
    else:
        x = origx + wild
    while len(x) > 1:
        if x in words:
            if report:
                print(x, "subsumes", origx)
            return x
        else:
            x = x[:-2] + wild
    return False

In [None]:
def lex_removesubsumed(lex):
    """Remove all entries in a lexicon that are subsumed by a wildcard entry with the same valence.

    Note that sometimes these are unintended wildcard matches.
    For example: 'terrifi*' (WordStat) is negative because intended for
    'terrified', 'terrifies', etc. However, it also subsumes 'terrific*',
    which is positive.

    In such cases, we want to look for the longest match first, which is
    indeed how our wildcard matching function operates. Therefore, we do not
    delete such subsumptions.
    """
    entries2delete = []
    for key, val in lex.items():
        subsumption = subsumed(
            key, lex, wild="*", report=True
        )  # set report to False for quiet operation
        if subsumption:
            if lex[subsumption] == val:
                entries2delete.append(key)
            else:
                print(
                    "{} and {} have different valences => keeping both".format(
                        subsumption, key
                    )
                )

    if len(entries2delete) > 1:
        print("Found {} subsumed entries; deleting now".format(len(entries2delete)))
        for entry in entries2delete:
            del lex[entry]

    return lex


### 1. Hu & Liu

Sentiment lexicon is available here: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html (look for "opinion lexicon"). Downloading the lexicon should produce a folder `opinion-lexicon-english` which contains be 2 files: `positive-words.txt` and `negative-words.txt`. Please cite the associated paper:

- Minqing Hu and Bing Liu. "Mining and summarizing customer reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004. 


In [None]:
def importHuLiu(lexiconfolder, posfile, negfile, savepickle=False, savetext=True):
    """Import and merge sentiment lexica from Bing Liu

    The opinion lexicon rar archive unpacks as a folder opinion-lexicon-English
    with the files positive-words.txt and negative-words.txt inside.
    """
    import csv
    import pickle

    words = {}

    # Read in the positive words
    with open(lexiconfolder + posfile, "r", errors="ignore") as infile:
        for line in infile.readlines():
            if line[0] != ";" and line.strip() != "":
                words[line.strip()] = 1
    poscount = len(words)

    # Read in the negative words
    with open(lexiconfolder + negfile, "r", errors="ignore") as infile:
        for line in infile.readlines():
            if line[0] != ";" and line.strip() != "":
                words[line.strip()] = -1

    # print(list(words.items())[:20])
    print(
        "Loaded {} positive and {} negative words, for a total of {} words.".format(
            poscount, len(words) - poscount, len(words)
        )
    )

    if savepickle:
        with open(lexiconfolder + "HuLiu_lexicon.pkl", "wb") as outfile:
            pickle.dump(words, outfile)
    if savetext:
        with open(lexiconfolder + "HuLiu_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(words.items())))

    return words


In [None]:
# CREATE Folder if it doesnt exists
os.makedirs(f"{SAfolder}HuLiu/opinion-lexicon-English/", exist_ok=True)

In [None]:
lexiconfolder = SAfolder + "HuLiu/opinion-lexicon-English/"
posfile = "positive-words.txt"
negfile = "negative-words.txt"

huliulex = importHuLiu(lexiconfolder, posfile, negfile)

Loaded 2006 positive and 4780 negative words, for a total of 6786 words.


In [None]:
# This lexicon has 'bull****' as a euphemism for 'bullshit'. However, our system
# would recognize the asterisks as wildcards, so delete this entry.

del huliulex["bull****"]

# In addition, it has naïve, but the ï does not come through correctly in the loading, so fix that

huliulex["naïve"] = huliulex["nave"]
del huliulex["nave"]


In [None]:
# Write out the updated file

with open(lexiconfolder + "HuLiu_lexiconX.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(huliulex.items())))


In [None]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(huliulex),
        sum([1 for x in huliulex if huliulex[x] > 0]),
        sum([1 for x in huliulex if huliulex[x] < 0]),
    )
)


Lexicon length: 6785 (2003 positive & 4782 negative)


### 2. labMT (filtered)

labMT stands for "language analysis by Mechanical Turk". The lexicon is available here: https://github.com/ryanjgallagher/shifterator/tree/master/shifterator/lexicons/labMT (use the file `labMT_English.tsv`). Please cite the associated paper:

- Dodds, Peter Sheridan, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth. "Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter." PLoS ONE 6, no. 12 (2011).

labMT is centered around 5, rather than 0. Subtract 5 to get a 0-centered lexicon.

Words right around 0 are effectively neutral in valence, so filter out all entries with an absolute value for valence less than 1.

In [None]:
# CREATE Folder if it doesnt exists
os.makedirs(f"{SAfolder}labMT/", exist_ok=True)

In [None]:
labMT = {}

labMTfolder = SAfolder + "labMT/"
with open(labMTfolder + "labMT_English.tsv", "r") as labMTfile:
    labreader = csv.reader(labMTfile, delimiter="\t")
    for row in labreader:
        labMT[row[0]] = float(row[1]) - 5

len(labMT)  # Should be 10,222

10222

In [34]:
labMTfiltered = {key: val for key, val in labMT.items() if abs(val) >= 1}
len(labMTfiltered)

3731

In [None]:
with open(labMTfolder + "labMT_lexicon_filtered.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(labMTfiltered.items())))


In [None]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(labMTfiltered),
        sum([1 for x in labMTfiltered if labMTfiltered[x] > 0]),
        sum([1 for x in labMTfiltered if labMTfiltered[x] < 0]),
    )
)


Lexicon length: 3731 (2668 positive & 1063 negative)


### 3. Lexicoder Sentiment Dictionary

The Lexicoder Sentiment Dictionary (LSD) was developed by Lori Young and Stuart Soroka. It is available at http://www.snsoroka.com/data-lexicoder/ and is part of the larger Lexicoder system, which also involves text preprocessing as well as some language substitution. Please cite the accompanying paper: 

- Young, Lori, and Stuart Soroka. "Affective news: The automated coding of sentiment in political texts." Political Communication 29.2 (2012): 205-231.

We use the August 2015 version. There are a few multi-word entries in the list; we ignore those. 


In [None]:
def importLexicoder(
    lexfolder,
    lexfile,
    lexfile_negated,
    savepickle=False,
    savetext=True,
    oldformat=False,
):
    import pickle

    words, negwords = {}, {}

    # Read in the non-negated values
    curcat = ""
    curval = 0
    with open(lexfolder + lexfile, "r") as infile:
        for line in infile.readlines():
            if oldformat:
                line = line.lower()
            if line[0] == "+" or (oldformat and line[0] != "\t"):
                curcat = "positive" if "positive" in line else "negative"
                curval = 1 if curcat == "positive" else -1
            else:
                aWord = line.strip().lower()
                if oldformat:
                    aWord = aWord[:-4]  # remove ' (1)' at end of each line
                if " " in aWord:
                    print("Phrase: {} -> skipping".format(aWord))
                    continue
                words[aWord] = curval

    # Read in the negated values; see which, if any, are new
    curcat = ""
    with open(lexfolder + lexfile_negated, "r") as infile:
        for line in infile.readlines():
            if oldformat:
                line = line.lower()
            if line[0] == "+" or (oldformat and line[0] != "\t"):
                curcat = "positive" if "positive" in line else "negative"
                curval = 1 if curcat == "positive" else -1
                continue
            linesplit = line.split()
            if oldformat:
                linesplit = linesplit[:-1]
            if linesplit[0] == "not" and len(linesplit) == 2:  # skip phrases
                aWord = linesplit[1].lower()
                if aWord not in words:
                    negwords[aWord] = curval
                    # print("New one from negatives: {} ({})".format(aWord, curval))
                else:
                    if words[aWord] != curval:  # should not happen!
                        print(
                            "Warning: {} has valence {} in original, but {} in negated file".format(
                                aWord, words[aWord], curval
                            )
                        )
            else:  # phrase not beginning 'not' -> skip
                # print('{} = negative file phrase or entry not beginning "not": -> skipping'.format(line))
                continue

    # Identify the ones that are in the basic file but not in the negated one
    # print("Words in the main file but not the negated file:")
    # print([x for x in words.keys() if x not in negwords])

    # Identify the ones that are in the negated list but not the basic one
    negwords_pos = [word for word, val in negwords.items() if val == 1]
    negwords_neg = [word for word, val in negwords.items() if val == -1]
    if len(negwords_pos) > 0:
        print(
            "\nPositive-valence words in the negated file only: {}".format(
                ", ".join(negwords_pos)
            )
        )
    if len(negwords_pos) > 0:
        print(
            "\nNegative-valence words in the negated file only: {}".format(
                ", ".join(negwords_neg)
            )
        )

    # Add them to the full list
    for word, val in negwords.items():
        if word not in words:
            words[word] = val

    print(
        "\nLexicon length: {} ({} positive & {} negative)".format(
            len(words),
            sum([words[x] for x in words if words[x] == 1]),
            -sum([words[x] for x in words if words[x] == -1]),
        )
    )

    # Save and return the lexicon
    if savepickle:
        with open(lexfolder + "LexicoderDictionary.pkl", "wb") as LSDout:
            pickle.dump(words, LSDout)

    if savetext:
        with open(lexfolder + "LSD_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(words.items())))

    return words

In [None]:
# CREATE Dirs if they do not doesnt exists
os.makedirs(f"{SAfolder}Lexicoder/LSDaug2015/", exist_ok=True)

In [None]:
lsdfolder = SAfolder + "Lexicoder/LSDaug2015/"
lsdfile = "LSD2015.lc3"
lsdfile_negated = "LSD2015_NEG.lc3"

lsd_lex = importLexicoder(
    lsdfolder, lsdfile, lsdfile_negated, savepickle=False, savetext=True
)

Phrase: a lie -> skipping
Phrase: affected manner* -> skipping
Phrase: at odds -> skipping
Phrase: back seat -> skipping
Phrase: beyond the pale -> skipping
Phrase: big lie* -> skipping
Phrase: black hole* -> skipping
Phrase: black mark* -> skipping
Phrase: by a side wind* -> skipping
Phrase: can of worms -> skipping
Phrase: cast down* -> skipping
Phrase: cast off* -> skipping
Phrase: cool reception -> skipping
Phrase: cool relations -> skipping
Phrase: cross the line -> skipping
Phrase: cut to pieces* -> skipping
Phrase: cut up* -> skipping
Phrase: fed up -> skipping
Phrase: god knows* -> skipping
Phrase: half hearted* -> skipping
Phrase: half mast -> skipping
Phrase: hang by a thread -> skipping
Phrase: hang over* -> skipping
Phrase: heart break -> skipping
Phrase: heart rending* -> skipping
Phrase: heart wounding -> skipping
Phrase: heart wrench -> skipping
Phrase: holier than thou -> skipping
Phrase: hot headed* -> skipping
Phrase: hot potato* -> skipping
Phrase: hot seat -> skippi

In [None]:
# Filter out subsumptions (only if they have the same valence)
lsd_lex = lex_removesubsumed(lsd_lex)


aggress* subsumes aggressiv*
angr* subsumes angry*
apologis* subsumes apologist
apologis* and apologist have different valences => keeping both
appal* subsumes appall*
bare* subsumes barely
boast* subsumes boastful*
bore* subsumes boredom*
confiscat* subsumes confiscation*
conspir* subsumes conspira*
controvers* subsumes controversy*
cram* subsumes cramp
cram* subsumes cramped
cram* subsumes cramps
disagre* subsumes disagree*
disagree* subsumes disagreement*
disappoint* subsumes disappointment*
discord* subsumes discordant*
displeas* subsumes displeasur*
distres* subsumes distress*
distress* subsumes distressing*
fals* subsumes falsehood*
fals* subsumes falseness*
filth* subsumes filthy*
fogg* subsumes foggy
goddam* subsumes goddamn*
grave* subsumes grave
grie* subsumes grievous*
hothead* subsumes hotheaded*
injur* subsumes injury*
irascib* subsumes irascibility*
irrita* subsumes irritat*
nervou* subsumes nervous*
oppose* subsumes opposed
oppose* subsumes opposer
oppose* subsumes oppos

In [None]:
# Save the changed lexicon
with open(lsdfolder + "LSD_lexiconX.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(lsd_lex.items())))

In [None]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(lsd_lex),
        sum([lsd_lex[x] for x in lsd_lex if lsd_lex[x] == 1]),
        -sum([lsd_lex[x] for x in lsd_lex if lsd_lex[x] == -1]),
    )
)


Lexicon length: 4353 (1608 positive & 2745 negative)


### 4. MPQA (Multi-Perspective Question Answering)

We use the lexicon associated with OpinionFinder 2.0, available at http://mpqa.cs.pitt.edu/opinionfinder/opinionfinder_2/. (Note that this is slightly different from the subjectivity lexicon available on the same website at: http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/.) We use the field `mpqapolarity`, which has values strongneg, weakneg, weakpos, and strongpos; we translate those values to -1.0, -0.5, 0.5, and 1.0. Two additional values -- neutral and both -- are ignored. If words occur in multiple parts of speech, their valence will be averaged.

Please cite the associated paper:

- Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005. 


In [None]:
def importMPQA(lexiconfolder, lexiconfile, savepickle=False, savetext=True):
    """Import the length-1 subjectivity clues files from MPQA; convert to dictionary.

    Original sentiment assessments are strongpos, weakpos, weakneg, strongneg
    Assign values 1, 0.5, -0.5, -1.
    Average across different word usages (parts-of-speech)
    """
    import pickle

    opinionvals = {
        "strongpos": 1,
        "weakpos": 0.5,
        "weakneg": -0.5,
        "strongneg": -1,
        "neutral": 0,
        "both": 0,
    }

    with open(lexiconfolder + lexiconfile, "r") as in1:
        cluesdata = in1.readlines()

    lexicon = {}
    wordcount = 0
    for counter, wordinfo in enumerate(cluesdata):
        if wordinfo[0] != "#":  # skip comment lines (should not be present)
            wordsplit = wordinfo.split()

            # Skip any terms not of length 1
            termlength = [x[-1] for x in wordsplit if x[:3] == "len"]
            if len(termlength) > 0 and termlength[0] == "1":
                # Extract word and polarity; these are not always the same location, so search
                theword = [x[6:] for x in wordsplit if x[:5] == "word1"]
                thesent = [x[13:] for x in wordsplit if x[:12] == "mpqapolarity"]
                if len(theword) > 0 and len(thesent) > 0:
                    # Store multiple uses of the same word (varies by POS)
                    if theword[0] in lexicon:
                        lexicon[theword[0]].append(opinionvals[thesent[0]])
                    else:
                        lexicon[theword[0]] = [
                            opinionvals[thesent[0]],
                        ]
                        wordcount += 1

    # Assign valence by averaging across multiple valences for the same word
    # At the same time, filter out words whose average valence is 0
    newlexicon = {
        key: sum(val) / float(len(val))
        for key, val in lexicon.items()
        if abs(sum(val)) > 0
    }

    print(
        "Total lines: %d; unique words: %d, in lexicon: %d"
        % (counter, wordcount, len(newlexicon))
    )

    if savetext:
        with open(lexiconfolder + "MPQA_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(newlexicon.items())))

    if savepickle:
        with open(lexiconfolder + "MPQA_lexicon.pkl", "wb") as outfile:
            pickle.dump(newlexicon, outfile)

    return newlexicon


In [None]:
# CREATE Folder if it doesnt exists
os.makedirs(f"{SAfolder}MPQA 2.0/opinionfinderv2.0/lexicons/", exist_ok=True)

In [None]:
mpqafolder = SAfolder + "MPQA 2.0/opinionfinderv2.0/lexicons/"
mpqafile = "subjclueslen1polar.tff"

mpqa_lex = importMPQA(mpqafolder, mpqafile, savepickle=False, savetext=True)

Total lines: 8220; unique words: 6885, in lexicon: 6449


In [None]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(mpqa_lex),
        sum([1 for x in mpqa_lex if mpqa_lex[x] > 0]),
        sum([1 for x in mpqa_lex if mpqa_lex[x] < 0]),
    )
)


Lexicon length: 6449 (2299 positive & 4150 negative)


### 5. NRC (Canadian National Research Council)

The NRC lexicon is a LIWC-style lexicon, with values for multiple categorie: each line contains a word, a category, and the value for that category. We use version 0.92 of the lexicon, which is available here: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm (the filename is `NRC-Emotion-Lexicon-Wordlevel-v0.92.txt`). Please cite the associated paper:

- Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.

There are 81 words in the lexicon that are listed as both positive and negative. Through 2021 we included these as positive. In the current version we delete them, in parallel with the way we treat the 'both' options in the MPQA lexicon.

There are also 12 words (4 positive, 8 negative) that appear to have been removed from the lexicon since 2015. They are listed at the end of this section for reference.


In [None]:
def importNRC(nrcfolder, nrcfile, savepickle=False, savetext=True):
    """Import NRC sentiment dictionary."""
    import csv, pickle

    words = {}
    doublewords = []

    with open(nrcfolder + nrcfile, "r") as infile:
        for row in csv.reader(infile, delimiter="\t"):
            if len(row) > 0 and row[0] != "":
                word = row[0].strip()
                cat = row[1].strip()
                val = int(row[2])

                if cat == "negative" and val == 1:  # should never happen
                    if word in words:
                        print(
                            "{} was in there already: was {}, now {}.".format(
                                word, words[word], val
                            )
                        )
                    words[word] = -1

                elif (
                    cat == "positive" and val == 1
                ):  # happens when word is in list as both negative and positive
                    if word in words:
                        doublewords.append(word)
                        del words[word]
                    else:
                        words[word] = 1

                else:  # other categories: emotions
                    pass

    print(
        "Words: {}, positive: {}, negative: {}, both (deleted): {}".format(
            len(words),
            len([1 for w in words if words[w] == 1]),
            len([1 for w in words if words[w] == -1]),
            len(doublewords),
        )
    )

    if savepickle:
        with open(nrcfolder + "NRC_lexicon.pkl", "wb") as outfile:
            pickle.dump(words, outfile)

    if savetext:
        with open(nrcfolder + "NRC_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(words.items())))

    return words, doublewords


In [None]:
# CREATE Folder if it doesnt exists
os.makedirs(f"{SAfolder}NRC/NRC-Emotion-Lexicon-v0.92/", exist_ok=True)

In [None]:
NRCfolder = SAfolder + "NRC/NRC-Emotion-Lexicon-v0.92/"
NRCfile = "NRC-Emotion-Lexicon-Wordlevel-v0.92.txt"

nrclex, posnegwords = importNRC(NRCfolder, NRCfile, savepickle=False, savetext=True)


Words: 5462, positive: 2227, negative: 3235, both (deleted): 81


In [None]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(nrclex),
        sum([1 for x in nrclex if nrclex[x] > 0]),
        sum([1 for x in nrclex if nrclex[x] < 0]),
    )
)


Lexicon length: 5462 (2227 positive & 3235 negative)


### 6. SentiWordNet

SentiWordNet is based on the WordNet semantic network. It is available at https://github.com/aesuli/SentiWordNet. Please cite the associated paper:

- Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani. "Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining." Lrec. Vol. 10. No. 2010. 2010.



In [143]:
# CREATE directory if doesn't exists
os.makedirs(f"{SAfolder}SWN/", exist_ok=True)

In [144]:
def importSWN(swnfolder, swnfile, savepickle=False, savetext=True):
    """Import SentiWordNet as sentiment analysis lexicon.

    Produce a plain sentiment dictionary by separating out members of each synset.
    For polysemous words, simply average valences.

    File format: tab-separated
    - POS (a, n, ...)
    - WordNet ID
    - positive score (0-1)
    - negative score (0-1)
    - synset (list of members)
    - gloss (verbalization of meaning)

    Synset members: space-separated; each word with the suffix '#n' where n is the sequence
    number for the synset memberships of the same word.

    Ignore multi-word phrases. The dataset contains 83499 words, of which 20,099 positive,
    20,698 negative (but 9,783 of the positive/negative words are both). If we sum the positive
    and negative, there are 29,502 words with a non-zero sum.

    There is commented-out code in here for an alternative way of generating a lexicon.
    Un-comment if needed. The alterantive way is to ignore all the zero-values assigned
    to a word's positive or negative scores and include the non-zero values only. These
    data are tracked in the variable nonzerovals and compiled into the lexicon polarity2.
    """
    import csv, pickle
    from operator import itemgetter

    # Initialize dictionaries
    synsets, wordpos, posvals, negvals, nonzerovals, polarity, polarity2 = (
        {},
        {},
        {},
        {},
        {},
        {},
        {},
    )

    with open(swnfolder + swnfile, "r") as infile:
        # Run down synsets
        for row in csv.reader(infile, delimiter="\t"):
            if len(row) > 0 and row[0] != "" and row[0][0] != "#":  # skip comments &c.
                pos = row[0].strip()
                id = row[1].strip()
                posval = float(row[2])
                negval = float(row[3])
                synsetraw = row[4].split()

                # Run down members of each synset
                for term in synsetraw:
                    word = term[:-2]  # skip the synset number (#1, #2, etc.)
                    if "_" not in word:  # ignore multi-word phrases
                        # Track non-zero values
                        # if posval != 0:
                        #     if word in nonzerovals:
                        #         nonzerovals[word].append(posval)
                        #     else:
                        #         nonzerovals[word] = [posval,]
                        # if negval != 0:
                        #     if word in nonzerovals:
                        #         nonzerovals[word].append(-negval)
                        #     else:
                        #         nonzerovals[word] = [-negval,]

                        # Weight zero-values as equal to non-zero values
                        if word not in synsets:  # first or only meaning of word
                            synsets[word] = [
                                id,
                            ]
                            wordpos[word] = [
                                pos,
                            ]
                            posvals[word] = posval
                            negvals[word] = negval

                        else:  # polysemous word
                            nrmeanings = len(synsets[word])
                            oldposvals = posvals[word] * nrmeanings
                            oldnegvals = negvals[word] * nrmeanings
                            synsets[word].append(id)
                            if pos not in wordpos[word]:
                                wordpos[word].append(pos)
                            posvals[word] = (oldposvals + posval) / float(
                                nrmeanings + 1
                            )
                            negvals[word] = (oldnegvals + negval) / float(
                                nrmeanings + 1
                            )

    # Combine positive and negative
    # Note: this treats 0 values as equal in weight to non-zero values
    for word, val in posvals.items():
        sumvals = val - negvals[word]
        if sumvals != 0:
            polarity[word] = sumvals

    # We could also focus on non-zero values only:
    # for word, vals in nonzerovals.items():
    #     avgval = sum(vals)/len(vals)
    #     if avgval != 0:
    #         polarity2[word] = avgval

    # Report basic data
    print(
        "Words: {}, positive: {}, negative: {}, both: {}, sum non-0: {}".format(
            len(synsets),
            len([1 for w in posvals if posvals[w] > 0]),
            len([1 for w in negvals if negvals[w] > 0]),
            len([1 for w in posvals if posvals[w] > 0 and negvals[w] > 0]),
            len(polarity),
            # len(polarity2),
        )
    )

    if savepickle:
        with open(swnfolder + "SWNlex.pkl", "wb") as outfile:
            pickle.dump((polarity, posvals, negvals, synsets, wordpos), outFile)

    if savetext:
        with open(swnfolder + "SWN_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(polarity.items())))
        # with open(swnfolder + 'SWN_lexicon2.csv', 'w') as outfile:
        #     outwriter = csv.writer(outfile)
        #     outwriter.writerows(sorted(list(polarity2.items())))

    return polarity  # , polarity2


In [145]:
SWNfolder = SAfolder + "SWN/"
SWNfile = "SentiWordNet_3.0.0.txt"

# swnlex, swnlex2 = importSWN(SWNfolder, SWNfile, savepickle=False, savetext=True)
swnlex = importSWN(SWNfolder, SWNfile, savepickle=False, savetext=True)

Words: 83499, positive: 20099, negative: 20698, both: 9783, sum non-0: 29502


In [146]:
# Filter out all the values that are very close to 0 (use 0.1 as the cut-off)

cutoff = 0.1
swn_filtered = {key: val for key, val in swnlex.items() if abs(val) >= cutoff}
len(swn_filtered)

24222

In [147]:
# Save updated version
with open(SWNfolder + "SWN_lexicon_filtered0.1.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(swn_filtered.items())))


In [148]:
print(
    "\nLexicon length: {} ({} positive & {} negative)".format(
        len(swn_filtered),
        sum([1 for x in swn_filtered if swn_filtered[x] > 0]),
        sum([1 for x in swn_filtered if swn_filtered[x] < 0]),
    )
)


Lexicon length: 24222 (11116 positive & 13106 negative)


### 7. SO-CAL

The SO-CAL dictionaries are available at https://github.com/sfu-discourse-lab/SO-CAL/tree/master/Resources/dictionaries/English. Please cite the associated paper:

- Taboada, Maite, Julian Brooke, Milan Tofiloski, Kimberly Voll and Manfred Stede (2011) Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics 37 (2): 267-307.

Note: SO-CAL's dictionaries provide only the singular for nouns and the infinitive for verbs. We expand those lists by adding plurals as well as basic verb conjugations. For this, we use the python module `pattern` ( https://github.com/clips/pattern).


In [134]:
# ADD pattern module to NBs dir
MODULE = "/pattern"
sys.path.append(MODULE)
import pattern.en

In [135]:
def importSOCAL(socalfolder, socalnames, savepickle=False, savetext=True):
    """Import and pre-process SO-CAL sentiment lexicon

    Expects 5 separate files in the socalfolder, with names specified in the
    dictionary socalnames, which should have the keys 'adjective', 'adverb',
    'noun', and 'verb', plus 'intensifier' for the separate intensifiers list.

    We ignore part-of-speech, and we expand nouns (pluralize) and verbs
    (singular, plural, prsent, past, etc.) to combine into a single dictionary.
    We use the pattern module for these expansions.
    When words appear in multiple sub-lists, simply average their valence.

    Each file contains 1 word per line, tab-separated from its valence.
    For the 4 POS categories, these are words to be added to the lexicon; for the last,
    the 'valence' is the multiplier to use when the word precedes a valence
    word.

    Note: A number of these 'words' are multi-word phrases, some with
    generic wildcards and POS tags (e.g. "(bowl)_#PER?#_over" ). These are all separated
    by hyphens or underscores; skip them here, to keep single words only in the lexicon.

    Note 2: some entries are in the verb dictionary twice, and their valence is automatically averaged
    - ameliorate, at 1 and 2
    - appall, at -3 and -5
    - befriend (both at 1)
    - belie, at -2 and -3
    - bug, at -1 and -2
    - enthrall (both at 3)
    - extol, at 2 and 3
    - gladden, at 2 and 3
    - loathe at -4 and loath at -5
    - misunderstand (both at -1), plus misunderstood, also at -1
    - quibble (both at -1)
    - uplift, at 2 and 3
    """
    import csv
    from operator import itemgetter
    from pattern.en import pluralize
    import pickle

    words, counts = {}, {}

    # Read in the nouns; handle plurals
    counter = 0
    with open(socalfolder + socalnames["noun"], "r", errors="ignore") as infile:
        inreader = csv.reader(infile, delimiter="\t")
        for row in inreader:
            if len(row) == 2:  # only take lines with 2 entries
                counter += 1
                theword = row[0]
                theval = int(row[1])
                words, counts = updatewordscounts(theword, theval, words, counts)
                words, counts = updatewordscounts(
                    pluralize(theword), theval, words, counts
                )
    lensofar = len(words)
    print(
        "Processed {} nouns; total count (singular + plural): {}".format(
            counter, lensofar
        )
    )

    # Read in the verbs; handle conjugation
    with open(socalfolder + socalnames["verb"], "r") as infile:
        inreader = csv.reader(infile, delimiter="\t")
        for row in inreader:
            if len(row) == 2:
                theval = int(row[1])
                for verbtense in alltenses(row[0]):
                    words, counts = updatewordscounts(verbtense, theval, words, counts)
    print("Verbs (incl. conjugations):", len(words) - lensofar)
    lensofar = len(words)

    # Read in adjectives & adverbs: no special treatment
    for pos in ("adjective", "adverb"):
        with open(socalfolder + socalnames[pos], "r", errors="ignore") as infile:
            inreader = csv.reader(infile, delimiter="\t")
            for row in inreader:
                if len(row) == 2:
                    words, counts = updatewordscounts(
                        row[0], int(row[1]), words, counts
                    )
    print("Adjectives & adverbs:", len(words) - lensofar)

    # Remove phrases, to keep single words only
    terms2delete = []
    for term, val in words.items():
        if "_" in term:  # redundant: or '(' in term or '[' in term
            terms2delete.append(term)
            # print(term)
    print("Deleting {} phrases (identified by underscores)".format(len(terms2delete)))
    for term in terms2delete:
        del words[term]

    print("Total lexicon length:", len(words))

    # Now read in the modifiers
    mods = {}
    with open(socalfolder + socalnames["intensifier"], "r") as infile:
        inreader = csv.reader(infile, delimiter="\t")
        for row in inreader:
            if len(row) == 2:
                mods[row[0]] = float(row[1])
    print("Modifiers:", len(mods))

    # Print some output to double-check everything looks right
    # print(list(words.items())[:20])
    # print("Total nr. of words", len(words))
    # print(list(mods.items())[:20])
    # print("Total nr. of modifiers", len(mods))
    # dupes = [(word, val) for word, val in counts.items() if val > 1]
    # print(sorted(dupes, key=itemgetter(1), reverse=True))
    # print("Total nr. of words encountered more than once", len(dupes))

    # Save SO-CAL dictionary in a text or pickle file
    if savetext:
        with open(socalfolder + "SO-CAL_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(words.items())))
        with open(socalfolder + "SO-CAL_modifiers.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(mods.items())))
    if savepickle:
        with open(socalfolder + "SO-CAL.pkl", "wb") as outfile:
            pickle.dump((words, mods, counts), outfile)

    return words, mods


def updatewordscounts(word, val, words, counts):
    """Update words & counts dictionaries"""
    if word != "":
        if word in words:
            words[word] = words[word] * counts[word] + val
            counts[word] += 1
            words[word] /= float(counts[word])
        else:
            words[word] = val
            counts[word] = 1
    return words, counts


def alltenses(v):
    """Return all different verb forms, given infinitive.

    Uses the Linguistics module from Nodebox (called 'en').
    """
    from pattern.en import conjugate

    # filter out multi-word phrases, which we just ignore
    if "_" in v:
        return []
    return list(
        set(
            [
                conjugate(v, conj)
                for conj in [
                    "inf",
                    "1sg",
                    "2sg",
                    "3sg",
                    "pl",
                    "part",
                    "1sgp",
                    "2sgp",
                    "3sgp",
                    "ppl",
                    "ppart",
                ]
            ]
        )
    )


In [136]:
os.makedirs(f"{SAfolder}SO-CAL/English (from GitHub)/", exist_ok=True)

In [137]:
# Load the SO-CAL lexicon from the official distribution, version 1.11.

# Note: the pattern libraries conjugate function may raise an error
# (StopIteration) the first time this is run. Just execute the cell again.

socalfolder = SAfolder + "SO-CAL/English (from GitHub)/"
socalsuffix = "_dictionary1.11.txt"
socalabbrevs = ["noun", "verb", "adj", "adv", "int"]
socalcats = ["noun", "verb", "adjective", "adverb", "intensifier"]
socalnames = {cat: abbrev + socalsuffix for cat, abbrev in zip(socalcats, socalabbrevs)}

socallex, socalmods = importSOCAL(socalfolder, socalnames)

Processed 1549 nouns; total count (singular + plural): 3070
Verbs (incl. conjugations): 3562
Adjectives & adverbs: 3422
Deleting 29 phrases (identified by underscores)
Total lexicon length: 10025
Modifiers: 210


In [138]:
# There are 2 entries with an accented letter that do not come through correctly:
# cliché and 'scslsiscshs�s s'. The latter can be deleted (not sure what it is supposed to be).
# The former should be corrected. Relatedly, there is an entry for clichd, which should be clichéd.

badkey = ""
for key, val in socallex.items():
    if key[:5] == "clich" and "e" not in key and "d" not in key:
        badkey = key
if len(badkey) > 0:
    print("Deleting bad version of cliché")
    del socallex[badkey]

badkey = ""
for key, val in socallex.items():
    if key[:4] == "scsl":
        badkey = key
if len(badkey) > 0:
    print('Deleting bad "scsl..." term')
    del socallex[badkey]

socallex["cliché"] = -2  # Same as for 'cliche'
socallex["clichés"] = -2  # Same as for 'cliches'

if "clichd" in socallex:
    del socallex["clichd"]
socallex["clichéd"] = -3  # Same as for 'cliched'

Deleting bad version of cliché
Deleting bad "scsl..." term


In [139]:
# Specify some additional fixes to the SO-CAL lexicon
# Mostly these fix spelling errors

socallex_fix = {  # adjectives
    "redepemption": "",
    "suspensful": "",
    "uncoventional": "",
    "anti-climatic": "anti-climactic",
    "autorcratic": "autocratic",
    "devestating": "devastating",
    "digruntled": "disgruntled",
    "disasterous": "disastrous",
    "forlon": "forlorn",
    "futurisitic": "futuristic",
    "inprudent": "imprudent",
    "intractible": "intractable",
    "juandiced": "jaundiced",
    "less-than-desireable": "less-than-desirable",
    "obsure": "obscure",
    "opressive": "oppressive",
    "plebian": "plebeian",
    "priviledged": "privileged",
    "pgnacious": "pugnacious",
    "strenous": "strenuous",
    "uneveness": "unevenness",
    "unweildy": "unwieldy",
    "uproductive": "unproductive",
    "inpudent": "impudent",
    # adverbs
    "immensly": "immensely",  # this overwrites 'immensely' that was in there at valence 1
    "exceptionaly": "",  # just delete
    "digitaly": "",  # ,,
    "entirly": "entirely",
    "realy": "really",
    "disasterously": "disastrously",
}


In [140]:
# Implement fix
fix_lex(socallex, socallex_fix)

Deleted redepemption
Deleted suspensful
Deleted uncoventional
Replaced anti-climatic by anti-climactic
Replaced autorcratic by autocratic
Replaced devestating by devastating
Replaced digruntled by disgruntled
Replaced disasterous by disastrous
Replaced forlon by forlorn
Replaced futurisitic by futuristic
Replaced inprudent by imprudent
Replaced intractible by intractable
Replaced juandiced by jaundiced
Replaced less-than-desireable by less-than-desirable
Replaced obsure by obscure
Replaced opressive by oppressive
Replaced plebian by plebeian
Replaced priviledged by privileged
Replaced pgnacious by pugnacious
Replaced strenous by strenuous
Replaced uneveness by unevenness
Replaced unweildy by unwieldy
Replaced uproductive by unproductive
Replaced inpudent by impudent
Replaced immensly by immensely
Deleted exceptionaly
Deleted digitaly
Replaced entirly by entirely
Replaced realy by really
Replaced disasterously by disastrously


{'masterpiece': 5,
 'masterpieces': 5,
 'perfection': 5,
 'perfections': 5,
 'classic': 3.5,
 'classics': 4,
 'mastery': 4,
 'masteries': 4,
 'wonder': 4,
 'wonders': 4,
 'zenith': 4,
 'zeniths': 4,
 'beauty': 4,
 'beauties': 4,
 'bliss': 4,
 'blisses': 4,
 'brilliance': 4,
 'brilliances': 4,
 'culmination': 4,
 'culminations': 4,
 'ecstasy': 4,
 'ecstasies': 4,
 'excellence': 4,
 'excellences': 4,
 'exhilaration': 4,
 'exhilarations': 4,
 'genius': 4,
 'genii': 4,
 'magnificence': 4,
 'magnificences': 4,
 'tour-de-force': 4,
 'tour-de-forces': 4,
 'pinnacle': 4,
 'pinnacles': 4,
 'revelation': 4,
 'revelations': 4,
 'jubilation': 4,
 'jubilations': 4,
 'euphoria': 4,
 'euphorias': 4,
 'elation': 3.0,
 'elations': 3.0,
 'stud': 3,
 'studs': 3,
 'angel': 3,
 'angels': 3,
 'freshness': 3,
 'freshnesses': 3,
 'gaiety': 3,
 'gaieties': 3,
 'heroism': 3,
 'heroisms': 3,
 'honor': 2.5,
 'honors': 2.5,
 'intelligence': 3,
 'intelligences': 3,
 'love': 3.0,
 'profoundness': 3,
 'profoundnesses

In [141]:
# Fix the intensifier/modifier dictionary the same way

socalmods_fix = {
    "a_mutltidue_of": "a_multitude_of",
    "visable": "visible",
    "collossal": "colossal",
}

fix_lex(socalmods, socalmods_fix)

# Change one entry
socalmods["more"] = 0.5  # was -0.5 (i.e a weakening rather than a strengthening)

# Add some additional entries
socalmods["absence_of"] = -1.5  # Negater, not in SO-CAL lexicon
socalmods["devoid_of"] = -1.5  # ,,
socalmods["lack_of"] = -1.5  # ,,
socalmods["not_very"] = -1.5  # Addition to parallel 'not_too'

# Add some intensifiers from an earlier version of SO-CAL
socalmods["low"] = -2.0
socalmods["some"] = -0.2
socalmods["obvious"] = 0.3
socalmods["lots_of"] = 0.3
socalmods["serious"] = 0.3


Replaced a_mutltidue_of by a_multitude_of
Replaced visable by visible
Replaced collossal by colossal


In [142]:
# Save the updated versions. Add an 'X' to indicate they're modified

with open(socalfolder + "SO-CAL_lexiconX.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(socallex.items())))

with open(socalfolder + "SO-CAL_modifiersX.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(socalmods.items())))


### 8. WordStat

Provalis Research, the makers of WordStat, make available for public use a basic sentiment dictionary. This dictionary used to be located at http://www.provalisresearch.com/wordstat/Sentiment-Analysis.html (URL no longer live, but probably accessible via Wayback Machine), and that is the version we used through 2021. Our current set-up uses the newer version.

The new version (2.0, dated 2018) is at https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/.


In [None]:
# CREATE directory
os.makedirs(f"{SAfolder}WordStat/WSD 2.0/", exist_ok=True)

wordstatfolder = SAfolder + "WordStat/WSD 2.0/"


'../MultiLexScaledCorpora/Lexica/English/MultiLexScaled/WordStat/WSD 2.0/'

In [150]:
def importWordStat(WSfolder, WSname, savepickle=False, savetext=True):
    """Import WordStat sentiment dictionary.

    The dictionary file begins with several good/bad expressions; skip these because our
    modifiers take care of them. This is followed by a list of negations (again, skip)
    and a list of "double negations" (again, skip).

    Finally, there are two sections titled "NEGATIVE WORDS" and "POSITIVE WORDS".
    These we load into the WordStat dictionary.

    The file is all caps, so make sure to lower-case things.
    """
    import csv
    import pickle

    words = {}
    poswords, negwords = 0, 0
    negationwords, doublenegwords = [], []
    curCat = "prelims"

    with open(WSfolder + WSname, "r", errors="ignore") as infile:
        for aWord in infile.readlines():
            aWord = aWord.lower().strip()
            if aWord == "negations":
                curCat = "negs"
            elif aWord == "double_negation":
                curCat = "doublenegs"
            elif aWord == "positive words":
                curCat = "positive"
            elif aWord == "negative words":
                curCat = "negative"
            elif aWord == "exceptions":
                curCat = "exceptions"
            elif curCat in ("positive", "negative", "negs", "doublenegs"):
                aWord = aWord[:-4]  # remove ' (1)' at end of each line
                if (
                    curCat == "negs"
                ):  # We don't actually do anything with negs and doublenegs, so could skip this
                    negationwords.append(aWord)
                elif curCat == "doublenegs":
                    doublenegwords.append(aWord.replace("_", " "))
                elif curCat == "positive":
                    words[aWord] = 1
                    poswords += 1
                else:  # curCat == 'negative'
                    words[aWord] = -1
                    negwords += 1

    print(
        "Dictionary contained {} positive and {} negative words (including wildcards) for a total of {} words".format(
            poswords, negwords, len(words)
        )
    )

    if savepickle:
        # Optionally add in the negationwords and doublenegwords here
        with open("WordStatDictionary.pkl", "wb") as outfile:
            pickle.dump(words, outfile)
    if savetext:
        with open(WSfolder + "WordStat_lexicon.csv", "w") as outfile:
            outwriter = csv.writer(outfile)
            outwriter.writerows(sorted(list(words.items())))

    return words


In [151]:
# Load WordStat lexicon (version 2.0)
wordstatfolder = SAfolder + "WordStat/WSD 2.0/"
wordstatfile = "WordStat Sentiments.cat"

wslex2 = importWordStat(wordstatfolder, wordstatfile, savepickle=False, savetext=True)

Dictionary contained 4710 positive and 9579 negative words (including wildcards) for a total of 14289 words


In [152]:
# 2.0 has phrases, with underscores; remove these
keys2delete = []
for key, val in wslex2.items():
    if "_" in key:
        keys2delete.append(key)
print("Deleting {} entries with underscores (phrases)".format(len(keys2delete)))
for key in keys2delete:
    del wslex2[key]

Deleting 258 entries with underscores (phrases)


In [153]:
# It also has a few lines beginning with @ that are comments; remove these too
keys2delete = []
for key, val in wslex2.items():
    if key[0] == "@":
        keys2delete.append(key)
print("Deleting {} entries starting with @ (comments/notes)".format(len(keys2delete)))
for key in keys2delete:
    del wslex2[key]


Deleting 4 entries starting with @ (comments/notes)


In [154]:
# Filter out subsumptions (if they have the same valence) and save
wslex2 = lex_removesubsumed(wslex2)
with open(wordstatfolder + "WordStat_lexicon2X.csv", "w") as outfile:
    outwriter = csv.writer(outfile)
    outwriter.writerows(sorted(list(wslex2.items())))


abus* subsumes abuse
aggravat* subsumes aggravate
aggress* subsumes aggressive
agoniz* subsumes agonizing
alarm* subsumes alarm
annoy* subsumes annoy
annoy* subsumes annoyed
annoy* subsumes annoying
appall* subsumes appalling
arrogan* subsumes arroganc*
arrogant* subsumes arrogant
arrogan* subsumes arrogant*
attack* subsumes attacked
avoid* subsumes avoid
awe* subsumes aweful
awe* and aweful have different valences => keeping both
awkward* subsumes awkward
bitter* subsumes bitter
blam* subsumes blame
bore* subsumes bore
bother* subsumes bother
bother* subsumes bothered
bother* subsumes bothering
bother* subsumes bothersome
break* subsumes break
brok* subsumes broke
brok* subsumes broken
brutal* subsumes brutal
bullshit* subsumes bullshit
burden* subsumes burden
burden* subsumes burdensome
challeng* subsumes challenging
challeng* and challenging have different valences => keeping both
chao* subsumes chaotic
cheat* subsumes cheat
cheat* subsumes cheated
complain* subsumes complain
compla

### Done!