# Loughran-McDonald Dictionary for Sentiment Analysis

Data reference: [Notre Dame](https://sraf.nd.edu/loughranmcdonald-master-dictionary/)

Processing reference: [Wharton](https://wrds-www.wharton.upenn.edu/pages/classroom/sec-filings-dictionary-based-sentiment-analysis/)

## Load in dictionary

In [1]:
import csv
import glob
import re
import string
import sys
import datetime as dt

In [2]:
def utf8len(s):
    """helper function to get the size of string"""
    return len(s.encode("utf-8"))

In [3]:
# Load your master dictionary file. This file requires a
# Word column and a Syllables column. Other columns are optional
# and should be defined in the SENTIMENT_OUTPUT_FIELDS Python dictionary below.
master_dictionary_file = "Loughran-McDonald_MasterDictionary_1993-2024.csv"

In [4]:
# Load the master dictionary CSV file into a Python dictionary
# with Word as the key.
master_dictionary = {}
with open(master_dictionary_file) as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=",")
    line_count = 0
    for row in csv_reader:
        master_dictionary[row["Word"].lower()] = row
        line_count += 1
print(f"master dictionary has {len(master_dictionary)} words.")

master dictionary has 86553 words.


The dictionary is now loaded into memory. Let's inspect what information it contains for an example word.

In [5]:
master_dictionary["key"]

{'Word': 'KEY',
 'Seq_num': '40955',
 'Word Count': '4987086',
 'Word Proportion': '0.00019519456055297713',
 'Average Proportion': '0.00029559214410589807',
 'Std Dev': '0.0012756908126321722',
 'Doc Count': '1100659',
 'Negative': '0',
 'Positive': '0',
 'Uncertainty': '0',
 'Litigious': '0',
 'Strong_Modal': '0',
 'Weak_Modal': '0',
 'Constraining': '0',
 'Complexity': '0',
 'Syllables': '1',
 'Source': '12of12inf'}

Normalize words by lowercasing them all

In [6]:
for key, item in master_dictionary.items():
    item = {item_key.lower(): item_v for item_key, item_v in item.items()}
    item['word'] = item['word'].lower()
    master_dictionary[key] = item
    
    
print(f"master dictionary has {len(master_dictionary)} words.")

master dictionary has 86553 words.


Convert numeric fields to numeric

In [7]:
for key, item in master_dictionary.items():
    for field, value in item.items():
        if type(value) == str and value.isdigit():
            item[field] = int(value)
        elif type(value) == str:
            try:
                item[field] = float(value)
            except:
                pass
    master_dictionary[key] = item

In [8]:
master_dictionary["key"]

{'word': 'key',
 'seq_num': 40955,
 'word count': 4987086,
 'word proportion': 0.00019519456055297713,
 'average proportion': 0.00029559214410589807,
 'std dev': 0.0012756908126321722,
 'doc count': 1100659,
 'negative': 0,
 'positive': 0,
 'uncertainty': 0,
 'litigious': 0,
 'strong_modal': 0,
 'weak_modal': 0,
 'constraining': 0,
 'complexity': 0,
 'syllables': 1,
 'source': '12of12inf'}

In [9]:
type(master_dictionary['key']['word proportion'])

float

Sentiment scores are given *BY YEAR* added, but actual values are *categorical*.

## Calculate score of document based on sentiment

- Assumes input of list of words/tokens
- Assumes we are looking for negative, positive, and uncertainty
- The following sentiments will be excluded: litigious, strong_modal, weak_modal, and constraining

In [10]:
# The SENTIMENT_OUTPUT_FIELDS list below contains the sentiment fields we want
# to include.
SENTIMENT_OUTPUT_FIELDS = [
    "negative",
    "positive",
    "uncertainty",
]

In [11]:
# Assumes doc has been cleaned and lowercased
def calculate_sentiment_score(doc: list[str]):
    token_count = 0
    sentiment_counts = {k: 0 for k in SENTIMENT_OUTPUT_FIELDS}
    for token in doc:
        if token in master_dictionary:
            token_count += 1
            for sentiment in SENTIMENT_OUTPUT_FIELDS:
                sentiment_counts[sentiment] += int(master_dictionary[token][sentiment] != 0)
    return {k: v / token_count for k, v in sentiment_counts.items()}
                

In [12]:
test_doc = "terrible horrible very bad day".split(" ")

In [13]:
for token in test_doc:
    print(master_dictionary[token])

{'word': 'terrible', 'seq_num': 76799, 'word count': 2963, 'word proportion': 1.1597182862266085e-07, 'average proportion': 1.0795392194729806e-07, 'std dev': 1.6821892318184515e-05, 'doc count': 276, 'negative': 0, 'positive': 0, 'uncertainty': 0, 'litigious': 0, 'strong_modal': 0, 'weak_modal': 0, 'constraining': 0, 'complexity': 0, 'syllables': 3, 'source': '12of12inf'}
{'word': 'horrible', 'seq_num': 35552, 'word count': 151, 'word proportion': 5.910140439426861e-09, 'average proportion': 9.87359857861409e-09, 'std dev': 1.502871043089561e-06, 'doc count': 102, 'negative': 0, 'positive': 0, 'uncertainty': 0, 'litigious': 0, 'strong_modal': 0, 'weak_modal': 0, 'constraining': 0, 'complexity': 0, 'syllables': 3, 'source': '12of12inf'}
{'word': 'very', 'seq_num': 83165, 'word count': 774789, 'word proportion': 3.0325243714722504e-05, 'average proportion': 3.322288557482321e-05, 'std dev': 9.990595973362199e-05, 'doc count': 363334, 'negative': 0, 'positive': 0, 'uncertainty': 0, 'liti

In [14]:
calculate_sentiment_score(test_doc)

{'negative': 0.2, 'positive': 0.0, 'uncertainty': 0.0}

In [15]:
test_doc2 = "happy sunny awesome ice cream sundae cool".split(" ")

In [16]:
for token in test_doc2:
    print(master_dictionary[token])

{'word': 'happy', 'seq_num': 33779, 'word count': 9917, 'word proportion': 3.8815140885957736e-07, 'average proportion': 3.8237742179555884e-07, 'std dev': 1.434768683064166e-05, 'doc count': 5111, 'negative': 0, 'positive': 2009, 'uncertainty': 0, 'litigious': 0, 'strong_modal': 0, 'weak_modal': 0, 'constraining': 0, 'complexity': 0, 'syllables': 2, 'source': '12of12inf'}
{'word': 'sunny', 'seq_num': 74714, 'word count': 6486, 'word proportion': 2.5386205887498427e-07, 'average proportion': 2.7938888778724973e-07, 'std dev': 1.3410517873374067e-05, 'doc count': 2387, 'negative': 0, 'positive': 0, 'uncertainty': 0, 'litigious': 0, 'strong_modal': 0, 'weak_modal': 0, 'constraining': 0, 'complexity': 0, 'syllables': 2, 'source': '12of12inf'}
{'word': 'awesome', 'seq_num': 4726, 'word count': 723, 'word proportion': 2.8298222104010733e-08, 'average proportion': 2.742281749408698e-08, 'std dev': 3.951523000354606e-06, 'doc count': 339, 'negative': 0, 'positive': 0, 'uncertainty': 0, 'litig

In [17]:
calculate_sentiment_score(test_doc2)

{'negative': 0.0, 'positive': 0.14285714285714285, 'uncertainty': 0.0}

## Now use python module LMSentimentDict

In [18]:
from importlib import reload
from lm_sentiment import LMSentimentDict

In [19]:
sentiment_dict = LMSentimentDict(master_dictionary_file, SENTIMENT_OUTPUT_FIELDS)

In [20]:
sentiment_dict.master_dictionary['happy']

{'word': 'happy',
 'seq_num': 33779,
 'word count': 9917,
 'word proportion': 3.8815140885957736e-07,
 'average proportion': 3.8237742179555884e-07,
 'std dev': 1.434768683064166e-05,
 'doc count': 5111,
 'negative': 0,
 'positive': 2009,
 'uncertainty': 0,
 'litigious': 0,
 'strong_modal': 0,
 'weak_modal': 0,
 'constraining': 0,
 'complexity': 0,
 'syllables': 2,
 'source': '12of12inf'}

In [25]:
print(test_doc, test_doc2)

['terrible', 'horrible', 'very', 'bad', 'day'] ['happy', 'sunny', 'awesome', 'ice', 'cream', 'sundae', 'cool']


In [21]:
print(sentiment_dict.calculate_sentiment_score(test_doc))
print(sentiment_dict.calculate_sentiment_score(test_doc2))

{'negative': 0.2, 'positive': 0.0, 'uncertainty': 0.0}
{'negative': 0.0, 'positive': 0.14285714285714285, 'uncertainty': 0.0}


## Run on begie_book_1996_2025.csv

In [22]:
import pandas as pd

In [36]:
bbdf = pd.read_csv('beige_book_1996_2025.csv')

In [37]:
bbdf.head()

Unnamed: 0,year,month,url,text,timestamp
0,1996,10,https://www.federalreserve.gov/fomc/beigebook/...,moderate expansion of business activity charac...,1996-10-01
1,1996,12,https://www.federalreserve.gov/fomc/beigebook/...,moderate economic growth continues to be repor...,1996-12-01
2,1997,1,https://www.federalreserve.gov/fomc/beigebook/...,most district reports characterized early autu...,1997-01-01
3,1997,3,https://www.federalreserve.gov/fomc/beigebook/...,district economies generally continue to expan...,1997-03-01
4,1997,5,https://www.federalreserve.gov/fomc/beigebook/...,district economies generally continued to expa...,1997-05-01


In [38]:
negative_scores = []
positive_scores = []
uncertainty_scores = []
for t in bbdf['text']:
    doc_split = t.split(' ')
    score = sentiment_dict.calculate_sentiment_score(doc_split)
    negative_scores.append(score['negative'])
    positive_scores.append(score['positive'])
    uncertainty_scores.append(score['uncertainty'])

In [39]:
bbdf['negative_score'] = negative_scores
bbdf['positive_score'] = positive_scores
bbdf['uncertainty_score'] = uncertainty_scores

In [40]:
bbdf.head()

Unnamed: 0,year,month,url,text,timestamp,negative_score,positive_score,uncertainty_score
0,1996,10,https://www.federalreserve.gov/fomc/beigebook/...,moderate expansion of business activity charac...,1996-10-01,0.029883,0.029883,0.010204
1,1996,12,https://www.federalreserve.gov/fomc/beigebook/...,moderate economic growth continues to be repor...,1996-12-01,0.014639,0.024703,0.007319
2,1997,1,https://www.federalreserve.gov/fomc/beigebook/...,most district reports characterized early autu...,1997-01-01,0.035156,0.026367,0.004883
3,1997,3,https://www.federalreserve.gov/fomc/beigebook/...,district economies generally continue to expan...,1997-03-01,0.023581,0.020087,0.006114
4,1997,5,https://www.federalreserve.gov/fomc/beigebook/...,district economies generally continued to expa...,1997-05-01,0.027429,0.028997,0.001567


In [31]:
max(bbdf['negative_score'])

0.06366630076838639

In [32]:
min(bbdf['negative_score'])

0.010775862068965518

In [33]:
print(max(bbdf['positive_score']))
print(min(bbdf['positive_score']))

0.03530751708428246
0.0023391812865497076


In [None]:
print(max(bbdf['uncertainty_score']))
print(min(bbdf['uncertainty_score']))

0.019243530192435302
0.0


In [41]:
bbdf.to_csv("beige_book_sentiment_scores_1996_2025.csv")