This is the notebook pertaining to text analysis where I seek to find different text analysis metrics of the financial reports.

In [1]:
import pandas as pd

In [2]:
corpus = pd.read_csv('corpus.csv')
corpus.head()

Unnamed: 0.1,Unnamed: 0,0,1,2
0,https://www.sec.gov/Archives/edgar/data/3662/0...,'item 7. management\'s discussion analysis f...,,
1,https://www.sec.gov/Archives/edgar/data/3662/0...,"""item 2. management's discussion analysis fi...",,
2,https://www.sec.gov/Archives/edgar/data/3662/0...,,,
3,https://www.sec.gov/Archives/edgar/data/3662/0...,'item 7. management\'s discussion analysis f...,,
4,https://www.sec.gov/Archives/edgar/data/3662/0...,,,


## 1. Analysis of Sentiment

The data above is a sample of the extracted data that one has to work with. It has already been made free of stopwords during the extraction phase so that ought not be done here. However, first, in order to find sentiment metrics, one has to prepare a positive and negative word dictionary. I am using "Loughran McDonald Master Dictionary 2018" for that purpose.

### Creating Positive and Negative Word Dictionaries (Lists)

In [3]:
word_dict = pd.read_csv('LoughranMcDonald_MasterDictionary_2018.csv')
word_dict.head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,277,1.48e-08,1.24e-08,3.56e-06,84,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.6e-10,9.73e-12,9.86e-09,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,4.28e-10,1.39e-10,6.23e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,12,6.41e-10,3.16e-10,9.38e-08,12,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,7250,3.87e-07,3.68e-07,3.37e-05,914,0,0,0,0,0,0,0,0,0,0,3,12of12inf


In [4]:
word_dict.Negative.unique()

array([   0, 2009, 2011, 2014, 2012])

In [5]:
word_dict.Positive.unique()

array([   0, 2009, 2012, 2011])

From logically analysing, the words corresponding to which the fields in the `Positive` and `Negative` columns are not zero, are hence positive and negative words respectively.

In [6]:
neg_words = word_dict.Word[word_dict.Negative != 0]
neg_words = list(neg_words)

In [7]:
pos_words = word_dict.Word[word_dict.Positive != 0]
pos_words = list(pos_words)

In [8]:
def make_lower(word_list):
    new_word_list = []
    for word in word_list:
        new_word_list.append(word.lower())
    return new_word_list

In [9]:
pos_words = make_lower(pos_words)
neg_words = make_lower(neg_words)

### Extracting the Derived Variables

The text ought to be tokenised first for efficient recognition (using the `nltk` library). Then the first two directly derived metrics can be calculated: Positive score and Negative score. First, I will try a specific case, and then I will try to generalise.

In [10]:
text = corpus['0'][0]

In [11]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

In [12]:
positive_score, negative_score = 0, 0
for token in tokens:
    if token in pos_words:
        positive_score += 1
    if token in neg_words:
        negative_score -= 1
negative_score = -negative_score
print("Positive score = " + str(positive_score))
print("Negative score = " + str(negative_score))

Positive score = 12
Negative score = 16


Now, polarity and subjectivity scores can be calculated

In [13]:
polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
subjectivity_score = (positive_score + negative_score) / (len(tokens) + 0.000001)
print("Polarity score = "+ str(polarity_score))
print("Subjectivity score = " + str(subjectivity_score))

Polarity score = -0.14285713775510223
Subjectivity score = 0.03305785120064008


### Sentiment Score Categorisation

A text is classified as:
1. Most negative if polarity is below -0.5
2. Negative if polarity is between -0.5 and 0
3. Neutral if polarity is 0
4. Positive if polarity is between 0 and 0.5
5. Most positive if polarity is above 0.5

in the range (-1, 1)

### Generalisation

In [14]:
def find_pos_neg_score(tokens):
    positive_score, negative_score = 0, 0
    for token in tokens:
        if token in pos_words:
            positive_score += 1
        if token in neg_words:
            negative_score += 1
    polarity_score = (positive_score - negative_score) / ((positive_score + negative_score) + 0.000001)
    pos_word_prop = positive_score / len(tokens)
    neg_word_prop = negative_score / len(tokens)
    return positive_score, negative_score, polarity_score, pos_word_prop, neg_word_prop

In [15]:
mda_positive_score, mda_negative_score, mda_polarity_score = [], [], []
mda_positive_word_proportion, mda_negative_word_proportion = [], []

for mda in corpus['0']:
    tokens = word_tokenize(str(mda))
    positive_score, negative_score, polarity_score, pos_word_prop, neg_word_prop = find_pos_neg_score(tokens)
    mda_positive_score.append(positive_score)
    mda_negative_score.append(negative_score)
    mda_polarity_score.append(polarity_score)
    mda_positive_word_proportion.append(pos_word_prop)
    mda_negative_word_proportion.append(neg_word_prop)

In [16]:
qqdmr_positive_score, qqdmr_negative_score, qqdmr_polarity_score = [], [], []
qqdmr_positive_word_proportion, qqdmr_negative_word_proportion = [], []

for qqdmr in corpus['1']:
    tokens = word_tokenize(str(qqdmr))
    positive_score, negative_score, polarity_score, pos_word_prop, neg_word_prop = find_pos_neg_score(tokens)
    qqdmr_positive_score.append(positive_score)
    qqdmr_negative_score.append(negative_score)
    qqdmr_polarity_score.append(polarity_score)
    qqdmr_positive_word_proportion.append(pos_word_prop)
    qqdmr_negative_word_proportion.append(neg_word_prop)

In [17]:
rf_positive_score, rf_negative_score, rf_polarity_score = [], [], []
rf_positive_word_proportion, rf_negative_word_proportion = [], []

for rf in corpus['2']:
    tokens = word_tokenize(str(rf))
    positive_score, negative_score, polarity_score, pos_word_prop, neg_word_prop = find_pos_neg_score(tokens)
    rf_positive_score.append(positive_score)
    rf_negative_score.append(negative_score)
    rf_polarity_score.append(polarity_score)
    rf_positive_word_proportion.append(pos_word_prop)
    rf_negative_word_proportion.append(neg_word_prop)

## 2. Analysis of Readability

Readability analysis metrics are average sentence length, percentage of complex words and fog index. Again, I will take a specific case for understandability and generalise after.

To find the average sentence length, total number of words and sentences are required per text.

In [18]:
text = corpus['0'][0]
words = text.split()
sentences = text.split('.')
print(len(words))
print(len(sentences))

754
54


### Extracting Derived Variables

In [19]:
avg_sentence_len = len(words) / len(sentences)
print("Average Sentence Length: " + str(avg_sentence_len))

Average Sentence Length: 13.962962962962964


To find the percentage of complex words, one ought to find the number of complex words in a text. Complex words are the ones which have more than two syllables. This can be done using `TextStat`

In [20]:
from textstat.textstat import textstatistics

In [21]:
num_complex_words = 0
for word in words:
    syllables = textstatistics().syllable_count(word)
    if syllables >= 2:
        num_complex_words += 1
print(num_complex_words)

508


In [22]:
perc_complex_words = num_complex_words / len(words)
print("Percentage of Complex words = "+ str(perc_complex_words))

Percentage of Complex words = 0.6737400530503979


In [23]:
fog_index = 0.4 * (avg_sentence_len + perc_complex_words)
print("Fog index = "+ str(fog_index))

Fog index = 5.8546812064053455


### Generalisation

In [24]:
def get_complex_words(words):
    num_complex_words = 0
    for word in words:
        syllables = textstatistics().syllable_count(word)
        if syllables >= 2:
            num_complex_words += 1
    return num_complex_words

In [25]:
def calc_read_metrics(words, sentences):
    avg_sentence_len = len(words) / len(sentences)
    num_complex_words = get_complex_words(words)
    perc_complex_words = num_complex_words / len(words)
    fog_index = 0.4 * (avg_sentence_len + perc_complex_words)
    return avg_sentence_len, perc_complex_words, fog_index, len(words)

In [26]:
mda_average_sentence_length, mda_percentage_of_complex_words = [], []
mda_fog_index, mda_complex_word_count = [], []
mda_word_count = []

for text in corpus['0']:
    words = str(text).split(); sentences = str(text).split('.')
    avg_sentence_len, perc_complex_words, fog_index, num_words = calc_read_metrics(words, sentences)
    num_complex_words = get_complex_words(words)
    
    mda_average_sentence_length.append(avg_sentence_len)
    mda_percentage_of_complex_words.append(perc_complex_words)
    mda_fog_index.append(fog_index)
    mda_complex_word_count.append(num_complex_words)
    mda_word_count.append(num_words)

In [27]:
qqdmr_average_sentence_length, qqdmr_percentage_of_complex_words = [], []
qqdmr_fog_index, qqdmr_complex_word_count = [], []
qqdmr_word_count = []

for text in corpus['1']:
    words = str(text).split(); sentences = str(text).split('.')
    avg_sentence_len, perc_complex_words, fog_index, num_words = calc_read_metrics(words, sentences)
    num_complex_words = get_complex_words(words)
    
    qqdmr_average_sentence_length.append(avg_sentence_len)
    qqdmr_percentage_of_complex_words.append(perc_complex_words)
    qqdmr_fog_index.append(fog_index)
    qqdmr_complex_word_count.append(num_complex_words)
    qqdmr_word_count.append(num_words)

In [28]:
rf_average_sentence_length, rf_percentage_of_complex_words = [], []
rf_fog_index, rf_complex_word_count = [], []
rf_word_count = []

for text in corpus['2']:
    words = str(text).split(); sentences = str(text).split('.')
    avg_sentence_len, perc_complex_words, fog_index, num_words = calc_read_metrics(words, sentences)
    num_complex_words = get_complex_words(words)
    
    rf_average_sentence_length.append(avg_sentence_len)
    rf_percentage_of_complex_words.append(perc_complex_words)
    rf_fog_index.append(fog_index)
    rf_complex_word_count.append(num_complex_words)
    rf_word_count.append(num_words)

## 3. Analysis of Words of Uncertaintly and Constraint

The uncertain and constraining words are as given in the `uncertainty_dictionary.xlsx` and `constraining_dictionary.xlsx`.

In [29]:
uncertain = pd.read_csv('instructions/uncertainty_dictionary.csv')

In [30]:
constraining = pd.read_csv('instructions/constraining_dictionary.csv')

In [31]:
uncertain = uncertain['Word']
uncertain.head()

0       ABEYANCE
1      ABEYANCES
2         ALMOST
3     ALTERATION
4    ALTERATIONS
Name: Word, dtype: object

In [32]:
uncertain = list(uncertain)

In [33]:
constrain = constraining['Word']
constrain.head()

0      ABIDE
1    ABIDING
2      BOUND
3    BOUNDED
4     COMMIT
Name: Word, dtype: object

In [34]:
constrain = list(constrain)

In [35]:
uncertain = make_lower(uncertain)
constrain = make_lower(constrain)

### Extracting Derived Variables

In [36]:
def get_uncertain_constrain(text):
    uncertain_score, constrain_score = 0, 0
    for word in str(text).split():
        if word in uncertain:
            uncertain_score += 1
        if word in constrain:
            constrain_score += 1
    uncertain_prop = uncertain_score / len(str(text).split())
    constrain_prop = constrain_score / len(str(text).split())
    return uncertain_score, constrain_score, uncertain_prop, constrain_prop

In [37]:
mda_uncertainty_score, mda_constraining_score = [], []
mda_uncertainty_word_proportion, mda_constraining_word_proportion = [], []

for text in corpus['0']:
    uncertain_score, constrain_score, uncertain_prop, constrain_prop = get_uncertain_constrain(text)
    
    mda_uncertainty_score.append(uncertain_score)
    mda_constraining_score.append(constrain_score)
    mda_uncertainty_word_proportion.append(uncertain_prop)
    mda_constraining_word_proportion.append(constrain_prop)

In [38]:
qqdmr_uncertainty_score, qqdmr_constraining_score = [], []
qqdmr_uncertainty_word_proportion, qqdmr_constraining_word_proportion = [], []

for text in corpus['1']:
    uncertain_score, constrain_score, uncertain_prop, constrain_prop = get_uncertain_constrain(text)
    
    qqdmr_uncertainty_score.append(uncertain_score)
    qqdmr_constraining_score.append(constrain_score)
    qqdmr_uncertainty_word_proportion.append(uncertain_prop)
    qqdmr_constraining_word_proportion.append(constrain_prop)

In [39]:
rf_uncertainty_score, rf_constraining_score = [], []
rf_uncertainty_word_proportion, rf_constraining_word_proportion = [], []

for text in corpus['2']:
    uncertain_score, constrain_score, uncertain_prop, constrain_prop = get_uncertain_constrain(text)
    
    rf_uncertainty_score.append(uncertain_score)
    rf_constraining_score.append(constrain_score)
    rf_uncertainty_word_proportion.append(uncertain_prop)
    rf_constraining_word_proportion.append(constrain_prop)

All but one variables have been extracted. It is the number of constraining words present throughout the report instead of any particular section. Since I stored only a few sections of all reports, I must scrape the urls again, but I can try to find constraining words without storing the text.

In [40]:
corpus = pd.read_csv('corpus.csv', index_col = 0)

In [41]:
corpus.head()

Unnamed: 0,0,1,2
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt,'item 7. management\'s discussion analysis f...,,
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-001001.txt,"""item 2. management's discussion analysis fi...",,
https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-000783.txt,,,
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002145.txt,'item 7. management\'s discussion analysis f...,,
https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-001203.txt,,,


In [42]:
from bs4 import BeautifulSoup
import requests

constraining_words_whole_report = []
for url in corpus.index:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    soup = str(soup).lower()
    
    _, constraining_words, _, _ = get_uncertain_constrain(soup)
    constraining_words_whole_report.append(constraining_words)
    
    print("Processed " + url)

Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-001001.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-000783.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002145.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-001203.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002278.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002401.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002402.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950172-99-000362.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-99-000775.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950172-99-000584.txt
Processed https://www.sec.gov/Archives/edgar/data/3662/0000950170-99-001005.txt
Processed https://www.sec.gov/Archives/e

Processed https://www.sec.gov/Archives/edgar/data/6201/0000950134-01-500665.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-01-500032.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-01-500047.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000950134-02-001661.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-02-000015.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-02-000035.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-02-000052.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000950134-02-012680.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000950134-02-012682.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-03-000012.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0001047469-03-013301.txt
Processed https://www.sec.gov/Archives/edgar/data/6201/0000006201-03-000030.txt
Processed https://www.sec.gov/Archives/e

## Output

In [43]:
output = pd.read_csv('instructions/cik_list.csv')
output.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME
0,3662,SUNBEAM CORP/FL/,199803,3/6/98,10-K405,edgar/data/3662/0000950170-98-000413.txt
1,3662,SUNBEAM CORP/FL/,199805,5/15/98,10-Q,edgar/data/3662/0000950170-98-001001.txt
2,3662,SUNBEAM CORP/FL/,199808,8/13/98,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt
3,3662,SUNBEAM CORP/FL/,199811,11/12/98,10-K/A,edgar/data/3662/0000950170-98-002145.txt
4,3662,SUNBEAM CORP/FL/,199811,11/16/98,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt


In [44]:
output['mda_positive_score'] = mda_positive_score
output['mda_negative_score'] = mda_negative_score
output['mda_polarity_score'] = mda_polarity_score
output['mda_average_sentence_length'] = mda_average_sentence_length
output['mda_percentage_of_complex_words'] = mda_percentage_of_complex_words
output['mda_fog_index'] = mda_fog_index
output['mda_complex_word_count'] = mda_complex_word_count
output['mda_word_count'] = mda_word_count
output['mda_uncertainty_score'] = mda_uncertainty_score
output['mda_constraining_score'] = mda_constraining_score
output['mda_positive_word_proportion'] = mda_positive_word_proportion
output['mda_negative_word_proportion'] = mda_negative_word_proportion
output['mda_uncertainty_word_proportion'] = mda_uncertainty_word_proportion
output['mda_constraining_word_proportion'] = mda_constraining_word_proportion

In [45]:
output.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME,mda_positive_score,mda_negative_score,mda_polarity_score,mda_average_sentence_length,mda_percentage_of_complex_words,mda_fog_index,mda_complex_word_count,mda_word_count,mda_uncertainty_score,mda_constraining_score,mda_positive_word_proportion,mda_negative_word_proportion,mda_uncertainty_word_proportion,mda_constraining_word_proportion
0,3662,SUNBEAM CORP/FL/,199803,3/6/98,10-K405,edgar/data/3662/0000950170-98-000413.txt,12,16,-0.142857,13.962963,0.67374,5.854681,508,754,4,1,0.014168,0.01889,0.005305,0.001326
1,3662,SUNBEAM CORP/FL/,199805,5/15/98,10-Q,edgar/data/3662/0000950170-98-001001.txt,0,3,-1.0,8.545455,0.574468,3.647969,54,94,1,0,0.0,0.027523,0.010638,0.0
2,3662,SUNBEAM CORP/FL/,199808,8/13/98,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt,0,0,0.0,1.0,0.0,0.4,0,1,0,0,0.0,0.0,0.0,0.0
3,3662,SUNBEAM CORP/FL/,199811,11/12/98,10-K/A,edgar/data/3662/0000950170-98-002145.txt,6,41,-0.744681,0.854737,0.580049,0.573914,471,812,5,11,0.004501,0.030758,0.006158,0.013547
4,3662,SUNBEAM CORP/FL/,199811,11/16/98,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt,0,0,0.0,1.0,0.0,0.4,0,1,0,0,0.0,0.0,0.0,0.0


In [46]:
output['qqdmr_positive_score'] = qqdmr_positive_score
output['qqdmr_negative_score'] = qqdmr_negative_score
output['qqdmr_polarity_score'] = qqdmr_polarity_score
output['qqdmr_average_sentence_length'] = qqdmr_average_sentence_length
output['qqdmr_percentage_of_complex_words'] = qqdmr_percentage_of_complex_words
output['qqdmr_fog_index'] = qqdmr_fog_index
output['qqdmr_complex_word_count'] = qqdmr_complex_word_count
output['qqdmr_word_count'] = qqdmr_word_count
output['qqdmr_uncertainty_score'] = qqdmr_uncertainty_score
output['qqdmr_constraining_score'] = qqdmr_constraining_score
output['qqdmr_positive_word_proportion'] = qqdmr_positive_word_proportion
output['qqdmr_negative_word_proportion'] = qqdmr_negative_word_proportion
output['qqdmr_uncertainty_word_proportion'] = qqdmr_uncertainty_word_proportion
output['qqdmr_constraining_word_proportion'] = qqdmr_constraining_word_proportion

In [48]:
output.shape

(152, 34)

In [49]:
output['rf_positive_score'] = rf_positive_score
output['rf_negative_score'] = rf_negative_score
output['rf_polarity_score'] = rf_polarity_score
output['rf_average_sentence_length'] = rf_average_sentence_length
output['rf_percentage_of_complex_words'] = rf_percentage_of_complex_words
output['rf_fog_index'] = rf_fog_index
output['rf_complex_word_count'] = rf_complex_word_count
output['rf_word_count'] = rf_word_count
output['rf_uncertainty_score'] = rf_uncertainty_score
output['rf_constraining_score'] = rf_constraining_score
output['rf_positive_word_proportion'] = rf_positive_word_proportion
output['rf_negative_word_proportion'] = rf_negative_word_proportion
output['rf_uncertainty_word_proportion'] = rf_uncertainty_word_proportion
output['rf_constraining_word_proportion'] = rf_constraining_word_proportion
output['constraining_words_whole_report'] = constraining_words_whole_report

In [50]:
output.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME,mda_positive_score,mda_negative_score,mda_polarity_score,mda_average_sentence_length,...,rf_fog_index,rf_complex_word_count,rf_word_count,rf_uncertainty_score,rf_constraining_score,rf_positive_word_proportion,rf_negative_word_proportion,rf_uncertainty_word_proportion,rf_constraining_word_proportion,constraining_words_whole_report
0,3662,SUNBEAM CORP/FL/,199803,3/6/98,10-K405,edgar/data/3662/0000950170-98-000413.txt,12,16,-0.142857,13.962963,...,0.4,0,1,0,0,0.0,0.0,0.0,0.0,1333
1,3662,SUNBEAM CORP/FL/,199805,5/15/98,10-Q,edgar/data/3662/0000950170-98-001001.txt,0,3,-1.0,8.545455,...,0.4,0,1,0,0,0.0,0.0,0.0,0.0,916
2,3662,SUNBEAM CORP/FL/,199808,8/13/98,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt,0,0,0.0,1.0,...,0.4,0,1,0,0,0.0,0.0,0.0,0.0,5
3,3662,SUNBEAM CORP/FL/,199811,11/12/98,10-K/A,edgar/data/3662/0000950170-98-002145.txt,6,41,-0.744681,0.854737,...,0.4,0,1,0,0,0.0,0.0,0.0,0.0,642
4,3662,SUNBEAM CORP/FL/,199811,11/16/98,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt,0,0,0.0,1.0,...,0.4,0,1,0,0,0.0,0.0,0.0,0.0,4


In [51]:
output.to_csv('output.csv')