<h1><center> Data Extraction and Text Analysis </h1></center>
<h2><center> By: Gyan Prakash Tripathi </h2></center> 

# *Objective*: 
*The goal of this work is to fetch the data of reports from SEC / EDGAR  **(US Securities and Exchange Commission)** financial reports and perform text analysis on it to calculate following metrics :*
* Positive score, Negative score, Polarity score
* Average Sentence Length
* Percentage of Complex Words
* fog Index
* Complex Word Count
* Word Count
* Uncertainty Score
* Constraining Score
* Positive Word Proportion
* Negative Word Proportion
* Uncertainity Word Proportion
* Constraining Word Proportion
* Constraining Words of Whole Report

*Except the last variable all of other measures need to be calculated for following three sections :*
1. Management's Discussion and Analysis
2. Quantitative and Qualitative Disclosures about Market Risk
3. Risk Factors

# Step 0 : Installing and Importing libraries

In [0]:
#Connecting to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
pip install requests



In [47]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
import requests 
import pandas as pd
import numpy as np


In [0]:
from nltk.corpus import stopwords
import re #regular expression
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 

ps = PorterStemmer() 


# Step 1 : Downloading the data
### Loading the file containing report links

In [0]:
links= pd.read_excel('gdrive/My Drive/Assignment/cik_list.xlsx')

### Let's check the structure of our dataframe

In [26]:
links.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME
0,3662,SUNBEAM CORP/FL/,199803,1998-03-06,10-K405,edgar/data/3662/0000950170-98-000413.txt
1,3662,SUNBEAM CORP/FL/,199805,1998-05-15,10-Q,edgar/data/3662/0000950170-98-001001.txt
2,3662,SUNBEAM CORP/FL/,199808,1998-08-13,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt
3,3662,SUNBEAM CORP/FL/,199811,1998-11-12,10-K/A,edgar/data/3662/0000950170-98-002145.txt
4,3662,SUNBEAM CORP/FL/,199811,1998-11-16,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt


### Storing the links in a list

In [0]:
t=pd.DataFrame()
t=links['SECFNAME'].astype(str)

### Downloading Reports

In [30]:
DocName=[]
count =1
for l in t:
  c=str(count)+'.txt'
  fetch= 'https://www.sec.gov/Archives/' + l
  print(fetch)
  DocName.append(i)
  r = requests.get(fetch) # create HTTP response object 
  with open(c,'wb') as f: 
  
    # Saving received content as a png file in 
    # binary format 
  
    # write the contents of the response (r.content) 
    # to a new file in binary mode. 
    f.write(r.content) 
    count+=1

https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-001001.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-000783.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002145.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950172-98-001203.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002278.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002401.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-002402.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950172-99-000362.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-99-000775.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950172-99-000584.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-99-001005.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950172-99-001074.txt
https://www.sec.gov/Archives/edgar/data/3662/0000950170-99-001361.txt
https://www.sec.gov/

# Step 2 : Loading standard lists for pre-processing
### (Stop words, Master dictionary, Uncertain Words, Constraining Words)

### Loading stop words to  a list

In [0]:
#we have added list of stopwords in files stopwords.txt and master file on our drive. Next step is to create a list of stop words, positive words and negative words
#list of stopwords:
f=open('gdrive/My Drive/StopWords_Generic.txt','r')
stopwords=f.read()
stoplist=stopwords.split('\n')

In [32]:
stoplist[0:5]

['ABOUT', 'ABOVE', 'AFTER', 'AGAIN', 'ALL']

### Let's have a look at the structure of Master Dictionary

In [33]:
master=pd.read_csv('gdrive/My Drive/LoughranMcDonald_MasterDictionary_2018.csv')
master.head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,277,1.48e-08,1.24e-08,3.56e-06,84,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.6e-10,9.73e-12,9.86e-09,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,4.28e-10,1.39e-10,6.23e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,12,6.41e-10,3.16e-10,9.38e-08,12,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,7250,3.87e-07,3.68e-07,3.37e-05,914,0,0,0,0,0,0,0,0,0,0,3,12of12inf


### Data types of various columns in the master dataframe

In [34]:
master.dtypes

Word                   object
Sequence Number         int64
Word Count              int64
Word Proportion       float64
Average Proportion    float64
Std Dev               float64
Doc Count               int64
Negative                int64
Positive                int64
Uncertainty             int64
Litigious               int64
Constraining            int64
Superfluous             int64
Interesting             int64
Modal                   int64
Irr_Verb                int64
Harvard_IV              int64
Syllables               int64
Source                 object
dtype: object

*We will use Positive and Negative columns to make a list of positive and negative words.*

### Creating Positive and Negative words' dataframe

In [0]:
#positive and negative df
positive= pd.DataFrame()
negative= pd.DataFrame()

In [0]:
positive[['Word','Positive']]=master[['Word','Positive']]
negative[['Word','Negative']]=master[['Word','Negative']]


In [37]:
negative.head()

Unnamed: 0,Word,Negative
0,AARDVARK,0
1,AARDVARKS,0
2,ABACI,0
3,ABACK,0
4,ABACUS,0


### Removing Unwanted words from the Positive and negative Data Frames

In [0]:
positive = positive[(positive.T != 0).all()]
negative = negative[(negative.T != 0).all()]

In [39]:
positive.head()

Unnamed: 0,Word,Positive
125,ABLE,2009
336,ABUNDANCE,2009
338,ABUNDANT,2009
438,ACCLAIMED,2009
477,ACCOMPLISH,2009


In [40]:
negative.head()

Unnamed: 0,Word,Negative
9,ABANDON,2009
10,ABANDONED,2009
11,ABANDONING,2009
12,ABANDONMENT,2009
13,ABANDONMENTS,2009


In [41]:
print(positive.shape,negative.shape)

(354, 2) (2355, 2)


In [0]:
for i in stoplist:
  positive = positive[(positive.T != i).all()]
  negative = negative[(negative.T != i).all()]

In [43]:
print(positive.shape,negative.shape)

(354, 2) (2355, 2)


### Creating lists of Positive and Negative words

In [0]:
#list of positive and negative words
p=list(positive['Word'])
n=list(negative['Word'])

In [45]:
constrain= list(pd.read_excel('gdrive/My Drive/Assignment/constraining_dictionary.xlsx')['Word'])
constrain[:5]


['ABIDE', 'ABIDING', 'BOUND', 'BOUNDED', 'COMMIT']

In [46]:
unsure=list(pd.read_excel('gdrive/My Drive/Assignment/uncertainty_dictionary.xlsx')['Word'])
unsure[:5]

['ABEYANCE', 'ABEYANCES', 'ALMOST', 'ALTERATION', 'ALTERATIONS']

# Step 3 : Calculating Various Measures
### Function to calculate various metrics (Scorer)

In [0]:
def scorer(content,pre):
  sent=[]
  length=[]
  tempx=[]
  sent=content.split('.')
  for g in sent:
    length.append(len(g))
  #average_sentence_length=length.mean()
  tempx=content.split(' ')
  word_count=len(tempx)
                 
  
  no_links = re.sub(r'http\S+', ' ', content)
  no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", ' ', no_links)
  no_special_characters = re.sub('[^A-Za-z ]+', ' ', no_unicode)
  words = no_special_characters.split(" ")
  words = [w for w in words if len(w) > 2]  # ignore a, an, be, ...
  words = [w.upper() for w in words]
  
  complexword=[]
  vowel=['A','E','I','O','U']
  stop_words=set(stopwords.words('english'))
  words = [w for w in words if w not in stop_words]
  
  positive_score=0
  negative_score=0
  constraining_score=0
  uncertainty_score=0
  for w in words:
    for s in ps.stem(w):
      c=0
      if s in vowel:
        c=c+1
      if c>2:
        complexword.append(w)
          
          
    if w in p:
      positive_score += 1
    if w in n:
      negative_score += 1
    if w in constrain:
      constraining_score += 1
    if w in unsure:
      uncertainty_score += 1
      
  #print(positive_score,negative_score)
  test=positive_score+negative_score
  print('Word Count',word_count)
  polarity_score= (positive_score-negative_score)/((positive_score+negative_score)+ 0.000001)

  
  complex_word_count=len(complexword)
  
  average_sentence_length=len(sent)/(word_count + 0.000001)
  
  percentage_of_complex_words=(complex_word_count/(word_count+0.000001))*100
  fog_index=0.4*(average_sentence_length+percentage_of_complex_words)
  positive_word_proportion=positive_score/(word_count+0.000001)
  negative_word_proportion=negative_score/(word_count+0.000001)
  uncertainty_word_proportion=uncertainty_score/(word_count+0.000001)
  constraining_word_proportion=constraining_score/(word_count+0.000001)
  
  var=['positive_score','negative_score','polarity_score','average_sentence_length','percentage_of_complex_words',
      'fog_index','complex_word_count','word_count','uncertainty_score','constraining_score'
      ,'positive_word_proportion','negative_word_proportion','uncertainty_word_proportion',
      'constraining_word_proportion']
  l=[positive_score,negative_score,polarity_score,average_sentence_length,percentage_of_complex_words,
      fog_index,complex_word_count,word_count,uncertainty_score,constraining_score
      ,positive_word_proportion,negative_word_proportion,uncertainty_word_proportion,
      constraining_word_proportion]
  
  #insert values
  k=0
  for itera in var:
    j=str(pre+'_'+itera)
    if word_count==1:
      links.loc[i,j]=np.NaN
    else:
      links.loc[i,j]=l[k]
    k=k+1
          
  #print("The positive score of"+str(i+1)+" is:",ps)
  #print("The negative score of"+str(i+1)+" is:",ns)
  return

### Function to take out the three sections (if present) of each report, and call Scorer function for each of these sections (Caller)

In [0]:
def caller(f):
  ga=f.find('MANAGEMENT\'S DISCUSSION AND ANALYSIS')
  if f.find('MANAGEMENT\'S DISCUSSION AND ANALYSIS',f.find('MANAGEMENT\'S DISCUSSION AND ANALYSIS')+1)==-1:
    gb=ga
  else:
    gb=f.find('MANAGEMENT\'S DISCUSSION AND ANALYSIS',f.find('MANAGEMENT\'S DISCUSSION AND ANALYSIS')+1)
  gc=f[gb:].find('ITEM ') + gb
  g=f[gb:gc]
###############
  ha=f.find('QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK')
  if f.find('QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK',
            f.find('QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK')+1)==-1:
    hb=ha
  else:
    hb=f.find('QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK',
              f.find('QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK')+1)
  hc=f[hb:].find('ITEM ') + hb
  h=f[hb:hc]
################
  iia=f.find('RISK FACTORS')
  if f.find('RISK FACTORS',f.find('RISK FACTORS')+1)==-1:
    iib=iia
  else:
    iib=f.find('RISK FACTORS',f.find('RISK FACTORS')+1)
  iic=f[iib:].find('ITEM ') + iib
  ii=f[iib:iic]

  scorer(g,'MDA')
  scorer(h,'QQDMR')
  scorer(ii,'RF')
  return

### Function to calculate Constraining words in whole report

In [0]:
def constrainwhole(f):
    cs=0
    no_links = re.sub(r'http\S+', ' ', f)
    no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", ' ', no_links)
    no_special_characters = re.sub('[^A-Za-z ]+', ' ', no_unicode)
    words = no_special_characters.split(" ")
    words = [w for w in words if len(w) > 2]  # ignore a, an, be, ...
    words = [w.upper() for w in words]

    stop_words=set(stopwords.words('english'))
    words = [w for w in words if w not in stop_words]

    for w in words:      
      if w in constrain:
        cs+=1
    links.loc[i,'constraining_words_whole_report']=cs
    return

# Step 4 : Calculating Metrics of Each Report
### These values are stored in the Links dataframe we loaded in the beginning.

In [127]:

for i in range(count-1):
  fin=open(str(i+1)+'.txt','r')
  f=fin.read().upper()
  caller(f)
  constrainwhole(f)
  print('file '+str(i+1)+' added')
  

Word Count 2666
Word Count 1
Word Count 1
file 1 added
Word Count 2831
Word Count 1
Word Count 1
file 2 added
Word Count 1
Word Count 1
Word Count 1
file 3 added
Word Count 8384
Word Count 1
Word Count 1
file 4 added
Word Count 1
Word Count 1
Word Count 1
file 5 added
Word Count 254
Word Count 1
Word Count 1
file 6 added
Word Count 7826
Word Count 1
Word Count 1
file 7 added
Word Count 7313
Word Count 1
Word Count 1
file 8 added
Word Count 1
Word Count 1
Word Count 1
file 9 added
Word Count 2248
Word Count 1
Word Count 1
file 10 added
Word Count 1
Word Count 1
Word Count 1
file 11 added
Word Count 4832
Word Count 1
Word Count 1
file 12 added
Word Count 1
Word Count 1
Word Count 1
file 13 added
Word Count 11067
Word Count 1
Word Count 1
file 14 added
Word Count 2156
Word Count 1
Word Count 1
file 15 added
Word Count 11784
Word Count 1
Word Count 1
file 16 added
Word Count 7872
Word Count 1
Word Count 1
file 17 added
Word Count 1
Word Count 1
Word Count 1
file 18 added
Word Count 4221
Wo

### Let's have a look at the final dataframe.

In [128]:
links.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME,MDA_positive_score,MDA_negative_score,MDA_polarity_score,MDA_average_sentence_length,MDA_percentage_of_complex_words,MDA_fog_index,MDA_complex_word_count,MDA_word_count,MDA_uncertainty_score,MDA_constraining_score,MDA_positive_word_proportion,MDA_negative_word_proportion,MDA_uncertainty_word_proportion,MDA_constraining_word_proportion,QQDMR_positive_score,QQDMR_negative_score,QQDMR_polarity_score,QQDMR_average_sentence_length,QQDMR_percentage_of_complex_words,QQDMR_fog_index,QQDMR_complex_word_count,QQDMR_word_count,QQDMR_uncertainty_score,QQDMR_constraining_score,QQDMR_positive_word_proportion,QQDMR_negative_word_proportion,QQDMR_uncertainty_word_proportion,QQDMR_constraining_word_proportion,RF_positive_score,RF_negative_score,RF_polarity_score,RF_average_sentence_length,RF_percentage_of_complex_words,RF_fog_index,RF_complex_word_count,RF_word_count,RF_uncertainty_score,RF_constraining_score,RF_positive_word_proportion,RF_negative_word_proportion,RF_uncertainty_word_proportion,RF_constraining_word_proportion,constraining_words_whole_report
0,3662,SUNBEAM CORP/FL/,199803,1998-03-06,10-K405,edgar/data/3662/0000950170-98-000413.txt,34.0,14.0,0.416667,0.040885,0.0,0.016354,0.0,2666.0,39.0,6.0,0.012753,0.005251,0.014629,0.002251,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1488.0
1,3662,SUNBEAM CORP/FL/,199805,1998-05-15,10-Q,edgar/data/3662/0000950170-98-001001.txt,11.0,56.0,-0.671642,0.039915,0.0,0.015966,0.0,2831.0,67.0,3.0,0.003886,0.019781,0.023667,0.00106,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1047.0
2,3662,SUNBEAM CORP/FL/,199808,1998-08-13,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0
3,3662,SUNBEAM CORP/FL/,199811,1998-11-12,10-K/A,edgar/data/3662/0000950170-98-002145.txt,36.0,121.0,-0.541401,0.140506,0.0,0.056202,0.0,8384.0,49.0,43.0,0.004294,0.014432,0.005844,0.005129,,,,,,,,,,,,,,,,,,,,,,,,,,,,,717.0
4,3662,SUNBEAM CORP/FL/,199811,1998-11-16,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0


*The files which don't have the required sections, their cells have been left blank.*
### Let's check the shape  of our final dataframe

In [129]:
links.shape

(152, 49)

*The shape is correct (same as we expected it to be). So our dataframe is perfect.*
# Step 5 : Storing the result in an Excel file

In [0]:
links.to_excel('Results.xlsx')

*We are done with the extraction and calculation of required data and variables. Thanks for following till now!*
<h1><center>Thanks!</h1></center>

<h3><center>Please provide your valuable feedbacks on prakashthegyan@gmail.com</h3></center>