# Text Mining 

## Sanjiv R. Das

### Reading references

- NLTK: http://www.nltk.org/book/
- Introduction to Linguistics: http://www.ling.upenn.edu/courses/Fall_2003/ling001/

In [3]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


## Example Text Handling

In [4]:
text = "Ask not what your country can do for you, \
but ask what you can do for your country."

In [5]:
#How many characters including blanks? 
len(text)

83

In [6]:
#Tokenize the words, separating by spaces, periods, commas
x = text.split(" ")
print(x)

['Ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you,', 'but', 'ask', 'what', 'you', 'can', 'do', 'for', 'your', 'country.']


In [7]:
#How many words?
len(x)

18

But this returns words with commas and periods included, which is not desired. So what we need is the regular expressions package, i.e., re. 

In [8]:
import re
x = re.split('[ ,.]',text)
print(x)

['Ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you', '', 'but', 'ask', 'what', 'you', 'can', 'do', 'for', 'your', 'country', '']


In [9]:
#Use a list comprehension to remove spaces
x = [j for j in x if len(j)>0]
print(x)

['Ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you', 'but', 'ask', 'what', 'you', 'can', 'do', 'for', 'your', 'country']


In [10]:
len(x)

18

In [11]:
#Unique words
y = [j.lower() for j in x]
z = unique(y)
print(z)

['ask' 'but' 'can' 'country' 'do' 'for' 'not' 'what' 'you' 'your']


In [12]:
len(z)

10

## Using List Comprehensions to find specific words

In [13]:
#Find words greater than 3 characters
[j for j in x if len(j)>3]

['what', 'your', 'country', 'what', 'your', 'country']

In [14]:
#Find capitalized words
[j for j in x if j.istitle()]

['Ask']

In [15]:
#Find words that begin with c
[j for j in x if j.startswith('c')]

['country', 'can', 'can', 'country']

In [16]:
#Find words that end in t
[j for j in x if j.endswith('t')]

['not', 'what', 'but', 'what']

In [17]:
#Find words that contain a
[j for j in x if "a" in set(j.lower())]

['Ask', 'what', 'can', 'ask', 'what', 'can']

Or, use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [18]:
#Find words that contain 'a' using RE
[j for j in x if re.search('[Aa]',j)]

['Ask', 'what', 'can', 'ask', 'what', 'can']

In [19]:
#Test type of tokens
print(x)
[j for j in x if j.islower()]

['Ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you', 'but', 'ask', 'what', 'you', 'can', 'do', 'for', 'your', 'country']


['not',
 'what',
 'your',
 'country',
 'can',
 'do',
 'for',
 'you',
 'but',
 'ask',
 'what',
 'you',
 'can',
 'do',
 'for',
 'your',
 'country']

In [20]:
[j for j in x if j.isdigit()]

[]

In [21]:
[j for j in x if j.isalnum()]

['Ask',
 'not',
 'what',
 'your',
 'country',
 'can',
 'do',
 'for',
 'you',
 'but',
 'ask',
 'what',
 'you',
 'can',
 'do',
 'for',
 'your',
 'country']

## String operations

In [22]:
y = '  To be or not to be.  '
print(y.strip())
print(y.rstrip())
print(y.lower())
print(y.upper())

To be or not to be.
  To be or not to be.
  to be or not to be.  
  TO BE OR NOT TO BE.  


In [23]:
#Return the starting position of the string
print(y.find('be'))
print(y.rfind('be'))

5
18


In [24]:
print(y.replace('be','do'))

  To do or not to do.  


In [25]:
y = 'Supercalifragilisticexpialidocious'
ytok = y.split('i')
print(ytok)
print('i'.join(ytok))
print(list(y))

['Supercal', 'frag', 'l', 'st', 'cexp', 'al', 'doc', 'ous']
Supercalifragilisticexpialidocious
['S', 'u', 'p', 'e', 'r', 'c', 'a', 'l', 'i', 'f', 'r', 'a', 'g', 'i', 'l', 'i', 's', 't', 'i', 'c', 'e', 'x', 'p', 'i', 'a', 'l', 'i', 'd', 'o', 'c', 'i', 'o', 'u', 's']


## Read in a URL

In [26]:
## Reading in a URL
import requests

url = 'http://srdas.github.io/bio-candid.html'
f = requests.get(url)
text = f.text
print(text)
f.close()

<HTML>
<BODY background="http://algo.scu.edu/~sanjivdas/graphics/back2.gif">

Sanjiv Das is the William and Janice Terry Professor of Finance and
Data Science at Santa Clara University's Leavey School of Business. He
previously held faculty appointments as Associate Professor at Harvard
Business School and UC Berkeley. He holds post-graduate degrees in
Finance (M.Phil and Ph.D. from New York University), Computer Science
(M.S. from UC Berkeley), an MBA from the Indian Institute of
Management, Ahmedabad, B.Com in Accounting and Economics (University
of Bombay, Sydenham College), and is also a qualified Cost and Works
Accountant (AICWA). He is a senior editor of The Journal of Investment
Management, co-editor of The Journal of Derivatives and The Journal of
Financial Services Research, and Associate Editor of other academic
journals. Prior to being an academic, he worked in the derivatives
business in the Asia-Pacific region as a Vice-President at
Citibank. His current research interests

In [27]:
len(text)

4113

In [28]:
lines = text.splitlines()
print(len(lines))
print(lines[3])

80
Sanjiv Das is the William and Janice Terry Professor of Finance and


### Use BS to clean up all the html stuff

In [29]:
from bs4 import BeautifulSoup
sanjivbio = BeautifulSoup(text,'lxml').get_text()
print(sanjivbio)




Sanjiv Das is the William and Janice Terry Professor of Finance and
Data Science at Santa Clara University's Leavey School of Business. He
previously held faculty appointments as Associate Professor at Harvard
Business School and UC Berkeley. He holds post-graduate degrees in
Finance (M.Phil and Ph.D. from New York University), Computer Science
(M.S. from UC Berkeley), an MBA from the Indian Institute of
Management, Ahmedabad, B.Com in Accounting and Economics (University
of Bombay, Sydenham College), and is also a qualified Cost and Works
Accountant (AICWA). He is a senior editor of The Journal of Investment
Management, co-editor of The Journal of Derivatives and The Journal of
Financial Services Research, and Associate Editor of other academic
journals. Prior to being an academic, he worked in the derivatives
business in the Asia-Pacific region as a Vice-President at
Citibank. His current research interests include: machine learning,
social networks, derivatives pricing models, po

In [30]:
print(len(sanjivbio))
type(sanjivbio)

4016


str

## Read in a file

In [31]:
## Read in a file
## Here we will read in an entire dictionary from Harvard Inquirer

f = open('inqdict.txt')
HIDict = f.read()
HIDict = HIDict.splitlines()
HIDict[:20]

['Entryword Source Pos Neg Pstv Affil Ngtv Hostile Strng Power Weak Subm Actv Psv Pleasure Pain Arousal EMOT Feel Virtue Vice Ovrst Undrst Acad Doctr Econ* Exch ECON Exprs Legal Milit Polit* POLIT Relig Role COLL Work Ritual Intrel Race Kin* MALE Female Nonadlt HU ANI PLACE Social Region Route Aquatic Land Sky Object Tool Food Vehicle Bldgpt Natobj Bodypt Comnobj Comform COM Say Need Goal Try Means Ach Persist Complt Fail Natpro Begin Vary Change Incr Decr Finish Stay Rise Move Exert Fetch Travel Fall Think Know Causal Ought Percv Comp Eval EVAL Solve Abs* ABS Qual Quan NUMB ORD CARD FREQ DIST Time* TIME Space POS DIM Dimn Rel COLOR Self Our You Name Yes No Negate Intrj IAV DAV SV IPadj IndAdj POWGAIN POWLOSS POWENDS POWAREN POWCON POWCOOP POWAPT POWPT POWDOCT POWAUTH POWOTH POWTOT RCTETH RCTREL RCTGAIN RCTLOSS RCTENDS RCTTOT RSPGAIN RSPLOSS RSPOTH RSPTOT AFFGAIN AFFLOSS AFFPT AFFOTH AFFTOT WLTPT WLTTRAN WLTOTH WLTTOT WLBGAIN WLBLOSS WLBPHYS WLBPSYC WLBPT WLBTOT ENLGAIN ENLLOSS ENLENDS

## Sentiment Score the Text using this Dictionary from Harvard Inquirer

In [47]:
#Extract all the lines that contain the Pos tag
HIDict = HIDict[1:]
print(HIDict[:5])
print(len(HIDict))
poswords = [j for j in HIDict if "Pos" in j]  #using a list comprehension
poswords = [j.split()[0] for j in poswords]
poswords = [j.split("#")[0] for j in poswords]
poswords = unique(poswords)
poswords = [j.lower() for j in poswords]
print(poswords[:20])
print(len(poswords))

['ABOLISH H4Lvd Neg Ngtv Hostile Strng Power Actv Intrel IAV POWOTH POWTOT SUPV  |', 'ABOLITION Lvd TRANS Noun  ', 'ABOMINABLE H4 Neg Strng Vice Ovrst Eval IndAdj Modif  |', 'ABORTIVE Lvd POWOTH POWTOT Modif POLIT  ', 'ABOUND H4 Pos Psv Incr IAV SUPV  |']
11880
['abound', 'absolve', 'absorbent', 'absorption', 'abundance', 'abundant', 'accede', 'accentuate', 'accept', 'acceptable', 'acceptance', 'accessible', 'accession', 'acclaim', 'acclamation', 'accolade', 'accommodate', 'accommodation', 'accompaniment', 'accomplish']
1644


In [49]:
#Extract all the lines that contain the Neg tag
negwords = [j for j in HIDict if "Neg" in j]  #using a list comprehension
negwords = [j.split()[0] for j in negwords]
negwords = [j.split("#")[0] for j in negwords]
negwords = unique(negwords)
negwords = [j.lower() for j in negwords]
print(negwords[:20])
print(len(negwords))

['abolish', 'abominable', 'abrasive', 'abrupt', 'abscond', 'absence', 'absent', 'absent-minded', 'absentee', 'absurd', 'absurdity', 'abuse', 'abyss', 'accident', 'accost', 'account', 'accursed', 'accusation', 'accuse', 'ache']
2113


In [58]:
#Pull clean lowercase version of bio as one long string
text = sanjivbio.replace('\n',' ').lower()
text

'   sanjiv das is the william and janice terry professor of finance and data science at santa clara university\'s leavey school of business. he previously held faculty appointments as associate professor at harvard business school and uc berkeley. he holds post-graduate degrees in finance (m.phil and ph.d. from new york university), computer science (m.s. from uc berkeley), an mba from the indian institute of management, ahmedabad, b.com in accounting and economics (university of bombay, sydenham college), and is also a qualified cost and works accountant (aicwa). he is a senior editor of the journal of investment management, co-editor of the journal of derivatives and the journal of financial services research, and associate editor of other academic journals. prior to being an academic, he worked in the derivatives business in the asia-pacific region as a vice-president at citibank. his current research interests include: machine learning, social networks, derivatives pricing models, 

In [59]:
text = text.split(' ')
text

['',
 '',
 '',
 'sanjiv',
 'das',
 'is',
 'the',
 'william',
 'and',
 'janice',
 'terry',
 'professor',
 'of',
 'finance',
 'and',
 'data',
 'science',
 'at',
 'santa',
 'clara',
 "university's",
 'leavey',
 'school',
 'of',
 'business.',
 'he',
 'previously',
 'held',
 'faculty',
 'appointments',
 'as',
 'associate',
 'professor',
 'at',
 'harvard',
 'business',
 'school',
 'and',
 'uc',
 'berkeley.',
 'he',
 'holds',
 'post-graduate',
 'degrees',
 'in',
 'finance',
 '(m.phil',
 'and',
 'ph.d.',
 'from',
 'new',
 'york',
 'university),',
 'computer',
 'science',
 '(m.s.',
 'from',
 'uc',
 'berkeley),',
 'an',
 'mba',
 'from',
 'the',
 'indian',
 'institute',
 'of',
 'management,',
 'ahmedabad,',
 'b.com',
 'in',
 'accounting',
 'and',
 'economics',
 '(university',
 'of',
 'bombay,',
 'sydenham',
 'college),',
 'and',
 'is',
 'also',
 'a',
 'qualified',
 'cost',
 'and',
 'works',
 'accountant',
 '(aicwa).',
 'he',
 'is',
 'a',
 'senior',
 'editor',
 'of',
 'the',
 'journal',
 'of',
 'i

In [63]:
#Match text to poswords, negwords, use the set operators
posmatches = set(text).intersection(set(poswords))
print(posmatches)
print(len(posmatches))
negmatches = set(text).intersection(set(negwords))
print(negmatches)
print(len(negmatches))

{'fellow', 'real', 'reconcile', 'your', 'pleasure', 'important', 'pretty', 'live', 'his', 'distinct', 'associate', 'mutual', 'credit', 'excitement', 'great', 'open', 'have', 'education', 'meet', 'unique'}
20
{'bad', 'no', 'short', 'cool', 'never', 'unchecked', 'cost', 'default', 'get', 'unreal', 'let', 'mad'}
12


## General Function to Pull Financial Text and score it

In [68]:
def finScore(url,poswords,negwords):
    f = requests.get(url)
    text = f.text
    f.close()
    text = BeautifulSoup(text,'lxml').get_text()    
    text = text.replace('\n',' ').lower()
    text = text.split(' ')
    posmatches = set(text).intersection(set(poswords))
    print(posmatches)
    print(len(posmatches))
    negmatches = set(text).intersection(set(negwords))
    print(negmatches)
    print(len(negmatches))

In [70]:
#Try this on the same data as before
url = 'http://srdas.github.io/bio-candid.html'
finScore(url,poswords,negwords)

{'fellow', 'real', 'reconcile', 'your', 'pleasure', 'important', 'pretty', 'live', 'his', 'distinct', 'associate', 'mutual', 'credit', 'excitement', 'great', 'open', 'have', 'education', 'meet', 'unique'}
20
{'bad', 'no', 'short', 'cool', 'never', 'unchecked', 'cost', 'default', 'get', 'unreal', 'let', 'mad'}
12


In [71]:
#Let's get Apple's SEC filing 10K
url = 'http://investor.apple.com/secfiling.cfm?filingID=320193-17-70&CIK=320193'
finScore(url,poswords,negwords)

{'integrity', 'comprehensive', 'unlimited', 'enhance', 'accordance', 'travel', 'indicative', 'reward', 'relevant', 'favorable', 'supreme', 'award', 'particular', 'achieve', 'secure', 'justice', 'understood', 'continuity', 'competence', 'persuasive', 'offer', 'pro', 'credit', 'complement', 'extraordinary', 'acceptable', 'safe', 'adjustment', 'maturity', 'success', 'useful', 'interest', 'attract', 'protect', 'responsibility', 'create', 'principal', 'allowance', 'back', 'relief', 'home', 'appoint', 'conjunction', 'even', 'best', 'major', 'have', 'facilitate', 'law', 'engage', 'survive', 'aggregate', 'her', 'security', 'precedent', 'their', 'sensitivity', 'intelligent', 'profit', 'essential', 'important', 'defend', 'support', 'effectiveness', 'fitness', 'educational', 'guarantee', 'satisfaction', 'asset', 'gain', 'renewal', 'responsible', 'commitment', 'compliance', 'assist', 'effective', 'portable', 'knowledge', 'upgrade', 'independent', 'help', 'capability', 'actual', 'compensation', 'bo

In [72]:
#Repeat with a different URL from the SEC
url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/a10-k20179302017.htm'
finScore(url,poswords,negwords)

{'integrity', 'comprehensive', 'unlimited', 'enhance', 'accordance', 'travel', 'indicative', 'reward', 'relevant', 'favorable', 'supreme', 'award', 'particular', 'achieve', 'secure', 'justice', 'continuity', 'competence', 'persuasive', 'offer', 'pro', 'credit', 'complement', 'acceptable', 'safe', 'adjustment', 'maturity', 'success', 'useful', 'interest', 'attract', 'protect', 'responsibility', 'create', 'principal', 'allowance', 'relief', 'home', 'conjunction', 'even', 'best', 'major', 'have', 'law', 'aggregate', 'their', 'her', 'security', 'sensitivity', 'intelligent', 'profit', 'essential', 'important', 'support', 'effectiveness', 'fitness', 'educational', 'guarantee', 'asset', 'gain', 'renewal', 'responsible', 'commitment', 'compliance', 'assist', 'effective', 'portable', 'knowledge', 'upgrade', 'help', 'capability', 'actual', 'compensation', 'bonus', 'free', 'behalf', 'protection', 'commission', 'devote', 'productivity', 'allow', 'better', 'enable', 'contribution', 'privacy', 'defi

**The results are different, n0t sure why this is so!**

## Parts of Speech (POS) Tagging

https://www.cs.toronto.edu/~frank/csc2501/Tutorials/cs485_nltk_krish_tutorial1.pdf