## NLTK

### Installation

On the command line ('cmd'), type: `pip install nltk`
    
Then, type: `python`. Within python (running on the command line), type: `import nltk` and `nltk.download()`
        
This will open up a window where you can select the different components to install. By default, everything is selected (which is good). 


In [None]:
## Tokenizing

Tokenizers is used to divide strings into lists of substrings. For example, Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.


In [1]:
from nltk.tokenize import word_tokenize
word_tokenize('Tokenizing this sentence will result in a list with the different elements. Very exciting indeed!')

['Tokenizing',
 'this',
 'sentence',
 'will',
 'result',
 'in',
 'a',
 'string',
 'with',
 'the',
 'different',
 'elements',
 '.',
 'Very',
 'exciting',
 'indeed',
 '!']

## Stop words

Text may contain stop words like 'the', 'is', 'are'. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
print(wordsFiltered)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.']


## Punctuation

As seen in the examples above, punctuation is part of the tokenized output and not filtered out by stop words.

In [4]:
import string
print (string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
words = word_tokenize(data)
wordsFiltered = []
for w in words:
    if w not in stopWords and w not in string.punctuation:
        wordsFiltered.append(w)
print(wordsFiltered)

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy']


In [7]:
# as one-liner
print ( [w for w in words if w not in stopWords and w not in string.punctuation] )

['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', 'All', 'work', 'play', 'makes', 'jack', 'dull', 'boy']


## Simple statistics for Apple 2017 MD&A

In [15]:
import nltk
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# read Apple 2017 md&a
with open('AAPL_2017.html' , 'r') as myfile:    
    mda =  myfile.read() 
    
# list of stop words and punctuation
stopWords = set(stopwords.words('english') ) 

# tokens excluding stopwords
mda_tokens = [x for x in word_tokenize(mda) if x.lower() not in stopWords and x not in string.punctuation]
# convert it to nltk text
text = nltk.Text(mda_tokens)
# now we can use nltk functions on the text
fdist2 = FreqDist(text)
print(fdist2)
fdist2.most_common(15)    

<FreqDist with 1663 samples and 6198 outcomes>


[('Company', 209),
 ('sales', 115),
 ('2017', 101),
 ('’', 93),
 ('net', 92),
 ('2016', 88),
 ('billion', 70),
 ('tax', 55),
 ('2015', 49),
 ('“', 47),
 ('”', 47),
 ('ASU', 44),
 ('primarily', 36),
 ('due', 36),
 ('compared', 31)]

In [9]:
# long words
V = set(text)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

['Apple-compatible',
 'Manufacturing-Related',
 'available-for-sale',
 'dollar-denominated',
 'euro-denominated',
 'exchange-related',
 'headcount-related',
 'industry-specific',
 'infrastructure-related',
 'manufacturing-related',
 'other-than-temporarily',
 'other-than-temporary',
 'telecommunications',
 'weighted-average']

## Collocations: bigrams that occur more often than we would expect 

In [31]:
# bigrams -- two words used together (both orders)
from nltk import bigrams
list(bigrams(['more', 'is', 'said', 'than', 'done']))

[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]

In [30]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))

finder = BigramCollocationFinder.from_words(mda_tokens2)
scored = finder.score_ngrams(bigram_measures.raw_freq)

#sorted(bigram for bigram, score in scored) 
sorted(finder.nbest(bigram_measures.raw_freq, 10))

[('2016', '2015'),
 ('2017', '2016'),
 ('2017', 'compared'),
 ('Company', '’'),
 ('Form', '10-K'),
 ('U.S.', 'dollar'),
 ('compared', '2016'),
 ('due', 'primarily'),
 ('net', 'sales'),
 ('relative', 'U.S.')]

## Stemming 

In [35]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

print ( [ps.stem(w) for w in ["game","gaming","gamed","games"]  ] )

['game', 'game', 'game', 'game']
