# Natural Language Processing: NLTK
## Seperating
First step is to organise in some way, you can separate by paragraphs or sentences. If you break up sentences you can refer back to the paragraph it came back from as paragraphs tend to be logical groupings of ideas.

## Tokenizing
Work tokenizer seperates by words, sentence tokenizer separates by sentences.

## Corpa
Body of text based around similar things: e.g. medical journals, presidential speeches, anything in the English language.

## Lexicon
Words and their meanings: Investors use word differently than your average person

### Download Datasets
```
import nltk
nltk.download() # download datasets
```
Downloads nltk_data directory in home directory.

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [2]:
example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat cardboard"

## Splitting Sentence
Could try to group parts by punctuation, but what about cases where you have 'Mr.'? NLTK handles this:

In [3]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue.', 'You should not eat cardboard']


In [4]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'cardboard']


## Stop Words
Words that are filler words - things you don't need for your analysis like 'a', 'the', etc. In NLTK you can set the language for these words.

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [6]:
example_text = "This is an example showing off stop word filtration"
stop_words = set(stopwords.words("english"))
print(stop_words)

{'yourself', 'she', 's', 'don', 'didn', 'have', 'mustn', 'other', 'ma', 'how', 'your', 'can', 'them', 'out', 'itself', 'then', 'wasn', 'should', 'our', 'ourselves', 'about', 'isn', 'themselves', 'm', 'where', 'him', 'nor', 'shan', 'just', 're', 'ain', 'any', 'on', 'this', 'herself', 'so', 'or', 'be', 'with', 'me', 'doesn', 'do', 'shouldn', 'my', 'yours', 'hers', 'no', 'at', 'of', 'against', 'why', 'too', 'it', 'who', 'whom', 'these', 'did', 'into', 'each', 'those', 'd', 'aren', 'am', 'had', 'has', 'for', 'once', 'y', 'before', 'because', 'very', 'a', 'there', 'which', 'that', 'now', 'what', 'himself', 've', 'were', 'having', 'again', 'll', 'if', 'such', 'by', 'hadn', 'below', 'but', 'few', 'after', 'more', 'does', 'all', 'to', 'they', 'until', 'during', 'ours', 'his', 'in', 'as', 'hasn', 'weren', 'their', 'under', 'some', 'will', 'been', 'between', 'up', 'wouldn', 'was', 'theirs', 'is', 'above', 'we', 'through', 'being', 'you', 't', 'haven', 'than', 'not', 'here', 'over', 'its', 'both'

In [7]:
words = word_tokenize(example_text)
filtered_sentence = [w for w in words if w not in stop_words]
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration']


## Stemming Words
Find words that have the same base meaning.

In [8]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [9]:
ps = PorterStemmer()
example_words = ['python', 'pythoner', 'pythoning', 'pythoned', 'pythonly']
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [10]:
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
    print(ps.stem(w))

It
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.


## Speech Tagging
Classifying different words - noun, etc.

**POS tag list:**
- CC	coordinating conjunction
- CD	cardinal digit
- DT	determiner
- EX	existential there (like: "there is" ... think of it like "there exists")
- FW	foreign word
- IN	preposition/subordinating conjunction
- JJ	adjective	'big'
- JJR	adjective, comparative	'bigger'
- JJS	adjective, superlative	'biggest'
- LS	list marker	1)
- MD	modal	could, will
- NN	noun, singular 'desk'
- NNS	noun plural	'desks'
- NNP	proper noun, singular	'Harrison'
- NNPS	proper noun, plural	'Americans'
- PDT	predeterminer	'all the kids'
- POS	possessive ending	parent's
- PRP	personal pronoun	I, he, she
- PRP$	possessive pronoun	my, his, hers
- RB	adverb	very, silently,
- RBR	adverb, comparative	better
- RBS	adverb, superlative	best
- RP	particle	give up
- TO	to	go 'to' the store.
- UH	interjection	errrrrrrrm
- VB	verb, base form	take
- VBD	verb, past tense	took
- VBG	verb, gerund/present participle	taking
- VBN	verb, past participle	taken
- VBP	verb, sing. present, non-3d	take
- VBZ	verb, 3rd person sing. present	takes
- WDT	wh-determiner	which
- WP	wh-pronoun	who, what
- WP\$	possessive wh-pronoun	whose
- WRB	wh-abverb	where, when

In [11]:
#%matplotlib inline
import nltk
from nltk.corpus import state_union
# trained tokenizer
from nltk.tokenize import PunktSentenceTokenizer

In [12]:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [13]:
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [28]:
# break statement so not too much output
try:
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)
        break
except Exception as e:
    print(str(e))

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]


## Chunking

After splitting text up into sentences/words and identifying the noun/named entity as the subject. Chunking is used to identify descriptive words around noun and group things.

In [34]:
# break statement so not too much output
try:
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)

        # use regular expressions to identify any adverb
        chunk_gram = r'''Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}'''
        chunk_parser = nltk.RegexpParser(chunk_gram)
        chunked = chunk_parser.parse(tagged)
        print(chunked)
        #chunked.draw()
        break

except Exception as e:
    print(str(e))

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)


## Chinking

You chink something from a chunk - it is the removal of something. You can say you want to chunk everything except for somet things.

In [35]:
# break statement so not too much output
try:
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)

        # chunk everything
        chunk_gram = r'''Chunk: {<.*>+}
                    # things to keep out
                    }<VB.?|IN|DT>+{'''
        chunk_parser = nltk.RegexpParser(chunk_gram)
        chunked = chunk_parser.parse(tagged)
        print(chunked)
        #chunked.draw()
        break

except Exception as e:
    print(str(e))

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP 'S/POS ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk
    THE/NNP
    UNION/NNP
    January/NNP
    31/CD
    ,/,
    2006/CD
    THE/NNP
    PRESIDENT/NNP
    :/:
    Thank/NNP
    you/PRP)
  all/DT
  (Chunk ./.))


## Named Entity Recognition

### Type and Examples
- ORGANIZATION - Georgia-Pacific Corp., WHO
- PERSON - Eddy Bonte, President Obama
- LOCATION - Murray River, Mount Everest
- DATE - June, 2008-06-29
- TIME - two fifty a m, 1:30 p.m.
- MONEY - 175 million Canadian Dollars, GBP 10.40
- PERCENT - twenty pct, 18.75 %
- FACILITY - Washington Monument, Stonehenge
- GPE - South East Asia, Midlothian

In [40]:
# break statement so not too much output
try:
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        # binary=True will try group entity words together: "white house" rather than "white" and "house"
        named_entity = nltk.ne_chunk(tagged, binary=True)
        print(named_entity)
        break

except Exception as e:
    print(str(e))

(S
  PRESIDENT/NNP
  (NE GEORGE/NNP)
  W./NNP
  BUSH/NNP
  'S/POS
  (NE ADDRESS/NNP)
  BEFORE/IN
  A/NNP
  JOINT/NNP
  SESSION/NNP
  OF/IN
  (NE THE/NNP)
  (NE CONGRESS/NNP)
  ON/NNP
  THE/NNP
  STATE/NNP
  OF/IN
  (NE THE/NNP UNION/NNP)
  January/NNP
  31/CD
  ,/,
  2006/CD
  THE/NNP
  PRESIDENT/NNP
  :/:
  Thank/NNP
  you/PRP
  all/DT
  ./.)


## Lemmatizing
Similar to stemming, may not find the original word but instead a synonym with the same meaning.

In [41]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))

cat
cactus
goose
rock
python


In [42]:
# default argument for pos is noun
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))

good
best


In [43]:
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

run
run


## NLTK Corpus

In [44]:
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer

sample = gutenberg.raw('bible-kjv.txt')
tok = sent_tokenize(sample)
print(tok[5:15])

['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itself, upon the earth: and it was so.', '1:12 And the earth brought forth grass, and

## WordNet
Can use to look up synonyms, antonyms, definitions and contexts of words.

In [45]:
from nltk.corpus import wordnet

In [46]:
# find synonyms for the word 'program'
syns = wordnet.synsets("program")
print(syns)

[Synset('plan.n.01'), Synset('program.n.02'), Synset('broadcast.n.02'), Synset('platform.n.02'), Synset('program.n.05'), Synset('course_of_study.n.01'), Synset('program.n.07'), Synset('program.n.08'), Synset('program.v.01'), Synset('program.v.02')]


In [47]:
print(syns[0].lemmas())
print(syns[0].lemmas()[0].name())

[Lemma('plan.n.01.plan'), Lemma('plan.n.01.program'), Lemma('plan.n.01.programme')]
plan


In [48]:
print(syns[0].definition())
print(syns[0].examples())

a series of steps to be carried out or goals to be accomplished
['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [49]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

{'good', 'dependable', 'near', 'sound', 'full', 'unspoiled', 'safe', 'respectable', 'estimable', 'well', 'serious', 'commodity', 'beneficial', 'skillful', 'soundly', 'honorable', 'proficient', 'in_effect', 'secure', 'thoroughly', 'right', 'undecomposed', 'goodness', 'expert', 'just', 'upright', 'honest', 'practiced', 'skilful', 'salutary', 'unspoilt', 'trade_good', 'dear', 'ripe', 'effective', 'in_force', 'adept'}
{'badness', 'ill', 'bad', 'evilness', 'evil'}


## Comparing words

Can use the Wu and Palmer (WUP) method to identify semantic relatedness of words: compare the similarity of two words and their tenses.

In [50]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
w3 = wordnet.synset('cat.n.01')
w4 = wordnet.synset('car.n.01')

# 90% similar
print(w1.wup_similarity(w2))
print(w1.wup_similarity(w3))
print(w1.wup_similarity(w4))

0.9090909090909091
0.32
0.6956521739130435


## Text Classification
Can be used to classify text as being about politics/military or to identify the gender of the author. A common use of this technique is to identify spam email.

NLTK has a movie reviews database in its corpus and we're going to try classify reviews as positive or negative.

In [51]:
import nltk
import random
from nltk.corpus import movie_reviews

# get tuple with words and pos/neg category
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
# shuffle as reviews are grouped by category
random.shuffle(documents)

print(documents[0])

(['anastasia', 'contains', 'something', 'that', 'has', 'been', 'lacking', 'from', 'all', 'of', 'the', 'recent', 'disney', 'releases', '.', '.', '.', '(', 'especially', 'hercules', ')', '.', '.', '.', 'emotion', '.', 'all', 'the', 'wacky', 'characters', 'voiced', 'by', 'celebrities', 'and', 'fantastically', 'animated', 'adventure', 'sequences', 'aren', "'", 't', 'going', 'to', 'hold', 'anyone', "'", 's', 'interest', 'unless', 'there', 'is', 'an', 'emotional', 'core', 'to', 'hold', 'it', 'all', 'together', '.', 'not', 'since', 'disney', "'", 's', 'beauty', '&', 'the', 'beast', 'has', 'there', 'been', 'such', 'a', 'compelling', 'animated', 'film', 'with', 'interesting', 'characters', 'and', 'drama', 'that', 'works', '.', 'the', 'story', 'of', 'the', 'romanov', 'family', ',', 'the', 'rulers', 'of', 'russia', ',', 'and', 'their', 'downfall', 'begins', 'the', 'film', '.', 'anastasia', ',', 'one', 'of', 'the', 'daughters', ',', 'narrowly', 'escapes', 'the', 'mad', 'monk', 'rasputin', '(', 'vo

## Word Frequency Distribution

Ordered from most common words through to least common

In [52]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(len(all_words))
print(all_words.most_common(15))
print(all_words["stupid"])

39768
[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253


In [54]:
# avoid using keys as it is not sorted
word_features = list(all_words.keys())[:3000]
#word_features = all_words.most_common(10000)
print(word_features[:20])
#word_features = word_features[100:]

['stayin', 'heigh', 'faceless', 'slight', 'brooms', 'salads', 'galahad', 'liaison', 'laps', 'brethren', 'degenerated', '128', 'humility', 'osmet', 'fascination', 'homicides', 'scrambles', 'moonshiner', 'cross', 'shrinking']


In [55]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        # return true/false if word present
        features[w] = (w in words)
    return features

In [60]:
# run example review
example = find_features(movie_reviews.words('neg/cv000_29416.txt'))
print(example['laps'], example['out'])

False True


In [61]:
feature_sets = [(find_features(rev), category) for (rev, category) in documents]
#print(feature_sets[0][0]['swept'])
#print(feature_sets[0][1])

In [62]:
training_set = feature_sets[:1900]
testing_set = feature_sets[1900:]

train = {'pos':0, 'neg':0}
test = {'pos':0, 'neg':0}
for text, verdict in training_set:
    train[verdict] += 1
print(train)

for text, verdict in testing_set:
    test[verdict] += 1
print(test)

{'pos': 962, 'neg': 938}
{'pos': 38, 'neg': 62}


## Naive Bayes Classifier
Assumes that features are independent and is rather basic but tends to work well even when this assumption isn't entirely true.

posterior = prior occurences x likelihood / evidence

In [63]:
classifier = nltk.NaiveBayesClassifier.train(training_set)
print(nltk.classify.accuracy(classifier, testing_set) * 100)
print(classifier.show_most_informative_features())

73.0
Most Informative Features
             fascination = True              pos : neg    =     10.1 : 1.0
               addresses = True              pos : neg    =      9.4 : 1.0
                  hudson = True              neg : pos    =      9.2 : 1.0
              weaknesses = True              pos : neg    =      8.1 : 1.0
              infectious = True              pos : neg    =      7.5 : 1.0
               balancing = True              pos : neg    =      7.5 : 1.0
               behaviour = True              pos : neg    =      7.5 : 1.0
                  shoddy = True              neg : pos    =      6.5 : 1.0
              annoyingly = True              neg : pos    =      6.5 : 1.0
                  wasted = True              neg : pos    =      6.2 : 1.0
None


## Saving Classifiers

```
# save
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()
```

```
# load
classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()
```

## Scikit-Learn with NLTK

In [64]:
from nltk.classify.scikitlearn import SklearnClassifier
# multinomial = not binary distribution
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

In [65]:
mnb_classifier = SklearnClassifier(MultinomialNB())
mnb_classifier.train(training_set)

print(nltk.classify.accuracy(mnb_classifier, testing_set) * 100)

70.0


In [66]:
bernoullinb_classifier = SklearnClassifier(BernoulliNB())
bernoullinb_classifier.train(training_set)

print(nltk.classify.accuracy(bernoullinb_classifier, testing_set) * 100)

75.0


In [67]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [68]:
log_classifier = SklearnClassifier(LogisticRegression())
log_classifier.train(training_set)

print(nltk.classify.accuracy(log_classifier, testing_set) * 100)

71.0


In [69]:
sgdc_classifier = SklearnClassifier(SGDClassifier())
sgdc_classifier.train(training_set)

print(nltk.classify.accuracy(sgdc_classifier, testing_set) * 100)

71.0


In [70]:
svc_classifier = SklearnClassifier(SVC())
svc_classifier.train(training_set)

print(nltk.classify.accuracy(svc_classifier, testing_set) * 100)

38.0


In [71]:
linear_svc_classifier = SklearnClassifier(LinearSVC())
linear_svc_classifier.train(training_set)

print(nltk.classify.accuracy(linear_svc_classifier, testing_set) * 100)

68.0


In [72]:
nu_svc_classifier = SklearnClassifier(NuSVC())
nu_svc_classifier.train(training_set)

print(nltk.classify.accuracy(nu_svc_classifier, testing_set) * 100)

73.0


## Combining Algorithms

In [73]:
from nltk.classify import ClassifierI
from statistics import mode

In [74]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers
    
    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)
    
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

In [75]:
voted_classifier = VoteClassifier(classifier, mnb_classifier, bernoullinb_classifier, log_classifier, sgdc_classifier, svc_classifier, nu_svc_classifier)
print((nltk.classify.accuracy(voted_classifier, testing_set))*100)

73.0


In [76]:
print("Classification:", voted_classifier.classify(testing_set[0][0]), "Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
print("Classification:", voted_classifier.classify(testing_set[1][0]), "Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
print("Classification:", voted_classifier.classify(testing_set[2][0]), "Confidence %:",voted_classifier.confidence(testing_set[2][0])*100)
print("Classification:", voted_classifier.classify(testing_set[3][0]), "Confidence %:",voted_classifier.confidence(testing_set[3][0])*100)
print("Classification:", voted_classifier.classify(testing_set[4][0]), "Confidence %:",voted_classifier.confidence(testing_set[4][0])*100)
print("Classification:", voted_classifier.classify(testing_set[5][0]), "Confidence %:",voted_classifier.confidence(testing_set[5][0])*100)

Classification: pos Confidence %: 85.71428571428571
Classification: neg Confidence %: 57.14285714285714
Classification: pos Confidence %: 100.0
Classification: neg Confidence %: 71.42857142857143
Classification: pos Confidence %: 71.42857142857143
Classification: pos Confidence %: 100.0


## Identifying Bias in Classifier
Determine whether the classifer does better with different classes. Usually you'd generate a confusion matrix.

## Stanford NER Tagger

Alternative to NLTK's named entity recognition classifier. It's regarded as the best but is slower. It provides multiple models for extracting named entities:

- 3 class model for recognizing locations, persons, and organizations
- 4 class model for recognizing locations, persons, organizations, and miscellaneous entities
- 7 class model for recognizing locations, persons, organizations, times, money, percents, and dates

The tagger is written in Java.

# Natural Language Processing with NLTK and Gensim
## Exploring

- Tokens are not words: they're substrings and only structural while words are objects that have meaning
- Concordance: Searches for text and provides the surrounding context
- Similar: Can find words that occur frequently in the same context as a word. Allows you to do things like understand how words are being used in different texts
- Common Contexts: Identify contexts for sets of words
- Dispersion plot: can use to visualise frequency of words in different texts throughout time.
- Stop Words: Eliminate common words - 'a', 'and', 'the', etc.

## Frequency Analyses
- Can count token frequency in text. NLTK comes with two useful classes:
	- `FreqDist`
	- `ConditionalFreqDist`
- Words that occur infrequently or even only once are usually very important. Stop words can be useful when identifying these words.
- We can compute:
	- The count of words
	- Vocabulary (unique words)
	- Lexical diversity (ratio of word count to vocabulary). Average number of times a word occurs in a corpus. It is useful for corpus analysis and can help inform you if the corpus has changed significantly under the hood. It can help you identify when you have problems in your analysis.
- `most_common()` lets you retrieve tuples of the most common tokens and their counts
- `counts.hapaxes()` tokens that occur only once
- `counts.freq` how often a word occurs in the corpus
- Conditional Frequency: Frequency of an event given a condition. 

## Features
- Document Level Features:
	- Metadata: title, author
	- Paragraphs
	- Sentence construction
- Word Level Features
	- Vocabulary
	- Form (Capitalization)
	- Frequency
- Vector Encoding: Basic representation of documents: vector whose length is equal to the vocabulary of the entire corpus. Word positions in the vector are based on lexicographic (alphabetical) order
- Bag of words: Token frequency is one of the simplest models - calculate the frequency of words in the document and use those numbers as the vector encoding. You can normalise the frequencies according to the total length of the vocabulary of the corpus. Useful for terms that occur frequently that are important. You need to remove stop words for this to be effective.
- One hot encoding: Feature vector encodes vocabulary of document: words are equally distant - give 1 if present, 0 if not. Often used for neural network models.
- TF-IDF Encoding: Highlight terms that are relevant to a document relative to the rest of the corpus by computing the term frequency times the inverse document frequency of the term.
- Distributed representation: Can be used to encode similarity within vector space. Implemented in Gensim's doc2vec class.