# Let's Say Hello to NLTK

***Natural Language Toolkit***
- It is a suite of *libraries* and *programs* for symbolic and statistical natural language processing (NLP) for *English* written in the Python programming language. 
- It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of *Pennsylvania*.
- NLTK includes graphical demonstrations and sample data.
- It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit (Natural Language Processing with Python. https://www.nltk.org/book/) , plus a cookbook (https://github.com/karanmilan/Automatic-Answer-Evaluation/blob/master/Python%203%20Text%20Processing%20with%20NLTK%203%20Cookbook.pdf)

In [None]:
%matplotlib inline 
import nltk

If you didn't already, don't forget to download nltk data

In [None]:
nltk.download()

# A rude hello to nltk

In [None]:
nltk.chat.rude_chat()

The book module contains *sample data* that is used in chpate 1 of nltk book

In [None]:
from nltk.book import *

In [None]:
text1

In [None]:
text2

In [None]:
print(text1[:100])

# Searching Text
There are many ways to **examine the context of a text apart from simply reading it**.


- A **concordance** view shows us every occurrence of a given word, together with some context. 

In [None]:
text1.concordance("monstrous")

- What other words appear in a **similar range of contexts**? 

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word *very*.

- examine just the **contexts that are shared** by two or more words

In [None]:
text2.common_contexts(["monstrous", "very"])

- we can also determine the **location of a word** in the text: how many words from the beginning it appears.

artificial text constructed by joining the texts of the Inaugural Address Corpus end-to-end

In [None]:
text4

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

# Counting Vocabulary

In [None]:
len(text3)

- he vocabulary of a text is just the set of tokens that it uses,

In [None]:
sorted(set(text3))[:20] 

How many distinct words does text3 contain? 

In [None]:
len(set(text3))

In [None]:
text3.plot() #try 10, 20, 50

Now, let's calculate a measure of the lexical richness of the text

In [None]:
def lexical_diversity(text):
     return len(set(text)) / len(text)    

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

# Frequency Distributions

***what makes a text distinct??***
**Frequency distribution** it tells us the frequency of each vocabulary item in the text. 

In [None]:
fdist1 = FreqDist(text1)

In [None]:
print(fdist1)

In [None]:
print(fdist1.most_common(50))

In [None]:
fdist1['whale']

how about the words that occur once only? the so-called **hapaxes**? 

In [None]:
print(fdist1.hapaxes()[:30]) # over 9000

how about **long words**?

In [None]:
V = set(text1) #try it for text5
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

limit the size, try to get frequently-occurring content-bearing words of the text (> 7 chars)

In [None]:
fdist = FreqDist(text1) # try it for text 5
print(sorted(w for w in set(text1) if len(w) > 7 and fdist1[w] > 7)[:30])

# Collocations and Bigrams
- A **bigram** is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2.

In [None]:
list(bigrams(['more', 'is', 'said', 'than', 'done']))

- A **collocation** is a sequence of words that occur together unusually often.
- Thus **red wine** is a collocation, whereas **the wine** is not.
- A characteristic of collocations is that they are *resistant to substitution* with words that have similar senses; for example, **maroon wine** sounds definitely odd.

In [None]:
 text4.collocations()

In [None]:
text8.collocations()