## Find Frequency Distributions with NLTK (1)

Lab instructor: Zeyang(Jenny) Gong

#### Kernel: Python 2

#### Under blackboard, you can find the following documents: 

•	Installation instruction 

•	Lab1.doc (tutorial that you can go over by yourself after the lab) 

•	Lab1.ipynb (material that will be present during the lab)

•	Frequency.py (code to create frequency distribution) 

•	Frequency without stopwords.py (code to create frequency distribution without stop words) 

•	Input.txt

If you want to follow the instructor during the lab, please get your jupyter notebook ready before the lab. The installation instruction has been posted on the Blackboard. 


Today we will be learning how to use the Natural Language Toolkit (NLTK) module to perform basic text analysis. 

NLTK is a Python add-on package which has multiple useful functions for analyzing text files. One key component of text analysis is finding keyword frequency distributions.

A frequency distribution can be shown as a table containing how many times within a document different words appear. To directly quote the NLTK book, “it’s called a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them” (http://www.nltk.org/book/ch01.html). 

Let’s get started.


### Lab1:
1. String Manipulation 
2. tokenization 
3. stop words removal

### Lab2:
3. stemming 
4. POS
5. N-grams
6. Tfidf



## String Manipulation 

In [None]:
dir(str)

In [None]:
s = '  Hello '
s

In [None]:
s = s.strip()
s

In [None]:
#string indexing 
s[0]

In [None]:
s[-1]

In [None]:
s[:4]

In [None]:
s = s.lower()
s

In [None]:
s.isalpha()

In [None]:
s.isdigit()

In [None]:
s = 'http:\\www.abc.com'
s.startswith('http')

In [None]:
s = 'I love coding'
s_list = s.split() #default deliminator is ' '(space) 
s_list

In [None]:
new_string = '$$'.join(s_list)
new_string

In [None]:
s.replace('$$',' ')

## Read data

In [None]:
import nltk
from nltk import FreqDist

#read file from local 
f = open('input.txt','rU')
raw = f.read()

In [None]:
type (raw)

In [None]:
raw = raw.replace('\n',' ') 
raw = raw.decode('utf8') #decode raw text by utf-8

## Tokenization 

In [None]:
tokens = nltk.word_tokenize(raw)
type(tokens)

In [None]:
#change all tokens into lower case 
words1 = [w.lower() for w in tokens]   #list comprehension 

#only keep text words, no numbers 
words2 = [w for w in words1 if w.isalpha()]

#encode the raw token list by utf-8
words3 = [w.encode('utf8') for w in words2]

In [None]:
#another way to create a new list: 
words1 = []
for w in tokens: 
    words1.append(w.lower())

In [None]:
#generate a frequency dictionary for all tokens 
freq = FreqDist(words3)

#sort the frequency list in descending order
sorted_freq = sorted(freq.items(),key = lambda k:k[1], reverse = True)
sorted_freq

In [None]:
freq.plot(30)

## Stop words removal 

In [None]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english') #use the NLTK stopwords

In [None]:
#only keep the words that not in nltk stopwords word list
words_nostopwords = [w.encode('utf8') for w in words2 if w not in stopwords]

In [None]:
#generate a frequency dictionary for all tokens 
freq_nostw = FreqDist(words_nostopwords)

#sort the frequency list in decending order
sorted_freq_nostw = sorted(freq_nostw.items(),key = lambda k:k[1], reverse = True)
sorted_freq_nostw

In [None]:
freq_nostw.plot(30)

## Save the result 

In [None]:
with open ('output.txt','a') as outfile:
    for line in sorted_freq_nostw:
        outfile.write(str(line[0])+'\t'+str(line[1])+'\n')

All done!