# Muse's NLTK Cookbook

[NLTK](http://www.nltk.org) is a powreful tool for natural language processing in Python. There are a lot of really great resources on the internet discussing the features of NLTK at length. This will not be one of those. Instead, this cookbook aims to share some common algorithms that you can get started playing with right away. By using the Jupyter/iPython notebook format, you can read each step and execute code as you go, getting immediate feedback. 

Please share any questions/additions on [the github repo](http://github.com/nelsonam) or reach out to me on Twitter [@musegarden](http://twitter.com/musegarden).

The first thing you'll want to do is make sure you have NLTK installed. Official instructions for installation can be found on the [NLTK website](http://nltk.org/install.html) but if you want a quick setup on OSX/Linux just type `pip install nltk`. Then you should be able to follow along here.

First, import the nltk library:

In [10]:
import nltk

## Obtaining Corpora

### NLTK Corpora

NLTK includes a wide array of corpora, so you can download the included datasets or use your own. I recommend downloading the "book" collection; it contains a variety of corpora that you can use to follow along with the official NLTK book.

In [12]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ 

True

In [20]:
# if you didn't download the "book" collection above, do so now
from nltk.book import *

### Your Own Corpora

You're just as welcome to load in data from your own sources.

In [21]:
# this helps us deal with unicode gracefully
import codecs

def readInput(filename):
    book = codecs.open(filename, 'r', 'utf-8')
    raw = book.read()
    tokens = word_tokenize(raw) # breaks up the text into a bag of words
    text = nltk.Text(tokens) # the standard NLTK Text object

## Personals Corpus

Let's have a look at what people want in the personals corpus. If you downloaded the NLTK book corpus above, you should be good to go.

The first thing we'll look at is "concordance". It shows the _context_ of a given string within the corpus. For instance, we can see what people are "seeking".

### Concordance

In [24]:
personals = text8 # text8 from the nltk.book collection
personals.concordance("seeks")

Displaying 25 of 72 matches:
                                     seeks attrac older single lady , for discr
yo SINGLE DAD , sincere friendly DTE seeks r / ship with fem age open S / E 44y
ip with fem age open S / E 44yo tall seeks working single mum or lady below 45 
 Nat Open 6 . 2 35 yr old OUTGOING M seeks fem 28 - 35 for o / door sports - w 
earing from you all . ABLE young man seeks , sexy older women . Phone for fun r
sports , music , cafes , beach & c . Seeks an honest , attractive lady with a E
, financially secure , no children , seeks attractive lady up to 40 y . o . wit
 , cafes , movies , dinner parties . Seeks out there female for friendship and 
th 11 y . o . son , living with me . Seeks nice , caring lady who likes childre
dining etc . OUTGOING GUY Late 30s , seeks lady , 25 - 50 , size unimportant fo
ch , country drives , quiet nights , seeks employed 28 - 40 year old lady for r
nderstanding , with varied interests seeks genuine female for friendship , rela
never marri

### POS Tagging and Frequency Analysis

What words do people use most often in their personals ads? Let's find out.

In [34]:
fdist1 = FreqDist(personals)
fdist1.most_common(50)

[(u',', 539),
 (u'.', 353),
 (u'/', 110),
 (u'for', 99),
 (u'to', 74),
 (u'and', 74),
 (u'lady', 68),
 (u'-', 66),
 (u'seeks', 60),
 (u'a', 52),
 (u'with', 44),
 (u'S', 36),
 (u'ship', 33),
 (u'&', 30),
 (u'relationship', 29),
 (u'fun', 28),
 (u'slim', 27),
 (u'build', 27),
 (u'in', 27),
 (u'o', 26),
 (u's', 24),
 (u'smoker', 23),
 (u'y', 23),
 (u'50', 23),
 (u'movies', 22),
 (u'non', 22),
 (u'I', 22),
 (u'good', 21),
 (u'honest', 20),
 (u'out', 19),
 (u'dining', 19),
 (u'looking', 18),
 (u'rship', 18),
 (u'like', 18),
 (u'friendship', 17),
 (u'who', 17),
 (u'attractive', 17),
 (u'age', 17),
 (u'35', 16),
 (u'Looking', 16),
 (u'45', 16),
 (u'40', 16),
 (u'5', 16),
 (u'meet', 15),
 (u'male', 15),
 (u'times', 15),
 (u'MALE', 15),
 (u'life', 15),
 (u'enjoy', 14),
 (u'fit', 14)]

If you look at the results above, you'll see that several words like "male" and "MALE" are in more than one place. How can we combine these?

In [35]:
personals = nltk.Text([t.lower() for t in personals.tokens])

Let's try that again.

In [36]:
fdist1 = FreqDist(personals)
fdist1.most_common(50)

[(u',', 539),
 (u'.', 353),
 (u'/', 110),
 (u'for', 100),
 (u'lady', 88),
 (u'to', 79),
 (u'and', 76),
 (u'seeks', 72),
 (u'-', 66),
 (u'a', 62),
 (u's', 60),
 (u'with', 46),
 (u'male', 42),
 (u'looking', 34),
 (u'slim', 33),
 (u'ship', 33),
 (u'fun', 31),
 (u'&', 30),
 (u'relationship', 29),
 (u'attractive', 29),
 (u'o', 29),
 (u'build', 27),
 (u'in', 27),
 (u'good', 26),
 (u'y', 26),
 (u'non', 25),
 (u'seeking', 25),
 (u'n', 23),
 (u'smoker', 23),
 (u'50', 23),
 (u'guy', 22),
 (u'movies', 22),
 (u'honest', 22),
 (u'i', 22),
 (u'married', 21),
 (u'age', 21),
 (u'friendship', 20),
 (u'like', 20),
 (u'fit', 19),
 (u'out', 19),
 (u'dining', 19),
 (u'rship', 19),
 (u'single', 19),
 (u'r', 18),
 (u'f', 17),
 (u'who', 17),
 (u'caring', 17),
 (u'enjoy', 16),
 (u'35', 16),
 (u'tall', 16)]