### The Python NaturalLanguageToolKit (NLTK) 

is a set of modules and corpora enabling the reader to do natural langauge processing against corpora of one or more texts. It goes beyond text minnig and provides tools to do machine learning, but this Notebook barely scratches that surface.

### Install NLTK & Download data

* pip install nltk -- install nltk in the correct virtual environment
* import nltk
* nltk.download() -- pops up a GUI

In [1]:
import nltk

### Tokenize Sententences Into Words

In [14]:
# help(nltk.word_tokenize)
my_sent = 'SCENE 1 : [wind] [ clop clops clop ] KING ARTHUR : Whoa there !'
my_tokens = nltk.word_tokenize(my_sent)
my_tokens

['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clops',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!']

## Bassic Text Processing

In [8]:
w = my_tokens[0]

In [10]:
w.is

str

In [15]:
words = [w for w in my_tokens if w.isalpha()]
words

['SCENE', 'wind', 'clop', 'clops', 'clop', 'KING', 'ARTHUR', 'Whoa', 'there']

In [16]:
# create a list of (English) stopwords
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )

In [17]:
words = [w for w in words if w not in stopwords]
words

['SCENE', 'wind', 'clop', 'clops', 'clop', 'KING', 'ARTHUR', 'Whoa']

### Word Stemming

In [18]:
from nltk.stem import PorterStemmer
stemmer     = PorterStemmer()
stems       = [ stemmer.stem( w ) for w in words ]
stems

['scene', 'wind', 'clop', 'clop', 'clop', 'king', 'arthur', 'whoa']

### Refactor into a function

In [19]:
# Libraries
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [20]:
def process_text(text):
    my_tokens = nltk.word_tokenize(text)
    words = [w for w in my_tokens if w.isalpha()]
    words = [w for w in words if w not in stopwords]
    stems = [ stemmer.stem( w ) for w in words ]
    return [words, stems]
words, stems = process_text('Fellow - Citizens of the Senate and of the House of Representatives :')
# words, stems = process_text('Fellow fellow - Citizens of the Senate and of the House of Representatives :')
print('Words: ', words)
print('Stems: ', stems)

Words:  ['Fellow', 'Citizens', 'Senate', 'House', 'Representatives']
Stems:  ['fellow', 'citizen', 'senat', 'hous', 'repres']
