# Text Mining: Introduction NLTK

## Getting Started

Import the NLTK module.

In [None]:
import nltk

Define a sentence.

In [None]:
sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good."

Tokenise the sentence using the built-in tokenizer (requires the download of the NLTK module `book`).

In [None]:
tokens = nltk.word_tokenize(sentence)
print(tokens)

Assign part-of-speech tags to the words in the sentence.

In [None]:
tags = nltk.pos_tag(tokens)
print(tags)

Identify named entities.

In [None]:
entities = nltk.chunk.ne_chunk(tags)
print(entities)

## Moview Reviews

Import the movie review corpus.

In [None]:
from nltk.corpus import movie_reviews

Show the list of document categories in the corpus.

In [None]:
movie_reviews.categories()

Each category comes with a number of examples. Show the first five examples from the *negative* category.

In [None]:
movie_reviews.fileids('neg')[:5]

Store all documents in one list.

In [None]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [None]:
print(documents[0])

Optionally, shuffle the data set.

In [None]:
# import random
# random.shuffle(documents)

Compute a frequency distribution for the words; take the 2000 most frequent words.

In [None]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [None]:
all_words

The following function will extract features from a document.

In [None]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

In [None]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.show_most_informative_features(5)