# Sentiment analysis of free-text comments using NLTK

2015-07-04 -- by [Harald Schilly](http://harald.schil.ly) -- License: Apache 2.0

The following NLTK demo works for German free-text comments.
It tokenizes the text, cleans it up, does word stemming and then trains a naive bayesian model.
In the end, a few tests show that it did indeed learn something.

In [1]:
import yaml
from codecs import open
import nltk

### Initialization of tokenizer and stemmer

... just the defaults

In [2]:
# NLTK tokenizer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

# NLTK stemmer for german
from nltk.stem.snowball import GermanStemmer
stemmer_de = GermanStemmer()

## Data Acquisition

For the demo, a map of categories to a list of texts is read from a data file.

In [3]:
data = yaml.load(open("reviews1.yaml", "r", "utf-8"))
for cat, texts in data.items():
    print("%s: %d entries" % (cat, len(texts)))

negative: 16 entries
positive: 14 entries
question: 7 entries


## Processing texts in `data`

`data2` is then a map of categories to a list of tokenized texts.

In [4]:
# for each text, this tokenizing and stemming process is applied
def process_text(text):
    words = tokenizer.tokenize(text)
    words = [stemmer_de.stem(w) for w in words if len(w) >= 3]
    words = [("<QM>" if '?' in w else w) for w in words]
    return words

data2 = {}
for cat, texts in data.items():
    data2[cat] = []
    for text in texts:
        data2[cat].append(process_text(text))

### Feature extraction

In [5]:
all_words = []
for texts in data2.values():
    [all_words.extend(text) for text in texts]

wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()

In [6]:
# 10 most common words
wordlist.most_common(10)

[('und', 16),
 ('ein', 11),
 ('das', 9),
 ('bei', 6),
 ('nicht', 6),
 ('ist', 6),
 ('euch', 5),
 ('noch', 5),
 ('ich', 5),
 ('wied', 4)]

In [7]:
def extract_features(doc):
    doc_words = set(doc)
    features = {}
    for word in word_features:
        features["contains %s" % word] = (word in doc_words)
    return features

## Training set & Bayes Classifier

NTLK's NaiveBayesClassifier is trained using the training set.

In [8]:
# just a little helper
def get_all_docs():
    for cat, texts in data2.items():
        for words in texts:
            yield (words, cat)

In [9]:
training_set = nltk.classify.apply_features(extract_features, list(get_all_docs()))

**This is where the magic happens:**

In [10]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

This list of most informative features is an indicator if the training did work well.

In [11]:
classifier.show_most_informative_features(20)

Most Informative Features
           contains auch = True           questi : negati =      3.5 : 1.0
            contains ist = True           negati : positi =      2.6 : 1.0
           contains wied = True           positi : negati =      2.6 : 1.0
            contains der = True           positi : negati =      2.6 : 1.0
           contains mein = True           positi : negati =      2.6 : 1.0
           contains euch = True           positi : negati =      2.6 : 1.0
            contains ihr = True           questi : negati =      2.1 : 1.0
           contains bitt = True           questi : negati =      2.1 : 1.0
           contains viel = True           questi : negati =      2.1 : 1.0
            contains ich = True           negati : positi =      2.1 : 1.0
            contains und = True           positi : questi =      2.0 : 1.0
            contains imm = True           positi : negati =      1.9 : 1.0
           contains aber = True           positi : negati =      1.9 : 1.0

## Testing the classifier

In [12]:
t1 = "diese art von bedienung brauchen wir gar nicht."
classifier.classify(extract_features(process_text(t1)))

'negative'

In [13]:
t2 = "Hervorragende Bedienung, jederzeit gerne wieder!"
classifier.classify(extract_features(process_text(t2)))

'positive'

In [14]:
t3 = "Ganz schlechtes Service, kann ich nicht empfehlen ..."
classifier.classify(extract_features(process_text(t3)))

'negative'

In [15]:
t4 = "Wir kommen gerne jederzeit wieder."
classifier.classify(extract_features(process_text(t4)))

'positive'

In [16]:
t5 = "uns hat es sehr gut gefallen"
classifier.classify(extract_features(process_text(t5)))

'positive'

In [17]:
t6 = "Wann sperrt ihr morgen auf?"
classifier.classify(extract_features(process_text(t6)))

'question'

### All probabilities

List all probabilities for each testing text. Gives an impression how well the classification did work.

In [18]:
def all_probabilities(text):
    from math import exp
    print(text)
    probs = classifier.prob_classify(extract_features(process_text(text)))
    for label, lprop in probs._prob_dict.items():
        print("%5.1f%% %s" % (100. * exp(lprop), label))

In [19]:
all_probabilities(t1)

diese art von bedienung brauchen wir gar nicht.
 66.2% negative
  9.4% positive
  1.5% question


In [20]:
all_probabilities(t2)

Hervorragende Bedienung, jederzeit gerne wieder!
 12.9% negative
 64.7% positive
  0.3% question


In [21]:
all_probabilities(t3)

Ganz schlechtes Service, kann ich nicht empfehlen ...
 99.1% negative
  0.0% positive
  0.0% question


In [22]:
all_probabilities(t4)

Wir kommen gerne jederzeit wieder.
  0.0% negative
 99.4% positive
  0.0% question


In [23]:
all_probabilities(t5)

uns hat es sehr gut gefallen
  7.9% negative
 73.5% positive
  0.4% question


In [24]:
all_probabilities(t6)

Wann sperrt ihr morgen auf?
  0.1% negative
  2.9% positive
 86.9% question
