# Seminar Applied Text Mining
## Session 3: Classifying Documents
## Notebook 1: Bag-of-words model with 1-grams and Logit classifier

## Importing packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is the fundamental package for scientific computing with Python.
- `itertools` provides functions for creating iterators for efficient looping through data structures.
- `json` allows to read and write JSON files.
- `spacy` offers industrial-strength natural language processing
- `sklearn` is the de-facto standard machine learning package in Python

In [1]:
import pandas as pd
import numpy as np
import itertools
import json
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Load documents

Load the corpus of 10,000 Airline Tweets from a JSON file and display the first tweet.

In [2]:
docs = json.loads(open('/Users/oliver/Dropbox/10 - Lehre/UPB/Applied Text Mining/Code and Datasets/AirlineTweets.json').read())
docs[0]

FileNotFoundError: ignored

## Prepare documents

Perform standard NLP preparation steps with spaCy.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

for i, entry in enumerate(docs):
    text = nlp(entry[u'text'])
    tokens_to_keep = []
    for token in text:
        if token.is_alpha and not token.is_stop: # see with what other tags spaCy has annotated the tokens: https://spacy.io/api/token#attributes1
            tokens_to_keep.append(token.lemma_)
    entry[u'text_prep'] = " ".join(tokens_to_keep) # the .join turns the list into a concatenated string

<br>
Transform results into a data frame and display the first couple of lines.

In [None]:
docs_df = pd.DataFrame(docs)
docs_df.head()

Unnamed: 0,airline,date,retweet_count,sentiment,text,text_prep,tweet_created,tweet_id
0,American,2015-02-23 05:08:53 -0800,0,positive,@AmericanAir thank you for doing the best you ...,thank good rebook agent phone amp addtl resolu...,2015-02-23,5.698464e+17
1,American,2015-02-22 20:27:10 -0800,0,positive,@AmericanAir wow that's helpful.,wow helpful,2015-02-22,5.697151e+17
2,United,2015-02-17 14:32:23 -0800,0,negative,@united so I wasted 40mins filling in 2 online...,-PRON- waste fill online form tell receive -PR...,2015-02-17,5.678138e+17
3,American,2015-02-24 06:43:15 -0800,0,negative,@AmericanAir my seat is disgusting. Old and di...,seat disgusting old dirty when go refurbish pl...,2015-02-24,5.702325e+17
4,US Airways,2015-02-22 17:26:18 -0800,0,negative,@USAirways ur specialist said they would talk ...,ur specialist say talk stewardess serve drunk ...,2015-02-22,5.696695e+17


<br>
Split corpus into training (80%) and test (20%) sets.

In [None]:
docs_df_train = docs_df.iloc[0:8000,]
print docs_df_train.shape
docs_df_test = docs_df.iloc[8000:10000,]
print docs_df_test.shape

(8000, 8)
(2000, 8)


<br>
Initialize a CountVectorizer object to turn texts into term-document matrix with term frequency as cell values.

In [None]:
count_vect = CountVectorizer(min_df=2, ngram_range=[1,3])

<br>
Apply the CountVectorizer object to the training set. Ignore terms that appear in less than 10 documents.

In [None]:
X = count_vect.fit_transform(docs_df_train["text_prep"].tolist())
print X.shape

(8000, 11467)


<br>
Display an extract of the term-document matrix

In [None]:
X[90:95,90:95].todense()

matrix([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]])

<br>
Turn frequency counts into tf-idf values.

In [None]:
tfidf_transformer = TfidfTransformer().fit(X)
X = tfidf_transformer.transform(X)
X[90:95,90:95].todense()

matrix([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]])

<br>
Extract the labels that we want to predict from the training set.

In [None]:
Y = docs_df_train["sentiment"]
Y.head()

0    positive
1    positive
2    negative
3    negative
4    negative
Name: sentiment, dtype: object

## Train classifier on training set

Perform a logistic regression classification with the term-document matrix as the input variables (or features, indepdendent variables) and the sentiment classes as the target variable (or label, or dependent variable).

In [None]:
clf = LogisticRegression().fit(X, Y)

<br>
Test whether classifier is working by predicting the sentiment of some fake tweets. We are reusing the count_vect and tfidf_transformer objects to apply the same preprocessing steps to the new data (in a real use case, we would also preprocess the new documents with spaCy). 

In [None]:
docs_new = ['I love Delta', 'I love aa']
X_new = count_vect.transform(docs_new)
X_new = tfidf_transformer.transform(X_new)
predicted = clf.predict(X_new)
print predicted

[u'positive' u'positive']


<br>
Instead of predicting binary labels, we can also predict the probability of a `positive` or `negative` label.

In [None]:
predicted_prob = clf.predict_proba(X_new)
print clf.classes_
print predicted_prob

[u'negative' u'positive']
[[0.26238916 0.73761084]
 [0.18022177 0.81977823]]


## Evaluate accuracy on test set

Instead of just testing on two fake tweets, we evaluate the predictive accurcay of our model on the test set. Again, we reuse the count_vect and tfidf_transformer objects.

In [None]:
X_test = count_vect.transform(docs_df_test["text_prep"])
X_test = tfidf_transformer.transform(X_test)
Y_test = docs_df_test["sentiment"]

<br>
Call the predict function of our model with the test data and calculate precision, recall and F1-score.

In [None]:
predicted = clf.predict(X_test)
print metrics.classification_report(Y_test, predicted)

             precision    recall  f1-score   support

   negative       0.87      0.99      0.93      1561
   positive       0.92      0.49      0.64       439

avg / total       0.88      0.88      0.86      2000



## Look at model coefficients

Logistic regression is typically not the most accurate classification model, but one big advantage is that it can be interpreted by looking at the coefficients of the input features.

In [None]:
coeffs = clf.coef_[0].tolist()
words = count_vect.get_feature_names()
words_with_coeffs = pd.DataFrame(coeffs, words, columns=["coeff"])

Here are the coefficient of the input features.

In [None]:
words_with_coeffs.head(10)

Unnamed: 0,coeff
aa,-0.291471
aa agent,-0.041178
aa choice,-0.104358
aa choice bother,-0.027953
aa dallas,0.250298
aa dallas only,0.250298
aa email,-0.024269
aa employee,-0.07293
aa employee rude,-0.038766
aa flight,0.22442


<br>
These are the words with the most negative impact.    

In [None]:
words_with_coeffs.sort_values("coeff", ascending=True).head(100)

Unnamed: 0,coeff
hour,-4.051477
delay,-3.549587
bad,-3.262633
hold,-3.024519
bag,-2.369616
cancel,-2.236620
say,-2.220612
tell,-2.216237
luggage,-2.184288
sit,-2.175515


<br>
And these are the words with the most positive impact.

In [None]:
words_with_coeffs.sort_values("coeff", ascending=False).head(100)

Unnamed: 0,coeff
thank,11.050532
great,5.64405
awesome,4.682721
love,4.639319
good,3.870915
amazing,3.624084
thx,2.99976
appreciate,2.889685
excellent,2.253851
wonderful,2.221593
