# Seminar Applied Text Mining
## Session 3: Classifying Documents
## Notebook 2: Tune classifier

## Importing packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is the fundamental package for scientific computing with Python.
- `itertools` provides functions for creating iterators for efficient looping through data structures.
- `json` allows to read and write JSON files.
- `spacy` offers industrial-strength natural language processing
- `sklearn` is the de-facto standard machine learning package in Python

In [1]:
import pandas as pd
import numpy as np
import itertools
import json
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn import metrics

  from numpy.core.umath_tests import inner1d


## Load documents

Load the corpus of 10,000 Airline Tweets from a JSON file and display the first tweet.

In [2]:
docs = json.loads(open('/Users/oliver/Dropbox/10 - Lehre/UPB/Applied Text Mining/Code and Datasets/AirlineTweets.json').read())
docs[0]

{u'airline': u'American',
 u'date': u'2015-02-23 05:08:53 -0800',
 u'retweet_count': 0,
 u'sentiment': u'positive',
 u'text': u'@AmericanAir thank you for doing the best you could to get me rebooked. Agent on phone &amp; addtl resolution on DM was very much appreciated.',
 u'tweet_created': u'2015-02-23',
 u'tweet_id': 5.6984635640934e+17}

## Prepare documents

Perform standard NLP preparation steps with spaCy.

In [3]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

for i, entry in enumerate(docs):
    text = nlp(entry[u'text'])
    tokens_to_keep = []
    for token in text:
        if token.is_alpha and not token.is_stop: # see with what other tags spaCy has annotated the tokens: https://spacy.io/api/token#attributes1
            tokens_to_keep.append(token.lemma_)
    entry[u'text_prep'] = " ".join(tokens_to_keep) # the .join turns the list into a concatenated string

<br>
Transform results into a data frame and display the first couple of lines.

In [4]:
docs_df = pd.DataFrame(docs)
docs_df.head()

Unnamed: 0,airline,date,retweet_count,sentiment,text,text_prep,tweet_created,tweet_id
0,American,2015-02-23 05:08:53 -0800,0,positive,@AmericanAir thank you for doing the best you ...,thank good rebook agent phone amp addtl resolu...,2015-02-23,5.698464e+17
1,American,2015-02-22 20:27:10 -0800,0,positive,@AmericanAir wow that's helpful.,wow helpful,2015-02-22,5.697151e+17
2,United,2015-02-17 14:32:23 -0800,0,negative,@united so I wasted 40mins filling in 2 online...,-PRON- waste fill online form tell receive -PR...,2015-02-17,5.678138e+17
3,American,2015-02-24 06:43:15 -0800,0,negative,@AmericanAir my seat is disgusting. Old and di...,seat disgusting old dirty when go refurbish pl...,2015-02-24,5.702325e+17
4,US Airways,2015-02-22 17:26:18 -0800,0,negative,@USAirways ur specialist said they would talk ...,ur specialist say talk stewardess serve drunk ...,2015-02-22,5.696695e+17


<br>
Split corpus into training (80%) and test (20%) sets.

In [5]:
docs_df_train = docs_df.iloc[0:8000,]
print docs_df_train.shape
docs_df_test = docs_df.iloc[8000:10000,]
print docs_df_test.shape

(8000, 8)
(2000, 8)


## Train and evaluate multiple classifiers with pipelines

### Logistic Regression

In order to make the vectorizer => transformer => classifier process easier to work with, scikit-learn provides the Pipeline class.

In [6]:
text_clf_01 = Pipeline([
    ('vect', CountVectorizer(min_df = 2)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

<br>
We can now preprocess the texts and train a classifier with a single command.

In [7]:
text_clf_01.fit(docs_df_train["text_prep"], docs_df_train["sentiment"])  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

<br>
And then classify all documents in the test set and evaluate the model's predictive accuracy.

In [8]:
Y_test = docs_df_test["sentiment"]
predicted = text_clf_01.predict(docs_df_test["text_prep"])
print metrics.classification_report(Y_test, predicted)

             precision    recall  f1-score   support

   negative       0.88      0.99      0.93      1561
   positive       0.91      0.51      0.66       439

avg / total       0.89      0.88      0.87      2000



### Random Forest

In [13]:
text_clf_02 = Pipeline([
    ('vect', CountVectorizer(min_df = 2, ngram_range=[1,3])),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(n_estimators=500))
])
text_clf_02.fit(docs_df_train["text_prep"], docs_df_train["sentiment"])  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=[1, 3], preprocessor=None, stop_words=None,
        st...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

In [14]:
Y_test = docs_df_test["sentiment"]
predicted = text_clf_02.predict(docs_df_test["text_prep"])
print metrics.classification_report(Y_test, predicted)

             precision    recall  f1-score   support

   negative       0.90      0.97      0.93      1561
   positive       0.85      0.60      0.70       439

avg / total       0.88      0.89      0.88      2000



### Support Vector Machine

In [15]:
text_clf_03 = Pipeline([
    ('vect', CountVectorizer(min_df = 2, ngram_range=[1,3])),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC())
])
text_clf_03.fit(docs_df_train["text_prep"], docs_df_train["sentiment"])  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=[1, 3], preprocessor=None, stop_words=None,
        st...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [16]:
Y_test = docs_df_test["sentiment"]
predicted = text_clf_03.predict(docs_df_test["text_prep"])
print metrics.classification_report(Y_test, predicted)

             precision    recall  f1-score   support

   negative       0.91      0.97      0.94      1561
   positive       0.86      0.66      0.74       439

avg / total       0.90      0.90      0.90      2000



## More tuning

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html