I will apply the Text Classification techniques to identify the author of the fictions. This dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. Your objective is to accurately identify the author of the sentences in the test set.

The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. The problem requires here to predict the author, i.e. EAP, HPL and MWS given the text. In simpler words, text classification with 3 different classes.

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk import word_tokenize
from nltk.corpus import stopwords

Load data;

I'll first load the data and split the data into train and test. 

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/zariable/UW-MSIS541/master/assignments/hw1/data/data.csv')
data

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL
...,...,...,...
19574,id17718,"I could have fancied, while I looked at it, th...",EAP
19575,id08973,The lids clenched themselves together as if in...,EAP
19576,id05267,"Mais il faut agir that is to say, a Frenchman ...",EAP
19577,id17513,"For an item of news like this, it strikes us i...",EAP


I'll first preprocess the label by converting the label from string into integer. In particular, I will use the LabelEncoder from scikit-learn to convert text labels to integers, 0, 1 2.

In [3]:
# label encode the author names into 0, 1 and 2 for easy evaluation.
y = preprocessing.LabelEncoder().fit_transform(data.author.values)
y

array([0, 1, 0, ..., 0, 0, 1])

Then, I will randomly split the original data into train and test where training data is used to train the text classifier and test data is used to evaluate the model performance. We can do it using train_test_split from the model_selection module of scikit-learn.

In [4]:
# split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(
    data.text.values, y, 
    stratify=y, 
    random_state=42, 
    test_size=0.1, 
    shuffle=True
)

print(x_train.shape)
print(x_test.shape)

# print the first training example
print(x_train[0])

(17621,)
(1958,)
Her hair was the brightest living gold, and despite the poverty of her clothing, seemed to set a crown of distinction on her head.


Task 1; Build a Naive Bayes Model

The very first model is a TF-IDF (Term Frequency - Inverse Document Frequency) followed by a Naive Bayes classifier.

I will also use Grid Search to find the best hyper parameter from the following settings;

- Differnet ngram range
- Weather or not to remove the stop words
- Weather or not to apply IDF

After identifying the best model, I will use that model to make predictions on the test data and report its accuracy.


Here, I converted the raw text into a vector of counts before feeding into a ML model. In sklearn, we have a API called CountVectorizer to the job for us.

In [5]:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(
    stop_words = 'english',
    max_features = None,
    ngram_range = (1, 1)
)

X_train_counts = count_vect.fit_transform(x_train)
X_train_counts.shape

(17621, 23675)

TF-IDF; features from text Similar to raw count vector. Sklearn has a API called TfidfTransformer which convert raw counts ot TF-IDF feature representation.

In [6]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(
    norm = 'l2',
    use_idf = True,
    smooth_idf = True
)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape


(17621, 23675)

Train Naive Bayes Model

Given the TFIDF features for the data, we are ready to train the Naive Bayes classifier.



In [7]:
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, y_train)

Buliding the  pipeline to connect all the components together. It's another way to connect both the feature generation and model training steps into one is to use the pipeline API.

In [10]:
# Building a pipeline: We can write less code and do all of the above,as follows:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
# I will be using the 'text_clf' going forward.
from sklearn.pipeline import Pipeline

text_clf = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultinomialNB())
    ]
)

text_clf = text_clf.fit(x_train, y_train)

In [11]:
# Performance of NB Classifier
import numpy as np
predicted = text_clf.predict(x_test)
np.mean(predicted == y_test)

0.8161389172625128

In [12]:
#Grid Search
# Here, I am creating a list of parameters for which I would like to do performance tuning. 
# All the parameters name start with the classifier name. 
# E.g. vect__ngram_range; here I am  telling to use unigram, bigrams and trigrams and choose the one which is optimal.

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 2), (2, 3)],'vect__stop_words' : ('english', None), 'tfidf__use_idf': (True, False)}
parameters

{'tfidf__use_idf': (True, False),
 'vect__ngram_range': [(1, 2), (2, 3)],
 'vect__stop_words': ('english', None)}

In [13]:
# Next,i'll create an instance of the grid search by passing the classifier, parameters 
# and n_jobs=-1 which tells to use multiple cores from user machine.
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, cv=2)
gs_clf = gs_clf.fit(x_train, y_train)

In [14]:
# To see the best mean score and the params, run the following code

gs_clf.best_score_
gs_clf.best_params_


{'tfidf__use_idf': True,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': 'english'}

In [15]:
predicted = gs_clf.predict(x_test)
np.mean(predicted == y_test)

0.7819203268641471

Here, the Naive approach gives an accuracy of 78%.

Task 2: Build a Support Vector Machines (SVM) Model
Similar to the first task, now I'll built a SVM model for the same task. I will also use Grid Search to find the best hyper parameters and report the accuracy on the test from the best model.

In [17]:

# Training Support Vector Machines - SVM and calculating its performance

from sklearn.svm import LinearSVC

text_clf_svm = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        ('clf-svm', LinearSVC(random_state=0))
    ]
)

text_clf_svm = text_clf_svm.fit(x_train, y_train)
predicted_svm = text_clf_svm.predict(x_test)
np.mean(predicted_svm == y_test)

0.8381001021450459

In [18]:
# Similarly doing grid search for SVM
from sklearn.model_selection import GridSearchCV
parameters_svm = {'vect__ngram_range': [(1, 2), (2,3)], 'vect__stop_words' : ('english', None), 'tfidf__use_idf': (True, False)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=1, cv=2)
gs_clf_svm = gs_clf_svm.fit(x_train, y_train)

gs_clf_svm.best_score_
gs_clf_svm.best_params_

{'tfidf__use_idf': True, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}

In [19]:
predicted = gs_clf_svm.predict(x_test)
np.mean(predicted == y_test)

0.8452502553626149

In case of SVM, the accuracy is 84% which is better than NB classifier by not using stopwords but using idf.