### Comparing performance homogeneity of text classification algorithms using McNemar statistic

This workflow use a McNemar statistic to compare two classification algorithms, a naive bayes classifier and a support vector machine, on a binomial document classification task using a common corpus of music reviews. The objectives of this workflow, which are part of a larger project, are to: 
- explore and test hypotheses on text classification algorithms on common domains, in terms of relative performance advantages in different scenarios
- explore test statistics for comparison of text classification algorithms in varied contexts
- explore sci-kit learn nlp libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
import psycopg2
from statsmodels.stats.contingency_tables import mcnemar
from collections import Counter

#### I. Import, explore and initially pre-process data

In [2]:
# connect to postgres database containing pitchfork music reviews
conn = psycopg2.connect("dbname=pitchfork_reviews")
cur = conn.cursor()

# query database
cur.execute("""
SELECT genres.genre, content.reviewid, content.content 
FROM content
INNER JOIN genres on content.reviewid = genres.reviewid;
""")

# cast to dataframe
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
df.head(5)

# drop ~20K rows that contain nulls in genre column
df = df.dropna(how='any')

# create new column that collapses 8 non-rock genres into a single 'not_rock' category
df_2 = df['genre'].replace(['electronic', 'experimental', 'folk/country', 'global', 'jazz',
        'metal', 'pop/r&b', 'rap'], 'not_rock')
df['genre_dichot'] = df_2
df['genre_dichot'].value_counts()

# separate datasets into data (review text) and labels (genres), respectively
data = df['content'].astype(str)
data.head(5)

df_genre = pd.DataFrame(df['genre_dichot'])

label_strings = df['genre_dichot'].astype(str)
label_strings[:5]

# converts label strings into numeric values, 0 and 1
label_encoder = LabelEncoder()
label_nums = label_encoder.fit_transform(label_strings)
label_nums.shape
np.vstack((label_nums[:10], label_nums[:10]))
label_encoder.classes_, len(label_encoder.classes_)

# partition data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data, label_nums, test_size=0.30, random_state=3)

# not_rock takes the value 0, and rock takes the value 1
label_strings.head(5), label_nums[:5]

(0    not_rock
 1    not_rock
 2        rock
 3        rock
 4    not_rock
 Name: genre_dichot, dtype: object, array([0, 0, 1, 1, 0]))

#### II. Tokenize, vectorize and normalize training data

In [3]:
# converts 1000 key words from music reviews into cleaned "tokens," counts tokens, and normalizes using tf-idf
# vectorizes words directly from initial data, to tfidf-normalized vectors
# builds a dictionary of feature indices
# Normalizes word count-based vectors to term frequency inverse document frequency (TF-IDF) to penalize
# words that occur in many documents in the corpus and thus are less informative
tf_vect = TfidfVectorizer()
tf_vect.fit(X_train)
X_train_tf = tf_vect.transform(X_train)

#### III. Train models on training data

##### Train naive bayes classifier

In [4]:
# train Naive Bayes Classifier on training features (X_train_tfidf) and training targets (y_train)
# TODO (Lee) - alter the naming convention X_train_tf to reflect tfidf convention
model = MultinomialNB()
model.fit(X_train_tf, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

##### Train linear support vector machine (SVM)

In [5]:
# tokenize and vectorize per sklearn workflow
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

In [6]:
model_svm = SGDClassifier(penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)

model_svm.fit(X_train_tf, y_train)

SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False)

#### IV. Predict genres of test set

##### Vectorizes test set

In [7]:
# vectorize X_test set, similar to above for train set, EXCEPT calls transform, NOT fit_transform
X_test_tfidf = tf_vect.transform(X_test)

# inspect shapes
X_test.shape, X_test_tfidf.shape[0]

((6096,), 6096)

##### Predict genres of test set with naive bayes classifier

In [8]:
# uses model to predict vectorized test set
preds_nb = model.predict(X_test_tfidf)
probas_nb = model.predict_proba(X_test_tfidf)

In [9]:
# these are the predicted categories, in terms of binomial categories of music genres
y_test[:6], preds_nb[:6]

(array([0, 0, 0, 0, 1, 1]), array([0, 0, 0, 0, 1, 0]))

In [10]:
# these are the predictions, in terms of the probability of new reviews ### TODO (Lee) - complete
probas_nb[:6]

array([[0.78572796, 0.21427204],
       [0.65446581, 0.34553419],
       [0.50263802, 0.49736198],
       [0.93077674, 0.06922326],
       [0.38772961, 0.61227039],
       [0.59624457, 0.40375543]])

In [11]:
np.vstack((y_test[:20], preds_nb[:20]))

array([[0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0]])

In [12]:
for doc, category in zip(X_test[:5], preds_nb[:5]):
    print('%r => %s' % (doc[:5], category))

'The L' => 0
'Can y' => 0
'As Ge' => 0
'Few e' => 0
'"We t' => 1


In [13]:
# these are the actual genres of the test set
Counter(y_test)

Counter({0: 3282, 1: 2814})

In [14]:
# these are the predictions of the naive bayes across the two categories
Counter(preds_nb)

Counter({0: 4150, 1: 1946})

In [15]:
# TODO (Lee) - this is not functioning.
# Counter(probas_nb)

##### Predict genres of test set with SVM

In [16]:
preds_svm = model_svm.predict(X_test_tfidf)

# TODO (Lee) - probas not functioning
# probas_svm = model_svm.predict_proba(X_test_tfidf) # TODO (Lee) - issue with probas

In [17]:
preds_svm[:5]

array([0, 0, 1, 0, 1])

In [18]:
# probas_svm[:5]

In [19]:
np.vstack((y_test[:5], preds_svm[:5]))

array([[0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1]])

In [20]:
# these are the actual genres of the test set
Counter(y_test)

Counter({0: 3282, 1: 2814})

In [21]:
# these are the predictions of the naive bayes across the two categories
Counter(preds_svm)

Counter({0: 3286, 1: 2810})

In [22]:
# TODO (Lee) - this is not functioning.
# Counter(probas_svm)

#### V. Evaluate and compare model performance on predictions

##### Evaluate mean accuracy of naive bayes classifier

In [23]:
# this is the accuracy of the Naive Bayes Classifier in predicting tbe genre of the test set
np.mean(preds_nb == y_test)

0.6981627296587927

##### Evaluate mean accuracy of SVM

In [24]:
# this is the accuracy of the SVM in 
np.mean(preds_svm == y_test)

0.7129265091863517

#### Compare performance of naive bayes and svm using McNemar test

In [25]:
#### Implement functions to produce data to be passed as input to McNemar
# McNemar test expects four values: 1) # obs when both models predict correctly, 2) # obs when both models predict 
# incorrectly, 3) # obs when nb predicts correctly & svm predicts incorrectly, 4) # obs when nb predicts incorrectly 
# & svm predicts correctly

# creates array of test lables, naive bayes predictions, and svm predictions
cont_table = np.vstack((y_test, preds_nb, preds_svm)).T
cont_table[:5]

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       [0, 0, 0],
       [1, 1, 1]])

In [26]:
# el idx 0 = both correct, el idx 1 = both incorrect
# el idx 2 = nbcorrect, svm incorrect, el idx 3 = svm correct, nb incorrect
def process_row(row):
    """for each row in array, returns array representing one of four categories"""
    if row[0] == row[1] and row[0] == row[2]: # 
        result = [1,0,0,0]
    
    elif row[0] == row[1]:
        result = [0,1,0,0]
        
    elif row[0] == row[2]:
        result = [0,0,1,0]
        
    else:
        result = [0,0,0,1]
    
    return np.array(result)

In [27]:
def process_ndarray(array):
    result = sum([process_row(row) for row in array])
    return np.array([[result[0], result[2]], [result[1], result[3]]])

In [28]:
# calls function
contingency_table = process_ndarray(cont_table)

##### Calculate McNemar test

In [29]:
# calculate mcnemar test
result = mcnemar(contingency_table, exact=True)

In [30]:
# summarizes the finding, with McNemar statistic of 615 and p-value of 0.014
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))

# defines alpha level
alpha = 0.05

if result.pvalue > alpha:
	print('Same proportions of errors (fail to reject H0)')
else:
	print('Different proportions of errors (reject H0)')

statistic=615.000, p-value=0.014
Different proportions of errors (reject H0)
