## Introduction

This notebook will document a small NLP project on classifying News articles into 31 different categories, from Politics, to Sport, to Religion etc.(I have just taken the top 10 of those categories to do this project). The source of news headlines and short descriptions is HuffPost and the dataset was downloaded from: https://www.kaggle.com/rmisra/news-category-dataset/downloads/news-category-dataset.zip/2. 

In the previous notebook, I used the general processing techniques and cleaned up the data, then used basic BOW and TF-IDF with Naive Bayes, SVM ad Logistic Reg models to get to a baseline.

In this notebook, I will try to experiment with word embeddings to see if they help improve performance, especially for such short text.

In [52]:
import nltk
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
import matplotlib as plt
import pickle
from nltk.tokenize import word_tokenize,RegexpTokenizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


## Load Train/Test from Previous

In [2]:
with open('data/traintest.pickle', 'rb') as f:
     X_train, X_test, y_train, y_test = pickle.load(f)

## Word Embeddings
- Lets start with GloVe embeddings

In [8]:
import gensim.downloader as api
glove_model = api.load('glove-wiki-gigaword-50')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [13]:
list(glove_model.vocab)

['the',
 ',',
 '.',
 'of',
 'to',
 'and',
 'in',
 'a',
 '"',
 "'s",
 'for',
 '-',
 'that',
 'on',
 'is',
 'was',
 'said',
 'with',
 'he',
 'as',
 'it',
 'by',
 'at',
 '(',
 ')',
 'from',
 'his',
 "''",
 '``',
 'an',
 'be',
 'has',
 'are',
 'have',
 'but',
 'were',
 'not',
 'this',
 'who',
 'they',
 'had',
 'i',
 'which',
 'will',
 'their',
 ':',
 'or',
 'its',
 'one',
 'after',
 'new',
 'been',
 'also',
 'we',
 'would',
 'two',
 'more',
 "'",
 'first',
 'about',
 'up',
 'when',
 'year',
 'there',
 'all',
 '--',
 'out',
 'she',
 'other',
 'people',
 "n't",
 'her',
 'percent',
 'than',
 'over',
 'into',
 'last',
 'some',
 'government',
 'time',
 '$',
 'you',
 'years',
 'if',
 'no',
 'world',
 'can',
 'three',
 'do',
 ';',
 'president',
 'only',
 'state',
 'million',
 'could',
 'us',
 'most',
 '_',
 'against',
 'u.s.',
 'so',
 'them',
 'what',
 'him',
 'united',
 'during',
 'before',
 'may',
 'since',
 'many',
 'while',
 'where',
 'states',
 'because',
 'now',
 'city',
 'made',
 'like',
 

In [19]:
# 400000 words in vocab with 50 dim vectors
glove_model.vectors.shape

(400000, 50)

In [20]:
words_to_index = {}
index_to_words = {}
word_to_vec_map = {}
i = 0
for word in glove_model.vocab:
    words_to_index[word] = i
    index_to_words[i] = word
    word_to_vec_map[word] = glove_model.vectors[i]
    i+=1

Get vectors for each sentence by just averaging over each sentence (alternative is to use Doc2Vec, but maybe need longer sentences). Note that word embeddings do not have any stemming and stop words like 'the' and 'a' are included.

In [25]:
def text_preprocess(X):
    # Convert to Lowercase
    X = X.str.lower()

    tokenizer = RegexpTokenizer(r'\w+')
    # Tokenise words
    tokens = X.apply(tokenizer.tokenize).reset_index(drop=True)
    
    return tokens

In [27]:
tokens_train = text_preprocess(X_train)
tokens_test = text_preprocess(X_test)

Map each token to a word vector

In [28]:
def avg_sentence(sentence,word_to_vec_map):
    
    avg = np.zeros(50)
    
    num_words=0
    
    for word in sentence:
        if word in word_to_vec_map:
            avg += word_to_vec_map[word]
            num_words+=1
            
            
    final_avg=avg/num_words
    
    return final_avg


In [29]:
def get_ave_vectors(tokens):
    vector_averages = np.zeros((len(tokens),50))
    for i in range(len(tokens)):
        vector_averages[i] = avg_sentence(tokens[i],word_to_vec_map)
    
    return vector_averages


In [30]:
vectors_train = get_ave_vectors(tokens_train) 
vectors_test = get_ave_vectors(tokens_test) 

  del sys.path[0]


In [31]:
vectors_train

array([[ 0.31745181,  0.27095886,  0.15774308, ..., -0.23168436,
        -0.05679811, -0.10682221],
       [ 0.2189454 ,  0.19574832,  0.06896296, ..., -0.09957325,
        -0.05050234, -0.00984774],
       [ 0.17610761,  0.29692621, -0.05693024, ..., -0.2227863 ,
        -0.04790437,  0.02110767],
       ...,
       [ 0.19472769,  0.17020141,  0.14255291, ..., -0.04178312,
         0.17055104,  0.13073211],
       [ 0.42998029, -0.20409162,  0.25457611, ...,  0.11828359,
        -0.43743682, -0.49634174],
       [ 0.2927075 ,  0.204627  , -0.02120767, ...,  0.17845107,
        -0.11459441,  0.26996   ]])

In [35]:
# Remove NA rows
def remove_NA_rows(vectors, y):
    # Concatenating the target with the word vector array, since it will be easier to remove NAs rows from both
    ads = pd.DataFrame(vectors,columns=["V" + str(col) for col in range(50)])
    ads['category'] = y
    ads.head()
    # Remove rows which are all NANs - just means the word doesnt exist in the glove corpus
    ads = ads[~np.isnan(ads['V0'])]
    
    return ads


In [45]:
ads_train = remove_NA_rows(vectors_train,y_train.reset_index(drop=True))
ads_test = remove_NA_rows(vectors_test,y_test.reset_index(drop=True))

In [47]:
ads_train.head()

Unnamed: 0,V0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V41,V42,V43,V44,V45,V46,V47,V48,V49,category
0,0.317452,0.270959,0.157743,0.127826,0.297763,0.045279,-0.513147,-0.333156,-0.095779,0.021313,...,0.157476,0.045669,-0.158897,-0.322931,0.368688,-0.242698,-0.231684,-0.056798,-0.106822,ENTERTAINMENT
1,0.218945,0.195748,0.068963,-0.021564,0.32081,0.15386,-0.432786,-0.308079,-0.128172,0.09139,...,0.002778,0.010209,0.114806,-0.051136,0.167961,-0.05199,-0.099573,-0.050502,-0.009848,STYLE & BEAUTY
2,0.176108,0.296926,-0.05693,-0.157755,0.3193,0.008696,-0.353465,-0.058974,-0.128485,-0.109559,...,-0.078528,-0.011632,0.168951,-0.079328,0.168598,0.224817,-0.222786,-0.047904,0.021108,WELLNESS
3,0.334793,0.502029,0.094673,0.255322,0.254575,0.015837,0.02635,0.115594,-0.013418,-0.082745,...,-0.079285,0.122779,-0.270327,0.107331,0.301835,-0.105233,0.599083,-0.086085,-0.273655,WORLD NEWS
4,0.23529,0.41421,-0.217002,0.109667,0.383724,0.131862,-0.328546,-0.308277,-0.183282,0.113998,...,0.264618,-0.057838,0.275382,0.041859,-0.189463,-0.156943,-0.393461,0.084659,0.106194,TRAVEL


In [49]:
# Write Model Function
def build_model(classifier, X,y, X_test, y_test):
    classifier.fit(X, y)
    y_pred = classifier.predict(X_test)
    return accuracy_score(y_test, y_pred), y_pred

In [54]:
acc_svm,pred_svm = build_model(LinearSVC(), ads_train.iloc[:,:50],ads_train['category'],ads_test.iloc[:,:50],ads_test['category'])
print(acc_svm)

# Naive Bayes does not work with negative values. Even after scaling it doesnt work

acc_logistic,pred_log = build_model(LogisticRegression(),ads_train.iloc[:,:50],ads_train['category'],ads_test.iloc[:,:50],ads_test['category'])
print(acc_logistic)



0.6755733944954129




0.6789755351681958


Looks like the word embeddings are still not as good as the original tfidf model. (although I should redo it and align the train and test sets for fairer comparison)

In [55]:
pred_svm[:10]

array(['POLITICS', 'TRAVEL', 'FOOD & DRINK', 'WORLD NEWS',
       'STYLE & BEAUTY', 'FOOD & DRINK', 'ENTERTAINMENT', 'POLITICS',
       'POLITICS', 'ENTERTAINMENT'], dtype=object)

In [56]:
ads_test['category'][:10]

0          POLITICS
1            TRAVEL
2      FOOD & DRINK
3        WORLD NEWS
4    STYLE & BEAUTY
5      FOOD & DRINK
6      QUEER VOICES
7          POLITICS
8    HEALTHY LIVING
9     ENTERTAINMENT
Name: category, dtype: object

## Try Doc2Vec?
- a way to get an embedding for whole sentence, rather than just use average of words to aggregate to the sentence
- however, Doc2Vec usually is not so good for short sentences

## Try FastText Next?
- uses character n-grams, so robust to typos and words that haven't been seen before