# Approaching (Almost) Any NLP Problem on Kaggle

In this post I'll talk about approaching natural language processing problems on Kaggle. As an example, we will use the data from this competition. We will create a very basic first model first and then improve it using different other features. We will also see how deep neural networks can be used and end this post with some ideas about ensembling in general.

This covers:

* tfidf
* count features
* logistic regression
* naive bayes
* svm
* xgboost
* grid search
* word vectors
* LSTM
* GRU
* Ensembling

NOTE: This notebook is not meant for achieving a very high score on the Leaderboard for this dataset. However, if you follow it properly, you can get a very high score with some tuning. ;)

So, without wasting any time, let's start with importing some important python modules that I'll be using.


In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')


Using TensorFlow backend.


In [2]:
train = pd.read_csv("spooky-author-identification/train/train.csv")
test = pd.read_csv("spooky-author-identification/test/test.csv")
sample = pd.read_csv("spooky-author-identification/sample_submission/sample_submission.csv")

In [3]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [4]:
test.head()

Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


In [5]:
sample.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.403494,0.287808,0.308698
1,id24541,0.403494,0.287808,0.308698
2,id00134,0.403494,0.287808,0.308698
3,id27757,0.403494,0.287808,0.308698
4,id04081,0.403494,0.287808,0.308698


In [6]:
def multiclass_logloss(actual, predicted,eps=1e-15):
    
    if len(actual.shape)==1:
        actual2 = np.zeros((actual.shape[0],predicted.shape[1]))
        for i,val in enumerate(actual):
            actual2[i,val]=1
        actual = actual2
        
    clip = np.clip(predicted, eps, 1-eps)
    rows = actual.shape[0]
    vsota = np.sum(actual*np.log(clip))
    return -1.0/rows*vsota


In [7]:
lbl_enc = preprocessing.LabelEncoder()
y=lbl_enc.fit_transform(train.author.values)

In [8]:
xtrain,xvalid,ytrain,yvalid = train_test_split(train.text.values,y,
                                              stratify=y,random_state=42,
                                              test_size=0.1,shuffle=True)

In [9]:
print(xtrain.shape)
print(xvalid.shape)

(17621,)
(1958,)


## Building Basic Models
Let's start building our very first model.

Our very first model is a simple __TF-IDF (Term Frequency - Inverse Document Frequency)__ followed by a simple Logistic Regression.

In [10]:
tfv = TfidfVectorizer(min_df=3,max_features =None,
                     strip_accents='unicode',analyzer='word',token_pattern=r'\w{1,}', 
                     ngram_range=(1,3),use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words='english')
                     
tfv.fit(list(xtrain)+list(xvalid))
xtrain_tfv=tfv.transform(xtrain)
xvalid_tfv=tfv.transform(xvalid)

In [11]:
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv,ytrain)
predictions =clf.predict_proba(xvalid_tfv)

print('logloss :%0.3f' %multiclass_logloss(yvalid,predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


logloss :0.572


In [12]:
ctv = CountVectorizer(analyzer ='word', token_pattern=r'\w{1,}',
                     ngram_range=(1,3),stop_words='english')

ctv.fit(list(xtrain)+list(xvalid))
xtrain_ctv = ctv.transform(xtrain)
xvalid_ctv = ctv.transform(xvalid)

In [13]:
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv,ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print('logloss : %0.3f'%multiclass_logloss(yvalid,predictions) )

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


logloss : 0.527


In [14]:
clf = MultinomialNB()
clf.fit(xtrain_tfv,ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.578 


In [15]:
clf = MultinomialNB()
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.485 


In [16]:
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)

In [17]:
clf = SVC(C=1.0,probability=True)
clf.fit(xtrain_svd_scl,ytrain)
predictions = clf.predict_proba(xvalid_svd_scl)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.735 


In [18]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200,colsample_bytree=0.8,
                       subsample=0.8,nthread=10,learning_rate=0.1)
clf.fit(xtrain_tfv.tocsc(),ytrain)
predictions = clf.predict_proba(xvalid_tfv.tocsc())



In [19]:
print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

logloss: 0.782 


In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_ctv.tocsc(), ytrain)
predictions = clf.predict_proba(xvalid_ctv.tocsc())

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
clf = xgb.XGBClassifier(nthread=10)
clf.fit(xtrain_svd, ytrain)
predictions = clf.predict_proba(xvalid_svd)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

### Grid Search

In [None]:
mll_scorer = metrics.make_scorer(multiclass_logloss,greater_is_better=False,needs_proba=True)

In [None]:
svd = TruncatedSVD()

scl = preprocessing.StandardScaler()

lr_model = LogisticRegression()

clf = pipeline.Pipeline([('svd',svd),
                        ('scl',scl),
                        ('lr',lr_model)])

In [None]:
param_grid = {'svd__n_components' :[120,180],
             'lr__C':[0.1,1.0,10],
             'lr__penalty':['l1','l2']}

In [None]:
model = GridSearchCV(estimator=clf,param_grid = param_grid,scoring=mll_scorer,
                    verbose=10,n_jobs=-1,iid=True,refit=True,cv=2)

model.fit(xtrain_tfv,ytrain)
print('Best score : %0.3f' %model.best_score_)
print('Best parameters set : ')
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print('\t%s :%r'%(param_name,best_parameters[param_name]))

In [None]:
nb_model = MultinomialNB()

clf = pipeline.Pipeline([('nb',nb_model)])

param_grid = {'nb__alpha':[0.001,0.01,0.1,1,10,100]}

model= GridSearchCV(estimator=clf, param_grid=param_grid,scoring=mll_scorer,
                   verbose=10,n_jobs=-1,iid=True,refit=True,cv=2)

model.fit(xtrain_tfv,ytrain)
print("Best score: %0.3f" % model.best_score_)
print("Best parameters set:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

### Word Vectors

Without going into too much details, I would explain how to create sentence vectors and how can we use them to create a machine learning model on top of it. I am a fan of GloVe vectors, word2vec and fasttext. In this post, I'll be using the GloVe vectors. You can download the GloVe vectors from here http://www-nlp.stanford.edu/data/glove.840B.300d.zip

In [None]:
embeddings_index={}

f = open('glove.840B.300d/glove.840B.300d.txt', encoding='UTF8')

for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:])
    embeddings_index[word]=coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
def sent2vec(s):
    words = str(s).lower().decode('utf-8')
    words = word_tokenize(words)
    words =[w for w in words if not w in stop_words]
    words =[w for w in words if w.isalpha()]
    M=[]
    
    fot w in words : 
        try : 
            M.append(embeddings_index[w])
        except:
            continue
        M = np.array(M)
        v=M.sum(axis=0)
        if type(v) != np.ndarray:
            return np.zeros(300)
        return v/np.sqrt((v**2).sum())

In [None]:
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]

In [None]:
xtrain_glove= np.array(xtrain_glove)
xvalid_glove= np.array(xvalid_glove)

In [None]:
clf = xgb.XGBClassifier(nthread=10, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict_proba(xvalid_glove)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1, silent=False)
clf.fit(xtrain_glove, ytrain)
predictions = clf.predict_proba(xvalid_glove)

print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

## Deep Learning
But this is an era of deep learning! We cant live without training a few neural networks. Here, we will train LSTM and a simple dense network on the GloVe features. Let's start with the dense network first: