# <center>An ELMo/XGBoost Model for Text Classification</center>

## Fetching  & Processing the '20newsgroups' Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). Let's extract and process a training set correponding to two categories: *'alt.atheism'* and *'sci.space'*.

In [1]:
from sklearn.datasets import fetch_20newsgroups
import csv
import re  # RegEx

categories = ['alt.atheism', 'sci.space']  # Extracting two specific categories
label, cnt = 0, 0

with open("train.csv", "w", newline = '') as outfile:
    w = csv.writer(outfile)
    w.writerow(['id', 'comment_text', 'label'])  # Each comment will have an ID and a label ('0' or '1')
    
    for cat in categories:
        train = fetch_20newsgroups(subset = 'train', categories =[cat], remove = ('headers', 'footers', 'quotes'))
        
        for n in range(len(train.data)):
            
            comment = train.data[n].replace('\n', " ")
            # Remove URLs from train and test
            comment = re.sub(r'http\S+', '', comment)
            # Remove punctuation marks
            punctuation = '!"#$%&()*+-/,:;<=>?@[\\]^_`{|}~'
            comment = ''.join(ch for ch in comment if ch not in set(punctuation))
            # Convert text to lowercase
            comment = comment.lower()
            # Remove numbers
            comment = re.sub(r'[0-9]+', '', comment)
            # Remove whitespaces
            comment = ' '.join(comment.split())

            if len(comment) < 1000 and len(comment) > 25:  # Discarding very short comments
                row = [str(cnt), comment, str(label)]
                w.writerow(row)
                cnt += 1
            elif len(comment) >= 1000:
                row = [str(cnt), comment[0:1000], str(label)]  # Limiting large comments to 1000 characters
                w.writerow(row)
                cnt += 1
        label += 1  # Changing the label for the next category

Let's look at a sample row (last row) from the *.csv* file:

In [2]:
print(row)

['1035', "many of you at this point have seen a copy of the lunar resources data purchase act by now. this bill also known as the back to the moon bill would authorize the u.s. government to purchase lunar science data from private and nonprofit vendors selected on the basis of competitive bidding with an aggregate cap on bid awards of million. if you have a copy of the bill and can't or don't want to go through all of the legalese contained in all federal legislationdon't both you have a free resource to evaluate the bill for you. your local congressional office listed in the phone bookis staffed by people who can forward a copy of the bill to legal experts. simply ask them to do so and to consider supporting the lunar resources data purchase act. if you do get feedback negative or positive from your congressional office please forward it to david anderman e. yorba linda blvd. apt g fullerton ca or via email to david.andermanofa.fidonet.org. another resource is your local chapter of t

Let us now write a corresponding *.json* file:

In [3]:
import json  
  
# Open the CSV
f = open('train.csv', 'r')
f.seek(0)
next(f)
reader = csv.DictReader(f, fieldnames = ('id', 'comment_text', 'label'))  

# Parse the CSV into JSON  
out = json.dumps([ row for row in reader ])  

# Save the JSON  
f = open('train.json', 'w')
f.write(out)
f.close()

## Computing ELMo Vectors

Next, we compute the ELMo vectors of our training data. Let's begin by loading the training data as a *Pandas* DataFrame:

In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 50)

import numpy as np
import spacy
import pickle

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.FATAL)
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Suppressing some TensorFlow warnings.

from math import floor

# Read data
train = pd.read_json("train.json", orient = "records")

print("At data loading, train is of shape ", train.shape, " and of type ", type(train))
print("At data loading, train['comment_text'] is of type ", type(train['comment_text']))

At data loading, train is of shape  (1036, 3)  and of type  <class 'pandas.core.frame.DataFrame'>
At data loading, train['comment_text'] is of type  <class 'pandas.core.series.Series'>


Let's look at a few of examples in the DataFrame *train*:

In [5]:
print(train.head(3))

                                        comment_text  id  label
0  ideologies also split giving more to disagree ...   0      0
1  i would rather be at a higher risk of being ki...   1      0
2  nope germany has extremely restrictive citizen...   2      0


Next, we load a *spaCy* language model, as well as a trainable ELMo model from *TensorFlow Hub*:

In [6]:
import en_core_web_sm  # spaCy's English model. To install: $ python -m spacy download en_core_web_sm
import tensorflow_hub as hub

# Load spaCy's language model
nlp = en_core_web_sm.load(disable = ['parser', 'ner'])

def lemmatization(texts):
    """A function to lemmatize texts"""
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

train['comment_text'] = lemmatization(train['comment_text'])

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable = True)

def elmo_vectors(x):
    """A function to compute ELMo embeddings"""
    embeddings = elmo(x.tolist(), signature = "default", as_dict = True)["elmo"]
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        #  return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings,1))

W0716 14:12:18.857589  9884 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [7]:
print("Type of elmo: ", type(elmo))

Type of elmo:  <class 'tensorflow_hub.module.Module'>


We are now ready to extract ELMo vectors:

In [8]:
percent_samples = 1 # Set to 1 to train on all of input dataset

batch_size = 50
    
list_train = [train[i:i+batch_size] for i in range(0, floor(percent_samples*train.shape[0]), batch_size)]

# Extract ELMo embeddings - Uncomment the two lines below to compute from scratch; 
# otherwise, load from .pickle file
#elmo_train = [elmo_vectors(x['comment_text']) for x in list_train]
#elmo_train = np.concatenate(elmo_train, axis = 0)
elmo_train = pickle.load(open("elmo_train.pickle", "rb"))

The ELMo vectors took a long time to compute. Let's save them in a *.pickle* file:

In [9]:
# Save elmo_train
pickle_out = open("elmo_train.pickle", "wb")
pickle.dump(elmo_train, pickle_out)
pickle_out.close()

In [10]:
print("elmo_train is of type: ", type(elmo_train), ", and of shape: ", elmo_train.shape)

elmo_train is of type:  <class 'numpy.ndarray'> , and of shape:  (1036, 1024)


## Building an XGBoost Model

In [11]:
import sklearn
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# Load trained ELMo embeddings from elmo_train.pickle
pickle_in = open("elmo_train.pickle", "rb")
elmo_train_loaded = pickle.load(pickle_in)

# Load raw training data to extract labels
train = pd.read_json("train.json", orient = "records")
y_train = train['label']

temp = np.array(y_train)
y_train = temp.astype(np.int) # Converting labels in y_train to integers

### Initial Model Setup and Grid Search

We are going to start tuning the maximum depth of the trees first, along with the min_child_weight. We set the objective to *‘binary:logistic’* since this is a binary classification problem.

In [12]:
cv_params = {'max_depth': [1, 3, 5], 'min_child_weight': [1, 3, 5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic'}

optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                             cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)

Now let’s run our grid search with 5-fold cross-validation and see which parameters perform the best.

In [13]:
optimized_GBM.fit(elmo_train_loaded, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=0.8, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=1000, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=0, silent=None,
                                     subsample=0.8, verbosity=1),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [1, 3, 5], 'min_child_weight': [1, 3, 5]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
          

Let's check our best score and best parameters:

In [14]:
optimized_GBM.best_score_

0.9131274131274131

Let's find out what were the best parameters:

In [15]:
optimized_GBM.best_params_

{'max_depth': 1, 'min_child_weight': 3}

Let’s try optimizing some other hyperparameters now to see if we can beat a mean of 91.31% accuracy. This time, we will play around with subsampling along with lowering the learning rate to see if that helps.

In [16]:
cv_params = {'learning_rate': [0.1, 0.01], 'subsample': [0.7, 0.8, 0.9]}
ind_params = {'n_estimators': 1000, 'seed':0, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth': 1, 'min_child_weight': 3}

optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                            cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)

In [17]:
optimized_GBM.fit(elmo_train_loaded, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=0.8, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=1, min_child_weight=3,
                                     missing=None, n_estimators=1000, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=0, silent=None,
                                     subsample=1, verbosity=1),
             iid='warn', n_jobs=-1,
             param_grid={'learning_rate': [0.1, 0.01],
                         'subsample': [0.7, 0.8, 0.9]},
             pre_dispatch='2*n_jobs', refit=True, return_t

Let us check our score and parameters again:

In [18]:
optimized_GBM.best_score_

0.9131274131274131

It appears that subsampling and learning rates were optimized already.

In [19]:
optimized_GBM.best_params_

{'learning_rate': 0.1, 'subsample': 0.8}

Based on the CV testing performed earlier, we want to utilize the following parameters:

-  learning_rate = 0.1
-  Subsample = 0.8
-  Max_depth = 1
-  Min_child_weight = 3

To increase the performance of *XGBoost*’s speed through many iterations of the training set, and since we are using only *XGBoost*’s API and not *sklearn*’s anymore, we can create a *DMatrix*. This sorts the data initially to optimize for *XGBoost* when it builds trees, making the algorithm more efficient. This is especially helpful when you have a very large number of training examples.

In [20]:
xgdmat = xgb.DMatrix(elmo_train_loaded, y_train) # Creating a DMatrix to make XGBoost more efficient

Now let’s specify our parameters and set our stopping criteria. For now, let’s be aggressive with the stopping and say we don’t want the accuracy to improve for at least 100 new trees.

In [21]:
# xgb optimal parameters
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':1, 'min_child_weight':3}

In [22]:
cv_xgb = xgb.cv(params = our_params, dtrain = xgdmat, num_boost_round = 3000, nfold = 5,
                metrics = ['error'],
                early_stopping_rounds = 100) # Look for early stopping that minimizes error

We can look at our CV results to see how accurate we were with these settings. The output is automatically saved into a pandas dataframe for us.

In [23]:
cv_xgb.tail()

Unnamed: 0,train-error-mean,train-error-std,test-error-mean,test-error-std
444,0.000241,0.000482,0.084954,0.012887
445,0.000241,0.000482,0.083027,0.014848
446,0.000241,0.000482,0.083993,0.011766
447,0.000241,0.000482,0.084959,0.012539
448,0.000241,0.000482,0.083027,0.012824


Best iteration: 448. Our CV test error at this number of iterations is 8.3%, or 91.7% accuracy.

Now that we have our best settings, let’s create this as an *XGBoost* object model that we can reference later.

In [24]:
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':1, 'min_child_weight':3}
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 448)

Let us now save the trained model as a *.pickle* file:

In [25]:
pickle.dump(final_gb,open("ELMO_nlp_xgboost.pickle","wb"))

### Analyzing Performance on Test Data

Let us construct a test dataset to analyze the performance of our trained model, and generate predictions.

In [26]:
import csv
import re  # RegEx
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'sci.space']  # Extracting two specific categories
label, cnt = 0, 0

with open("ELMo_input_data.csv", "w", newline='') as outfile:
    w = csv.writer(outfile)
    w.writerow(['id', 'comment_text', 'label'])  # Each comment will have an ID and a label ('0' or '1')
    
    for cat in categories:
        test = fetch_20newsgroups(subset = 'test', categories = [cat], remove = ('headers', 'footers', 'quotes'))
        
        for n in range(len(test.data)):            
            
            comment = test.data[n].replace('\n', " ")            
            # Remove URL's from train and test
            comment = re.sub(r'http\S+', '', comment)
            # Remove punctuation marks
            punctuation = '!"#$%&()*+-/,:;<=>?@[\\]^_`{|}~'
            comment = ''.join(ch for ch in comment if ch not in set(punctuation))
            # Convert text to lowercase
            comment = comment.lower()
            # Remove numbers
            comment = re.sub(r'[0-9]+', '', comment)
            # Remove whitespaces
            comment = ' '.join(comment.split())

            if len(comment) < 1000 and len(comment) > 25:  # Discarding very short comments
                row = [str(cnt), comment, str(label)]
                w.writerow(row)
                cnt += 1
            elif len(comment) >= 1000:
                row = [str(cnt), comment[0:1000], str(label)]  # Limiting large comments to 1000 characters
                w.writerow(row)
                cnt += 1
        label += 1  # Changing the label for the next category

In [27]:
import json  
  
# Open the CSV
f = open('ELMo_input_data.csv', 'r')
f.seek(0)
next(f)
reader = csv.DictReader(f, fieldnames = ('id', 'comment_text', 'label'))  
# Parse the CSV into JSON  
out = json.dumps([ row for row in reader ])  

# Save the JSON  
f = open('ELMo_input_data.json', 'w')  
f.write(out)
f.close()

In [28]:
test = pd.read_json('ELMo_input_data.json', orient = 'records')
print(test.head(2))
print("Test data is of shape: ", test.shape)

                                        comment_text  id  label
0  some big deletions another in a string of idio...   0      0
1  you should wear your nicest boxer shorts and b...   1      0
Test data is of shape:  (685, 3)


Let us generate ELMo vectors for test data, and save them in a *.pickle* file:

In [29]:
test['comment_text'] = lemmatization(test['comment_text'])

percent_samples = 1  # set to 1 to score all of test dataset  
batch_size = 50
list_test = [test[i:i+batch_size] for i in range(0,floor(percent_samples*test.shape[0]),batch_size)]

In [30]:
# Extract ELMo embeddings - Uncomment the two lines below to compute from scartch; 
# otherwise, load from .pcikle file
#elmo_vecs = [elmo_vectors(x['comment_text']) for x in list_test]
#elmo_vecs = np.concatenate(elmo_vecs, axis = 0)

In [31]:
elmo_vecs = pickle.load(open("elmo_test.pickle", "rb"))  # Loading saved elmo_vecs

In [32]:
# save elmo_vecs
pickle_out = open("elmo_test.pickle","wb")
pickle.dump(elmo_vecs, pickle_out)
pickle_out.close()

Next, let's load the trained XGBoost model:

In [34]:
loaded_model = pickle.load(open("ELMO_nlp_xgboost.pickle","rb"))  # Loading saved xgboost model

We can now generate predictions on test data:

In [35]:
predictions = loaded_model.predict(xgb.DMatrix(elmo_vecs))

In [36]:
# print(predictions) # Uncomment to view predictions

Now let’s use sklearn’s accuracy metric to see how well we did on the test set.

In [37]:
from sklearn.metrics import accuracy_score

The predict function for *XGBoost* outputs probabilities by default and not actual class labels. To calculate accuracy we need to convert these to a 0/1 label. We will set 0.5 probability as our threshold.

In [38]:
predictions[predictions > 0.5] = 1
predictions[predictions <= 0.5] = 0

In [39]:
y_test = test['label']
temp = np.array(y_test)
y_test = temp.astype(np.int) # Converting y_train to integers

print("Accuracy on test data: ", round(100*accuracy_score(predictions, y_test),2),"%")

Accuracy on test data:  86.28 %
