# Fetching & Processing the '20newsgroups' Dataset

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). Let's extract and process a training set correponding to two categories: *'alt.atheism'* and *'sci.space'*.

In [None]:
import csv
import re
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'sci.space'] # Extracting two specific categories
idx=0
cnt=0
with open("train.csv","w",newline='') as outfile:
    w = csv.writer(outfile)
    w.writerow(['id','comment_text','label']) # each comment will have an id and a label ('0' or '1')
    
    for cat in categories:
        train= fetch_20newsgroups(subset='train', categories=[cat], remove=('headers', 'footers', 'quotes'))
        
        for n in range(len(train.data)):
            
            comment = train.data[n].replace('\n'," ")
            
            # remove URL's from train and test
            comment = re.sub(r'http\S+', '', comment)

            # remove punctuation marks
            punctuation = '!"#$%&()*+-/,:;<=>?@[\\]^_`{|}~'
            comment = ''.join(ch for ch in comment if ch not in set(punctuation))

            # convert text to lowercase
            comment = comment.lower()

            # remove numbers
            comment = re.sub(r'[0-9]+', '', comment)

            # remove whitespaces
            comment = ' '.join(comment.split())

            if len(comment)<2000 and len(comment)>25: # discarding very short comments and very large comments
                cnt=cnt+1
                row = [str(cnt), comment, str(idx)]
                w.writerow(row)
        idx=idx+1

Let's look at a sample row (last row) from the .csv file:

In [None]:
print(row)

Let us now write a corresponding *.json* file:

In [None]:
import json  
  
# Open the CSV
f = open('train.csv', 'r')
reader = csv.DictReader(f, fieldnames = ('id','comment_text','label'))  
# Parse the CSV into JSON  
out = json.dumps([ row for row in reader ])  

# Save the JSON  
f = open('train.json', 'w')  
f.write(out)
f.close()

# Computing ELMo Vectors

Next, we compute the ELMo vectors of our training data. Let's begin by loading the training data as a *Pandas* DataFrame:

In [12]:
import pandas as pd
import numpy as np
import spacy
import pickle
import tensorflow as tf
from math import floor

pd.set_option('display.max_colwidth', 50)

#read data
train = pd.read_json("train.json",orient="records")
train = train.iloc[1:] #First record is a repetition of labels - skipping it

print("\nAt data loading, train is of shape ", train.shape, " and of type ", type(train))

Let's look at a couple of examples in the DataFrame *train*:

In [None]:
print(train.head(2))

Next, we load a *spaCy* language model, as well as a trainable *ELMo* model from *tensorflow_hub*:

In [None]:
import en_core_web_sm # to install: $ python -m spacy download en_core_web_sm
import tensorflow_hub as hub
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Supressing some TensorFlow warnings. Comment this line to display warnings

# load spaCy's language model
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

# function to lemmatize text
def lemmatization(texts):
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

train['comment_text'] = lemmatization(train['comment_text'])

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

def elmo_vectors(x):
    embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings,1))

We are now ready to extract *ELMo* vectors:

In [None]:
percent_samples = 1 # set to 1 to train all of input dataset, .5 to train half, etc.

batch_size = 50
    
list_train = [train[i:i+batch_size] for i in range(0,floor(percent_samples*train.shape[0]),batch_size)]

# Extract ELMo embeddings - Uncomment the two lines below to compute from scratch; otherwise, load from .pickle file
elmo_train = [elmo_vectors(x['comment_text']) for x in list_train]
elmo_train = np.concatenate(elmo_train, axis = 0)

The *ELMo* vectors took a long time to compute. Let's save them in a *.pickle* file:

In [None]:
# save elmo_train
pickle_out = open("elmo_train.pickle","wb")
pickle.dump(elmo_train, pickle_out)
pickle_out.close()

In [None]:
print("elmo_train is of type: ", type(elmo_train), ", and of shape: ", elmo_train.shape)

# Building an XGBoost Model

In [None]:
import sklearn
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# load elmo_train
pickle_in = open("elmo_train.pickle", "rb")
elmo_train_loaded = pickle.load(pickle_in)

#load raw training data to extract labels
train = pd.read_json("train.json",orient="records")
train = train.iloc[1:]
y_train = train['label']

temp = np.array(y_train)
y_train = temp.astype(np.int) # converting y_train to integers

## Initial Model Setup and Grid Search

We are going to start tuning the *maximum depth* of the trees first, along with the *min_child_weight*. We set the objective to *‘binary:logistic’* since this is a binary classification problem.

In [None]:
cv_params = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
ind_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic'}
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                            cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)

Now let’s run our grid search with 5-fold cross-validation and see which parameters perform the best.

In [None]:
optimized_GBM.fit(elmo_train_loaded, y_train)

Let's check our best score and best parameters:

In [None]:
optimized_GBM.best_score_

In [None]:
optimized_GBM.best_params_

Let’s try optimizing some other hyperparameters now to see if we can beat a mean of XX.YY% accuracy. This time, we will play around with subsampling along with lowering the learning rate to see if that helps.

In [None]:
cv_params = {'learning_rate': [0.1, 0.01], 'subsample': [0.7,0.8,0.9]}
ind_params = {'n_estimators': 1000, 'seed':0, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth': 5, 'min_child_weight': 3}

optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), 
                            cv_params, 
                             scoring = 'accuracy', cv = 5, n_jobs = -1)

In [None]:
optimized_GBM.fit(elmo_train_loaded, y_train)

Let us check our score and parameters again:

In [None]:
optimized_GBM.best_score_

In [None]:
optimized_GBM.best_params_

Based on the CV testing performed earlier, we want to utilize the following parameters:

* learning_rate = 0.1
* Subsample = 0.8
* Max_depth = 5
* Min_child_weight = 3

To increase the performance of *XGBoost*’s speed through many iterations of the training set, and since we are using only *XGBoost*’s API and not *sklearn*’s anymore, we can create a *DMatrix*. This sorts the data initially to optimize for *XGBoost* when it builds trees, making the algorithm more efficient. This is especially helpful when you have a very large number of training examples.

In [None]:
xgdmat = xgb.DMatrix(elmo_train_loaded, y_train) # Creating a DMatrix to make XGBoost more efficient

Now let’s specify our parameters and set our stopping criteria. For now, let’s be aggressive with the stopping and say we don’t want the accuracy to improve for at least 100 new trees.

In [None]:
# xgb optimal parameters
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':5, 'min_child_weight':3}

cv_xgb = xgb.cv(params = our_params, dtrain = xgdmat, num_boost_round = 3000, nfold = 5,
                metrics = ['error'],
                early_stopping_rounds = 100) # Look for early stopping that minimizes error

We can look at our CV results to see how accurate we were with these settings. The output is automatically saved into a *Pandas* dataframe for us.

In [None]:
cv_xgb.tail(5)

**Best iteration: 210**. Our CV test error at this number of iterations is 8.4%, or **91.6% accuracy**.

Now that we have our best settings, let’s create this as an *XGBoost* object model that we can reference later.

In [None]:
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic', 'max_depth':5, 'min_child_weight':3}
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 210)

Let us now save the trained model as a *.pickle* file:

In [None]:
pickle.dump(final_gb,open("ELMO_nlp_xgboost.pickle","wb"))

# Analyzing Performance on Test Data

Let us construct a test dataset to analyze the performance of our trained model, and generate predictions.

In [None]:
import csv
import re
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'sci.space'] # Extracting two specific categories
idx=0
cnt=0
with open("ELMo_input_data.csv","w",newline='') as outfile:
    w = csv.writer(outfile)
    w.writerow(['id','comment_text','label']) # each comment will have an id and a label ('0' or '1')
    
    for cat in categories:
        test= fetch_20newsgroups(subset='test', categories=[cat], remove=('headers', 'footers', 'quotes'))
        
        for n in range(len(test.data)):            
            comment = test.data[n].replace('\n'," ")            
            # remove URL's from train and test
            comment = re.sub(r'http\S+', '', comment)
            # remove punctuation marks
            punctuation = '!"#$%&()*+-/,:;<=>?@[\\]^_`{|}~'
            comment = ''.join(ch for ch in comment if ch not in set(punctuation))
            # convert text to lowercase
            comment = comment.lower()
            # remove numbers
            comment = re.sub(r'[0-9]+', '', comment)
            # remove whitespaces
            comment = ' '.join(comment.split())

            if len(comment)<2000 and len(comment)>25: # discarding very short comments and very large comments
                cnt=cnt+1
                row = [str(cnt), comment, str(idx)]
                w.writerow(row)
        idx=idx+1

In [None]:
import json  
  
# Open the CSV
f = open('ELMo_input_data.csv', 'r')
reader = csv.DictReader(f, fieldnames = ('id','comment_text','label'))  
# Parse the CSV into JSON  
out = json.dumps([ row for row in reader ])  

# Save the JSON  
f = open('ELMo_input_data.json', 'w')  
f.write(out)
f.close()

In [148]:
test = pd.read_json('ELMo_input_data.json',orient='records')
test = test.iloc[1:]
print(test.head(2))
test.shape

                                        comment_text id label
1  some big deletions another in a string of idio...  1     0
2  you should wear your nicest boxer shorts and b...  2     0


(617, 3)

Let us generate *ELMo* vectors for test data, and save them in a *.pickle* file:

In [None]:
percent_samples = 1 # set to 1 to score all of input dataset  
batch_size = 50
list_test = [test[i:i+batch_size] for i in range(0,floor(percent_samples*test.shape[0]),batch_size)]

# Extract ELMo embeddings - Uncomment the two lines below to compute from scartch; otherwise, load from .pcikle file
# elmo_vecs = [elmo_vectors(x['comment_text']) for x in list_test]
# elmo_vecs = np.concatenate(elmo_vecs, axis = 0)

# save elmo_vecs
pickle_out = open("elmo_test.pickle","wb")
pickle.dump(elmo_vecs, pickle_out)
pickle_out.close()

In [159]:
# If we've previously saved elmo_vecs in a .pickle file, we can simply load them
pickle_in = open("elmo_test_1.pickle", "rb")
elmo_vecs = pickle.load(pickle_in)

Next, let's load the trained *XGBoost* model:

In [48]:
import pickle
loaded_model = pickle.load(open("ELMo_nlp_xgboost.pickle","rb")) # load saved xgboost model

In [55]:
# print(loaded_model.feature_names)

We can now generate predictions on test data:

In [162]:
import xgboost as xgb

In [163]:
predictions = loaded_model.predict(xgb.DMatrix(elmo_vecs))

In [164]:
print(predictions)

[9.87347245e-01 7.31116712e-01 3.28189023e-02 2.58683780e-04
 6.39936002e-03 7.10943580e-01 2.15560973e-01 6.60109101e-03
 4.70442843e-04 9.96160150e-01 8.58237094e-04 1.93643868e-01
 2.04130150e-02 1.43510429e-03 4.26262530e-04 2.59525259e-03
 1.42872185e-01 9.98519480e-01 2.19664257e-02 6.64148748e-01
 1.38154048e-02 1.53272906e-02 1.21697402e-02 4.27198305e-04
 2.21942384e-02 9.52323496e-01 8.99149978e-04 1.43053591e-01
 4.64153141e-02 2.97685619e-03 2.06443062e-03 1.40017986e-01
 6.57744527e-01 1.47331674e-02 1.39694829e-02 5.53940004e-03
 7.73854554e-01 1.45339807e-02 1.32592511e-03 9.90438044e-01
 6.48238584e-02 9.98781741e-01 5.94218553e-04 1.05184328e-03
 2.77987301e-01 9.10708010e-01 7.50156343e-01 4.10835678e-03
 6.78646611e-04 9.36977923e-01 6.52891397e-03 1.23848999e-03
 3.05006057e-01 8.78522933e-01 3.61156076e-01 5.01230417e-04
 9.31613445e-01 1.56352371e-02 7.54740788e-04 6.94161560e-03
 4.38514277e-02 9.81914927e-05 7.04622790e-02 6.76127244e-03
 6.34107709e-01 5.349286

Now let’s use *sklearn*’s accuracy metric to see how well we did on the test set.

In [165]:
from sklearn.metrics import accuracy_score

The predict function for *XGBoost* outputs probabilities by default and not actual class labels. To calculate accuracy we need to convert these to a 0/1 label. We will set 0.5 probability as our threshold.

In [166]:
predictions[predictions > 0.5] = 1
predictions[predictions <= 0.5] = 0

In [167]:
y_test = test['label']
temp = np.array(y_test)
y_test = temp.astype(np.int) # converting y_train to integers

print("Accuracy on test data: ",round(100*accuracy_score(predictions, y_test),2),"%")

Accuracy on test data:  84.12 %


# Deploying to FastScore

To start, we import the *fastscoredeploy* library. This will leverage the FastScore API for deploying assets and models.

In [1]:
from fastscoredeploy import ipmagic
from fastscore.io import Slot

Schemas are going to define the input and output contract of the model execution code with the data transport. We will add one  for input and another output. Schemas leverage the Avro system: https://avro.apache.org/docs/1.8.1/spec.html. The cell magic command **%%schema (name)** at the top defines the name of the schema. 

**Note**: the name in this command must match the name of the corresponding schema name in the model.

In [283]:
%%schema three_strings
{
    "items": {
        "fields": [
            {   "name": "label",        "type": "string"   },
            {   "name": "comment_text", "type": "string"   },
            {   "name": "id",           "type": "string"   }
        ],
        "name": "three_strings",
        "type": "record"
    },
    "type": "array"
}

Schema loaded and bound to three_strings variable


In [336]:
%%schema double
{"type":"double"}

Schema loaded and bound to double variable


Schemata can also be inferred from sample data using **Schema.infer**. The samples must be given as records.

In [337]:
from fastscoredeploy.Schema import infer
import json

In [338]:
input_sample = dict({"id":"5", "comment_text":"Oh my God", "label":"0"})

In [339]:
print(input_sample)

{'id': '5', 'comment_text': 'Oh my God', 'label': '0'}


In [340]:
print(infer([input_sample]))

{
    "type": "record",
    "name": "Rec476098",
    "fields": [
        {
            "name": "label",
            "type": "string"
        },
        {
            "name": "comment_text",
            "type": "string"
        },
        {
            "name": "id",
            "type": "string"
        }
    ]
}


In [184]:
temp = json.loads(infer([input_sample]))
temp = json.dumps(temp,indent=2)

In [185]:
f = open("three_strings.avsc","w")
f.write(temp)
f.close()

Next we need to provide the model execution code. This code will be deployed into the engine and used to score new data. 
 - The cell magic command **%%model** at the top defines the name of the model *ELMo_nlp*.
 - We will not be using *FastScore*'s *call-back* style (begin and action functions), so we use the smar comment **#fastscore.action: unused**
 - The following smart comments map the schemas to the input and output. The names of the schemas in these smart comments must match the names of the schemas in the cell magic commands above (the name after **%%schema**). 
 - Next, use import statements to pull in the dependencies. Since the engine is containerized, you must include these import statements again even though you included them at the beginning of the notebook. These will need to be added to the Fastscore Engine's Dockerfile and import policy if they are not included in the default engine. 
 - Slot(0) and Slot(1) are the default input and output slots, respectively. Data coming to the input stream will be read and processed as a *Pandas DataFrame*. Any data processing that has to be carried  is done before scoring. The trained model is then loaded (from, say a *.pickle* file), and used to assign predictions to input data. these predictions are assigned to the output slot, Slot(1).

In [378]:
import pandas as pd

In [512]:
%%model ELMo_nlp

#fastscore.action: unused
#fastscore.schema.0: three_strings
#fastscore.schema.1: double
#fastscore.recordsets.1: true
#fastscore.module-attached: tensorflow
#fastscore.module-attached: tensorflow_hub
#fastscore.module-attached: xgboost

from fastscore.io import Slot

import xgboost as xgb
import pickle
import tensorflow_hub as hub
import tensorflow as tf
import numpy as np
import pandas as pd
from math import floor

slot0 = Slot(0)
slot1 = Slot(1)

   
#print(tf.__version__)

input_data = slot0.read(format="pandas.standard")
input_data = input_data.iloc[0:]

#print("input_data  type: ", type(input_data))
#print("input_data shape: ", input_data.shape)
#print(input_data.head())

percent_samples = 1 # set to 1 to score all of input dataset

batch_size = 1

temp = input_data.shape[0]

#print([[1,2] for i in range(0,floor(percent_samples*temp),batch_size)])

list_input_data = []

for i in range(0,floor(percent_samples*temp),batch_size):
    list_input_data+=[input_data[i:i+batch_size]]

#print(list_input_data)

#list_input_data = [input_data[i:i+batch_size] for i in range(0,floor(percent_samples*temp),batch_size)]

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=False)
globals().update(locals())

def elmo_vectors(x):
    embeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]
    with tf.compat.v1.Session() as sess:
        # sess.run(tf.global_variables_initializer())
        sess.run(tf.compat.v1.global_variables_initializer())
        # sess.run(tf.tables_initializer())
        sess.run(tf.compat.v1.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(embeddings,1))


# Extract ELMo embeddings
globals().update(locals())
elmo_vecs = [elmo_vectors(x['comment_text']) for x in list_input_data]

'''elmo_vecs = []

for x in list_input_data:
    elmo_vecs += [[elmo_vectors(x['comment_text'])]]'''

elmo_vecs = np.concatenate(elmo_vecs, axis = 0)

loaded_model = pickle.load(open("ELMo_nlp_xgboost.pickle","rb"))

predictions = loaded_model.predict(xgb.DMatrix(elmo_vecs))

out = pd.Series(predictions)

slot1.write(out)

Model loaded and bound to ELMo_nlp variable.


In [501]:
# To retrive your model's source code, uncomment the line below
# print(ELMo_nlp.source)

To check that the model is working as expected, we can pass sample data to the *scoreExplicit* function.

**NOTE**: <code>input_data = slot0.read(format="pandas.standard")</code> should be changed to <code>input_data = slot0.read(format="pandas")</code> for *scoreExplicit* to run successfully.

In [502]:
print("Sample input: \n\n", pd.DataFrame([input_sample]).head())

ELMo_nlp.scoreExplicit(pd.DataFrame([input_sample]))
out = Slot(1).output()
print("\nSample score:\n")
print(out)
print(type(out))

Sample input: 

   comment_text id label
0    Oh my God  5     0

Sample score:

[<PandasArray>
[0.6168823]
Length: 1, dtype: float32]
<class 'list'>


Now we're ready to deploy our FastScore-conformed model to a FastScore engine.


In [472]:
#Let us now deploy the model to FastScore to validate it works within the Engine
from fastscoredeploy.suite import Connect

connect = Connect("https://ec2-18-223-205-88.us-east-2.compute.amazonaws.com:8000")
print("     Connect: ",connect.name)

#Then we specify the model-manage to add the model assets to:
model_manage = connect.lookup('model-manage')
print("Model Manage: ",model_manage.name)

#And finally we specify the Engine we will deploy to
connect.prefer('engine','engine-1')
eng = connect.lookup('engine')
print("      Engine: ", eng.name)

# To check current configuration, uncomment the line below
# print("\n",connect.get_config())

# To check current fleet info, uncomment the line below
# print("\n",connect.fleet())

# To retrieve pneumo messages, uncomment the lines below
# pneumo = connect.pneumo.socket()
# print(pneumo.recv())
# pneumo.close()

# To retrieve the Swagger specification of the API supported by the instance, uncomment the lines below
# print(connect.get_swagger())

# To retrieves version information from the instance, uncomment the line below. A successful reply means instance is healthy.
# connect.check_health()



     Connect:  connect
Model Manage:  model-manage-1
      Engine:  engine-1


In [424]:
sid = three_strings.verify(eng)

In [425]:
print(sid)

11


In [426]:
print(eng.verify_data(sid,mydf.to_dict(orient='records')[0:1]))

None


Let's do a sanity check:

In [427]:
print("\n     connect is: ",connect)
print("        eng  is: ",  eng)
print("model_manage is: ",  model_manage)


     connect is:  <fastscoredeploy.suite.Connect.Connect object at 0x00000000354E22E8>
        eng  is:  <fastscoredeploy.suite.Engine.Engine object at 0x000000000F986588>
model_manage is:  <fastscoredeploy.suite.ModelManage.ModelManage object at 0x00000000398FC128>


Everyting looks <span style="color:green"><b>good</b></span>! Let's update Model Manage with our ELMo_nlp model:

In [513]:
ELMo_nlp.update(model_manage=model_manage)

True

In [486]:
print("\n  Models in Model Manage: ",model_manage.models.names())
print("")
print("Schemata in Model Manage: ",  model_manage.schemata.names())


  Models in Model Manage:  ['ELMo_nlp']

Schemata in Model Manage:  ['three_strings', 'double']


Next, we upload to the model (ELMo_nlp) the trained **XGBoost** model (ELMo_nlp_xgboost), which will be used to give ppredictions:

In [487]:
from fastscore.attachment import Attachment

att = Attachment('ELMo_nlp_xgboost.tar.gz', datafile='ELMo_nlp_xgboost.tar.gz')
att.upload(ELMo_nlp)

In [514]:
# Now we deploy to the engine. Start by resetting the engine in case it had previously errored.
eng.reset()

# Deploy! If there are errors, view the container logs for details
ELMo_nlp.deploy(eng)

IntProgress(value=0, max=8)

In [506]:
# To ckeck the engine state
print("The Engine is: ",eng.state)

# To check active model name and type
print("\nActive model name: ",eng.active_model.name)
print("Active model type: ",  eng.active_model.mtype)
# print(engine.active_model.jets)

The Engine is:  RUNNING

Active model name:  ELMo_nlp
Active model type:  python3


In [490]:
print("Model attachments: ",ELMo_nlp.attachments.names())

Model attachments:  ['ELMo_nlp_xgboost.tar.gz']


In [491]:
mydf = pd.DataFrame({"id":["5","6"], "comment_text":["Oh my God","Space is cool"], "label":["0","1"]})

type(mydf.to_dict(orient='records')[0])

dict

In [515]:
#Now we score with our sample data

print("Type of  [mydf.to_dict(orient='records')[0:2] : ", type([mydf.to_dict(orient='records')[0:2]]))
print("Type of [[mydf.to_dict(orient='records')[0:2]]: ", type([[mydf.to_dict(orient='records')[0:2]]]))

eng.score([mydf.to_dict(orient='records')[0:1]])

Type of  [mydf.to_dict(orient='records')[0:2] :  <class 'list'>
Type of [[mydf.to_dict(orient='records')[0:2]]:  <class 'list'>


ConnectionError: ('Connection aborted.', OSError("(10053, 'WSAECONNABORTED')"))