# Creating a Baseline Tag Labeler

Here we will use [XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_api.html) and [scikit-learn](https://scikit-learn.org/stable/) to create a baseline multi-class, multi-label classifier that will label our sample of [Stack Overflow](http://stackoverflow.com) posts (questions and their answers), two thirds of which lack labels. This will serve as a basis of comparison for the deep network we will train therafter. We will create separate [embeddings](https://keras.io/layers/embeddings/) of their language and code and use these as the signal for our model.

In [1]:
import json
import numpy as np
import pandas as pd
import re

print("I'm working!")

I'm working!


#### Load our sample of questions/answers with at least 1 vote and 1 answer

In [2]:
sorted_all_tags = json.load(open('data/stackoverflow/08-05-2019/sorted_all_tags.50000.json'))
max_index = sorted_all_tags[-1][0] + 1

In [3]:
import pyarrow
posts_df = pd.read_parquet(
    'data/stackoverflow/08-05-2019/Questions.Stratified.Final.50000.parquet',
    columns=['_Body'] + ['label_{}'.format(i) for i in range(0, max_index)],
    engine='pyarrow'
)
posts_df.head(5)

Unnamed: 0,_Body,label_0,label_1,label_2,label_3,label_4,label_5,label_6,label_7,label_8,...,label_14,label_15,label_16,label_17,label_18,label_19,label_20,label_21,label_22,label_23
0,"[C, Mono, Winforms, MessageBox, problem, I, fi...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[Are, NET, data, providers, Oracle, require, O...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[How, I, focus, foreign, window, I, applicatio...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[Default, button, hit, windows, forms, trying,...",1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[Can, I, avoid, JIT, net, Say, code, always, g...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
print(
    '{:,} Questions'.format(len(posts_df.index))
)

1,293,018 Questions


## Map from Tags to IDs

In [5]:
tag_index = json.load(open('data/stackoverflow/08-05-2019/tag_index.50000.json'))
index_tag = json.load(open('data/stackoverflow/08-05-2019/index_tag.50000.json'))

## Count the Most Common Tags

In [6]:
label_counts = json.load(open('data/stackoverflow/08-05-2019/label_counts.50000.json'))

# Sanity check the difference files
assert(len(label_counts.keys()) == len(tag_index.keys()) == len(index_tag.keys()) == len(sorted_all_tags))

## To Be Consistent: Make Record Count a Multiple of the Batch Size and Post Sequence Length

Although it is not necessary in our baseline labeler, the Elmo embedding in the network model requires that the number of records be a multiple of the batch size times the number of tokens in the padded posts. We do the same thing to keep the data consistent.

In [7]:
import math

BATCH_SIZE = 32
MAX_LEN = 100
TOKEN_COUNT = 10000
EMBED_SIZE = 50

# Convert label columns to numpy array
labels = posts_df[list(posts_df.columns)[1:]].to_numpy()

# training_count must be a multiple of the BATCH_SIZE times the MAX_LEN for the Elmo embedding layer
highest_factor = math.floor(len(posts_df.index) / (BATCH_SIZE * MAX_LEN))
training_count = highest_factor * BATCH_SIZE * MAX_LEN
print('Highest Factor: {:,} Training Count: {:,}'.format(highest_factor, training_count))

# Remove stopwords - now done in Spark, so can remove once that runs
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')

documents = []
for body in posts_df[0:training_count]['_Body'].values.tolist():
    words = body.tolist()
    documents.append(words)

labels = labels[0:training_count]

# Lengths for x and y match
assert( len(documents) == training_count == labels.shape[0])

Highest Factor: 404 Training Count: 1,292,800


#### Sample the data to speed development

In [8]:
# import random
# random.seed(33)

# SAMPLE_SIZE = 10000
# id_list = list(range(0, len(filtered_code_words)))
# idx = random.sample(id_list, SAMPLE_SIZE)

# # idx = np.random.choice(np.arange(len(matrix_posts)), SAMPLE_SIZE, replace=False)

# sampled_posts = [x for i, x in enumerate(filtered_code_words) if i in idx]
# sampled_labels = [x for i, x in enumerate(new_labels) if i in idx]

# del filtered_code_words
# del new_labels

# len(sampled_posts), len(sampled_labels)

#### REMINDER: When we add text words we must combine the two valid label lists and then create a new list of labels

In [9]:
# MIN_TEXT = 20

# def extract_text(x):
#     doc = BeautifulSoup(x)
#     codes = doc.find_all('code')
#     [code.extract() if code else None for code in codes]
#     return doc.text

# post_text = tag_posts._Body.apply(extract_text)
# post_text_words = [x.split() for x in post_text.tolist()]

# # Take words with > MIN_TEXT (20) instances
# post_text_words = [[y for y in x if tag_counts[y] > MIN_TEXT] for x in post_text_words]

# # Create a new list of labels to match the new non-empty lists of words
# text_post_ids = defaultdict(bool)
# text_post_id_list = []
# for i, post in enumerate(post_text_words):
#     if len(post) == 0:
#         pass
#     else:
#         text_post_ids[i] = True
#         text_post_id_list.append(i)

#### Encode the tags, replacing their string form with their respective IDs

In [10]:
# encoded_tags = []
# raw_tags = []
# for tagset in coded_tags:
#    encoded_tags.append([1 if id in tagset else 0 for id in id_to_tag.keys()])

# labels = np.array(encoded_tags)

# encoded_tags[0]

## Create a Baseline Gradient Boosted Decision Tree Model

It is useful to have a decision tree model to use as a baseline for comparison with our deep network model. XGBoost's implementation of gradient boosted decision trees is state of the art for this kind of application, but it can't do multi-class, multi-label classification. Therefore we use an [`xgboost.XGBClassifier`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) with an [`sklearn.multiclass.OneVsRestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) on top of the XGBoost classifier to train one classifier model per label and then apply them to each label to compute the output for each.

We define `VOCAB_SIZE`, `MAX_LENGTH` and `TEST_SPLIT` to define the number of unique words as input into our embedding, the sequence length for each input, and the test/train split for our performance testing.

#### Encode the data using Gensim and Word2Vec

For the network, we'll create our own embeddings. For the baseline model we'll use Word2Vec.

In [11]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

from gensim.sklearn_api import W2VTransformer
from gensim.models import Word2Vec

import xgboost as xgb

VOCAB_SIZE = 5000
MAX_LENGTH = 100
EMBEDDING_SIZE = 50
NUM_CORES = 12

TEST_SPLIT = 0.2

w2v_model = Word2Vec(
    documents,
    size=EMBEDDING_SIZE,
    min_count=1,
    window=10,
    workers=NUM_CORES,
    iter=10,
    seed=33
)
w2v_model.save('data/stackoverflow/08-05-2019/word2vec.50000.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [12]:
w2v_model.wv.most_similar(positive='program')

[('program,', 0.8360808491706848),
 ('programm', 0.8122567534446716),
 ('programme', 0.7685509324073792),
 ('script', 0.74830162525177),
 ('process', 0.7271444797515869),
 ('program)', 0.7130448818206787),
 ('programs', 0.7013044357299805),
 ('routine', 0.6951900720596313),
 ('program.', 0.6787816286087036),
 ('computer', 0.6732232570648193)]

In [13]:
encoded_docs = [[w2v_model.wv[word] for word in post] for post in documents]
len(encoded_docs)

1292800

In [14]:
encoded_docs[0]

[array([ 5.1444416 ,  2.1435053 ,  9.380815  , -3.5109606 ,  5.5468736 ,
        -4.354106  ,  1.9213352 , -1.9126751 ,  0.61967486, -3.2145436 ,
         3.1040335 ,  1.6090815 ,  3.102138  ,  0.07399222, -2.991248  ,
        -3.7718468 ,  7.078993  ,  1.529843  ,  6.1552277 ,  1.205088  ,
         6.617514  ,  0.0702325 ,  2.4488902 , -2.3876965 , -9.971252  ,
         0.7666695 ,  5.7770705 ,  4.084718  ,  8.94757   , -3.1050925 ,
         4.6770372 , -4.470423  , -4.985759  , -6.3275146 , -0.48854896,
         5.638934  , -2.5682726 , -7.8195734 ,  5.3294067 ,  3.1417184 ,
        -1.0252663 ,  1.161822  ,  3.6086853 ,  2.0949922 ,  1.7200934 ,
         1.8849148 ,  2.8504086 , -4.8772483 ,  4.4279428 , 14.183523  ],
       dtype=float32),
 array([ 1.3306632 ,  0.5822065 ,  1.5964314 , -5.5610666 ,  5.202259  ,
        -1.198007  , -1.1820809 , -5.6698375 , -2.5972545 , -1.6917946 ,
         1.0757146 , -0.9334136 ,  0.35064355, -1.164361  ,  0.11443811,
        -4.1922216 , -1.903

### Pad and limit the posts to MAX_LENGTH (100) words using the average of all words in the corpus

We will now compute a position-wise maximum and minimum, concatenate these values, and use them to pad any documents with less than 20 words. We will simultaneously truncate any documents with more than 20 words. If we were creating our own embeddings using keras we would use [`keras.preprocessing.pad_sequences`](https://keras.io/preprocessing/sequence/#pad_sequences), but using [`gensim.models.word2vec`](https://radimrehurek.com/gensim/models/word2vec.html) we pad them on our own.

See [Representation learning for very short texts using weighted word embedding aggregation](https://arxiv.org/pdf/1607.00570.pdf) referenced from [Stack Overflow](https://datascience.stackexchange.com/a/17348/59975).

In [15]:
from math import ceil

padded_posts = []
for post in encoded_docs:
    # Pad short posts with alternating min/max
    if len(post) < MAX_LENGTH:
        pointwise_min = np.minimum.reduce(post)
        pointwise_max = np.maximum.reduce(post)
        padding = [pointwise_max, pointwise_min]
        
        post += padding * ceil((MAX_LENGTH - len(post) / 2.0))
        
    # Shorten long posts or those odd number length posts we padded to 51
    if len(post) > MAX_LENGTH:
        post = post[:MAX_LENGTH]
      
    padded_posts.append(post)

# Verify their lengths
assert(min([len(post) for post in padded_posts]) == MAX_LENGTH)
assert(max([len(post) for post in padded_posts]) == MAX_LENGTH)

# Free up the RAM, since we copied the data
del encoded_docs
len(padded_posts), len(padded_posts[0])

(1292800, 100)

#### Convert the 3D feature array into a wider 2D array

The classifier requires 2D data, so we need to convert our 3D feature array into a wider 2D feature array. We will do this by iterating through the 50 padded elements of Word2Vec vectors for each post and appending them to a long list for each post.

Note that the type of `padded_posts` is `list(list(np.array))`, an artifact of the Word2Vec mapping.

#### Create one Row per Label Column

Training a `sklearn.multiclass.OneVsRestClassifier` with one `xgboost.XGBClassifier` per label exceeded 64GB of RAM and so we are remapping the data to have one instance of the row for each label column in a given row.

For example:

```python
# Input
rows, labels = [0.1, 0.3, 0.4, ...],[0,1,0,1]

# Output
rows_w_labels = [
    ([0.1, 0.3, 0.4, ...], 0),
    ([0.1, 0.3, 0.4, ...], 1),
    ([0.1, 0.3, 0.4, ...], 0),
    ([0.1, 0.3, 0.4, ...], 1)
]
```

In [16]:
import cupy as cp

row_length = MAX_LENGTH * EMBEDDING_SIZE

matrix_posts = []
flat_labels = []
print_shape = True
print(len(padded_posts), len(sampled_labels))
for i, (post, labels) in enumerate(zip(padded_posts, sampled_labels)):
    # Starting with an empty array and append the entire list of embedded words to it, 
    # expanding it's shape to (5000,)
    post = cp.array(post)
    if print_shape:
        print(post.shape)
    post_row = cp.concatenate(post, axis=0)
    if print_shape:
        print(post_row.shape)
    assert(post_row.shape == (row_length,))
    
    # Now add a downward dimension to the data, expanding its dimensions to (1,5000)
    post_row = cp.expand_dims(post_row, axis=0)
    if print_shape:
        print(post_row.shape)
    assert(post_row.shape == (1,row_length))
    
    if print_shape:
        print(len(sampled_labels), len(sampled_labels[0]))
    
    # Sample the labels to see which to emit. I should really do this by relative frequency.
    SAMPLE_SIZE = 50
    id_list = list(range(0, 709))
    idx = random.sample(id_list, SAMPLE_SIZE)
    for j, label in enumerate(sampled_labels[i]):
        if print_shape:
            print(i, j, label, type(label))
            print_shape = False
        
        if j in idx:
            matrix_posts.append(post_row)
            flat_labels.append(label)
        else:
            continue

# Memory conservation is critical
del padded_posts
len(matrix_posts), len(flat_labels)

ModuleNotFoundError: No module named 'cupy'

#### Sample the Data Once Again

In [20]:
# SAMPLE_SIZE = 1000
# id_list = list(range(0, len(matrix_posts)))
# idx = random.sample(id_list, SAMPLE_SIZE)
# sampled_posts = [post for i, post in enumerate(matrix_posts) if i in idx]
# sampled_labels = [label for i, label in enumerate(flat_labels) if i in idx]

# del matrix_posts
# del flat_labels
# len(sampled_posts), len(sampled_labels)

#### Convert from GPU `cupy.ndarray` to main memory `numpy.ndarray`

In [21]:
matrix_posts = cp.asnumpy(cp.concatenate(matrix_posts, axis=0))

### Train the Baseline Model

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    matrix_posts,
    flat_labels,
    test_size=TEST_SPLIT,
    random_state=33
)
del matrix_posts
del flat_labels

In [23]:
print(X_train.shape, X_test.shape, len(y_train), len(y_test))
print(X_train.dtype, X_test.dtype, type(y_train), type(y_test))
print(type(X_train))

(400000, 1280) (100000, 1280) 400000 100000
float32 float32 <class 'list'> <class 'list'>
<class 'numpy.ndarray'>


In [24]:
from scipy import sparse

X_train = sparse.csr_matrix(X_train)

params = {
    'booster': 'gbtree',
    'silent': 0,
}

clf = OneVsRestClassifier(
    xgb.XGBClassifier(
        learning_rate=0.2,
        n_estimators=100,
        objective='binary:logistic',
        nthread=1,
        tree_method='gpu_hist'
    ), 
    n_jobs=1,
)
%timeit clf.fit(X_train[:200000], y_train[:200000])

29.8 s ± 312 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
# from sklearn.ensemble import RandomForestClassifier

# clf = RandomForestClassifier(
#     n_estimators=100,
#     max_depth=3,
#     random_state=33,
#     n_jobs=12
# )
# clf.fit(X_train, y_train)

In [26]:
%%bash

echo 'Fitting done!' | ~/bin/twilio-sms 404-317-3620

Sending SMS to 404-317-3620 from 678-264-3702...done


In [27]:
# d_train = xgb.DMatrix(X_train, label=y_train)
# d_test =  xgb.DMatrix(X_test, label=y_test)

In [28]:
# from sklearn.model_selection import cross_val_score

# cross_val_score(clf, X_train, y_train, cv=2, scoring='accuracy')

KeyboardInterrupt: 

In [30]:
from sklearn.metrics import (
    roc_curve, precision_recall_curve, auc, make_scorer, recall_score, 
    accuracy_score, jaccard_score, precision_score, confusion_matrix
)

y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, output_dict=False, target_names=['yes', 'no'])#tag_labels)

  'precision', 'predicted', average, warn_for)


In [None]:
print(report)

              precision    recall  f1-score   support

         yes       1.00      1.00      1.00     99697
          no       0.00      0.00      0.00       303

    accuracy                           1.00    100000
   macro avg       0.50      0.50      0.50    100000
weighted avg       0.99      1.00      1.00    100000



In [None]:
accuracy_score(y_test, y_pred)

jaccard_score(y_test, y_pred)


In [None]:
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, StratifiedKFold

plt.style.use("ggplot")

xgb_params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5]
}

In [None]:
from keras import Sequential
from keras.layers import Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=VOCAB_SIZE, lower=True)
tokenizer.fit_on_texts(post_code)
sequences = tokenizer.texts_to_sequences(post_code)
X = pad_sequences(sequences, maxlen=MAX_LENGTH)
X.shape

model = Sequential()
model.add(Embedding(1000, 64, input_length=MAX_LENGTH))


In [None]:
model