# Comparison of Vectorizer Models using Classification task
---
This is a simple comparins using a simple spam/ham <a href="https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection">dataset</a>. The entire process was left as is with minimal clean-up as a learning exercise.

In [2]:
import pandas as pd
import sklearn as skl

In [3]:
data = pd.read_csv("train.txt", delimiter = "\t", header=None, names=["label", "text"])

My first thought here is to take a look at the data and see exactly what we're working with. I've got the check if the data needs any preprocessing. This includes checking the distribution of spam to ham as this will affect how the model is trained.

In [4]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# better than .86
data.label.value_counts() / data.shape[0]

ham     0.865937
spam    0.134063
Name: label, dtype: float64

Here I take a look at each sentences number of tokens and the dataframe's shape. I want to have an idea of what the BOW vectorizer will look like. I also look at a counter of each word in the corpus, this can also give me an idea of how TFIDF will weigh the tokens.

In [6]:
# bag of words
data.text.str.lower().str.split(" ").apply(lambda x: len(x))

0       20
1        6
2       28
3       11
4       13
        ..
5567    30
5568     8
5569    10
5570    26
5571     6
Name: text, Length: 5572, dtype: int64

In [7]:
# get word count
from collections import Counter
results = Counter()
data['text'].str.lower().str.split().apply(results.update)
print(results)



# bag of words
---
Again, I want to see what my data looks like at every step. Notice that the beginning and end of the shown vectors are all 0's, this would normally be the case because each sentence only has a handful of words from the entire corpus.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
bag_of_words = count.fit_transform(data.text)
# Show feature matrix
bag_of_words.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [9]:
bag_of_words.shape

(5572, 8713)

# TFIDF
---
We do the same for TFIDF.

In [10]:
# term freq inverse doc freq (divide by word freq to weigh accordingly)
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_vec = tfidf.fit_transform(data.text)
# Show feature matrix
tfidf_vec.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Here we take a look at the numer of tokens vs the number vectorized. This difference can be attributed to a number of things such as words not in the corpus, erratic punctuation, etc.

In [11]:
import numpy as np
np.sum(bag_of_words, axis=1)

matrix([[18],
        [ 5],
        [27],
        ...,
        [10],
        [24],
        [ 6]])

In [12]:
data.text.str.lower().str.split(" ").apply(lambda x: len(x))

0       20
1        6
2       28
3       11
4       13
        ..
5567    30
5568     8
5569    10
5570    26
5571     6
Name: text, Length: 5572, dtype: int64

In [13]:
data.text[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [14]:
np.sum(tfidf_vec, axis=1)

matrix([[4.0774551 ],
        [2.18215459],
        [4.45232481],
        ...,
        [2.91815309],
        [4.59574031],
        [2.0992315 ]])

Now it's time to run the vectorized data through a classifier, I chose Random Forest as I believe it is a solid baseline model that can be used as a benchmark for any further testing.

In [15]:
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.model_selection import train_test_split

In [16]:
bag_of_words

<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
	with 74169 stored elements in Compressed Sparse Row format>

In [17]:
X_train, X_test, y_train, y_test = train_test_split(bag_of_words, data.label, test_size=0.33, random_state=42)

In [18]:
clf = rfc(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [19]:
preds = clf.predict(X_test)

In [20]:
pd.Series(preds).value_counts()

ham    1839
dtype: int64

We see below that both the BOW ad TFIDF models are going all in on ham, resulting in ~0.93 accuracy score. This is likely due to the uneven distribution of the dataset itself.

In [21]:
from sklearn.metrics import confusion_matrix, f1_score

In [22]:
confusion_matrix(y_test, preds)

array([[1593,    0],
       [ 246,    0]])

In [23]:
f1_score(y_test, preds, pos_label="ham")

0.9283216783216783

In [24]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_vec, data.label, test_size=0.33, random_state=42)

In [25]:
clf = rfc(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [26]:
preds = clf.predict(X_test)

In [27]:
pd.Series(preds).value_counts()

ham    1839
dtype: int64

In [28]:
confusion_matrix(y_test, preds)

array([[1593,    0],
       [ 246,    0]])

In [29]:
f1_score(y_test, preds, pos_label="ham")

0.9283216783216783

# Word2Vec
---
Now I move on to some word vectorizers in hopes of improving my score.

In [30]:
import gensim

In [31]:
# glove maybe later
# model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)

In [32]:
# word2vec
import gensim.downloader as api

api.load("word2vec-google-news-300")

# Load Google's pre-trained Word2Vec model.
# model = gensim.models.KeyedVectors.load_word2vec_format('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True) 

[--------------------------------------------------] 1.4% 23.4/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[==------------------------------------------------] 4.2% 69.2/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[===-----------------------------------------------] 6.8% 113.5/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[====----------------------------------------------] 9.3% 155.1/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[=====---------------------------------------------] 12.0% 199.5/1662.8MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





<gensim.models.keyedvectors.KeyedVectors at 0x7ff3fc626fa0>

In [33]:
model_word2vec = api.load("word2vec-google-news-300")

Note when formatting the vector I must average the vectors of each word in a sentence to put it in a similar one-sentence-one-vector format that I used in BOW and TFIDF. This way I can contexualize each sentence's vector such that they can be compared.

At this point I also notice that there are some NaN/Null values that must be taken care of, I decided to just fill the spots with vectors containing all ones. This will of course have an effect on the results and is somewhere to look if refinement is to be done.

In [34]:
vectors_model_word2vec = [sum([model_word2vec[word] for word in sent.split(" ") if word in model_word2vec]) / len(sent) for sent in data.text]

In [35]:
vectors_model_word2vec = [np.ones(300) if isinstance(vec, float) else vec for vec in vectors_model_word2vec]

In [36]:
word2vec_df = pd.DataFrame.from_dict(zip(data.label, vectors_model_word2vec))

In [37]:
word2vec_df.head()

Unnamed: 0,0,1
0,ham,"[0.0046023806, 0.005155546, -0.0020323058, 0.0..."
1,ham,"[-0.015317719, 0.010817955, 0.006785426, 0.009..."
2,spam,"[-0.0018367151, -0.009888286, -0.00878355, -0...."
3,ham,"[-0.010165468, 0.008688168, 0.021725401, 0.018..."
4,ham,"[0.018930905, 0.0112199625, 0.008914134, 0.025..."


In [38]:
word2vec_df.columns = ["label", "vec"]
word2vec_df = pd.DataFrame(word2vec_df.vec.to_list())

The result here is not a huge difference by any measure but a change nonetheless! We now have a singe data point correctly labeled as spam. 

In [39]:
X_train, X_test, y_train, y_test = train_test_split(word2vec_df, data.label, test_size=0.33, random_state=42)

clf = rfc(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

pd.Series(preds).value_counts()

print(confusion_matrix(y_test, preds))

f1_score(y_test, preds, pos_label="ham")

[[1593    0]
 [ 245    1]]


0.9285922471582628

# glove
---
Another word vectorizer, I simply used the same process after working it out in the last one.

In [40]:
model_glove = api.load("glove-wiki-gigaword-50")



IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





In [41]:
vectors_model_glove = [sum([model_glove[word] for word in sent.split(" ") if word in model_glove]) / len(sent) for sent in data.text]

In [42]:
vectors_model_glove = [np.ones(300) if isinstance(vec, float) else vec for vec in vectors_model_glove]

glove_df = pd.DataFrame.from_dict(zip(data.label, vectors_model_glove))

In [43]:
glove_df.columns = ["label", "vec"]
glove_df = pd.DataFrame(glove_df.vec.to_list())

In [44]:
glove_df.fillna(1.0, inplace=True)

Here we backpedal a bit and lose our only spam prediction, the loss is immeasureable.

In [45]:
X_train, X_test, y_train, y_test = train_test_split(glove_df, data.label, test_size=0.33, random_state=42)

clf = rfc(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

pd.Series(preds).value_counts()

print(confusion_matrix(y_test, preds))

f1_score(y_test, preds, pos_label="ham")

[[1593    0]
 [ 246    0]]


0.9283216783216783

Here I come to the realization that we could potentially improve the models by training them on a twitter corpora or on our data.

# Byte Pair Encoding
---

In [46]:
from bpemb import BPEmb

In [47]:
bpemb_en = BPEmb(lang="en", dim=50)

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 1042677.48B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz


100%|██████████| 1924908/1924908 [00:00<00:00, 3176234.93B/s]


In [48]:
vectors_model_bpemb = [sum(bpemb_en.embed(sent)) / len(sent) for sent in data.text]

vectors_model_bpemb = [np.ones(300) if isinstance(vec, float) else vec for vec in vectors_model_bpemb]

bpemb_df = pd.DataFrame.from_dict(zip(data.label, vectors_model_bpemb))

bpemb_df.columns = ["label", "vec"]

bpemb_df = pd.DataFrame(bpemb_df.vec.to_list())

bpemb_df.fillna(1.0, inplace=True)



Finally! We see some marked improvement in our model's labeling, even with our biased dataset. A jump of ~3%.

In [49]:
X_train, X_test, y_train, y_test = train_test_split(bpemb_df, data.label, test_size=0.33, random_state=42)

clf = rfc(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

preds = clf.predict(X_test)

pd.Series(preds).value_counts()

print(confusion_matrix(y_test, preds))

f1_score(y_test, preds, pos_label="ham")

[[1583   10]
 [ 106  140]]


0.9646556977452773

# Transformer model
---
A BERT vectorizer and RoBERTa for classification.

In [50]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import logging

Before this I ran everything on my laptop's CPU with no issues, but when using transformer models a GPU is all but mandatory. The runtime went from 3 hours to 5 minutes.

In [51]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)


# Preparing train data
data_df = pd.read_csv("train.txt", delimiter = "\t", header=None, names=["labels", "text"])

le = preprocessing.LabelEncoder()
data_df.labels = le.fit_transform(data_df.labels)


X_train, X_test, y_train, y_test = train_test_split(data_df, data_df.labels, test_size=0.33, random_state=42)

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=3,overwrite_output_dir=True)

# Create a ClassificationModel
model = ClassificationModel(
    "roberta",
    "roberta-base",
    args=model_args,
    cuda_device=7
)

# Train the model
model.train_model(X_train)


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_p

  0%|          | 0/3733 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_2_2


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/467 [00:00<?, ?it/s]

  torch.nn.utils.clip_grad_norm_(


Running Epoch 1 of 3:   0%|          | 0/467 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/467 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of roberta model complete. Saved to outputs/.


(1401, 0.0743371880751869)

Another milestone in our little comparison venture, we have another increase in 3% with an accuracy score of 0.99

In [52]:
from sklearn.metrics import accuracy_score

# Make predictions with the model
X_test_df = pd.DataFrame(X_test)
result, model_outputs, wrong_predictions = model.eval_model(X_test_df, acc=accuracy_score)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1839 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_roberta_128_2_2


Running Evaluation:   0%|          | 0/230 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.971756935133147, 'tp': 239, 'tn': 1588, 'fp': 5, 'fn': 7, 'auroc': 0.99863095146959, 'auprc': 0.9934571301121983, 'acc': 0.9934747145187602, 'eval_loss': 0.03973725932354801}


Here is the BPE's confusion matrix and accuracy score for comparison.

\[1583   10\]

\[ 106  140\]

0.9646556977452773
