The goal of this work is to process a text dataset using Neural Networks and Deep Learning
word embedding and data analytics methods and to extract knowledge from it. Prepare a report
for this work and deposit it on moodle.

In this work you will use 20 Newsgroup dataset, but you a free to use any text data (UCI datasets
repository, kaggle, data.gouv.fr, …) informing the Professor.

The work should contains at least the following 4 parts:
1. Analysis of the text dataset
2. Text processing and Transformation
3. Apply di erent Neural Networks (NN) embedding techniques
4. Clustering and/or classi cation on the embedded data
5. Results analysis and visualisation
6. Theoretical formalism

In [55]:
# In this work you will use 20 Newsgroup dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [56]:
# Analyse the dataset : the context, size, difficulties, detect the objectives.

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# the context of the dataset
print(newsgroups_train.target_names)
print(newsgroups_train.data[0])

X_train = newsgroups_train.data
Y_train = newsgroups_train.target

X_test = newsgroups_test.data
Y_test = newsgroups_test.target


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [57]:
# analyse the size of the dataset
print(len(X_train))
print(len(X_test))

2257
1502


2. Text Processing and Transformation

In [58]:
# Text Processing and Transformation
# For this part, you should use scikit-learn and you can follow the tutorial:
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup

# Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices)

# For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
def build_X(data, dictionary):
    X = np.zeros((len(data), len(dictionary)), dtype=np.int)
    for i, doc in enumerate(data):
        for word in doc.split():
            X[i, dictionary[word]] += 1
    return X


def build_dictionary(data):
    dictionary = {}
    for doc in data:
        for word in doc.split():
            if word not in dictionary:
                dictionary[word] = len(dictionary)
    return dictionary


dictionary_train = build_dictionary(X_train)
X_bow_train = build_X(X_train, dictionary_train)

dictionary_test = build_dictionary(X_test)
X_bow_test = build_X(X_test, dictionary_test)

In [59]:
# def tokenize(data, dictionary):
#     vectorizer = CountVectorizer(vocabulary=dictionary)
#     X = vectorizer.fit_transform(data)
#     return X

# X_cv_train = tokenize(X_train, dictionary_train)


# print(X_cv_train)

3. Apply different embedding techniques based on Neural Networks

In [60]:
def tfidf(X):
    transformer = TfidfTransformer()
    X = transformer.fit_transform(X)
    return X


X_tfidf_train = tfidf(X_bow_train)
X_tfidf_test = tfidf(X_bow_test)

print(X_tfidf_train[0:5])

  (0, 77)	0.1501782300675281
  (0, 76)	0.1501782300675281
  (0, 75)	0.1501782300675281
  (0, 74)	0.103699862561056
  (0, 73)	0.12962948804367816
  (0, 72)	0.1501782300675281
  (0, 71)	0.1501782300675281
  (0, 70)	0.2851886015699785
  (0, 69)	0.11378139920465326
  (0, 68)	0.07044791025673265
  (0, 67)	0.1501782300675281
  (0, 66)	0.11378139920465326
  (0, 65)	0.12962948804367816
  (0, 64)	0.05877826376929799
  (0, 63)	0.1501782300675281
  (0, 62)	0.1501782300675281
  (0, 61)	0.080491850362952
  (0, 60)	0.035036250633131864
  (0, 59)	0.1501782300675281
  (0, 58)	0.09774341765798598
  (0, 57)	0.02293400626289806
  (0, 56)	0.06833288469445191
  (0, 55)	0.1330396798598353
  (0, 54)	0.08004648953595733
  (0, 53)	0.029526882905000593
  :	:
  (4, 280)	0.031709364894395534
  (4, 278)	0.023816220382475
  (4, 273)	0.04705982356268544
  (4, 221)	0.019082684263568732
  (4, 219)	0.0513555290675402
  (4, 184)	0.0281348399793534
  (4, 182)	0.01993829776038902
  (4, 165)	0.040513867783262845
  (4, 151)

In [61]:
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec
import nltk


nltk.download('punkt')



data = pd.DataFrame(X_train, columns=['Text'])
# append X_test
data = data.append(pd.DataFrame(X_test, columns=['Text']))

def get_corpus(data):
    corpus_text = 'n'.join(data[:1000]['Text'])
    data = []
    # iterate through each sentence in the file
    for i in sent_tokenize(corpus_text):
        temp = []
        # tokenize the sentence into words
        for j in word_tokenize(i):
            temp.append(j.lower())
        data.append(temp)
    return data


corpus = get_corpus(data)

# Word2Vec
model = Word2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4)

# get the vector for each word in the vocabulary
words_key_to_index = model.wv.key_to_index
words_index_to_key = model.wv.index_to_key
words_vectors = model.wv.vectors

# split words_to_vectors into train and test
X_w2v_train = words_vectors[:len(X_train)]
X_w2v_test = words_vectors[len(X_train):]

print(X_w2v_train[0:5])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ion\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[[ 1.21886760e-01 -8.80167186e-02  5.61207712e-01 -3.76584500e-01
  -5.05346417e-01 -8.75589192e-01  3.35634559e-01  9.01217103e-01
  -3.27931195e-02  1.59283224e-02  1.15245506e-01 -1.64794326e+00
  -2.54266858e-02  6.27756596e-01 -5.54876506e-01 -6.04082525e-01
   5.49870133e-01 -7.30168283e-01 -5.06202042e-01 -1.54111886e+00
   9.64176238e-01  1.64968562e+00  1.25206590e+00 -7.90552348e-02
   3.56570929e-01 -2.71244720e-02 -4.82709289e-01 -2.53956109e-01
  -1.19589233e+00 -5.02319276e-01  1.29548323e+00 -6.09193981e-01
   1.70501685e+00 -1.06452310e+00 -2.71935642e-01  1.53095424e+00
  -3.63216013e-01 -3.79867256e-01  1.09562874e+00 -1.18346441e+00
  -3.10551357e-02 -4.05427396e-01 -7.71491408e-01  2.42090574e-03
   8.88898894e-02  2.47664571e-01 -1.42922366e+00  8.83940339e-01
  -1.36391833e-01  8.82265866e-01  4.79567312e-02 -2.19482020e-01
  -5.07226467e-01 -1.92014024e-01  3.70096527e-02  1.45585850e-01
   1.41330242e+00  2.11777538e-01 -7.45092332e-01  9.18270648e-02
   1.15650

In [62]:
# FastText
from gensim.models import FastText


model = FastText(corpus, vector_size=100, window=5, min_count=1, workers=4)


# get the vector for each word in the vocabulary
words_key_to_index = model.wv.key_to_index
words_index_to_key = model.wv.index_to_key
words_vectors = model.wv.vectors

# split words_to_vectors into train and test
X_ft_train = words_vectors[:len(X_train)]
X_ft_test = words_vectors[len(X_train):]

print(X_ft_train[0:5])

[[ 0.02731436  0.03532854 -0.649094    0.017563    0.92363745 -0.20742893
  -0.651582    1.2316182   0.6634129  -0.16187254  0.6303719  -0.784737
  -0.38953945  1.4947654  -0.3791734   0.32408726  0.77620983 -0.6670806
  -1.6654828  -1.3331561  -0.28174958 -0.24550638 -0.20765585 -1.4201847
  -0.06489483 -0.47188443 -1.0115728  -1.5898122  -0.8926713   0.24811971
  -0.7771993   0.7484617   0.6240752   0.9822568   1.0377414   1.188622
   0.20870152 -0.49757436 -1.331943   -0.37159234  0.10683428  0.30106544
  -0.69923186 -2.2436733  -0.29444557  0.22455022 -1.8583503  -0.470767
  -0.46327186 -0.26368976  1.1957734   0.7683525  -0.43630293 -0.38898528
   0.52679     0.39857554 -0.36295116 -1.0434092  -1.3003485  -0.02464187
   1.0461676  -0.54038966  0.9614785   2.2792933  -0.35931805  0.45200264
  -0.59067184 -0.35829046  0.5966126   1.4900286  -0.34155747  0.7667651
   1.0837072  -0.39562172  1.7894056   0.34115708  0.42516664 -0.0836772
   0.19626091  2.1668816   0.03700379 -0.8008802

In [63]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4)

words_vectors = model.wv.vectors


# split words_to_vectors into train and test
X_d2v_train = words_vectors[:len(X_train)]
X_d2v_test = words_vectors[len(X_train):]


print(X_d2v_train[0:5])

[[-1.19194150e+00 -7.20928460e-02  1.37898371e-01 -7.25268364e-01
  -1.04224706e+00 -4.24828410e-01  9.80469704e-01  1.66004837e+00
   6.93412900e-01 -2.28604838e-01  1.53491354e+00 -1.31897795e+00
  -4.73413199e-01  9.32623327e-01 -1.48994243e+00 -1.04747891e+00
   8.04982603e-01 -5.29905975e-01 -1.80774406e-01 -3.22483420e-01
   1.63759506e+00  1.40627408e+00  1.50395465e+00 -5.00040889e-01
   3.96168739e-01 -2.32179329e-01  2.52919704e-01 -1.23326814e+00
  -8.69860291e-01 -5.36870420e-01  2.50574499e-01 -4.36893046e-01
   1.04510033e+00 -4.38539505e-01 -1.10824394e+00  1.00409007e+00
   3.90145987e-01 -9.14870799e-01  1.37862968e+00 -1.62709486e+00
  -9.08986628e-01  4.26964253e-01 -5.31982958e-01 -1.28990698e+00
  -1.01923275e+00  1.19198096e+00 -7.74533212e-01  8.85437906e-01
  -5.71339548e-01  1.61840749e+00 -9.20520797e-02 -2.94697762e-01
   1.41870785e+00  1.22994065e+00 -6.24990523e-01 -9.57765996e-01
   1.08501768e+00 -1.45115948e+00 -1.19974267e+00 -1.55025303e-01
   1.77410

In [64]:
# BERT model
def build_bert_X(data):
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    X = np.zeros((len(data), 768))
    for i, doc in enumerate(data):
        X[i] = tokenizer.encode(doc, add_special_tokens=True, max_length=768, pad_to_max_length=True)
    return X

X_bow_train_list = X_bow_train.tolist()
X_bow_test_list = X_bow_test.tolist()

X_bert_train = build_bert_X(X_bow_train_list)
X_bert_test = build_bert_X(X_bow_train_list)

print(X_bert_train[0:5])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[[101.   1.   1. ...   0.   0. 102.]
 [101.   1.   0. ...   0.   0. 102.]
 [101.   1.   0. ...   0.   0. 102.]
 [101.   1.   0. ...   0.   0. 102.]
 [101.   1.   0. ...   0.   0. 102.]]


4. Clustering and/or classification on the embedded data

5. Results analysis and visualisation

In [65]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


def knn(X_train, X_test, y_train, y_test):
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    return y_pred



# first 30 predictions

print("real \t\t", Y_test[:30])

Y_pred_w2v = knn(X_w2v_train, X_w2v_test, Y_train, Y_test)
print("Y_pred_w2v \t", Y_pred_w2v[0:30])

Y_pred_ft = knn(X_ft_train, X_ft_test, Y_train, Y_test)
print("Y_pred_ft \t", Y_pred_ft[0:30])

Y_pred_d2v = knn(X_d2v_train, X_d2v_test, Y_train, Y_test)
print("Y_pred_d2v \t", Y_pred_d2v[0:30])

Y_pred_bert = knn(X_bert_train, X_bert_test, Y_train, Y_test)
print("Y_pred_bert \t", Y_pred_bert[0:30])


# calculate the goodness of fit
from sklearn.metrics import accuracy_score


def goodness_of_fit(y_test, y_pred):
    return accuracy_score(y_test[0:1500], y_pred[0:1500])


print("w2v \t", goodness_of_fit(Y_test, Y_pred_w2v))
print("ft \t", goodness_of_fit(Y_test, Y_pred_ft))
print("d2v \t", goodness_of_fit(Y_test, Y_pred_d2v))
print("bert \t", goodness_of_fit(Y_test, Y_pred_bert))



real 		 [2 2 2 0 3 0 1 3 2 2 1 3 2 3 1 0 1 3 0 0 2 3 1 0 2 1 1 3 2 0]
Y_pred_w2v 	 [2 2 0 0 2 0 3 2 2 0 1 1 3 3 2 3 1 2 0 2 0 3 0 2 0 2 1 0 2 2]
Y_pred_ft 	 [1 1 3 0 2 0 1 2 1 1 0 0 3 2 1 0 0 2 1 1 2 1 2 1 3 2 0 2 1 2]
Y_pred_d2v 	 [1 3 2 0 0 1 0 2 3 0 2 3 1 2 2 0 1 2 0 2 2 1 1 2 1 1 3 0 1 2]
Y_pred_bert 	 [1 1 3 3 3 3 3 2 2 2 3 1 0 0 1 1 2 0 3 0 3 0 3 1 1 1 3 3 2 2]
w2v 	 0.278
ft 	 0.24133333333333334
d2v 	 0.2633333333333333
bert 	 0.24333333333333335


By analyzing the results given by different embedding techniques, you should be able to I observe that the best result is given by:

    - Word2Vec

The other three: FestText, Doc2Vec and Bert gave approximately the same results.

6. Word2Vec Theoretical details

### Word2Vec -Skip-Gram

One of the prominent techniques in representing the word as a vector is Word2Vec. It was created by a team of researchers at Google which was led by Tomas Mikolov. It is an unsupervised learning, neural network algorithms for obtaining vector representation of words. It is trained on a large corpus of text data (Wikipedia) which helps us recognize linear substructures of word vector space. (we will get to this once we get a hold on building the vector representations)

Example:

“The quick brown fox jumps over the lazy dog”

Build a probability matrix of words w.r.t each word appearing in a fixed window size. Here, the window size implies the surroundings of a particular word. So, if the window size is 4, while constructing the probability matrix we consider four words on either side of a particular word.

This variant of Word2Vec model is skip-gram model. This model helps us understand the contextual information of the whole sentence. In this model, we consider only one word at a time and build probability of all those words appearing in a fixed window size, around this word.

Building a probability vector for the word ‘THE’ by calculating the probability of all other words appearing around it, for the example and then follow the same for the rest of the words. For this, it's need to be considered two things:

1. The window size in consideration and

2. The number of unique words appearing around the word “THE”.

Here, for our convenience we consider the window size as 2. Now, the unique words appearing around the word “THE” in our small corpus are: quick, brown, jumps, over, lazy, dog. Hence, the probability of each word in the unique words appearing around “THE” becomes 1/6, whereas, the probability vector for the word “quick” is calculated by considering the unique words: the, brown and fox and hence becomes 1/3 for each of these words. This way the matrix is built for all other words in the corpus.

![alt text for screen readers](m1.png "Text to show on mouseover")

This probability vector for each word in the corpus becomes the target while training a neural network and the one hot embedding vector becomes the input layer in the model.

![alt text for screen readers](m2.png "Text to show on mouseover")


The weights of the hidden layer neurons after the model converges becomes our embedding vectors. Also, we can fix the length of the vector by deciding the number of neurons in the hidden layer while training.


### Word2Vec -CBOW

The other variant of Word2Vec model works on these same lines but with a slightly different approach. In this case, the input layer is not a one hot embedding vector of just one word but two-three words put together i.e. a sequence. Then the model is trained to predict the probability vector, same as in skip-gram model. Once the model converges, we get the weights of the hidden layer nuerons which are the required embeddings of a particular word. Since, we consider a sequence of inputs, it is called ‘continuous bag of words or CBOW’. This variant helps us predict the next word of the given input sequence with good accuracy.

![alt text for screen readers](m3.png "Text to show on mouseover")