The goal of this work is to process a text dataset using Neural Networks and Deep Learning
word embedding and data analytics methods and to extract knowledge from it. Prepare a report
for this work and deposit it on moodle.

In this work you will use 20 Newsgroup dataset, but you a free to use any text data (UCI datasets
repository, kaggle, data.gouv.fr, …) informing the Professor.

The work should contains at least the following 4 parts:
1. Analysis of the text dataset
2. Text processing and Transformation
3. Apply di erent Neural Networks (NN) embedding techniques
4. Clustering and/or classi cation on the embedded data
5. Results analysis and visualisation
6. Theoretical formalism

In [241]:
# In this work you will use 20 Newsgroup dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pickle
import os
import sys
import time
import warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer

In [242]:
# Analyse the dataset : the context, size, difficulties, detect the objectives.

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# the context of the dataset
print(newsgroups_train.target_names)
print(newsgroups_train.data[0])

X_train = newsgroups_train.data
Y_train = newsgroups_train.target

X_test = newsgroups_test.data
Y_test = newsgroups_test.target


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [243]:
# analyse the size of the dataset
print(len(X_train))
print(len(X_test))

2257
1502


In [244]:
# Text Processing and Transformation
# For this part, you should use scikit-learn and you can follow the tutorial:
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup

# Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices)

# For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
def build_X(data, dictionary):
    X = np.zeros((len(data), len(dictionary)), dtype=np.int)
    for i, doc in enumerate(data):
        for word in doc.split():
            X[i, dictionary[word]] += 1
    return X


def build_dictionary(data):
    dictionary = {}
    for doc in data:
        for word in doc.split():
            if word not in dictionary:
                dictionary[word] = len(dictionary)
    return dictionary


dictionary_train = build_dictionary(X_train)
X_bow_train = build_X(X_train, dictionary_train)

dictionary_test = build_dictionary(X_test)
X_bow_test = build_X(X_test, dictionary_test)

In [245]:
# def tokenize(data, dictionary):
#     vectorizer = CountVectorizer(vocabulary=dictionary)
#     X = vectorizer.fit_transform(data)
#     return X

# X_cv_train = tokenize(X_train, dictionary_train)


# print(X_cv_train)

In [246]:
def tfidf(X):
    transformer = TfidfTransformer()
    X = transformer.fit_transform(X)
    return X


X_tfidf_train = tfidf(X_bow_train)
X_tfidf_test = tfidf(X_bow_test)

In [247]:
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec
import nltk


nltk.download('punkt')



data = pd.DataFrame(X_train, columns=['Text'])
# append X_test
data = data.append(pd.DataFrame(X_test, columns=['Text']))

def get_corpus(data):
    corpus_text = 'n'.join(data[:1000]['Text'])
    data = []
    # iterate through each sentence in the file
    for i in sent_tokenize(corpus_text):
        temp = []
        # tokenize the sentence into words
        for j in word_tokenize(i):
            temp.append(j.lower())
        data.append(temp)
    return data


corpus = get_corpus(data)

# Word2Vec
model = Word2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4)

# get the vector for each word in the vocabulary
words_key_to_index = model.wv.key_to_index
words_index_to_key = model.wv.index_to_key
words_vectors = model.wv.vectors

# split words_to_vectors into train and test
X_w2v_train = words_vectors[:len(X_train)]
X_w2v_test = words_vectors[len(X_train):]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ion\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [248]:
# FastText
from gensim.models import FastText


model = FastText(corpus, vector_size=100, window=5, min_count=1, workers=4)


# get the vector for each word in the vocabulary
words_key_to_index = model.wv.key_to_index
words_index_to_key = model.wv.index_to_key
words_vectors = model.wv.vectors

# split words_to_vectors into train and test
X_ft_train = words_vectors[:len(X_train)]
X_ft_test = words_vectors[len(X_train):]

In [249]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]
model = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4)

words_vectors = model.wv.vectors


# split words_to_vectors into train and test
X_d2v_train = words_vectors[:len(X_train)]
X_d2v_test = words_vectors[len(X_train):]



In [250]:
# BERT model
def build_bert_X(data):
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    X = np.zeros((len(data), 768))
    for i, doc in enumerate(data):
        X[i] = tokenizer.encode(doc, add_special_tokens=True, max_length=768, pad_to_max_length=True)
    return X

X_bow_train_list = X_bow_train.tolist()
X_bow_test_list = X_bow_test.tolist()

X_bert_train = build_bert_X(X_bow_train_list)
X_bert_test = build_bert_X(X_bow_train_list)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [252]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


def knn(X_train, X_test, y_train, y_test):
    knn = KNeighborsClassifier(n_neighbors=1)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    return y_pred


def goodness_of_fit(y_pred, y_test):
    return accuracy_score(y_pred[:1500], y_test[:1500])


# Y_pred_tfidf = knn(X_tfidf_train, X_tfidf_test, Y_train, Y_test)
# print(goodness_of_fit(Y_pred_tfidf, Y_test))


Y_pred_w2v = knn(X_w2v_train, X_w2v_test, Y_train, Y_test)
print(goodness_of_fit(Y_pred_w2v, Y_test))


Y_pred_ft = knn(X_ft_train, X_ft_test, Y_train, Y_test)
print(goodness_of_fit(Y_pred_ft, Y_test))


Y_pred_d2v = knn(X_d2v_train, X_d2v_test, Y_train, Y_test)
print(goodness_of_fit(Y_pred_d2v, Y_test))


Y_pred_bert = knn(X_bert_train, X_bert_test, Y_train, Y_test)
print(goodness_of_fit(Y_pred_bert, Y_test))


0.26666666666666666
0.23266666666666666
0.23066666666666666
0.24333333333333335
