The goal of this work is to process a text dataset using Neural Networks and Deep Learning
word embedding and data analytics methods and to extract knowledge from it. Prepare a report
for this work and deposit it on moodle.

In this work you will use 20 Newsgroup dataset, but you a free to use any text data (UCI datasets
repository, kaggle, data.gouv.fr, …) informing the Professor.

The work should contains at least the following 4 parts:
1. Analysis of the text dataset
2. Text processing and Transformation
3. Apply di erent Neural Networks (NN) embedding techniques
4. Clustering and/or classi cation on the embedded data
5. Results analysis and visualisation
6. Theoretical formalism

In [20]:
# In this work you will use 20 Newsgroup dataset
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pickle
import os
import sys
import time
import warnings
warnings.filterwarnings("ignore")

In [21]:
# Analyse the dataset : the context, size, difficulties, detect the objectives.

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

# Load the 20 newsgroups dataset
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# the context of the dataset
print(newsgroups_train.target_names)
print(newsgroups_train.data[0])


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [22]:
# analyse the size of the dataset
print(len(newsgroups_train.data))
print(len(newsgroups_test.data))

2257
1502


In [23]:
# Text Processing and Transformation
# For this part, you should use scikit-learn and you can follow the tutorial:
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tutorial-setup

# Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices)


def build_dictionary(data):
    """
    Build a dictionary from words to integer indices.
    :param data: a list of documents
    :return: a dictionary from words to integer indices
    """
    dictionary = {}
    for doc in data:
        for word in doc.split():
            if word not in dictionary:
                dictionary[word] = len(dictionary)
    return dictionary


dictionary = build_dictionary(newsgroups_train.data)


# For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
def build_X(data, dictionary):
    """
    Build the feature matrix X from the training set.
    :param data: a list of documents
    :param dictionary: a dictionary from words to integer indices
    :return: the feature matrix X
    """
    X = np.zeros((len(data), len(dictionary)), dtype=np.int)
    for i, doc in enumerate(data):
        for word in doc.split():
            X[i, dictionary[word]] += 1
    return X

X = build_X(newsgroups_train.data, dictionary)

print(X)

[[1 1 1 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 1 1 1]
 [1 0 0 ... 0 0 0]]


In [24]:
# Tokenizing text with scikit-learn


def build_X_scikit(data, dictionary):
    """
    Build the feature matrix X from the training set.
    :param data: a list of documents
    :param dictionary: a dictionary from words to integer indices
    :return: the feature matrix X
    """
    vectorizer = CountVectorizer(vocabulary=dictionary)
    X = vectorizer.fit_transform(data)
    return X


X = build_X_scikit(newsgroups_train.data, dictionary)

In [25]:
# Apply different embedding techniques based on Neural Networks

# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

def build_X_tfidf(data, dictionary):
    """
    Build the feature matrix X from the training set.
    :param data: a list of documents
    :param dictionary: a dictionary from words to integer indices
    :return: the feature matrix X
    """
    vectorizer = TfidfVectorizer(vocabulary=dictionary)
    X = vectorizer.fit_transform(data)
    return X


X = build_X_tfidf(newsgroups_train.data, dictionary)

In [26]:
# You should test different embedding approaches: word2vec, FastText, document2vec, BERT, Glove ??????????????????

In [27]:
from sklearn.naive_bayes import MultinomialNB


def build_model(X, y):
    """
    Build a Multinomial Naive Bayes model.
    :param X: the feature matrix
    :param y: the target vector
    :return: a MultinomialNB model
    """
    model = MultinomialNB()
    model.fit(X, y)
    return model


model = build_model(X, newsgroups_train.target)


def predict(model, X):
    """
    Predict the labels of the documents in X.
    :param model: a MultinomialNB model
    :param X: the feature matrix
    :return: the predicted labels
    """
    return model.predict(X)


predicted = predict(model, X)


def evaluate(y, predicted):
    """
    Evaluate the model by computing the confusion matrix and the classification report.
    :param y: the true labels
    :param predicted: the predicted labels
    :return: the confusion matrix and the classification report
    """
    print(confusion_matrix(y, predicted))
    print(classification_report(y, predicted))


evaluate(newsgroups_train.target, predicted)


[[302   0   0 178]
 [  1 569   2  12]
 [  0   3 584   7]
 [  0   1   0 598]]
              precision    recall  f1-score   support

           0       1.00      0.63      0.77       480
           1       0.99      0.97      0.98       584
           2       1.00      0.98      0.99       594
           3       0.75      1.00      0.86       599

    accuracy                           0.91      2257
   macro avg       0.93      0.90      0.90      2257
weighted avg       0.93      0.91      0.91      2257

