# Text Classification

We did not do much classification in class although it is relevant in many industrial settings, for example:
- spam detection
- sentiment analysis
- hate speech detection

There are also several theoretical NLP problems that are framed as classification, such as Natural Language Inference.

Because it is very basic, it gives you freedom to use any NLP method:
- bag of words (not really seen in class)
- word embeddings
- LSTM/RNN
- fine-tuned Transformer Encoder (e.g. BERT)...
- ...with full fine-tuning or parameter efficient fine-tuning (e.g. LoRA)
- prompted LLM (e.g. Llama)...
- ...with standard prompting or chain of thought...
- ...with or without In-Context Learning examples

For this homework, we will study the detection of automatically generated text (more specifically, automatically generated research papers), based on the work of [Liyanage et al. 2022 "A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications"](https://aclanthology.org/2022.lrec-1.501)

> Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.

# Installation and imports

Hit `Ctrl+S` to save a copy of the Colab notebook to your drive

Run on Google Colab GPU:
- Connect
- Modify execution
- GPU

![image.png](https://paullerner.github.io/aivancity_nlp/_static/colab_gpu.png)

In [1]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found



T4 GPU (on Google Colab) offers 15GB of memory. This should be enough to run inference and fine-tune LLMs of a few billion parameters (or less, obviously)

Note, in `float32`, 1 parameter = 4 bytes so a LLM of 1B parameters holds 4GB of RAM.
But for full fine-tuning, you will need to store gradient activations (without gradient checkpointing) and optimizer states (with optimizers like Adam).

Turn to quantization for cheap inference of larger models or to Parameter Efficient Fine-Tuning for full-fine tuning of LLMs of a few billion parameters.

Much simpler solution: stick to smaller models of hundred of millions of parameters (e.g. BERT, GPT-2, T5).
You're not here to beat the state of the art but to learn NLP.

In [2]:
import torch
import os

In [3]:
assert torch.cuda.is_available(), "Connect to GPU and try again"

AssertionError: Connect to GPU and try again

# Data
We will use the Hybrid subset of Vijini et al. in which some sentences of human-written abstracts where replaced by automatically-generated text. Experiments on the fully-generated subsets (or any other dataset) may provide bonus points (à faire)

There are no train-test split provided in the paper but we keep 80% to train and 20% to test, following Vijini et al.

In [None]:
import shutil

# Remplacez 'nom_du_dossier' par le chemin du dossier que vous souhaitez supprimer
dossier_a_supprimer = 'GeneratedTextDetection-main'

try:# Supprimer le dossier et tout son contenu
  shutil.rmtree(dossier_a_supprimer)
  print(f"Le dossier {dossier_a_supprimer} a été supprimé avec succès.")
except Exception:
  print(f"{dossier_a_supprimer} n\'existe peut être pas")
finally:
  print('téléchargemet du dataset')

In [None]:
!wget https://github.com/vijini/GeneratedTextDetection/archive/refs/heads/main.zip
!unzip main

In [None]:
from pathlib import Path

In [None]:
root = Path("GeneratedTextDetection-main/Dataset/Hybrid_AbstractDataset")

In [None]:
train_texts, train_labels, test_texts, test_labels = [], [], [], []
for path in root.glob("*.txt"):
    with open(path, 'rt') as file:
        text = file.read()
        text = text.lstrip('\ufeff')
    label = int(path.name.endswith("generatedAbstract.txt"))
    doc_id = int(path.name.split("_")[0].split(".")[-1])
    if doc_id < 10522:
        test_texts.append(text)
        test_labels.append(label)
    else:
        train_texts.append(text)
        train_labels.append(label)

In [None]:
len(train_texts), len(train_labels), len(test_texts), len(test_labels)

In [None]:
train_texts[0]

In [None]:
train_labels[0]

In [None]:
train_texts[10]

In [None]:
train_labels[10]

# Good luck!

It's now up to you to solve the problem. You are free to choose any NLP method (cf. the list I gave above)
but you should motivate your choice.
You can also compare several methods to get bonus points. (compare 3 méthode)

# Submission instructions


**Deadline: Thursday 27th of February 23:59 (Paris CEST)** (strict deadline, 5 points malus per day late, so 4 days late means 0/20)

This is a **group work** of **3 members**.

You will have to submit your **code** and a **report** which will be graded (instructions below) by email to lerner@isir.upmc.fr.

The homework (continuous assessment) will account for 50% of your final grade.

## Report

The report should be **a single .pdf file of max. 4 pages** (concision is key).
Please name the pdf with the name of your group as written in the spreadsheet https://docs.google.com/spreadsheets/d/1UbApMhPC_wof-GoByjkV7kgD5YMbjcFFPqPUCB0YRtQ/edit?usp=sharing for example `ABC.pdf`.

It should follow the following structure:

### Introduction
A few sentences placing the work in context. Limit it to a few paragraphs at most; since your report is based on Vijini et al., you don’t have to motivate that work. However, it should be clear enough what Vijini et al. is
about and what its contributions are.

### Methodology

Describe the methods you are using to tackle the problem and motivate it: why this method and not another?  
What are its advantages and inconvenients?  
What experiment are you running to measure the efficiency or effectiveness of your method to tackle the problem?

#### Model Descriptions
Describe the models you used, including the architecture, learning objective and the number of parameters.

#### Datasets
Describe the datasets you used and how you obtained them.

#### Hyperparameters
Describe how you set the hyperparameters and what was the source for their value (e.g., paper, code, or your guess).

#### Implementation
Describe whether you use existing code or write your own code.

#### Experimental Setup
Explain how you ran your experiments, e.g. the CPU/GPU resources.

### Results
Start with a high-level overview of your results. Keep this
section as factual and precise as possible.
Logically
group related results into sections.

Remember to add plots and diagrams to illustrate your methods or results if necessary.



### Discussion

Describe which parts of your project were difficult or took much more time than you expected.


### Contributions

You should state the contributions of each member of the group.



## Code

You can submit your code either as:

- single .zip file with your entire source code (e.g. several .py files)
- link to a GitHub/GitLab repository (in this case, **include the link in your .pdf report**)
- link to a Google Colab Notebook (your code may be quite simple so it may fit in a single notebook;
  likewise, in this case, **include the link in your .pdf report**)

# Let's start

# Installation and Import

In [None]:
!pip install lazypredict

In [None]:
#general
from lazypredict.Supervised import LazyClassifier
from collections import Counter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#Sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score
#process
import psutil
import time
import subprocess

#tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, Dense, Dropout
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score
from tensorflow.keras.callbacks import Callback

In [None]:
import nltk

#nltk local download
nltk_data_path = '/content/nltk_data'
os.makedirs(nltk_data_path, exist_ok=True)
nltk.data.path.append(nltk_data_path)
nltk.download('punkt_tab', download_dir=nltk_data_path)
nltk.download('stopwords', download_dir=nltk_data_path)
print("Chemins de recherche de NLTK :", nltk.data.path)

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Data analysis

## Distribution: balanced data?

In [None]:
# Afficher la distribution des classes dans les ensembles d'entraînement et de test
train_distribution = Counter(train_labels)
test_distribution = Counter(test_labels)

print("Data distribution in training set :", train_distribution)
print("Data distribution in test set :", test_distribution)


## conclusion: perfectly balanced dataset

# Data Preprocessing

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Mettre en minuscules
    text = text.lower()

    # Supprimer la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokeniser le texte
    tokens = word_tokenize(text)

    # Supprimer les stopwords
    tokens = [token for token in tokens if token not in stop_words]

    return tokens

In [None]:
tokenized_train_texts = [preprocess_text(text) for text in train_texts]
print("Tokens prétraités :", tokenized_train_texts[0][:10])

# Features extraction

##Matrice Vectoriel

###  bag of word

In [None]:
# Instanciation du vectorizer
vectorizer_bow = CountVectorizer()

# Transformation des textes d'entraînement et de test en matrices de comptage
X_train_bow = vectorizer_bow.fit_transform(train_texts)
X_test_bow = vectorizer_bow.transform(test_texts)

print("Taille de la matrice d'entraînement (bag-of-words) :", X_train_bow.shape)


### Tf-idf

In [None]:
# Instanciation du vectorizer TF-IDF
vectorizer_tfidf = TfidfVectorizer()

# Transformation des textes d'entraînement et de test en matrices TF-IDF
X_train_tfidf = vectorizer_tfidf.fit_transform(train_texts)
X_test_tfidf = vectorizer_tfidf.transform(test_texts)

print("Taille de la matrice d'entraînement (TF-IDF) :", X_train_tfidf.shape)


In [None]:

def training(X_train,X_test, Y_train, Y_test):
    print("Utilisation du CPU avant entraînement:", psutil.cpu_percent(interval=1), "%")
    print("Mémoire virtuelle avant entraînement:", psutil.virtual_memory())
    print("GPU usage before training:")
    !nvidia-smi

    #training
    clf_tf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
    models, predictions = clf_tfidf.fit(X_train.toarray(), X_test.toarray(), Y_train, Y_test)


    # Afficher l'utilisation finale du CPU et de la mémoire
    print("Utilisation du CPU après entraînement:", psutil.cpu_percent(interval=1), "%")
    print("Mémoire virtuelle après entraînement:", psutil.virtual_memory())

    # Afficher l'état du GPU après entraînement
    print("GPU usage after training:")
    !nvidia-smi

    return models, predictions

# Lazy predict to compare model (TFidf)

In [None]:
tfidf_models,tfidf_predictions=training(X_train_tfidf, X_test_tfidf, train_labels, test_labels)


In [None]:
tfidf_models

# Lazy predict to compare model (bow)

In [None]:
bow_models,bow_predictions=training(X_train_bow, X_test_bow, train_labels, test_labels)


In [None]:
bow_models

# Neural Network ( complexity minimal)

In [None]:
def print_cpu_usage():
    """Affiche l'utilisation du CPU et de la mémoire."""
    cpu_usage = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    print(f"Utilisation du CPU : {cpu_usage}%")
    print(f"Utilisation de la mémoire : {mem.percent}% (Total: {mem.total/1e9:.2f}GB, Utilisée: {mem.used/1e9:.2f}GB, Disponible: {mem.available/1e9:.2f}GB)")


In [None]:
def print_gpu_usage():
    """Affiche l'état du GPU via nvidia-smi."""
    try:
        gpu_info = subprocess.check_output(["nvidia-smi"]).decode("utf-8")
        print("Utilisation du GPU :\n", gpu_info)
    except Exception as e:
        print("Impossible d'obtenir les informations du GPU. Vérifiez que l'environnement possède un GPU.")
        print(e)

In [None]:
class ResourceMonitor(Callback):
    def __init__(self):
        super(ResourceMonitor, self).__init__()
        self.cpu_usage = []
        self.gpu_usage = []

    def on_epoch_end(self, epoch, logs=None):
        # Mesurer l'utilisation du CPU
        cpu = psutil.cpu_percent(interval=1)
        self.cpu_usage.append(cpu)
        print(f"Epoch {epoch+1} - CPU Usage: {cpu}%")

        # Mesurer l'utilisation du GPU via nvidia-smi
        try:
            gpu_info = subprocess.check_output(
                ["nvidia-smi", "--query-gpu=utilization.gpu", "--format=csv,noheader,nounits"]
            )
            gpu_util = gpu_info.decode("utf-8").strip()
            print(f"Epoch {epoch+1} - GPU Utilization: {gpu_util}%")
            self.gpu_usage.append(gpu_util)
        except Exception as e:
            print("GPU usage not available:", e)
            self.gpu_usage.append("0")  # En cas d'erreur, stocker 0

In [None]:
def plot_resource_usage(resource_monitor):
    epochs = range(1, len(resource_monitor.cpu_usage) + 1)

    # Convertir les valeurs GPU en nombres (float)
    gpu_usage = []
    for val in resource_monitor.gpu_usage:
        try:
            gpu_usage.append(float(val))
        except:
            gpu_usage.append(0.0)

    plt.figure(figsize=(14, 6))

    # Graphique pour l'utilisation du CPU
    plt.subplot(1, 2, 1)
    plt.plot(epochs, resource_monitor.cpu_usage, marker='o', linestyle='-', color='blue')
    plt.title("Utilisation du CPU par époque")
    plt.xlabel("Époque")
    plt.ylabel("CPU Usage (%)")

    # Graphique pour l'utilisation du GPU
    plt.subplot(1, 2, 2)
    plt.plot(epochs, gpu_usage, marker='o', linestyle='-', color='green')
    plt.title("Utilisation du GPU par époque")
    plt.xlabel("Époque")
    plt.ylabel("GPU Utilization (%)")

    plt.tight_layout()
    plt.show()

In [None]:
def train_nn(X_train, train_labels, X_test, test_labels, epochs=10, batch_size=32):
    """
    Entraîne un réseau de neurones sur des caractéristiques TF-IDF déjà calculées.

    Paramètres :
    - X_train : matrice TF-IDF pour l'entraînement (peut être sparse ou dense)
    - train_labels : étiquettes d'entraînement
    - X_test : matrice TF-IDF pour le test (peut être sparse ou dense)
    - test_labels : étiquettes de test
    - epochs : nombre d'époques d'entraînement (défaut=10)
    - batch_size : taille du batch (défaut=32)

    La fonction affiche l'utilisation des ressources CPU/GPU avant et après l'entraînement,
    entraîne un modèle de réseau de neurones simple et affiche l'évaluation sur le jeu de test.
    """
    # Conversion des labels en tableaux NumPy
    train_labels = np.array(train_labels)
    test_labels = np.array(test_labels)

    # Conversion en format dense si nécessaire
    if hasattr(X_train, "toarray"):
        X_train_dense = X_train.toarray()
    else:
        X_train_dense = X_train

    if hasattr(X_test, "toarray"):
        X_test_dense = X_test.toarray()
    else:
        X_test_dense = X_test

    # Définition du modèle
    input_dim = X_train_dense.shape[1]
    model = Sequential([
        Dense(128, activation='relu', input_shape=(input_dim,)),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dense(1, activation='sigmoid')  # Pour la classification binaire
    ])

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    #monitoring
    resource_monitor = ResourceMonitor()

    # Entraînement du modèle
    print("=== Début de l'entraînement ===")
    history = model.fit(X_train_dense, train_labels, epochs=epochs, batch_size=batch_size, validation_split=0.1, callbacks=[resource_monitor])


    # Évaluation sur le jeu de test
    loss, accuracy = model.evaluate(X_test_dense, test_labels)
    print("Test Loss :", loss)
    print("Test Accuracy :", accuracy)

    # Prédictions et calcul des métriques supplémentaires
    predictions = (model.predict(X_test_dense) > 0.5).astype("int32")

    print("\n=== Rapport de Classification ===")
    print(classification_report(test_labels, predictions))

    conf_matrix = confusion_matrix(test_labels, predictions)
    precision = precision_score(test_labels, predictions)
    recall = recall_score(test_labels, predictions)
    f1 = f1_score(test_labels, predictions)

    # Pour le ROC-AUC, on utilise directement les probabilités prédites
    roc_auc = roc_auc_score(test_labels, model.predict(X_test_dense))

    print("Matrice de confusion :\n", conf_matrix)
    print("Précision :", precision)
    print("Recall :", recall)
    print("F1 Score :", f1)
    print("ROC AUC :", roc_auc)

    #ressource use
    print('Utilisation des ressources')
    plot_resource_usage(resource_monitor)


    return model, history

In [None]:
 model_nn_zc, history_zc = train_nn(X_train_tfidf.toarray(), train_labels, X_test_tfidf.toarray(), test_labels, epochs=100, batch_size=8)

# Complexe Neural Network

In [None]:
def prepare_data(train_texts, test_texts, vocab_size=10000, padding='post'):
    """
    Tokenise les textes et applique le padding pour obtenir des séquences de même longueur.

    Retourne :
      - X_train_pad, X_test_pad : séquences paddées pour l'entraînement et le test.
      - tokenizer : l'objet Tokenizer entraîné sur les textes d'entraînement.
      - max_length : longueur maximale utilisée pour le padding.
    """
    tokenizer = Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(train_texts)
    X_train_seq = tokenizer.texts_to_sequences(train_texts)
    X_test_seq = tokenizer.texts_to_sequences(test_texts)

    # Définir la longueur maximale basée sur l'ensemble d'entraînement
    max_length = max(len(seq) for seq in X_train_seq)

    X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding=padding)
    X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding=padding)

    return X_train_pad, X_test_pad, tokenizer, max_length

In [None]:
# Étape 1 : Préparation des données
vocab_size = 10000
embed_dim = 128
X_train_pad, X_test_pad, tokenizer, max_length = prepare_data(train_texts, test_texts, vocab_size=vocab_size)

In [None]:
def train_model(model, X_train, train_labels, X_test, test_labels, epochs=10, batch_size=32):
    """
    Entraîne un modèle déjà défini avec les données fournies, en utilisant un callback
    pour suivre l'utilisation des ressources par époque, et affiche ensuite les métriques d'évaluation.
    """
    resource_monitor = ResourceMonitor()

    with tf.device('/GPU:0'):
        history = model.fit(np.array(X_train), np.array(train_labels),
                        epochs=epochs, batch_size=batch_size,
                        validation_split=0.1, callbacks=[resource_monitor])



    loss, accuracy = model.evaluate(np.array(X_test), np.array(test_labels))
    print("Test Loss:", loss)
    print("Test Accuracy:", accuracy)

    predictions = (model.predict(np.array(X_test)) > 0.5).astype("int32")
    print("\n=== Rapport de Classification ===")
    print(classification_report(np.array(test_labels), predictions))

    conf_matrix = confusion_matrix(np.array(test_labels), predictions)
    precision = precision_score(np.array(test_labels), predictions)
    recall = recall_score(np.array(test_labels), predictions)
    f1 = f1_score(np.array(test_labels), predictions)
    roc_auc = roc_auc_score(np.array(test_labels), model.predict(np.array(X_test)))

    print("Matrice de confusion :\n", conf_matrix)
    print("Précision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    print("ROC AUC:", roc_auc)


    # Afficher le graphique d'utilisation des ressources
    print('Ressource utilisation')
    plot_resource_usage(resource_monitor)

    return history

## RNN

In [None]:
def build_rnn_model_1(vocab_size, embed_dim, max_length):
    """
    Construit et compile un modèle RNN simple pour la classification binaire.

    Paramètres :
      - vocab_size : taille du vocabulaire.
      - embed_dim : dimension de l'embedding.
      - max_length : longueur maximale des séquences.

    Retourne :
      - model : le modèle Keras compilé.
    """
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embed_dim, input_length=max_length))
    model.add(SimpleRNN(128, activation='tanh'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

### training

In [None]:
# Étape 3 : Entraînement du modèle avec suivi des ressources
model_rnn = build_rnn_model_1(vocab_size, embed_dim, max_length)
history_rnn = train_model(model_rnn, X_train_pad, train_labels, X_test_pad, test_labels, epochs=100, batch_size=32)

## LSTM

In [None]:
def build_lstm_model_1(vocab_size, embed_dim, max_length):
    """
    Construit et compile un modèle LSTM simple pour la classification binaire.

    Paramètres :
      - vocab_size : taille du vocabulaire.
      - embed_dim : dimension de l'embedding.
      - max_length : longueur maximale des séquences.

    Retourne :
      - model : le modèle Keras compilé.
    """
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=embed_dim, input_length=max_length))
    model.add(LSTM(128))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

### training

In [None]:
model_lstm = build_lstm_model_1(vocab_size, embed_dim, max_length)
history_lstm = train_model(model_lstm, X_train_pad, train_labels, X_test_pad, test_labels, epochs=10, batch_size=32)