# Text Classification

We did not do much classification in class although it is relevant in many industrial settings, for example:
- spam detection
- sentiment analysis
- hate speech detection

There are also several theoretical NLP problems that are framed as classification, such as Natural Language Inference.

Because it is very basic, it gives you freedom to use any NLP method:
- bag of words (not really seen in class)
- word embeddings
- LSTM/RNN
- fine-tuned Transformer Encoder (e.g. BERT)...
- ...with full fine-tuning or parameter efficient fine-tuning (e.g. LoRA)
- prompted LLM (e.g. Llama)...
- ...with standard prompting or chain of thought...
- ...with or without In-Context Learning examples

For this homework, we will study the detection of automatically generated text (more specifically, automatically generated research papers), based on the work of [Liyanage et al. 2022 "A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications"](https://aclanthology.org/2022.lrec-1.501)

> Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.

# Installation and imports

Hit `Ctrl+S` to save a copy of the Colab notebook to your drive

Run on Google Colab GPU:
- Connect
- Modify execution
- GPU

![image.png](https://paullerner.github.io/aivancity_nlp/_static/colab_gpu.png)

In [1]:
!nvidia-smi

Wed Feb 26 08:13:53 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                


T4 GPU (on Google Colab) offers 15GB of memory. This should be enough to run inference and fine-tune LLMs of a few billion parameters (or less, obviously)

Note, in `float32`, 1 parameter = 4 bytes so a LLM of 1B parameters holds 4GB of RAM.
But for full fine-tuning, you will need to store gradient activations (without gradient checkpointing) and optimizer states (with optimizers like Adam).

Turn to quantization for cheap inference of larger models or to Parameter Efficient Fine-Tuning for full-fine tuning of LLMs of a few billion parameters.

Much simpler solution: stick to smaller models of hundred of millions of parameters (e.g. BERT, GPT-2, T5).
You're not here to beat the state of the art but to learn NLP.

In [2]:
import torch
import os

In [3]:
assert torch.cuda.is_available(), "Connect to GPU and try again"

# Data
We will use the Hybrid subset of Vijini et al. in which some sentences of human-written abstracts where replaced by automatically-generated text. Experiments on the fully-generated subsets (or any other dataset) may provide bonus points (à faire)

There are no train-test split provided in the paper but we keep 80% to train and 20% to test, following Vijini et al.

In [4]:
import shutil

# Remplacez 'nom_du_dossier' par le chemin du dossier que vous souhaitez supprimer
dossier_a_supprimer = 'GeneratedTextDetection-main'

try:# Supprimer le dossier et tout son contenu
  shutil.rmtree(dossier_a_supprimer)
  print(f"Le dossier {dossier_a_supprimer} a été supprimé avec succès.")
except Exception:
  print(f"{dossier_a_supprimer} n\'existe peut être pas")
finally:
  print('téléchargemet du dataset')

GeneratedTextDetection-main n'existe peut être pas
téléchargemet du dataset


In [5]:
!wget https://github.com/vijini/GeneratedTextDetection/archive/refs/heads/main.zip
!unzip main

--2025-02-26 08:13:57--  https://github.com/vijini/GeneratedTextDetection/archive/refs/heads/main.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/vijini/GeneratedTextDetection/zip/refs/heads/main [following]
--2025-02-26 08:13:57--  https://codeload.github.com/vijini/GeneratedTextDetection/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 20.205.243.165
Connecting to codeload.github.com (codeload.github.com)|20.205.243.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [ <=>                ] 800.25K  --.-KB/s    in 0.07s   

2025-02-26 08:13:57 (11.0 MB/s) - ‘main.zip’ saved [819461]

Archive:  main.zip
ab034465f857a93212a894fe598edb749345b6ff
   creating: GeneratedTextDetection-main/
  inflati

In [6]:
from pathlib import Path

In [7]:
root = Path("GeneratedTextDetection-main/Dataset/Hybrid_AbstractDataset")

In [8]:
train_texts, train_labels, test_texts, test_labels = [], [], [], []
for path in root.glob("*.txt"):
    with open(path, 'rt') as file:
        text = file.read()
        text = text.lstrip('\ufeff')
    label = int(path.name.endswith("generatedAbstract.txt"))
    doc_id = int(path.name.split("_")[0].split(".")[-1])
    if doc_id < 10522:
        test_texts.append(text)
        test_labels.append(label)
    else:
        train_texts.append(text)
        train_labels.append(label)

In [9]:
len(train_texts), len(train_labels), len(test_texts), len(test_labels)

(160, 160, 40, 40)

In [10]:
train_texts[0]

'Machine learning in medical imaging during clinical routine is impaired by changes in scan- ner protocols, hardware, or policies resulting in a heterogeneous set of acquisition settings. When training a deep learning model on an initial static training set, model performance and reliability suffer from changes of acquisition characteristics as data and targets may become inconsistent. Continual learning can help to adapt models to the changing environ- ment by training on a continuous data stream. However, continual manual expert labelling of medical imaging requires substantial effort. Thus, ways to use labelling resources ef- ficiently on a well chosen sub-set of new examples is necessary to render this strategy feasible. Here, we propose an integrated toolkit for automatic annotation of medical imaging , based on a deep embeddings framework for biomedical data, and present a method to automatically infer such annotation results using the full medical image corpus. The approach auto

In [11]:
train_labels[0]

1

In [12]:
train_texts[10]

'Public policies that supply public goods, especially those involve collaboration by limiting individual liberty, always give rise to controversies over governance legitimacy. Multi-Agent Reinforcement Learning (MARL) methods are appropriate for supporting the legitimacy of the public policies that supply public goods at the cost of individual interests. Among these policies, the inter-regional collaborative pandemic control is a prominent example, which has become much more important for an increasingly inter-connected world facing a global pandemic like COVID-19. Different patterns of collaborative strategies have been observed among different systems of regions, yet it lacks an analytical process to reason for the legitimacy of those strategies. In this paper, we use the inter-regional collaboration for pandemic control as an example to demonstrate the necessity of MARL in reasoning, and thereby legitimizing policies enforcing such inter-regional collaboration. Exper- imental result

In [13]:
train_labels[10]

0

# Good luck!

It's now up to you to solve the problem. You are free to choose any NLP method (cf. the list I gave above)
but you should motivate your choice.
You can also compare several methods to get bonus points. (compare 3 méthode)

# Submission instructions


**Deadline: Thursday 27th of February 23:59 (Paris CEST)** (strict deadline, 5 points malus per day late, so 4 days late means 0/20)

This is a **group work** of **3 members**.

You will have to submit your **code** and a **report** which will be graded (instructions below) by email to lerner@isir.upmc.fr.

The homework (continuous assessment) will account for 50% of your final grade.

## Report

The report should be **a single .pdf file of max. 4 pages** (concision is key).
Please name the pdf with the name of your group as written in the spreadsheet https://docs.google.com/spreadsheets/d/1UbApMhPC_wof-GoByjkV7kgD5YMbjcFFPqPUCB0YRtQ/edit?usp=sharing for example `ABC.pdf`.

It should follow the following structure:

### Introduction
A few sentences placing the work in context. Limit it to a few paragraphs at most; since your report is based on Vijini et al., you don’t have to motivate that work. However, it should be clear enough what Vijini et al. is
about and what its contributions are.

### Methodology

Describe the methods you are using to tackle the problem and motivate it: why this method and not another?  
What are its advantages and inconvenients?  
What experiment are you running to measure the efficiency or effectiveness of your method to tackle the problem?

#### Model Descriptions
Describe the models you used, including the architecture, learning objective and the number of parameters.

#### Datasets
Describe the datasets you used and how you obtained them.

#### Hyperparameters
Describe how you set the hyperparameters and what was the source for their value (e.g., paper, code, or your guess).

#### Implementation
Describe whether you use existing code or write your own code.

#### Experimental Setup
Explain how you ran your experiments, e.g. the CPU/GPU resources.

### Results
Start with a high-level overview of your results. Keep this
section as factual and precise as possible.
Logically
group related results into sections.

Remember to add plots and diagrams to illustrate your methods or results if necessary.



### Discussion

Describe which parts of your project were difficult or took much more time than you expected.


### Contributions

You should state the contributions of each member of the group.



## Code

You can submit your code either as:

- single .zip file with your entire source code (e.g. several .py files)
- link to a GitHub/GitLab repository (in this case, **include the link in your .pdf report**)
- link to a Google Colab Notebook (your code may be quite simple so it may fit in a single notebook;
  likewise, in this case, **include the link in your .pdf report**)

# Let's start

#distribution

In [14]:
from collections import Counter

# Afficher la distribution des classes dans les ensembles d'entraînement et de test
train_distribution = Counter(train_labels)
test_distribution = Counter(test_labels)

print("Data distribution in training set :", train_distribution)
print("Data distribution in test set :", test_distribution)


Data distribution in training set : Counter({1: 80, 0: 80})
Data distribution in test set : Counter({1: 20, 0: 20})


## conclusion: perfectly balanced dataset

# Data Preprocessing

In [15]:
import nltk

#nltk local download
nltk_data_path = '/content/nltk_data'
os.makedirs(nltk_data_path, exist_ok=True)
nltk.data.path.append(nltk_data_path)
nltk.download('punkt_tab', download_dir=nltk_data_path)
nltk.download('stopwords', download_dir=nltk_data_path)
print("Chemins de recherche de NLTK :", nltk.data.path)

Chemins de recherche de NLTK : ['/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data', '/content/nltk_data']


[nltk_data] Downloading package punkt_tab to /content/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [17]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Mettre en minuscules
    text = text.lower()

    # Supprimer la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokeniser le texte
    tokens = word_tokenize(text)

    # Supprimer les stopwords
    tokens = [token for token in tokens if token not in stop_words]

    return tokens

In [18]:
tokenized_train_texts = [preprocess_text(text) for text in train_texts]
print("Tokens prétraités :", tokenized_train_texts[0][:10])

Tokens prétraités : ['machine', 'learning', 'medical', 'imaging', 'clinical', 'routine', 'impaired', 'changes', 'scan', 'ner']


# Features extraction

##Matrice Vectoriel

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# Instanciation du vectorizer
vectorizer_bow = CountVectorizer()

# Transformation des textes d'entraînement et de test en matrices de comptage
X_train_bow = vectorizer_bow.fit_transform(train_texts)
X_test_bow = vectorizer_bow.transform(test_texts)

print("Taille de la matrice d'entraînement (bag-of-words) :", X_train_bow.shape)


Taille de la matrice d'entraînement (bag-of-words) : (160, 3180)


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Instanciation du vectorizer TF-IDF
vectorizer_tfidf = TfidfVectorizer()

# Transformation des textes d'entraînement et de test en matrices TF-IDF
X_train_tfidf = vectorizer_tfidf.fit_transform(train_texts)
X_test_tfidf = vectorizer_tfidf.transform(test_texts)

print("Taille de la matrice d'entraînement (TF-IDF) :", X_train_tfidf.shape)


Taille de la matrice d'entraînement (TF-IDF) : (160, 3180)


In [21]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.13-py2.py3-none-any.whl.metadata (12 kB)
Downloading lazypredict-0.2.13-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.13


In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from lazypredict.Supervised import LazyClassifier

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [25]:
import psutil
import time

def training(X_train,X_test, Y_train, Y_test):
    print("Utilisation du CPU avant entraînement:", psutil.cpu_percent(interval=1), "%")
    print("Mémoire virtuelle avant entraînement:", psutil.virtual_memory())
    print("GPU usage before training:")
    !nvidia-smi

    #training
    clf_tf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
    models, predictions = clf_tfidf.fit(X_train.toarray(), X_test.toarray(), Y_train, Y_test)


    # Afficher l'utilisation finale du CPU et de la mémoire
    print("Utilisation du CPU après entraînement:", psutil.cpu_percent(interval=1), "%")
    print("Mémoire virtuelle après entraînement:", psutil.virtual_memory())

    # Afficher l'état du GPU après entraînement
    print("GPU usage after training:")
    !nvidia-smi

    return models, predictions

# Lazy predict to compare model (TFidf)

In [26]:
tfidf_models,tfidf_predictions=training(X_train_tfidf, X_test_tfidf, train_labels, test_labels)


Utilisation du CPU avant entraînement: 2.0 %
Mémoire virtuelle avant entraînement: svmem(total=13609431040, available=11800244224, percent=13.3, used=1463853056, free=5728792576, active=786243584, inactive=6530990080, buffers=405987328, cached=6010798080, shared=12111872, slab=262754304)
GPU usage before training:
Wed Feb 26 08:22:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   

100%|██████████| 32/32 [00:08<00:00,  3.87it/s]

[LightGBM] [Info] Number of positive: 80, number of negative: 80
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000386 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2553
[LightGBM] [Info] Number of data points in the train set: 160, number of used features: 138
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000





Utilisation du CPU après entraînement: 29.8 %
Mémoire virtuelle après entraînement: svmem(total=13609431040, available=11794833408, percent=13.3, used=1469263872, free=5722873856, active=786444288, inactive=6541430784, buffers=406224896, cached=6011068416, shared=12111872, slab=262901760)
GPU usage after training:
Wed Feb 26 08:22:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   

In [29]:
tfidf_models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.68,0.68,0.68,0.67,0.7
LGBMClassifier,0.62,0.62,0.62,0.62,0.2
BernoulliNB,0.62,0.62,0.62,0.62,0.08
SGDClassifier,0.6,0.6,0.6,0.6,0.17
BaggingClassifier,0.6,0.6,0.6,0.6,0.25
Perceptron,0.57,0.57,0.58,0.57,0.12
RandomForestClassifier,0.57,0.57,0.57,0.55,0.33
NearestCentroid,0.57,0.57,0.57,0.57,0.1
LinearDiscriminantAnalysis,0.55,0.55,0.55,0.55,0.24
QuadraticDiscriminantAnalysis,0.55,0.55,0.55,0.55,0.16


# Lazy predict to compare model (bow)

In [30]:
bow_models,bow_predictions=training(X_train_bow, X_test_bow, train_labels, test_labels)


Utilisation du CPU avant entraînement: 2.5 %
Mémoire virtuelle avant entraînement: svmem(total=13609431040, available=11764404224, percent=13.6, used=1499684864, free=5544263680, active=793526272, inactive=6721712128, buffers=410009600, cached=6155472896, shared=12111872, slab=270540800)
GPU usage before training:
Wed Feb 26 08:24:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   

100%|██████████| 32/32 [00:07<00:00,  4.03it/s]


[LightGBM] [Info] Number of positive: 80, number of negative: 80
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000470 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 735
[LightGBM] [Info] Number of data points in the train set: 160, number of used features: 138
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Utilisation du CPU après entraînement: 27.3 %
Mémoire virtuelle après entraînement: svmem(total=13609431040, available=11756982272, percent=13.6, used=1507115008, free=5536591872, active=793731072, inactive=6748086272, buffers=410202112, cached=6155522048, shared=12111872, slab=270581760)
GPU usage after training:
Wed Feb 26 08:24:28 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15    

In [32]:
bow_models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesClassifier,0.65,0.65,0.65,0.64,0.33
RandomForestClassifier,0.65,0.65,0.65,0.65,0.32
BernoulliNB,0.62,0.62,0.62,0.62,0.08
LGBMClassifier,0.6,0.6,0.6,0.6,0.19
XGBClassifier,0.6,0.6,0.6,0.6,0.54
ExtraTreeClassifier,0.6,0.6,0.6,0.6,0.07
LinearSVC,0.57,0.57,0.57,0.57,0.88
RidgeClassifierCV,0.57,0.57,0.57,0.57,0.27
RidgeClassifier,0.57,0.57,0.57,0.57,0.26
DecisionTreeClassifier,0.55,0.55,0.55,0.55,0.12
