This .ipynb file is going to highlight some topics that is interesting for out project highliting what we learned in class and how we linked the knowledge with the small classification task we have in front of us.

The challenges of the project are the same that we find in the NLP domain in general which are the Productivity, ambiguitty, variability, diversity and sparsity of language.


# STEP 2-TOKENIZATION

The diversity of the different languages makes the choice of the tokeniser a bit more difficult.We need a multilingual tokeniser that can deal with different languages and texts . The choice was to use BERT Multilingual (mBERT).

In [1]:
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset

# Charger le modèle et le tokenizer mBERT
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Tokenisation d'un exemple de texte
example_text = "Ceci est un exemple de texte."
tokens = tokenizer.tokenize(example_text)
print(tokens)



  from .autonotebook import tqdm as notebook_tqdm


['Ceci', 'est', 'un', 'exemple', 'de', 'texte', '.']


Some texts surpasses the number of 512 as we saw in the part 1-Exploratory Data Analysis .512 is the maximum number that the BERT multilingual model can work with.However we assumed that to guess a language 512 "characters" will be enough for our model to guess the language.This is why we decided to truncate all the texts/lines to 512 tokens each.For really small texts we decided to use paddings to also bring them to 512 tokens.This is what we call:text normalisation.

In [2]:
from transformers import BertTokenizer
from datasets import load_dataset

# Charger le tokenizer BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

# Charger le fichier CSV avec `datasets`
dataset = load_dataset('csv', data_files={'train': 'train_cleaned_numeric_labels.csv'}, delimiter=',')

# Fonction de tokenisation avec tronquage et padding à une longueur maximale de 512 tokens
def tokenize_function(examples):
    return tokenizer(examples['Text'], padding="max_length", truncation=True, max_length=512)

# Appliquer la tokenisation uniquement à la colonne 'Text' du dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Vérifier les premières entrées après tokenisation

print(tokenized_datasets['train'][0])



Generating train split: 190099 examples [00:04, 44019.51 examples/s]
Map: 100%|██████████| 190099/190099 [10:19<00:00, 306.63 examples/s]

{'Usage': 'Public', 'Text': 'َ قَالَ النَّبِيُّ ص إِنِّي أَتَعَجَّبُ مِمَّنْ يَضْرِبُ امْرَأَتَهُ وَ هُوَ بِالضَّرْبِ أَوْلَى مِنْهَا لَا تَضْرِبُوا نِسَاءَكُمْ بِالْخَشَبِ فَإِنَّ فِيهِ الْقِصَاصَ وَ لَكِنِ اضْرِبُوهُنَّ بِالْجُوعِ وَ الْعُرْيِ حَتَّى تُرِيحُوا [تَرْبَحُوا] فِي الدُّنْيَا وَ الْآخِرَةِ وَ أَيُّمَا رَجُلٍ تَتَزَيَّنُ امْرَأَتُهُ وَ تَخْرُجُ مِنْ بَابِ دَارِهَا فَهُوَ دَيُّوثٌ وَ لَا يَأْثَمُ مَنْ يُسَمِّيهِ دَيُّوثاً وَ الْمَرْأَةُ إِذَا خَرَجَتْ مِنْ بَابِ دَارِهَا مُتَزَيِّنَةً مُتَعَطِّرَةً وَ الزَّوْجُ بِذَلِكَ رَاضٍ يُبْنَى لِزَوْجِهَا بِكُلِّ قَدَمٍ بَيْتٌ فِي النَّارِ فَقَصِّرُوا أَجْنِحَةَ نِسَائِكُمْ وَ لَا تُطَوِّلُوهَا فَإِنَّ فِي تَقْصِيرِ أَجْنِحَتِهَا رِضًى وَ سُرُوراً وَ دُخُولَ الْجَنَّةِ بِغَيْرِ حِسَابٍ احْفَظُوا وَصِيَّتِي فِي أَمْرِ نِسَائِكُمْ حَتَّى تَنْجُوا مِنْ شِدَّةِ الْحِسَابِ وَ مَنْ لَمْ يَحْفَظْ وَصِيَّتِي فَمَا أَسْوَأَ حَالَهُ بَيْنَ يَدَيِ اللَّهِ وَ قَالَ ع النِّسَاءُ حَبَائِلُ الشَّيْطَان', 'Label': 1, 'text_length': 924, 'input_ids':




Our first intuition with the BERT tokeniser was not very fruitful.The tokeniser was not able to identify arabic words for exemple and was creatinng a unique token for all the sentence.We decided to try another tokenizer.

In [1]:
from transformers import XLMRobertaTokenizer
from datasets import load_dataset

# Charger le tokenizer XLM-RoBERTa
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Charger le dataset
dataset = load_dataset('csv', data_files={'train': 'train_cleaned_numeric_labels.csv'}, delimiter=',')

# Afficher le premier texte du dataset pour vérification
text = dataset['train']['Text'][0]
print("Texte original :")
print(text)

# Tokenisation manuelle du premier texte
tokens = tokenizer.tokenize(text)
print("\nTokens générés :")
print(tokens)

# Fonction de tokenisation pour le dataset
def tokenize_function(examples):
    return tokenizer(
        examples['Text'],
        padding="max_length",
        truncation=True,
        max_length=512,
        return_tensors="pt"  # Retourne des tenseurs PyTorch (optionnel)
    )

# Appliquer la tokenisation à l'ensemble du dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True, batch_size=1000)

# Afficher les résultats de la tokenisation pour le premier exemple
print("\nIDs des tokens pour le premier exemple :")
print(tokenized_datasets['train'][0]['input_ids'])

print("\nMasque d'attention pour le premier exemple :")
print(tokenized_datasets['train'][0]['attention_mask'])

# Optionnel : Convertir les IDs en tokens pour vérification
input_ids = tokenized_datasets['train'][0]['input_ids']
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print("\nTokens reconstruits à partir des IDs :")
print(tokens)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/IreneETUDES/Library/Python/3.12/lib/python/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/IreneETUDES/Library/Python/3.12/lib/python/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/IreneETUDES/Library/Python/3.12/lib/python/site-packages/ipykernel/kernelapp.py", line 739, in start
    self

ImportError: 
XLMRobertaTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


This step was not mandatory, but the problem is with languages such as arabic,chinese or tukish where the characters are with a complex morphology and the words are not seperated, tokenizinng simplify the problem a lot 