## Chargement du dataset et création d'un dataset d'entraînement composé d'un dico avec 2 clés : text et label

In [4]:
import csv
import pandas as pd

file_path_train = "./data/train_submission.csv"
file_path_test = "./data/test_without_labels.csv"

data_train = pd.read_csv(file_path_train)


In [5]:
data_train

Unnamed: 0,Usage,Text,Label
0,Public,َ قَالَ النَّبِيُّ ص إِنِّي أَتَعَجَّبُ مِمَّن...,hau
1,Public,Filmen forteller historien om Will Hunting en...,nob
2,Public,An Arthrostylidium berryi in uska species han ...,wln
3,Public,Kancunarí enemigosniyquichejta munacuychej al...,quh
4,Public,Warmeqa ama yachachichunchu hermanospa tantaku...,quh
...,...,...,...
190594,Public,Publié par Masken à 22:46 Aucun commentaire:,hat
190595,Public,ειπεν δε προς τους μαθητας ελευσονται ημεραι ο...,grc
190596,Public,Ya bay boch ban’en ni kug rung’aged ni ga be y...,yap
190597,Public,P'alimentase nun absuerben el sangre sinón qu...,ast


## Analyse of the data train

In [6]:
data_train_without_label = data_train[data_train["Label"].isna()]

In [7]:
data_train_without_label

Unnamed: 0,Usage,Text,Label
107,Public,Kòe bô jōa kú hō͘-sū sió-chiá lâi kā góan mn̄...,
803,Public,Söğütçük sī chi̍t ê tī Türkiye Aydın séng Çine...,
1095,Public,Golden Valley Kūn ū khó-lêng sī kóng:,
1894,Public,Tī Montégut-Lauragais ê sì-ûi ū Nogaret Revel...,
2499,Public,Soveria Simeri ùi séng lāi ê hoān-ûi.,
...,...,...,...
189637,Public,Bellebrune sī ūi-tī Hoat-kok Nord-Pas-de-Calai...,
189946,Public,Bô phah-sǹg tī sin-le̍k 10 go̍eh 29 hō ē-po͘ ...,
189959,Public,Wiejki sī chi̍t ê tī Pho-lân Kiōng-hô-kok Podl...,
190397,Public,Tī pún só͘-chāi sì-ûi ê tē-hng ū Valy Veselí ...,


Il y a 500 instances qui ne sont pas labellisées. 

In [8]:
data_train_without_nan_for_label = data_train.dropna()

In [9]:
number_of_languages = len(data_train["Label"].unique())
print(f"Il y a {number_of_languages} différentes langues dans le dataset de train")

Il y a 390 différentes langues dans le dataset de train


### Analyse stats sur les données labellisées

In [10]:
dataset_sorted_by_number_instances_by_language = data_train_without_nan_for_label.groupby("Label").count().sort_values('Usage', ascending=False)
dataset_sorted_by_number_instances_by_language

Unnamed: 0_level_0,Usage,Text
Label,Unnamed: 1_level_1,Unnamed: 2_level_1
tgk,1500,1500
guj,1000,1000
tat,1000,1000
crh,1000,1000
kaa,1000,1000
...,...,...
gil,2,2
toi,1,1
gaa,1,1
kua,1,1


On observer que le nombre d'exemples par langue varie énormément. Certaines langues sont sur-représentées (avec 1500 instances pour la première) par rapport à d'autres. 

In [11]:
percentage_of_languages_with_at_least_100_instances = len(dataset_sorted_by_number_instances_by_language[dataset_sorted_by_number_instances_by_language["Usage"] >= 100])/len(dataset_sorted_by_number_instances_by_language) * 100
print(f"Le pourcentage de langues avec au moins 100 instances est {percentage_of_languages_with_at_least_100_instances}%")

Le pourcentage de langues avec au moins 100 instances est 93.31619537275064%


## Pré-traitement du dataset de train

In [118]:
import string
import re 
import unicodedata

def cleaning(text): 
    """
    Fonction pour pré-traiter le texte en enlevant tous les éléments de ponctuation, les chiffres et les double espaces. 
    """

    if not isinstance(text, str):
        return ""

    text = re.sub(r"\s+", " ", text).strip()
    text = re.sub(r"\(.*?\)|\[.*?\]|\{.*?\}|['\"«»„“”‘’]|\<.*?\>", " ", text) # 1. Supprimer les textes entre (), [], {}, "", « »
    text = re.sub(r"https?://[^\s]+|www\.[^\s]+", " ", text) # 2. Supprimer les URLs
    text = re.sub(r"\b[A-Z]+\d*[A-Z\d]*", " ", text) # 3. Supprimer les sigles type "IK10", "ABC123", "X4D" (au moins 1 lettre + au moins 1 chiffre)
    text = re.sub(r"\b[A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]*", " ", text) # 4. Supprimer les mots qui commencent par une majuscule (prénoms, noms propres, etc.)
    text = re.sub(r"\d+", " ", text)  # 5. Supprimer les nombres isolés
    text = text.translate(str.maketrans("", "", string.punctuation))  # 6. Supprimer la ponctuation et les caractères spéciaux
    text = ''.join(c for c in text if unicodedata.category(c)[0] not in ["C", "S"])  # 7. Supprimer les caractères de contrôle Unicode, symboles et emojis

    # Liste de ponctuation à inclure pour les langues asiatiques
    asian_punctuation = "，。？！《》【】（）；：、。"
    text = text.replace('-', ' ')
    text = text.translate(str.maketrans('', '', string.punctuation + asian_punctuation))

    # Supprimer les emojis
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"  # dingbats
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )

    text_cleaned = emoji_pattern.sub(r'', text)
    text_cleaned = text_cleaned.lower()

    return(text_cleaned)


### Création d'un ensemble de mots anglais pour pouvoir enlever les mots anglais dans les phrases avec des mots anglais mélangés à d'autres langues

In [13]:
import nltk
from nltk.corpus import words

# Télécharger la liste des mots en anglais (une seule fois nécessaire)
nltk.download('words')

# Liste des mots en anglais
english_words = set(word.lower() for word in words.words())


[nltk_data] Downloading package words to
[nltk_data]     /Users/hippolytelecomte/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [14]:
data_ang = data_train_without_nan_for_label[data_train_without_nan_for_label["Label"] == 'eng']["Text"]

# Collecte des mots uniques
for text in data_ang:
    for word in text.split():
        english_words.add(word.lower())



In [15]:
def remove_most_english_words(text): 
    """
    Fonction pour enlever les mots anglais lorsque la langue du texte n'est pas l'anglais. 
    """
    tokens = text.split() 
    filtered_tokens = [word for word in tokens if word.lower() not in english_words]

    return ' '.join(filtered_tokens)

## Première approche avec CountVectorizer et MultinomialNB

Séparation entre le train et le val

In [16]:
from sklearn.model_selection import train_test_split
train_set, val_set = train_test_split(data_train_without_nan_for_label, test_size=0.2, random_state=42)

Application du pré-traitement sur tout le dataframe

In [133]:
from tqdm import tqdm
tqdm.pandas()  

def pre_processing(df, remove_espace = True, not_test = True, need_to_clean = True): 
    if need_to_clean: 
        df['Text'] = df['Text'].apply(cleaning)
    
    if not_test: 
        df['Text'] = df.progress_apply(
            lambda row: remove_most_english_words(row['Text']) if row['Label'] != 'eng' else row['Text'], axis=1
        )
    
    if remove_espace: 
        df['Text'] = df['Text'].str.replace(' ', '', regex=False)
    
    return df


In [153]:
train_set_first_version = train_set.copy()
val_set_first_version = val_set.copy()
train_set_first_version = pre_processing(train_set_first_version, remove_espace=False)
val_set_first_version = pre_processing(val_set_first_version, remove_espace=False)

100%|██████████| 152079/152079 [00:01<00:00, 76878.05it/s]
100%|██████████| 38020/38020 [00:00<00:00, 80662.26it/s]


In [154]:
train_set_first_version.to_csv("train_set_preprocessed.csv", index=False)

In [88]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer(analyzer="char", ngram_range=(2, 6), max_features=10000)
x_train = train_set_first_version['Text'].tolist()
y_train = train_set_first_version['Label'].tolist()
x_val = val_set_first_version['Text'].tolist()
y_val = val_set_first_version['Label'].tolist()
y_total = y_train + y_val

# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_total)

y_train = le.transform(y_train)
y_val = le.transform(y_val)
label_mapping = dict(zip(le.classes_, range(len(le.classes_))))


x_train = vectorizer.fit_transform(x_train)
x_val = vectorizer.transform(x_val)

In [89]:
naive_bayes = MultinomialNB(alpha= 0.0001, fit_prior = False) 
naive_bayes.fit(x_train,y_train)

In [19]:
dataset_sorted_by_number_instances_by_language.loc["yue"]

Usage    500
Text     500
Name: yue, dtype: int64

In [90]:
from sklearn.metrics import accuracy_score, classification_report

predictions = naive_bayes.predict(x_val)
accuracy = accuracy_score(y_val, predictions)
print("Accuracy:", accuracy)


Accuracy: 0.7103366649132036


In [21]:
import numpy as np

# Obtenir les indices des classes présentes dans y_val
present_classes = np.unique(np.concatenate((y_val, predictions)))

# Extraire uniquement les noms correspondants
filtered_target_names = [le.classes_[i] for i in present_classes]

In [22]:
# Générer le rapport de classification sous forme de dictionnaire
report = classification_report(y_val, predictions, target_names=filtered_target_names, output_dict=True)

# Filtrer les classes (en excluant 'accuracy', 'macro avg', 'weighted avg')
filtered_report = {label: metrics for label, metrics in report.items() if isinstance(metrics, dict)}

# Trier les langues par F1-score de manière décroissante
sorted_report = sorted(filtered_report.items(), key=lambda x: x[1]['f1-score'], reverse=True)

# Afficher le rapport trié
print("Classification Report (trié par F1-score décroissant):\n")
for label, metrics in sorted_report:
    print(f"{label}: F1-score = {metrics['f1-score']:.4f}, Precision = {metrics['precision']:.4f}, Recall = {metrics['recall']:.4f}, Support = {metrics['support']}")


Classification Report (trié par F1-score décroissant):

asm: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 99.0
bpy: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 95.0
bzj: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 104.0
cab: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 83.0
ctu: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 101.0
knv: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 93.0
mau: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 104.0
nav: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 93.0
top: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 95.0
csy: F1-score = 0.9959, Precision = 0.9917, Recall = 1.0000, Support = 120.0
naq: F1-score = 0.9955, Precision = 0.9911, Recall = 1.0000, Support = 111.0
mco: F1-score = 0.9954, Precision = 0.9908, Recall = 1.0000, Support = 108.0
xav: F1-score = 0.9954, Pr

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [23]:
val_set[val_set['Label'] == "lzh"]

Unnamed: 0,Usage,Text,Label
151638,Public,茂原市處日本千葉縣。,lzh
73107,Public,此四者，當世數學之四則定義也。,lzh
12069,Public,至於除夕，同賀新春，共食團年飯。亦制餃子、年糕、春捲等。,lzh
171980,Public,可黎可足，諱赤祖德贊，又作克黎可足、徠巴贍、熱巴中。吐蕃贊普，建元彝泰。其在位時，吐蕃文武極...,lzh
168788,Public,魏室於軍政之外，文學亦為大家。魏三祖皆有成於詩文，而至者為文帝弟植。謝靈運曰：「天下文才一石...,lzh
...,...,...,...
48276,Public,方事故時，適有記者在場。《東京朝日新聞》採其說，旦朝報之。據報道，躓踐之後，憲兵、警察官分散...,lzh
17778,Public,妹穆皇后莧,lzh
137368,Public,阮氏大越本紀,lzh
135298,Public,一月二十日者，公曆之二十日也，距歲暮餘三百又四十五日（閏年乃三百又四十六日）。,lzh


## Deuxième approche avec SentencePiece comme tokenizer

### Génération d'un fichier brut .txt pour entraîner SentencePiece

In [97]:
def minimal_preprocessing(text):
    # Convertir en minuscules
    text = text.lower()
    # Supprimer les espaces excédentaires
    text = re.sub(r'\s+', ' ', text)
    return text

In [139]:
# Extraire uniquement la colonne "Text"
corpus_path = "corpus_multilingue.txt"  # Chemin de sortie pour le corpus
data_train_preprocessed_for_corpus = data_train.copy()
data_train_preprocessed_for_corpus = pre_processing(data_train_preprocessed_for_corpus, remove_espace=False, not_test=True, need_to_clean=True)
data_train_preprocessed_for_corpus["Text"].dropna().to_csv(corpus_path, index=False, header=False, sep="\n")

print(f"Corpus enregistré : {corpus_path}, avec {len(data_train)} phrases.")

100%|██████████| 190599/190599 [00:02<00:00, 76582.66it/s]


Corpus enregistré : corpus_multilingue.txt, avec 190599 phrases.


### Entraînement de SentencePiece et chargement du modèle

In [140]:
import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    input='./data/corpus_multilingue.txt',  
    model_prefix='sp_model',
    vocab_size=60000,  
    character_coverage=1.0,  
    model_type='unigram'  
)

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/corpus_multilingue.txt
  input_format: 
  model_prefix: sp_model
  model_type: UNIGRAM
  vocab_size: 60000
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_priva

In [141]:
sp = spm.SentencePieceProcessor(model_file='sp_model.model')

def sentencepiece_tokenize(text):
    """Tokenise un texte en sous-mots avec SentencePiece"""
    return ' '.join(sp.encode(text, out_type=str))

In [142]:
train_set_second_version = train_set.copy()
val_set_second_version = val_set.copy()
# train_set_second_version = pre_processing(train_set_second_version, remove_espace=False, not_test=True, need_to_clean=False)
# val_set_second_version = pre_processing(val_set_second_version, remove_espace=False, not_test=True, need_to_clean=False)

In [143]:
import unicodedata
from collections import defaultdict

# Catégorisation fine des scripts
SCRIPT_MAP = {
    "LATIN": "Latin",
    "CYRILLIC": "Cyrillique",
    "ARABIC": "Arabe",
    "HEBREW": "Hébreu",
    "GREEK": "Grec",
    "DEVANAGARI": "Devanagari (Hindi, Sanskrit)",
    "HIRAGANA": "Hiragana (Japonais)",
    "KATAKANA": "Katakana (Japonais)",
    "CJK": "Kanji (Chinois, Japonais, Coréen)",
    "HANGUL": "Hangul (Coréen)",
    "THAI": "Thaï",
    "ARMENIAN": "Arménien",
    "GEORGIAN": "Géorgien",
    "ETHIOPIC": "Éthiopien",
    "TAMIL": "Tamoul",
    "BENGALI": "Bengali",
    "TELUGU": "Télougou",
}

def count_alphabet_characters(text):
    script_counts = defaultdict(int)

    for char in text:
        if char.isalpha():  # On ignore les symboles et ponctuations
            try:
                char_name = unicodedata.name(char)  # Ex: 'LATIN CAPITAL LETTER A'
                script_key = char_name.split()[0]  # Prend le premier mot du nom Unicode
                
                if "CJK" in char_name:
                    script_key = "CJK"  # Les kanji sont classés sous "CJK UNIFIED IDEOGRAPH"
                
                script_name = SCRIPT_MAP.get(script_key, script_key)  # Utilise le mapping ou garde l'original
                script_counts[script_name] += 1  # Incrémente le compteur
                
            except ValueError:
                continue  # Si le caractère n'a pas de nom Unicode
    
    return dict(script_counts)  # Retourne un dictionnaire des comptages

def most_frequent_script(text):
    script_counts = count_alphabet_characters(text)  # Appel de la fonction précédente
    
    if script_counts:  # Vérifie si le dictionnaire n'est pas vide
        most_common_script = max(script_counts.items(), key=lambda x: x[1])  # Trouve l'alphabet avec le max de caractères
        return most_common_script  # Retourne (nom de l'alphabet, nombre d'occurrences)
    else:
        return None  # Retourne None si aucun alphabet trouvé

from tqdm import tqdm

def add_alphabet_to_label(df):
    for index, row in tqdm(df.iterrows(), total=len(df)):  # Parcourt chaque ligne du DataFrame
        alphabet_most_frequent = most_frequent_script(row['Text'])  # Détecte l'alphabet dominant
        
        if alphabet_most_frequent:  # Vérifie si un alphabet a été trouvé
            df.at[index, 'Label'] = f"{row['Label']}_{alphabet_most_frequent[0]}"  # Met à jour le label
    
    return df


In [144]:
train_set_second_version = add_alphabet_to_label(train_set_second_version)
val_set_second_version = add_alphabet_to_label(val_set_second_version)

100%|██████████| 152079/152079 [00:12<00:00, 12465.21it/s]
100%|██████████| 38020/38020 [00:03<00:00, 12228.62it/s]


In [145]:
# Appliquer SentencePiece à ton dataset
train_set_second_version['Text'] = train_set_second_version['Text'].progress_apply(sentencepiece_tokenize)
val_set_second_version['Text'] = val_set_second_version['Text'].progress_apply(sentencepiece_tokenize)


100%|██████████| 152079/152079 [00:07<00:00, 20590.64it/s]
100%|██████████| 38020/38020 [00:01<00:00, 20673.83it/s]


In [146]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

vectorizer_sp = TfidfVectorizer(analyzer="char", ngram_range=(1, 4), max_features=200000)
naive_bayes_sp = MultinomialNB(alpha= 0.001, fit_prior = False) 

pipeline = Pipeline([
    ('tfidf', vectorizer_sp),
    ('mnb', naive_bayes_sp)
])

x_train_sp = train_set_second_version['Text'].tolist()
y_train_sp = train_set_second_version['Label'].tolist()
x_val_sp = val_set_second_version['Text'].tolist()
y_val_sp = val_set_second_version['Label'].tolist()
y_total_sp = y_train_sp + y_val_sp

# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le_sp = LabelEncoder()
le_sp.fit(y_total_sp)

y_train_sp = le_sp.transform(y_train_sp)
y_val_sp = le_sp.transform(y_val_sp)
label_mapping = dict(zip(le_sp.classes_, range(len(le_sp.classes_))))


pipeline.fit(x_train_sp, y_train_sp)

In [147]:
from sklearn.metrics import accuracy_score, classification_report

predictions_sp = pipeline.predict(x_val_sp)
accuracy_sp = accuracy_score(y_val_sp, predictions_sp)
print("Accuracy:", accuracy_sp)

Accuracy: 0.8501841136244082


In [148]:
predicted_labels_sp = le_sp.inverse_transform(predictions_sp)
labels_to_predict = le_sp.inverse_transform(y_val_sp)

In [149]:
import numpy as np 

def restore_original_label(label):
    return label.split("_")[0]  # Prend seulement la première partie avant '_'

def restore_labels(liste):
    new_liste = []
    for element in tqdm(liste): 
        new_liste.append(restore_original_label(element))
    return np.array(new_liste)


In [150]:
final_prediction = restore_labels(predicted_labels_sp)
val_to_predict = restore_labels(labels_to_predict)
final_accuracy = accuracy_score(val_to_predict, final_prediction)
print("Accuracy:", final_accuracy)


100%|██████████| 38020/38020 [00:00<00:00, 997269.85it/s]
100%|██████████| 38020/38020 [00:00<00:00, 493087.44it/s]

Accuracy: 0.8521304576538664





In [151]:
present_classes_sp = np.unique(np.concatenate((y_val_sp, predictions_sp)))

# Extraire uniquement les noms correspondants
filtered_target_names_sp = [le_sp.classes_[i] for i in present_classes_sp]

In [152]:
# Générer le rapport de classification sous forme de dictionnaire
report_sp = classification_report(y_val_sp, predictions_sp, target_names=filtered_target_names_sp, output_dict=True)

# Filtrer les classes (en excluant 'accuracy', 'macro avg', 'weighted avg')
filtered_report = {label: metrics for label, metrics in report_sp.items() if isinstance(metrics, dict)}

# Trier les langues par F1-score de manière décroissante
sorted_report = sorted(filtered_report.items(), key=lambda x: x[1]['f1-score'], reverse=True)

# Afficher le rapport trié
print("Classification Report (trié par F1-score décroissant):\n")
for label, metrics in sorted_report:
    print(f"{label}: F1-score = {metrics['f1-score']:.4f}, Precision = {metrics['precision']:.4f}, Recall = {metrics['recall']:.4f}, Support = {metrics['support']}")


Classification Report (trié par F1-score décroissant):

abk_Cyrillique: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 116.0
ach_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 22.0
ada_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 1.0
ahk_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 89.0
alt_Cyrillique: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 90.0
aoj_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 98.0
arn_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 115.0
asm_Bengali: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 99.0
bpy_Bengali: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 95.0
bqc_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 95.0
bzj_Latin: F1-score = 1.0000, Precision = 1.0000, Recall = 1.0000, Support = 104.0
cab_Latin: F1-score = 1.00

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [196]:
val_set_second_version

Unnamed: 0,Usage,Text,Label
128184,Public,▁ A p ă râ nd ▁fa ț ă ▁în t r - un ▁exerci ți ...,ron_Latin
95049,Public,▁kaya ▁vua ▁ J i su ▁vol ai ▁ta lega ▁ka kua ▁...,fij_Latin
170377,Public,▁אי ן ▁צו וא נ צ יק ▁י אר ▁הא ט ▁דא ס ▁דא ר ף ...,yid_Hébreu
171119,Public,▁ S en iň ▁s özü ň e ▁gu lak ▁ assa ▁do ga nyň...,tuk_Latin
62238,Public,▁gro nta pu ▁fesi . ▁pra ti ▁ini ▁grup u ▁le k...,srn_Latin
...,...,...,...
38370,Public,▁ A ndũ ▁nĩ ▁mara kari re ▁mũno . N ĩ ▁ma tho ...,kik_Latin
175527,Public,▁ T á ▁ ið ▁hu g sa ð ▁verður ▁ska ðar nar ▁ i...,fao_Latin
65025,Public,▁ I ng ku ro ▁oku ▁mi lo ' ▁mo ngu hu p ▁ T om ?,dtp_Latin
150121,Public,▁ A ˬ ▁ma - ah ˇ ▁ya ˆ ▁ phy aw ˇ ▁ya ˆ ▁neh ˬ...,ahk_Latin


Test avec SVC

Test avec SGDClassifier

In [48]:
# from sklearn.linear_model import SGDClassifier
# from sklearn.feature_extraction.text import TfidfVectorizer

# vectorizer_sp_2 = TfidfVectorizer(analyzer="char", ngram_range=(1,4), max_features=50000)
# sgdclassifier_model = SGDClassifier(loss="hinge", alpha=0.0001, shuffle=True, max_iter=1, warm_start=True)

# x_train_sgdc_vectorized = vectorizer_sp_2.fit_transform(tqdm(x_train_sp, desc="Vectorizing training data"))
# x_val_sgdc_vectorized = vectorizer_sp_2.transform(tqdm(x_val_sp, desc="Vectorizing validation data"))

# sgdclassifier_model.fit(x_train_sgdc_vectorized, y_train_sp)
# val_accuracy = accuracy_score(y_val_sp, sgdclassifier_model.predict(x_val_sgdc_vectorized))
# print(f"Val Accuracy: {val_accuracy:.4f}\n")




Vectorizing training data: 100%|██████████| 152079/152079 [01:11<00:00, 2138.84it/s]
Vectorizing validation data: 100%|██████████| 38020/38020 [00:16<00:00, 2253.33it/s]


Val Accuracy: 0.7837



In [195]:
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

vectorizer_sp_2 = TfidfVectorizer(analyzer="char", ngram_range=(1,4), max_features=20000)
sgdclassifier_model = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=1, learning_rate='optimal', early_stopping=False)

pipeline_svm = Pipeline([
    ('tfidf', vectorizer_sp_2),
    ('mnb', sgdclassifier_model)
])

pipeline_svm.fit(x_train_sp, y_train_sp)
predictions_svm = pipeline_svm.predict(x_val_sp)

val_accuracy = accuracy_score(y_val_sp, predictions_svm)
print(f"Val Accuracy: {val_accuracy:.4f}\n")




Val Accuracy: 0.7128



## Utilisation du meilleur modèle pour le test set 

In [93]:
data_test= pd.read_csv(file_path_test)
test_set = pre_processing(data_test, remove_espace=False, not_test=False)
test_set['Text'] = test_set['Text'].progress_apply(sentencepiece_tokenize)

# pipeline chose 

best_pipeline = pipeline

x_test = test_set['Text'].tolist()
predictions_test = best_pipeline.predict(x_test)






100%|██████████| 190567/190567 [00:08<00:00, 22595.11it/s]


In [94]:
predicted_labels_test = le_sp.inverse_transform(predictions_test)
predicted_labels_test = restore_labels(predicted_labels_test)
test_set['Label'] = predicted_labels_test

100%|██████████| 190567/190567 [00:00<00:00, 1041060.33it/s]


In [95]:
column_ID = [i for i in range(1, len(test_set)+1)]
test_set['ID'] = column_ID

In [96]:
test_set[['Label', 'ID']].to_csv('test_set_v4_predicted.csv', index=False)