Ce notebook est dédié au classifier (Lyrics to genre) du projet NLP.
Commençons par importer les différents packages

In [None]:
!pip install kaggle
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0


In [None]:
import pandas as pd
import random
import nltk
import tensorflow as tf
import matplotlib.pyplot as plt
import tiktoken
import os

from google.colab import userdata
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.metrics import classification_report, accuracy_score, ConfusionMatrixDisplay, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from spacy.lang.en.stop_words import STOP_WORDS as en_stop


Ensuite vous aurez besoin de récupérer le dataset en provenance de kaggle.
Pour se faire, je vous prie de suivre ce tuto
https://www.kaggle.com/discussions/general/74235#2580958


In [None]:
nltk.download('punkt')
nltk.download('wordnet')

os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
!kaggle datasets download -d carlosgdcj/genius-song-lyrics-with-language-information
!unzip genius-song-lyrics-with-language-information.zip

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Downloading genius-song-lyrics-with-language-information.zip to /content
100% 3.04G/3.04G [01:18<00:00, 41.6MB/s]
100% 3.04G/3.04G [01:18<00:00, 41.7MB/s]
Archive:  genius-song-lyrics-with-language-information.zip
  inflating: song_lyrics.csv         


Etape de préprocessing 'rapide'.

In [None]:
# n = 100 every 100th line = 1% of the lines 50 000 lines taken
df = pd.read_csv("song_lyrics.csv", skiprows=lambda i: i % 100 != 0)
print(df.index)
df = df[df['tag'] != 'misc']
if 'language' in df.columns:
    df = df[df['language'] == 'en']
df = df[['title', 'lyrics', 'tag']]
df.reset_index(drop=True, inplace=True)
# To shuffle randomnly datas
df = df.sample(frac = 1)
# Split the data into features (X) and labels (Y)
X = df['lyrics']
Y = df['tag']

# Split the data into training and test sets (80% training, 20% test)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

RangeIndex(start=0, stop=51348, step=1)


Utilisons les tokenizers vu en cours encore une fois dans un soucis de rapidité

In [None]:
#Define tokenizers from course
def lemma_tokenize(doc):
    wnl = WordNetLemmatizer()
    return [wnl.lemmatize(t) for t in word_tokenize(doc)]

def char_tokenize(doc):
    return [char for char in doc]

def byte_tokenize(doc):
    tokens = doc.encode("utf-8")
    tokens = list(map(int, tokens))
    return [str(token) for token in tokens]

def gpt_tokenize(doc):
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = enc.encode(doc)
    return [str(token) for token in tokens]

Création du classifier en utilisant scikit-learn
On peut s'amuser à jouer sur les différents paramètres et hyperparamètres pour voir si on obtient de meilleur résultats

In [None]:
# Create model, we can test them one by one or even customize them using hyperparameters tunning
model = make_pipeline(CountVectorizer(ngram_range = (1,1), stop_words = en_stop), MultinomialNB()) #Naive Bayes
#model = make_pipeline(CountVectorizer(ngram_range = (1,1), stop_words = en_stop), LogisticRegression()) #Logistic Regression
model.fit(X_train, Y_train)


On évalue dans un premier temps le modèle simplement en regardant sa précision.

In [None]:
accuracy = model.score(X_test, Y_test)
print("Accuracy:", accuracy)

Ce-dessous une liste de toutes les combinaisons qui ont été testées une par une. Pour obtenir une vue des résultats : voir l'annexe dans le repo github.

In [None]:
#model = make_pipeline(CountVectorizer(tokenizer=gpt_tokenize, ngram_range=(1, 1)), scaler, LogisticRegression( max_iter = 1000, solver='saga',penalty='l2'))
#model = make_pipeline(CountVectorizer(tokenizer=byte_tokenize, ngram_range=(1, 1)), scaler, LogisticRegression( max_iter = 1000, solver='saga',penalty='l2'))
#model = make_pipeline(CountVectorizer(tokenizer=word_tokenize, ngram_range=(1, 1)), scaler, LogisticRegression( max_iter = 1000, solver='saga',penalty='l2'))
#model = make_pipeline(CountVectorizer(ngram_range=(1, 1), scaler, LogisticRegression( max_iter = 3000, solver='lbfgs'))
#model = make_pipeline(CountVectorizer(ngram_range=(1, 2), scaler, LogisticRegression( max_iter = 3000, solver='lbfgs'))


#model = make_pipeline(CountVectorizer(tokenizer=word_tokenize, ngram_range=(1, 1)), scaler, LogisticRegression( max_iter = 1000, solver='saga',penalty='elasticnet', l1-ratio=0.5))


Voici comment évaluer le modèle de manière plus pertinente.
On se permet d'utiliser

*   La matrice de confusion
*   Le classification report fourni par sk-learn
*   L'évaluation empirique

Matrice de confusion : on voit précisément où le modèle s'est trompé / où il a bien prédit



In [None]:
# Print classification report and confusion matrix
print("Classification Report:\n", classification_report(Y_test, Y_pred_classes))
cm = confusion_matrix(Y_test, Y_pred_classes)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()

Classification report : qu'on envoit vers un fichier excel pour ensuite le comparer avec d'autres itérations

In [None]:
y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(Y_test, y_pred))
# Convert classification report to dictionary
report_dict = classification_report(Y_test, y_pred, output_dict=True)

# Convert the dictionary to a DataFrame
df_excel = pd.DataFrame(report_dict).transpose()

# Convert the DataFrame to Excel format
df_excel.to_excel("classification_report.xlsx")

Tests empirique 'visuels' : on prend au hasard certaines musiques.
Et on regarde si notre modèle performe bien dessus ou pas.
Enfin on affiche les erreurs/réussites

In [None]:
df_test = pd.read_csv("song_lyrics.csv", skiprows=lambda i: i % 977 != 0 , nrows=10) # Change here to test different values

df_test = df_test[df_test['tag'] != 'misc']
if 'language' in df_test.columns:
    df_test = df_test[df_test['language'] == 'en']
df_test = df_test[['title', 'lyrics', 'tag']]
df_test.reset_index(drop=True, inplace=True)

for song_name, song_lyrics, song_tag in zip(df_test['title'], df_test['lyrics'], df_test['tag']):
    print("Song:", song_name)
    print("Tag:", song_tag)
    # Convert the lyrics to a list and predict probabilities
    probabilities = model.predict_proba([song_lyrics])

    # Print the distribution of probabilities
    print("Distribution of Probabilities:")
    for class_label, probability in zip(model.classes_, probabilities[0]):
        if(probability > 0.0001):
          print(f"{class_label}: {probability:.4f}")
    max_prob_index = probabilities.argmax()
    predicted_class = model.classes_[max_prob_index]
    if predicted_class != song_tag:
        print(f'Model failed to predict. Actual tag is {song_tag}, predicted tag is {predicted_class}')
    print()

Dans une optique de performance nous avons décidé de voir ce si les perfomances étaient meilleures sans utiliser scikit-learn

Passons donc maintenant à l'implementation du classifer à l'aide de PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.preprocessing import LabelEncoder

In [None]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


Y_train_vec = vectorizer.fit_transform(Y_train)
Y_test_vec = vectorizer.transform(Y_test)

In [None]:
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train_vec.toarray(), dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_vec.toarray(), dtype=torch.float32)

label_encoder = LabelEncoder()
Y_train_indices = label_encoder.fit_transform(Y_train)
Y_test_indices = label_encoder.transform(Y_test)

Y_train_tensor = torch.tensor(Y_train_indices, dtype=torch.long)
Y_test_tensor = torch.tensor(Y_test_indices, dtype=torch.long)

In [None]:
# Define logistic regression model
class LogisticRegression(nn.Module):
    def __init__(self, input_size, num_classes):
        super(LogisticRegression, self).__init__()
        self.linear = nn.Linear(input_size, num_classes)

    def forward(self, x):
        out = self.linear(x)
        return out

input_size = X_train_tensor.shape[1]
num_classes = len(Y_train.unique())


On utilise la CrossEntropy Loss et l'optimizer Adam

In [None]:
# Initialize the model
model = LogisticRegression(input_size, num_classes)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
print(X_train_tensor.shape)
print(X_test_tensor.shape)
print(Y_train_tensor.shape)

torch.Size([25680, 79784])
torch.Size([6420, 79784])
torch.Size([25680])


In [None]:
num_epochs = 100
for epoch in range(num_epochs):
    # Training phase
    model.train()
    outputs = model(X_train_tensor)
    loss = criterion(outputs, Y_train_tensor)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Validation phase
    model.eval()
    with torch.no_grad():
        val_outputs = model(X_test_tensor)
        _, predicted_val = torch.max(val_outputs, dim=1)
        num_correct = (predicted_val == Y_test_tensor).sum().item()
        accuracy = num_correct / Y_test_tensor.size(0) * 100

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Accuracy: {accuracy:.2f}%')

Epoch [1/100], Loss: 1.5848, Accuracy: 56.04%
Epoch [2/100], Loss: 1.5431, Accuracy: 61.50%
Epoch [3/100], Loss: 1.4722, Accuracy: 66.09%
Epoch [4/100], Loss: 1.1467, Accuracy: 48.68%
Epoch [5/100], Loss: 1.1932, Accuracy: 47.96%
Epoch [6/100], Loss: 1.1179, Accuracy: 63.52%
Epoch [7/100], Loss: 0.8890, Accuracy: 65.69%
Epoch [8/100], Loss: 0.8240, Accuracy: 65.59%
Epoch [9/100], Loss: 0.7766, Accuracy: 65.87%
Epoch [10/100], Loss: 0.7327, Accuracy: 61.88%
Epoch [11/100], Loss: 0.7472, Accuracy: 61.03%
Epoch [12/100], Loss: 0.7038, Accuracy: 63.15%
Epoch [13/100], Loss: 0.6065, Accuracy: 64.58%
Epoch [14/100], Loss: 0.5279, Accuracy: 64.97%
Epoch [15/100], Loss: 0.4877, Accuracy: 65.12%
Epoch [16/100], Loss: 0.4731, Accuracy: 65.09%
Epoch [17/100], Loss: 0.4710, Accuracy: 64.95%
Epoch [18/100], Loss: 0.4712, Accuracy: 64.74%
Epoch [19/100], Loss: 0.4654, Accuracy: 64.44%
Epoch [20/100], Loss: 0.4510, Accuracy: 64.13%
Epoch [21/100], Loss: 0.4298, Accuracy: 64.14%
Epoch [22/100], Loss: 

KeyboardInterrupt: 

Evaluation du modèle

In [None]:
model.eval()
with torch.no_grad():
    outputs = model(X_test_tensor)
    _, predicted = torch.max(outputs, 1)

cm = confusion_matrix(Y_test_tensor.numpy(), predicted.numpy())
print(cm)
report = classification_report(Y_test_tensor.numpy(), predicted.numpy())
print(report)