# Objectif

Ce projet a pour but de développer un modèle de classification binaire capable de prédire la polarité (positive ou négative) d'une critique de film issue du site IMDb, en utilisant exclusivement son contenu textuel.

Il s’inscrit dans le cadre de l’apprentissage automatique supervisé, et utilise un réseau de neurones à mémoire longue courte durée (LSTM) construit avec TensorFlow/Keras.

# Données 

Les données proviennent du corpus IMDb Large Movie Review Dataset disponible ici :
🔗 https://ai.stanford.edu/~amaas/data/sentiment/

50 000 critiques au total, réparties en :

train/pos : 12 500 critiques positives

train/neg : 12 500 critiques négatives

test/pos : 12 500 critiques positives

test/neg : 12 500 critiques négatives

Chaque critique est un fichier .txt contenant le texte d’une critique

In [43]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam

1. Chargement des données IMDb

In [44]:
def load_imdb_data(data_dir):
    texts, labels = [], []
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(data_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname.endswith(".txt"):
                with open(os.path.join(dir_name, fname), encoding="utf-8") as f:
                    texts.append(f.read())
                labels.append(0 if label_type == 'neg' else 1)
    return texts, labels

2. Prétraitement

In [45]:
train_dir = "data/aclImdb/train"
test_dir = "data/aclImdb/test"

x_train_texts, y_train = load_imdb_data(train_dir)
x_test_texts, y_test = load_imdb_data(test_dir)

3. Tokénisation

In [46]:

#  on décompose décomposer un document texte en unités plus petites appelées jetons, qui peuvent être des mots, des phrases ou des caractères individuels
vocab_size = 10000
max_length = 200

tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(x_train_texts)

x_train_seq = tokenizer.texts_to_sequences(x_train_texts)
x_test_seq = tokenizer.texts_to_sequences(x_test_texts)

x_train_pad = pad_sequences(x_train_seq, maxlen=max_length, padding='post')
x_test_pad = pad_sequences(x_test_seq, maxlen=max_length, padding='post')

3. Tâche 1 : Entraînement des embeddings

In [47]:
embedding_dim = 100

model_embed = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    GlobalAveragePooling1D(),
    Dense(1, activation='sigmoid')
])

model_embed.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_embed.summary()

model_embed.fit(x_train_pad, np.array(y_train), epochs=5, batch_size=128, validation_split=0.2)

# Extraire la matrice d'embeddings Z
embedding_matrix = model_embed.layers[0].get_weights()[0]

Epoch 1/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 43ms/step - accuracy: 0.6223 - loss: 0.6516 - val_accuracy: 0.2602 - val_loss: 0.8338
Epoch 2/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 36ms/step - accuracy: 0.7516 - loss: 0.5288 - val_accuracy: 0.5884 - val_loss: 0.6754
Epoch 3/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 34ms/step - accuracy: 0.8455 - loss: 0.4034 - val_accuracy: 0.7346 - val_loss: 0.5552
Epoch 4/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 37ms/step - accuracy: 0.8802 - loss: 0.3307 - val_accuracy: 0.6964 - val_loss: 0.5878
Epoch 5/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 34ms/step - accuracy: 0.8941 - loss: 0.2876 - val_accuracy: 0.7750 - val_loss: 0.4943


4. Tâche 2 : Modèle LSTM avec Z

In [None]:
model_lstm = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length, weights=[embedding_matrix], trainable=False),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

model_lstm.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
model_lstm.summary()

model_lstm.fit(x_train_pad, np.array(y_train), epochs=5, batch_size=128, validation_split=0.2)

Epoch 1/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 303ms/step - accuracy: 0.7589 - loss: 0.4868 - val_accuracy: 0.8534 - val_loss: 0.3854
Epoch 2/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 317ms/step - accuracy: 0.8465 - loss: 0.4179 - val_accuracy: 0.6798 - val_loss: 0.9343
Epoch 3/5
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 372ms/step - accuracy: 0.8203 - loss: 0.4172 - val_accuracy: 0.1976 - val_loss: 0.8953
Epoch 4/5
[1m  2/157[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m59s[0m 386ms/step - accuracy: 0.6445 - loss: 0.6327 

5. Évaluation sur le test set

In [49]:
loss, accuracy = model_lstm.evaluate(x_test_pad, np.array(y_test))
print(f"\nTest Accuracy: {accuracy:.4f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 44ms/step - accuracy: 0.8394 - loss: 0.5201

Test Accuracy: 0.5425
