Deep learning

In this notebook, we will first build a deep learning model using LSTM (Long Short-Term Memory) for sentiment classification. We will then use an RNN (Recurrent Neural Network) as we aim to solve a sentiment analysis problem.

In [35]:
!pip install tensorflow
!pip install nltk
import nltk
from preprocessing_pipeline import preprocess
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, SpatialDropout1D
from sklearn.metrics import classification_report, confusion_matrix



In [36]:
#Labels
class_labels = ["negative", "positive"]
category_orders = {"Sentiment_books": class_labels}

In [37]:
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px

confusion_matrix_kwargs = dict(
    text_auto=True,
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)

def report(y_true, y_pred, class_labels):
    print(classification_report(y_true, y_pred, target_names=class_labels))
    # Générer une matrice de confusion
    cm = confusion_matrix(y_true, y_pred)
    # Visualiser la matrice de confusion à l'aide de Plotly
    fig = px.imshow(
        cm,
        **confusion_matrix_kwargs
    )
    fig.show()


In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

# Load the data
data = pd.read_csv('amazon_books_Data.csv')
data = data.drop('Unnamed: 0', axis=1)
data['Sentiment_books'].replace('negaitve', 'negative', inplace=True)

# Convert labels to numerical values
label_encoder = LabelEncoder()
data['Sentiment_books'] = label_encoder.fit_transform(data['Sentiment_books'])

data['processed_review'] = data['review_body'].apply(preprocess)
X = data['processed_review']
y = to_categorical(data['Sentiment_books'], num_classes=2)  # Convert labels to categories

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


max_features = 2000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train)

# Recurrent Neural Network (RNN) 
model = Sequential()
model.add(Embedding(max_features, 128, input_length=X_train.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.1)


X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=X_train.shape[1])
y_pred = model.predict(X_test)

# Convert predictions to class labels
y_pred_labels = [1 if pred[1] > pred[0] else 0 for pred in y_pred]

report(y_test.argmax(axis=1), y_pred_labels, class_labels)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


During training, the model showed improved accuracy and reduced loss across 5 epochs. While achieving an 83.33% training accuracy, its validation performance remained steady at 87.5%. Notably, the model struggled to identify negative sentiment (precision 0), yet it effectively detected positive sentiment with 85% precision. Overall, the model attained an 85% test accuracy and a weighted F1-score of 0.78, indicating a relatively strong overall performance

I tried to find a solution to correct this problem.The main objective of this script is to develop an LSTM-based deep learning model for sentiment classification using Amazon book review data. As the classes in the data are imbalanced, the script incorporates resampling techniques to solve the problem of negative sentiment reviews. This step ensures that the model is trained on a more evenly distributed dataset, allowing it to learn from a more comprehensive range of examples for both classes.

In [43]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px


data = pd.read_csv('amazon_books_Data.csv')
data = data.drop('Unnamed: 0', axis=1)
data['Sentiment_books'].replace('negaitve', 'negative', inplace=True)

data['processed_review'] = data['review_body'].apply(preprocess)

# Resample to balance the classes and reduce the disparity between positive and negative review 
data_majority = data[data['Sentiment_books'] == 'positive']
data_minority = data[data['Sentiment_books'] == 'negative']
data_minority_upsampled = resample(data_minority, replace=True, n_samples=len(data_majority)-6, random_state=42)
data_upsampled = pd.concat([data_majority, data_minority_upsampled])

# Split data into training and test set 
X = data_upsampled['processed_review']
y = data_upsampled['Sentiment_books']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

max_features = 2000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train)

# Build the RNN model
model = Sequential()
model.add(Embedding(max_features, 128, input_length=X_train.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))  
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, to_categorical(LabelEncoder().fit_transform(y_train)), epochs=5, batch_size=32, validation_split=0.1)


X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=X_train.shape[1])
y_pred = model.predict(X_test)

# transform, convert label into number 
label_to_int = {'negative': 0, 'positive': 1}
y_test_int = y_test.map(label_to_int)

# Convert predictions to class labels
y_pred_labels_int = [pred.argmax() for pred in y_pred]


class_labels = ["negative", "positive"]


confusion_matrix_kwargs = dict(
    text_auto=True,
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)


def report(y_true, y_pred, class_labels):
    print(classification_report(y_true, y_pred, target_names=class_labels))
    cm = confusion_matrix(y_true, y_pred)
    fig = px.imshow(cm, **confusion_matrix_kwargs)
    fig.show()

# Call the report function to generate the classification report and confusion matrix
report(y_test_int, y_pred_labels_int, class_labels)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
              precision    recall  f1-score   support

    negative       0.75      0.56      0.64        16
    positive       0.67      0.82      0.74        17

    accuracy                           0.70        33
   macro avg       0.71      0.69      0.69        33
weighted avg       0.71      0.70      0.69        33



The model has been performing well, showcasing significant improvement in accuracy, which reached 92.24% by the fifth epoch. The validation accuracy also stabilized at 69.23% in the final epoch, demonstrating the model's consistent performance. Specifically, for the "negative" sentiment class, the model showed a precision of 0.75, recall of 0.56, and an F1-score of 0.64. On the other hand, for the "positive" sentiment class, the precision was 0.67, recall was 0.82, and the F1-score was 0.74. This model contrary to the other is able to classify both positive and this time also negative sentiment.