Natural language processing project

In [1]:
!pip install nltk
import nltk
from preprocessing_pipeline import preprocess
import pandas as pd

data = pd.read_csv('amazon_books_Data.csv')
data = data.drop('Unnamed: 0', axis=1)
data['Sentiment_books'].replace('negaitve', 'negative', inplace=True)#We noticed an issue in the writting of negative 

data['processed_review'] = data['review_body'].apply(preprocess)

# preprocessed data
print(data['processed_review'])




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0                                       [love, student]
1     [wife, order, 2, book, gave, presentson, frien...
2                      [great, book, like, other, seri]
3                                              [beauti]
4     [enjoy, author, stori, quilt, incred, plan, ma...
                            ...                        
95    [alway, talk, hygien, realli, made, think, tee...
96    [sourc, receiv, digit, copi, book, free, readt...
97    [youll, find, transport, back, wartim, forties...
98    [stori, long, cover, show, true, side, top, ec...
99    [love, bookbr, br, maya, interest, charact, ma...
Name: processed_review, Length: 100, dtype: object


We have just obtained the preprocessed dataset. Now that we have performed tokenization and stemming, it is now possible at this stage to proceed with the vectorization of our Amazon reviews

Vectorization involves converting textual data into numerical vectors that can be used by a machine learning model. We can either use the Bag of Words or the TF-IDF (Term Frequency-Inverse Document Frequency) approach. I have chosen TF-IDF here because it not only considers the occurrence of a word in a single document but also evaluates its importance in the context of the entire corpus, thus providing a more nuanced representation of the textual data in the Amazon book review dataset.

In [2]:
# TF-IDF requires text input as a string rather than a list of words.
data['processed_review'] = data['processed_review'].apply(lambda x: ' '.join(x))


Process of encoding the target labels using the LabelEncoder from the scikit-learn library

In [3]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# To instantiate a label encoder
label_encoder = LabelEncoder()
#Fit and transform the encoder on the labels
data['Sentiment_books'] = label_encoder.fit_transform(data['Sentiment_books'])

# Dataset separate into training et test set 
X = data.drop('Sentiment_books', axis=1)  
y = data['Sentiment_books']  # target

test_size = 0.2 
random_state = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

X = data['processed_review']
y = data['Sentiment_books']

# Use the train_test_split function to divide the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Review is a text that must be vectorized
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Logistic regression model
model = LogisticRegression()

# Train the model t
model.fit(X_train, y_train)

# Calculate the model's accuracy on the test set
accuracy = model.score(X_test, y_test)
print(f"Logistic Regression model accuracy: {accuracy}")


Logistic Regression model accuracy: 0.85


In [5]:
#Class labels
class_labels = ["negative", "positive"]
category_orders = {"Sentiment_books": class_labels}

I chose to study the model in more detail by generating a report and confusion matrices to evaluate its performance with each sentiment

In [6]:
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px

confusion_matrix_kwargs = dict(
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)

def report(y_true, y_pred, class_labels):
    print(classification_report(y_true, y_pred, target_names=class_labels))
    
    cm = confusion_matrix(y_true, y_pred)
    
    fig = px.imshow(
        cm, 
        **confusion_matrix_kwargs
    )
    fig.show()


In [7]:
y_true = y_test 
y_pred = model.predict(X_test) 

In [8]:
report(y_true, y_pred, class_labels)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20



The classification report indicates that the model is unable to classify negative sentiments effectively, displaying a precision, recall, and f1-score of 0. On the other hand, the model excels in identifying positive sentiments, demonstrating high precision, recall, and f1-score values. The overall accuracy of 0.85 suggests that the model performs well in classifying the majority positive sentiments, but it is not able to effectively handle the minority negative sentiments.

Indeed, the unequal distribution of positive (84) and negative (16) sentiments could be a significant factor contributing to the model's struggles in effectively classifying negative sentiments. With a significantly smaller representation of negative sentiments in the dataset, the model might not have received enough training data for this particular class, leading to difficulties in making accurate predictions for negative sentiments.

First let's test another model for the classification of our Amazon reviews. I chose to test the SVM model because it can be effective for binary or multiclass classification. It may better handle imbalanced data.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


X = data['processed_review']
y = data['Sentiment_books']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


vectorizer = TfidfVectorizer()


X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


model_svm = SVC()


model_svm.fit(X_train, y_train)


accuracy_svm = model_svm.score(X_test, y_test)

print(f"Accuracy of SVM : {accuracy_svm}")


Accuracy of SVM : 0.85


In [10]:
y_true = y_test  
y_pred = model_svm.predict(X_test)
report(y_true, y_pred, class_labels)


              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Unfortunately, the results still show an accuracy of 0.85, and the precision, recall, and F1-score for the negative class are all 0. This suggests that the SVM model also struggles to classify negative reviews.

Improve on the baseline


We are attempting to obtain a more favorable outcome for the negative class, allowing the model to perform adequately. To achieve this, we are trying to rearrange the data slightly to ensure a better balance between the classes.

In [11]:
data['Sentiment_books']

0     1
1     1
2     1
3     1
4     1
     ..
95    1
96    0
97    1
98    1
99    1
Name: Sentiment_books, Length: 100, dtype: int32

Oversampling balance the negative and positive sentiment 

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample


data_majority = data[data['Sentiment_books'] == 1]
data_minority = data[data['Sentiment_books'] == 0]

# Oversampling the minority class to resolve the problem
data_minority_upsampled = resample(data_minority,
                                  replace=True,  
                                  n_samples = len(data_majority) ,  
                                  random_state=42)  


# Merge the majority data with the upsampled data
data_upsampled = pd.concat([data_majority, data_minority_upsampled])

# Split the balanced data into training and testing sets
X = data_upsampled['processed_review']
y = data_upsampled['Sentiment_books']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


vectorizer = TfidfVectorizer()


X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


model_svm = SVC()


model_svm.fit(X_train, y_train)


accuracy_svm = model_svm.score(X_test, y_test)

print(f"Accuracy of SVM : {accuracy_svm}")


Accuracy of SVM : 1.0



The recent changes have improved the performance metrics, giving us results for both positive and negative classes. However, a perfect accuracy score of 1.0 could mean that the model has memorized the training data and might not work well with new, unseen data. 

I tried an other model the SVM model with the specified class weights

In [20]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import numpy as np

X = data['processed_review']
y = data['Sentiment_books']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Class weights based on the imbalance
class_weights = {0: 0.5, 1: 0.5}  

# Specify class weights
model_svm_weighted = SVC(class_weight=class_weights)

# Perform k-fold cross-validation
scores = cross_val_score(model_svm_weighted, X_train, y_train, cv=5)

# Print the cross-validation scores
print("Cross-Validation Scores:", scores)

# Fit the model on the training data
model_svm_weighted.fit(X_train, y_train)

# Model accuracy 
accuracy_svm_weighted = model_svm_weighted.score(X_test, y_test)
print(f"Model Accuracy after Class Weighting: {accuracy_svm_weighted}")


unique, counts = np.unique(y_train, return_counts=True)
class_balance = dict(zip(unique, counts))
print("Class Balance:")
print(class_balance)

# Classification report
class_labels = ["negative", "positive"]
y_true = y_test
y_pred = model_svm_weighted.predict(X_test)
print("Classification Report:")
print(classification_report(y_true, y_pred, target_names=class_labels))


Cross-Validation Scores: [0.875  0.875  0.8125 0.8125 0.8125]
Model Accuracy after Class Weighting: 0.85
Class Balance:
{0: 13, 1: 67}
Classification Report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



The SVM model with the specified class weights is still struggling to effectively handle and classify negative sentiments

Later on, my focus was on enhancing the model's performance by identifying the operational hyperparameters that maximize its efficiency without oversampling.

In [21]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to optimize
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf', 'linear', 'poly']}

# GridSearchCV 
grid_search = GridSearchCV(SVC(class_weight=class_weights), param_grid, cv=5)

# Run the grid search on the training data
grid_search.fit(X_train, y_train)

# Display the best hyperparameters and the best score
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

# Use the best hyperparameters to train the model
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate the performance on the test data
accuracy = best_model.score(X_test, y_test)
print(f"Model Accuracy: {accuracy}")


Best parameters:  {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
Best cross-validation score:  0.8375
Model Accuracy: 0.85


Now, I will conduct k-fold cross-validation to assess the performance of the Support Vector Machine (SVM) model for sentiment classification. The resulting accuracy scores and confusion matrices will provide a comprehensive analysis of the model's predictive capabilities.

In [22]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px

X = data['processed_review']
y = data['Sentiment_books']

# Number of folds for cross-validation
n_splits = 5


stratified_kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)


metric_scores = []

# Perform cross-validation
for train_index, val_index in stratified_kfold.split(X, y):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]

   
    vectorizer = TfidfVectorizer()

    # Vectorize the textual data here Amazon review 
    X_train = vectorizer.fit_transform(X_train)
    X_val = vectorizer.transform(X_val)

    # Balance the class positive and negative 
    class_weights = {0: 0.5, 1: 0.5}  

    # class weights
    model_svm_weighted = SVC(class_weight=class_weights)
    model_svm_weighted.fit(X_train, y_train)

    # Model's accuracy on the validation set
    accuracy_svm_weighted = model_svm_weighted.score(X_val, y_val)
    metric_scores.append(accuracy_svm_weighted)

    # Confusion matrix
    y_true = y_val
    y_pred = model_svm_weighted.predict(X_val)
    class_labels = ['negative', 'positive']
    report(y_true, y_pred, class_labels)




              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20



              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         3
    positive       0.85      1.00      0.92        17

    accuracy                           0.85        20
   macro avg       0.42      0.50      0.46        20
weighted avg       0.72      0.85      0.78        20



              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         4
    positive       0.80      1.00      0.89        16

    accuracy                           0.80        20
   macro avg       0.40      0.50      0.44        20
weighted avg       0.64      0.80      0.71        20




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



The precision, recall, and F1-scores obtained from the confusion matrices highlight the model's performance, revealing a strong ability to classify positive sentiments. However, the results indicate challenges in effectively identifying negative sentiments, as evidenced by lower precision and recall values. To address this limitation, we will explore the use of various machine learning models to enhance the classification of both positive and negative sentiments.