In [22]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In [23]:
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


1. The dataset used for this analysis is book review dataset.
2. The goal of this analysis is to prodict the seniment of the book review. The label is 'Positive Review', which is a binary label indicating whether the review is positive (True) or negetive (False).
3. This would be a supervised learning problem because we are predicting a pecific outcome based on labeled data. It is a binary classification problem because the prediction involves categorizing book reviews into two distinct classes, either positive or negetive.
4. The feature would be the textual data of the review given for a specific book.
5. This information could be useful to a company for a variety of reasons. Customer feedback can be analyzed and common themes and areas of improvment for products can be found. Depending on the context of the reviews, companies can measure the sentiment around different marketing campaigns and adjust strategies based on customer reactions.   

In [24]:
df.describe()

Unnamed: 0,Review,Positive Review
count,1973,1973
unique,1865,2
top,I have read several of Hiaasen's books and lov...,False
freq,3,993


* This indicates that there are duplicate reviews included in this dataset

In [25]:
df.isna().sum()

Review             0
Positive Review    0
dtype: int64

In [26]:
df['Positive Review'].value_counts()

False    993
True     980
Name: Positive Review, dtype: int64

* Check for class imbalance. Positive and negetive reviews are balanced and each represent an equal proportion of the data

There is only one feature, which is the review, however the it will be preprocessed through vectorization to allow for modeling. Duplicate reviews were removed as a part of the cleaning process. Preprocessing was done by converting words to all lowercase then removing punctuation. Common stop words were removed to reduce noise in the data, and then words were converted to their stem forms using lemmatization and stemming in order to reduce dimensionality. 
TF-IDF vecorization will be used in order to vectorize the input documents, with words weighted on document importance. 
Logistic Regression will be used for modeling, as this is best suited for a binary classification problem and this model is simple and interpretable. 
The dataset will be split into training and testing sets using a 70-30 ratio split. The training data will then be transformed into TF-IDF vectors. 
A logistic regression model will then be trained on the training data, using cross-validation as an extra measure to ensure good generalization. 
The model will then be evaluated using an AUC score and confusion matrix to understand the types of errors the model may be making. 
Hyperparameters can continue to be optimized, if needed, through techniques like grid search in order to improve performance. 
If the model's performance can't be improved through hyperparameter tuning, preprocessing techniques may need to be adjusted, such as the number of n-grams used, or removing the use of stemming/lemmatization. 
Other models, such as random forest and XGBoost, will be used in order to compare performance and for model selection. 
These techniques implemented together should produce a model that is able to generalize well to new data and accurately predict positive vs. negetive book reviews. 

In [27]:
!pip install nltk
!pip install xgboost
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ubuntu/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [28]:
df = df.drop_duplicates(subset='Review', keep='first')

In [29]:
df['Positive Review'] = df['Positive Review'].astype(int)

In [30]:
df['Review'] = df['Review'].str.lower().str.replace('[^\w\s]', '', regex=True)

* Removed duplicate terms, converted the Positive Review feature to numeric, converted words in reviews to all lowercase and removed punctuation.

In [31]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
df['Review'] = df['Review'].apply(lambda text: ' '.join([stemmer.stem(lemmatizer.lemmatize(word)) for word in text.split() if word not in stop_words]))

* Removed stop words and used lemmatization to reduce noise and dimensionality 

In [32]:
vectorizer = TfidfVectorizer(ngram_range = (1,1))
X = vectorizer.fit_transform(df['Review'])
y = df['Positive Review']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
model = LogisticRegression(max_iter = 200)
model.fit(X_train, y_train)

* Used TF-IDF vectorization for converting the textual data into numeric vectors that can be used as input
into the models. Used a train test split ratio of 70/30 in order to validate performance. Fit a logistic regression model
to the training data. 

In [33]:
prob_predictions = model.predict_proba(X_test)[:,1]
class_predictions = model.predict(X_test)

In [34]:
auc = roc_auc_score(y_test, prob_predictions)
print("AUC score:", auc)
accuracy = accuracy_score(y_test, class_predictions)
print("Accuracy:", accuracy)

AUC score: 0.878220200368023
Accuracy: 0.7839285714285714


In [35]:
cm = confusion_matrix(y_test, class_predictions)
print('Confusion Matrix:\n', cm)

Confusion Matrix:
 [[192  76]
 [ 45 247]]


In [36]:
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print("Cross-Validation AUC Scores: ", cv_scores)
print("Mean Cross-Validation AUC Score: ", cv_scores.mean())

Cross-Validation AUC Scores:  [0.89168825 0.86207911 0.85954792 0.86656062 0.82968069]
Mean Cross-Validation AUC Score:  0.8619113183228556


* Average AUC score of around 87%, with an accuracy of around 79%. 

In [37]:
param_grid = {
    'C': [0.1, 1, 10, 100]
}

grid_search = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
print("Best grid-search AUC score for 'C' hyperparam in logistic regression: ", grid_search.best_score_)

Best parameters found:  {'C': 10}
Best grid-search AUC score for 'C' hyperparam in logistic regression:  0.8629728609711236


* Grid search used for finding best 'C' parameter, used to control the amount of regularization applied. This search did
  not yield a much higher AUC score, indicating it didn't have much of an effect on performance. 

In [38]:
model = RandomForestClassifier(random_state=123)
model.fit(X_train, y_train)
prob_predictions = model.predict_proba(X_test)[:,1]
class_predictions = model.predict(X_test)

auc = roc_auc_score(y_test, prob_predictions)
accuracy = accuracy_score(y_test, class_predictions)
print("Random Forest AUC score:", auc)
print("Random Forest Accuracy:", accuracy)

Random Forest AUC score: 0.8665086383152729
Random Forest Accuracy: 0.7910714285714285


In [39]:
model = XGBClassifier(random_state=123)
model.fit(X_train, y_train)

prob_predictions = model.predict_proba(X_test)[:, 1]
class_predictions = model.predict(X_test)

auc = roc_auc_score(y_test, prob_predictions)
accuracy = accuracy_score(y_test, class_predictions)
cm = confusion_matrix(y_test, class_predictions)

print("XGBoost AUC score:", auc)
print("XGBoost Accuracy:", accuracy)

XGBoost AUC score: 0.820154365160499
XGBoost Accuracy: 0.7571428571428571


* After preparing the data by removing duplicates, converting the review texts to lowercase, removing punctuation, and applying lemmatization and stemming, the text  was vectorized using TF-IDF. Then, the data was split into training and test sets and a logistic regression model was trained, achieving an AUC score of 0.878 and an accuracy of 78.4%. A grid search for the best hyperparameter 'C' did not significantly improve the model's performance. A random forest classifier achieved a slightly lower AUC score of 0.867 but a higher accuracy of 79.1%, while an XGBoost model resulted in an AUC of 0.820 and an accuracy of 75.7%. The logistic regression model provided the best balance between AUC and accuracy. The AUC score measures the model's ability to distinguish between positive and negative classes, with higher values indicating better performance, while accuracy represents the proportion of correctly classified instances. All models universally had a higher AUC score, meaning that it was easier to distinguish between classes than it was to accurately predict positive cases.