# Problem statement: Classification model to analyze Amazon product reviews

The objective is to create a classification model that will analyze Amazon product reviews to classify sentiments as positive or negative. Here's a breakdown of the steps involved in this workflow:

- Step 1: Load the Dataset
- Step 2: Data Pre-processing
- Step 3: Feature Selection
- Step 4: Model Selection
- Step 5: Training the Model
- Step 6: Model Evaluation
- Step 7: Hyperparameter Tuning
- Step 8: Cross Validation

The notebook contains 7 exercises in total:

* [Exercise 1](#ex_1)
* [Exercise 2](#ex_2)
* [Exercise 3](#ex_3)
* [Exercise 4](#ex_4)
* [Exercise 5](#ex_5)
* [Exercise 6](#ex_6)
* [Exercise 7](#ex_7)

## Step 1: Load the dataset
First, let's load the dataset from Google Drive. You need to upload the dataset and then read the CSV file into a pandas DataFrame.

In [None]:
from google.colab import files
uploaded = files.upload()

In [18]:
# Import necessary libraries
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('Datasets/amazon-product-review-data.csv')

# Display the first few rows to check if the data is loaded correctly
df.head()



Unnamed: 0,market_place,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,sentiments
0,"""US""","""42521656""","""R26MV8D0KG6QI6""","""B000SAQCWC""","""159713740""","""The Cravings Place Chocolate Chunk Cookie Mix...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Using these for years - love them.""","""As a family allergic to wheat, dairy, eggs, n...",2015-08-31,positive
1,"""US""","""12049833""","""R1OF8GP57AQ1A0""","""B00509LVIQ""","""138680402""","""Mauna Loa Macadamias, 11 Ounce Packages""","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Wonderful""","""My favorite nut. Creamy, crunchy, salty, and ...",2015-08-31,positive
2,"""US""","""107642""","""R3VDC1QB6MC4ZZ""","""B00KHXESLC""","""252021703""","""Organic Matcha Green Tea Powder - 100% Pure M...","""Grocery""",1,0,0,0 \t(N),0 \t(N),"""Five Stars""","""This green tea tastes so good! My girlfriend ...",2015-08-31,positive
3,"""US""","""6042304""","""R12FA3DCF8F9ER""","""B000F8JIIC""","""752728342""","""15oz Raspberry Lyons Designer Dessert Syrup S...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""I love Melissa's brand but this is a great se...",2015-08-31,positive
4,"""US""","""18123821""","""RTWHVNV6X4CNJ""","""B004ZWR9RQ""","""552138758""","""Stride Spark Kinetic Fruit Sugar Free Gum, 14...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""good""",2015-08-31,positive


## Step 2: Data Pre-processing





In [2]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)


<a name="ex_1"></a>
## Exercise 1

- Use the train_test_split function and change the test_size to 0.3

This way the training set (X and y) should be 70% and the testing set(X and y) should be 30%

In [3]:
# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)


## Step 3: Feature Selection

In this step, we'll perform feature selection to reduce the dimensionality of the TF-IDF vectorized data and potentially improve the model's performance. We'll use feature selection techniques like chi-squared (chi2) or mutual information to select the most important features.

In [5]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


<a name="ex_2"></a>
## Exercise 2

- Compare the X_train_selected shape and X_test_selected shape with the new test_size=0.3

In [6]:
# Display the shapes of the selected feature sets with the new test size
print("X_train_selected shape with test_size=0.3:", X_train_selected.shape)
print("X_test_selected shape with test_size=0.3:", X_test_selected.shape)

X_train_selected shape with test_size=0.3: (350, 1000)
X_test_selected shape with test_size=0.3: (150, 1000)


We have successfully performed feature selection, reducing the dimensionality of the data while retaining the most important features.


## Step 4: Model Selection
For sentiment analysis, you can use various machine learning algorithms like Logistic Regression, Naive Bayes, Support Vector Machines, or even deep learning models like LSTM or BERT. Since you're a beginner, let's start with a simple model like Logistic Regression.

In [13]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, class_weight='balanced')


<a name="ex_3"></a>
## Exercise 3

What does the random_state (parameter of the LogisticRegression) represent?

**Answer**: The `random_state` parameter in `LogisticRegression` is used to control the randomness of the algorithm. It ensures that the results are reproducible by setting a seed for the random number generator. This way, if you run the same code multiple times with the same `random_state` value, you will get the same results each time. This is particularly useful for debugging and comparing different models.

## Step 5: Training the Model

Now that we have initialized our Logistic Regression model, it's time to train it on the selected features from the training dataset.



In [14]:

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# We can now proceed to Step 7: Model Evaluation

## Step 6: Model Evaluation

In this step, we'll evaluate the performance of the trained Logistic Regression model using the testing data.

- We import necessary metrics from `sklearn.metrics` such as `accuracy_score`, `classification_report`, and `confusion_matrix`.
- We use the trained model to predict sentiment labels (`y_pred`) for the test data (`X_test_selected`).
- We calculate the accuracy of the model by comparing the predicted labels to the true labels.
- We display a classification report that includes precision, recall, F1-score, and support for both positive and negative sentiment classes.
- We display a confusion matrix to visualize the true positive, true negative, false positive, and false negative predictions.



In [15]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8266666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.38      0.22      0.28        23
           1       0.87      0.94      0.90       127

    accuracy                           0.83       150
   macro avg       0.63      0.58      0.59       150
weighted avg       0.79      0.83      0.81       150


Confusion Matrix:
[[  5  18]
 [  8 119]]


<a name="ex_4"></a>
## Exercise 4

- Compare the Results with the new data split with the results of the actual split.

Model results with actual train test split

In [28]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


# Feature selection
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

# model selection
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, class_weight='balanced')

# model training

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# model evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)
X_train_selected shape: (400, 1000)
X_test_selected shape: (100, 1000)
Accuracy: 0.82

Classification Report:
              precision    recall  f1-score   support

           0       0.30      0.21      0.25        14
           1       0.88      0.92      0.90        86

    accuracy                           0.82       100
   macro avg       0.59      0.57      0.57       100
weighted avg       0.80      0.82      0.81       100


Confusion Matrix:
[[ 3 11]
 [ 7 79]]


**Answer**:

To compare the results with the new data split (test_size=0.3) and the original split (test_size=0.2), we need to look at the accuracy, classification report, and confusion matrix for both cases.

### Original Split (test_size=0.2)
- **Accuracy**: 0.82
- **Classification Report**:
    ```
                            Classification Report:
              precision    recall  f1-score   support

           0       0.30      0.21      0.25        14
           1       0.88      0.92      0.90        86

    accuracy                           0.82       100
   macro avg       0.59      0.57      0.57       100
weighted avg       0.80      0.82      0.81       100
    ```
- **Confusion Matrix**:
    ```
 [[ 3 11]
 [ 7 79]]
    ```

### New Split (test_size=0.3)
- **Accuracy**: 0.8266666666666667
- **Classification Report**:
    ```       Classification Report:
              precision    recall  f1-score   support

           0       0.38      0.22      0.28        23
           1       0.87      0.94      0.90       127

    accuracy                           0.83       150
   macro avg       0.63      0.58      0.59       150
weighted avg       0.79      0.83      0.81       150
    ```
- **Confusion Matrix**:
    ```
[[  5  18]
 [  8 119]]
    ```

### Conclusion
- The new split (test_size=0.3) resulted in a slightly higher accuracy (0.827) compared to the original split (0.82).
- The classification report and confusion matrix indicate that the model performs better with the new split, especially in terms of precision and recall for both classes.
- The new split provides more data for validation, which might help in better assessing the model's performance on unseen data.

In [None]:
# Write your code here

<a name="ex_5"></a>
## Exercise 5

Do different training and testing sizes impact the model's learning and response to new data?

**Answer**: Write your answer here

## Step 7: Hyperparameter Tuning

In this step, we'll perform hyperparameter tuning to optimize the Logistic Regression model's performance. We can search for the best hyperparameters using techniques like Grid Search or Random Search.

- We import `GridSearchCV` from `sklearn.model_selection`.
- We define a grid of hyperparameters to search, including 'C' (regularization parameter) and 'max_iter' (maximum iterations).
- We initialize Grid Search with cross-validation (5-fold) to find the best hyperparameters.
- The best hyperparameters are extracted using `grid_search.best_params_`.
- We fit the tuned model with the best hyperparameters to the training data.
- Finally, we evaluate the tuned model's accuracy on the test data.

In [33]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)

# Calculate the accuracy of the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Model Accuracy:", accuracy_tuned)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Tuned Model Accuracy: 0.86


<a name="ex_6"></a>
## Exercise 6

- What is GridSearchCV used for?
- What are hyperparameters?
- Does the model give better results after hyperparameters ?

**Answer**: 

1. **What is GridSearchCV used for?**
    GridSearchCV is used for hyperparameter tuning in machine learning models. It performs an exhaustive search over a specified parameter grid to find the best combination of hyperparameters that results in the highest model performance. It uses cross-validation to evaluate the performance of each combination of hyperparameters and selects the best one based on the evaluation metric.

2. **What are hyperparameters?**
    Hyperparameters are parameters that are not learned from the data but are set before the training process begins. They control the behavior of the training algorithm and the structure of the model. Examples of hyperparameters include the learning rate, the number of iterations, the regularization parameter, and the number of hidden layers in a neural network.

3. **Does the model give better results after hyperparameters tuning?**
    Yes, the model gives better results after hyperparameter tuning. In this case, the accuracy of the model improved from 0.82 to 0.86 after tuning the hyperparameters using GridSearchCV. This indicates that the tuned model is better at predicting the sentiments of the reviews compared to the initial model with default hyperparameters.

It appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case. The accuracy remains at 0.86.

## Step 8: Cross Validation

We'll use cross-validation to estimate how well the model will perform on unseen data and check if the model's performance is consistent across different folds of the data.

- We import `cross_val_score` from `sklearn.model_selection`.
- We perform 5-fold cross-validation on the tuned model (`best_model`) using the training data (`X_train_selected` and `y_train`).
- We calculate the mean cross-validation accuracy to get a more robust estimate of the model's performance.

In [34]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)

Mean Cross-Validation Accuracy: 0.7925000000000001


<a name="ex_7"></a>
## Exercise 7

- What is Cross Validation used for?
- Compare the new Validation score (with the new training and testing size)
- What do you conclude ?

**Answer**: 
1. What is Cross Validation used for?
Cross validation is used to assess how well a model will generalize to new, unseen data. It works by:
- Splitting the training data into multiple subsets (folds)
- Training and evaluating the model multiple times using different combinations of these folds
- Providing a more robust estimate of model performance than a single train-test split

2. Compare the new Validation score (with the new test_size=0.3):
- Original cross-validation score (mean): 0.7925
- This score is lower than both:
    * The initial model accuracy (0.82)
    * The tuned model accuracy (0.827)

3. Conclusion:
- The cross-validation score suggests that our model's performance might be less stable than indicated by the single train-test split
- The gap between cross-validation score (0.7925) and test accuracy (0.86) indicates possible overfitting
- The larger test size (0.3) provides more data for validation but reduces training data, which might affect model performance
- We should consider:
    * Collecting more training data
    * Using regularization techniques
    * Feature engineering to improve model stability
"""