# Problem statement: Classification model to analyze Amazon product reviews

The objective is to create a classification model that will analyze Amazon product reviews to classify sentiments as positive or negative. Here's a breakdown of the steps involved in this workflow:

- Step 1: Load the Dataset
- Step 2: Data Pre-processing
- Step 3: Feature Selection
- Step 4: Model Selection
- Step 5: Training the Model
- Step 6: Model Evaluation
- Step 7: Hyperparameter Tuning
- Step 8: Cross Validation

The notebook contains 7 exercises in total:

* [Exercise 1](#ex_1)
* [Exercise 2](#ex_2)
* [Exercise 3](#ex_3)
* [Exercise 4](#ex_4)
* [Exercise 5](#ex_5)
* [Exercise 6](#ex_6)
* [Exercise 7](#ex_7)

## Step 1: Load the dataset
First, let's load the dataset from Google Drive. You need to upload the dataset and then read the CSV file into a pandas DataFrame.

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Import necessary libraries
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('amazon-product-review-data.csv')

# Display the first few rows to check if the data is loaded correctly
df.head()



## Step 2: Data Pre-processing





In [None]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


<a name="ex_1"></a>
## Exercise 1

- Use the train_test_split function and change the test_size to 0.3

This way the training set (X and y) should be 70% and the testing set(X and y) should be 30%

In [None]:
#Write your code here

## Step 3: Feature Selection

In this step, we'll perform feature selection to reduce the dimensionality of the TF-IDF vectorized data and potentially improve the model's performance. We'll use feature selection techniques like chi-squared (chi2) or mutual information to select the most important features.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

<a name="ex_2"></a>
## Exercise 2

- Compare the X_train_selected shape and X_test_selected shape with the new test_size=0.3

In [None]:
#Write your code here

We have successfully performed feature selection, reducing the dimensionality of the data while retaining the most important features.


## Step 4: Model Selection
For sentiment analysis, you can use various machine learning algorithms like Logistic Regression, Naive Bayes, Support Vector Machines, or even deep learning models like LSTM or BERT. Since you're a beginner, let's start with a simple model like Logistic Regression.

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42)


<a name="ex_3"></a>
## Exercise 3

What does the random_state (parameter of the LogisticRegression) represent?

**Answer**: Write your answer here

## Step 5: Training the Model

Now that we have initialized our Logistic Regression model, it's time to train it on the selected features from the training dataset.



In [None]:

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# We can now proceed to Step 7: Model Evaluation

## Step 6: Model Evaluation

In this step, we'll evaluate the performance of the trained Logistic Regression model using the testing data.

- We import necessary metrics from `sklearn.metrics` such as `accuracy_score`, `classification_report`, and `confusion_matrix`.
- We use the trained model to predict sentiment labels (`y_pred`) for the test data (`X_test_selected`).
- We calculate the accuracy of the model by comparing the predicted labels to the true labels.
- We display a classification report that includes precision, recall, F1-score, and support for both positive and negative sentiment classes.
- We display a confusion matrix to visualize the true positive, true negative, false positive, and false negative predictions.



In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

<a name="ex_4"></a>
## Exercise 4

- Compare the Results with the new data split with the results of the actual split.

In [None]:
# Write your code here

<a name="ex_5"></a>
## Exercise 5

Do different training and testing sizes impact the model's learning and response to new data?

**Answer**: Write your answer here

## Step 7: Hyperparameter Tuning

In this step, we'll perform hyperparameter tuning to optimize the Logistic Regression model's performance. We can search for the best hyperparameters using techniques like Grid Search or Random Search.

- We import `GridSearchCV` from `sklearn.model_selection`.
- We define a grid of hyperparameters to search, including 'C' (regularization parameter) and 'max_iter' (maximum iterations).
- We initialize Grid Search with cross-validation (5-fold) to find the best hyperparameters.
- The best hyperparameters are extracted using `grid_search.best_params_`.
- We fit the tuned model with the best hyperparameters to the training data.
- Finally, we evaluate the tuned model's accuracy on the test data.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)

# Calculate the accuracy of the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Model Accuracy:", accuracy_tuned)

<a name="ex_6"></a>
## Exercise 6

- What is GridSearchCV used for?
- What are hyperparameters?
- Does the model give better results after hyperparameters ?

**Answer**: Write your answer here

It appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case. The accuracy remains at 0.86.

## Step 8: Cross Validation

We'll use cross-validation to estimate how well the model will perform on unseen data and check if the model's performance is consistent across different folds of the data.

- We import `cross_val_score` from `sklearn.model_selection`.
- We perform 5-fold cross-validation on the tuned model (`best_model`) using the training data (`X_train_selected` and `y_train`).
- We calculate the mean cross-validation accuracy to get a more robust estimate of the model's performance.

In [None]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)

<a name="ex_7"></a>
## Exercise 7

- What is Cross Validation used for?
- Compare the new Validation score (with the new training and testing size)
- What do you conclude ?

**Answer**: Write your answer here