# Problem statement: Classification model to analyze Amazon product reviews

The objective is to create a classification model that will analyze Amazon product reviews to classify sentiments as positive or negative. Here's a breakdown of the steps involved in this workflow:

- Step 1: Load the Dataset
- Step 2: Data Pre-processing
- Step 3: Feature Selection
- Step 4: Model Selection
- Step 5: Training the Model
- Step 6: Model Evaluation
- Step 7: Hyperparameter Tuning
- Step 8: Cross Validation

The notebook contains 7 exercises in total:

* [Exercise 1](#ex_1)
* [Exercise 2](#ex_2)
* [Exercise 3](#ex_3)
* [Exercise 4](#ex_4)
* [Exercise 5](#ex_5)
* [Exercise 6](#ex_6)
* [Exercise 7](#ex_7)

## Step 1: Load the dataset
First, let's load the dataset from Google Drive. You need to upload the dataset and then read the CSV file into a pandas DataFrame.

In [1]:
from google.colab import files
uploaded = files.upload()

Saving amazon-product-review-data.csv to amazon-product-review-data.csv


In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_csv('amazon-product-review-data.csv')

# Display the first few rows to check if the data is loaded correctly
df.head()



Unnamed: 0,market_place,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,sentiments
0,"""US""","""42521656""","""R26MV8D0KG6QI6""","""B000SAQCWC""","""159713740""","""The Cravings Place Chocolate Chunk Cookie Mix...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Using these for years - love them.""","""As a family allergic to wheat, dairy, eggs, n...",2015-08-31,positive
1,"""US""","""12049833""","""R1OF8GP57AQ1A0""","""B00509LVIQ""","""138680402""","""Mauna Loa Macadamias, 11 Ounce Packages""","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Wonderful""","""My favorite nut. Creamy, crunchy, salty, and ...",2015-08-31,positive
2,"""US""","""107642""","""R3VDC1QB6MC4ZZ""","""B00KHXESLC""","""252021703""","""Organic Matcha Green Tea Powder - 100% Pure M...","""Grocery""",1,0,0,0 \t(N),0 \t(N),"""Five Stars""","""This green tea tastes so good! My girlfriend ...",2015-08-31,positive
3,"""US""","""6042304""","""R12FA3DCF8F9ER""","""B000F8JIIC""","""752728342""","""15oz Raspberry Lyons Designer Dessert Syrup S...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""I love Melissa's brand but this is a great se...",2015-08-31,positive
4,"""US""","""18123821""","""RTWHVNV6X4CNJ""","""B004ZWR9RQ""","""552138758""","""Stride Spark Kinetic Fruit Sugar Free Gum, 14...","""Grocery""",1,0,0,0 \t(N),1 \t(Y),"""Five Stars""","""good""",2015-08-31,positive


In [3]:
unique_values = df['sentiments'].unique()
print(unique_values)


['positive' 'negative']


In [4]:
value_counts = df['sentiments'].value_counts()
print(value_counts)


positive    398
negative    102
Name: sentiments, dtype: int64


## Step 2: Data Pre-processing





In [5]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

print(X)
print('**************')
print(y)
# # Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
**************
[1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1
 1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1
 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 1 1 0 1 1 1
 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0
 1 1 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 1 1 0 1 0 1
 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 0 0 1
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
 1 1

In [6]:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)


from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, class_weight='balanced')
# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)
# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))



X_train_selected shape: (400, 1000)
X_test_selected shape: (100, 1000)
Accuracy: 0.82

Classification Report:
              precision    recall  f1-score   support

           0       0.30      0.21      0.25        14
           1       0.88      0.92      0.90        86

    accuracy                           0.82       100
   macro avg       0.59      0.57      0.57       100
weighted avg       0.80      0.82      0.81       100


Confusion Matrix:
[[ 3 11]
 [ 7 79]]


<a name="ex_1"></a>
## Exercise 1

- Use the train_test_split function and change the test_size to 0.3

This way the training set (X and y) should be 70% and the testing set(X and y) should be 30%

In [8]:
#Write your code here
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)


## Step 3: Feature Selection

In this step, we'll perform feature selection to reduce the dimensionality of the TF-IDF vectorized data and potentially improve the model's performance. We'll use feature selection techniques like chi-squared (chi2) or mutual information to select the most important features.

In [9]:
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


<a name="ex_2"></a>
## Exercise 2

- Compare the X_train_selected shape and X_test_selected shape with the new test_size=0.3

In [10]:
#Write your code here
from sklearn.feature_selection import SelectKBest, chi2

# Apply feature selection using chi-squared (chi2) test
# You can adjust the number of features (k) as needed
k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)


The shapes of `X_train_selected` and `X_test_selected` indicate the number of samples and the number of features in our training and testing sets, respectively. Here, `X_train_selected` has 350 samples with 1000 features, and `X_test_selected` has 150 samples with 1000 features.

If the `test_size` parameter in the `train_test_split` function is set to 0.3, it means that 30% of the data should be allocated to the test set, and the remaining 70% to the training set. Let's verify if the distribution of samples in `X_train_selected` and `X_test_selected` aligns with this split:

1. **Total number of samples**: The total is 350 (training) + 150 (testing) = 500 samples.
2. **Expected distribution**:
   - Training set: 70% of 500 = 0.7 * 500 = 350 samples
   - Testing set: 30% of 500 = 0.3 * 500 = 150 samples

The shapes `X_train_selected` (350, 1000) and `X_test_selected` (150, 1000) match the expected distribution with a test size of 30%. This means that the split has been done correctly, assigning 70% of the data to the training set and 30% to the test set while keeping the number of features consistent across both sets.

We have successfully performed feature selection, reducing the dimensionality of the data while retaining the most important features.


## Step 4: Model Selection
For sentiment analysis, you can use various machine learning algorithms like Logistic Regression, Naive Bayes, Support Vector Machines, or even deep learning models like LSTM or BERT. Since you're a beginner, let's start with a simple model like Logistic Regression.

In [11]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, class_weight='balanced')

<a name="ex_3"></a>
## Exercise 3

What does the random_state (parameter of the LogisticRegression) represent?

**Answer**:

The `random_state` parameter in `LogisticRegression` sets the seed for the random number generator, ensuring reproducibility of results by controlling randomness in the algorithm's execution, like data shuffling and coefficient initialization.

By setting a `random_state`, the algorithm is instructed to start from the same point each time it runs, which ensures that if we run the same code again with the same data and the same `random_state`, we will get the exact same output.

If we don't set a `random_state`, or if we set it to `None`, each run could produce slightly different results due to the randomness involved in the algorithm's execution. This could make debugging or replicating results challenging. Therefore, setting a `random_state` is a good practice when we need to ensure that our results are repeatable.


## Step 5: Training the Model

Now that we have initialized our Logistic Regression model, it's time to train it on the selected features from the training dataset.



In [12]:

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# We can now proceed to Step 7: Model Evaluation

## Step 6: Model Evaluation

In this step, we'll evaluate the performance of the trained Logistic Regression model using the testing data.

- We import necessary metrics from `sklearn.metrics` such as `accuracy_score`, `classification_report`, and `confusion_matrix`.
- We use the trained model to predict sentiment labels (`y_pred`) for the test data (`X_test_selected`).
- We calculate the accuracy of the model by comparing the predicted labels to the true labels.
- We display a classification report that includes precision, recall, F1-score, and support for both positive and negative sentiment classes.
- We display a confusion matrix to visualize the true positive, true negative, false positive, and false negative predictions.



In [13]:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8266666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.38      0.22      0.28        23
           1       0.87      0.94      0.90       127

    accuracy                           0.83       150
   macro avg       0.63      0.58      0.59       150
weighted avg       0.79      0.83      0.81       150


Confusion Matrix:
[[  5  18]
 [  8 119]]


<a name="ex_4"></a>
## Exercise 4

- Compare the Results with the new data split with the results of the actual split.

In [14]:
# 80/20 Data split

# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

# Initialize the Logistic Regression model
model = LogisticRegression(random_state=42, class_weight='balanced')

# Train the Logistic Regression model on the selected features
model.fit(X_train_selected, y_train)

# Predict sentiment labels for the test data
y_pred = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)
X_train_selected shape: (400, 1000)
X_test_selected shape: (100, 1000)
Accuracy: 0.74

Classification Report:
              precision    recall  f1-score   support

           0       0.12      0.14      0.13        14
           1       0.86      0.84      0.85        86

    accuracy                           0.74       100
   macro avg       0.49      0.49      0.49       100
weighted avg       0.75      0.74      0.75       100


Confusion Matrix:
[[ 2 12]
 [14 72]]


In [15]:
# Write your code here
# 70/30 Data split

# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Display the shapes of the selected feature sets
print("X_train_selected shape:", X_train_selected.shape)
print("X_test_selected shape:", X_test_selected.shape)

# Initialize the Logistic Regression model
model1 = LogisticRegression(random_state=42, class_weight='balanced')

# Train the Logistic Regression model on the selected features
model1.fit(X_train_selected, y_train)

# Predict sentiment labels for the test data
y_pred1 = model.predict(X_test_selected)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred1)
print("Accuracy:", accuracy)

# Display a classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred1))

# Display a confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred1))

X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)
X_train_selected shape: (350, 1000)
X_test_selected shape: (150, 1000)
Accuracy: 0.8266666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.38      0.22      0.28        23
           1       0.87      0.94      0.90       127

    accuracy                           0.83       150
   macro avg       0.63      0.58      0.59       150
weighted avg       0.79      0.83      0.81       150


Confusion Matrix:
[[  5  18]
 [  8 119]]


**Comparison result:-**

The results from the two different data splits show some differences in the model's performance:

**Accuracy:**
- 80/20 Split: The accuracy was 0.74.
- 70/30 Split: The accuracy improved to approximately 0.83.

**Precision, Recall, and F1-Score:**
- For class 0 (likely the minority class), both precision and recall increased in the 70/30 split compared to the 80/20 split, indicating a better ability to correctly identify and classify instances of class 0.
- For class 1, there's a slight increase in recall, indicating better identification of positive class 1 instances in the 70/30 split. Precision remains high in both splits.

**Confusion Matrix:**
- The number of true positives for class 0 increased from 2 to 5, and false negatives decreased from 12 to 18 when changing from an 80/20 split to a 70/30 split, showing an improved but still challenged performance on class 0.
- For class 1, the model shows a strong performance in both splits, with a notable increase in true positives from 72 to 119 and a decrease in false negatives from 14 to 8.

**Analysis:**
- The improved accuracy in the 70/30 split suggests that having a larger test set provided a more robust evaluation of the model, capturing its performance more accurately.
- The changes in precision and recall for class 0 in the 70/30 split indicate that the model is getting slightly better at correctly identifying instances of the minority class when it has more test data to work with.
- The improvement in class 1's recall suggests that with more test data, the model is better able to generalize its predictions for the majority class.

Overall, the 70/30 split seems to provide a better balance for training and evaluating the model, giving it more data to test on and a better understanding of its generalization capabilities. However, the performance on class 0 remains a challenge, indicating potential issues with class imbalance or feature representation that might need further tuning or adjustment.

<a name="ex_5"></a>
## Exercise 5

Do different training and testing sizes impact the model's learning and response to new data?

**Answer**: Write your answer here

Yes, different training and testing sizes can significantly impact a model's learning and its response to new data:

1. **Training Size Impact:** Larger training sets provide more data points for the model to learn from, capturing a broader range of patterns. However, if the training data is too large, it could introduce noise or irrelevant patterns, potentially leading to overfitting or increased training time without proportional gains in performance.

2. **Testing Size Impact:** The size of the testing set affects the reliability of the model's performance evaluation. A larger testing set can give a more robust and reliable estimate of the model's performance on unseen data. Conversely, a smaller testing set might not fully capture the model's effectiveness and could lead to a less reliable assessment of its generalization capabilities.

In conclusion, the division between training and testing data should be made thoughtfully, balancing the need for a model that learns well (larger training set) and the need to accurately assess its generalization to new data (sufficiently large testing set).

## Step 7: Hyperparameter Tuning

In this step, we'll perform hyperparameter tuning to optimize the Logistic Regression model's performance. We can search for the best hyperparameters using techniques like Grid Search or Random Search.

- We import `GridSearchCV` from `sklearn.model_selection`.
- We define a grid of hyperparameters to search, including 'C' (regularization parameter) and 'max_iter' (maximum iterations).
- We initialize Grid Search with cross-validation (5-fold) to find the best hyperparameters.
- The best hyperparameters are extracted using `grid_search.best_params_`.
- We fit the tuned model with the best hyperparameters to the training data.
- Finally, we evaluate the tuned model's accuracy on the test data.

In [20]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)

# Calculate the accuracy of the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Tuned Model Accuracy:", accuracy_tuned)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Tuned Model Accuracy: 0.84


<a name="ex_6"></a>
## Exercise 6

- What is GridSearchCV used for?
- What are hyperparameters?
- Does the model give better results after hyperparameters ?

**Answer**:

**What is GridSearchCV used for?**

GridSearchCV systematically searches through a specified range of hyperparameter values, performing cross-validation to determine the combination that yields the best model performance.It automates the process of finding the most optimal settings for a machine learning model.



**What are hyperparameters?**

Hyperparameters are the configuration settings of a model that are set prior to training and control the model's learning process and structure.They control the learning process itself (e.g., the complexity of the model, how fast it learns, etc.) and can significantly impact the performance of the model. Examples include the learning rate, kernel parameters in an SVM, depth for a decision tree, and regularization strength in logistic regression.



**Does the model give better results after hyperparameter tuning?**

Yes, by finding the optimal hyperparameters, the model is usually better tuned to the data, improving its performance and generalization ability. The whole point of hyperparameter tuning is to find the most optimal hyperparameter settings for our model relative to our data and the problem we solving. The optimal hyperparameters can improve the model's ability to generalise from the training data to unseen data, thus enhancing its overall performance.

It appears that the hyperparameter tuning did not significantly improve the model's accuracy in this case. The accuracy remains at 0.86.

## Step 8: Cross Validation

We'll use cross-validation to estimate how well the model will perform on unseen data and check if the model's performance is consistent across different folds of the data.

- We import `cross_val_score` from `sklearn.model_selection`.
- We perform 5-fold cross-validation on the tuned model (`best_model`) using the training data (`X_train_selected` and `y_train`).
- We calculate the mean cross-validation accuracy to get a more robust estimate of the model's performance.

In [24]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)


# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy for 80/20 train-test data split:", mean_cv_accuracy)

X_train shape: (400, 3466)
X_test shape: (100, 3466)
y_train shape: (400,)
y_test shape: (100,)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Mean Cross-Validation Accuracy for 80/20 train-test data split: 0.79


In [25]:
# Import necessary libraries for data pre-processing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

# Remove any rows with missing values
df.dropna(inplace=True)

# Encode the 'sentiments' column (positive/negative) to numerical values (0/1)
le = LabelEncoder()
df['sentiments'] = le.fit_transform(df['sentiments'])

# Text data preprocessing using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(df['review_body']).toarray()
y = df['sentiments'].values

# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Display the shapes of the resulting data
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

k = 1000
selector = SelectKBest(chi2, k=k)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Define hyperparameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Regularization parameters
    'max_iter': [100, 200, 300]  # Maximum number of iterations
}

# Initialize Grid Search with cross-validation (5-fold)
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit the Grid Search to the data
grid_search.fit(X_train_selected, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate the model with the best hyperparameters
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_selected)


# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy for 70/30 train-test data split:", mean_cv_accuracy)

X_train shape: (350, 3466)
X_test shape: (150, 3466)
y_train shape: (350,)
y_test shape: (150,)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Hyperparameters: {'C': 100, 'max_iter': 100}
Mean Cross-Validation Accuracy for 70/30 train-test data split: 0.7885714285714285


In [21]:
from sklearn.model_selection import cross_val_score

# Perform 5-fold cross-validation on the tuned model
cv_scores = cross_val_score(best_model, X_train_selected, y_train, cv=5)

# Calculate and display the mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)
print("Mean Cross-Validation Accuracy:", mean_cv_accuracy)

Mean Cross-Validation Accuracy: 0.7885714285714285


<a name="ex_7"></a>
## Exercise 7

- What is Cross Validation used for?
- Compare the new Validation score (with the new training and testing size)
- What do you conclude ?

**Answer**:

*1. What is Cross Validation used for?*

  Cross-validation is used to evaluate the generalizability of a model by dividing the data into several subsets, training the model on some subsets while validating it on the remaining ones. This process is repeated multiple times, helping to ensure that the model's performance is consistent across different subsets of the data, reducing the risk of overfitting and providing a more robust estimate of the model's performance on unseen data.

2. *Compare the new Validation score (with the new training and testing size)*


When comparing the mean cross-validation accuracies for different train-test splits, we are observing how altering the amount of training and testing data impacts the model's ability to generalize.

1. **80/20 Split:**
   - Mean Cross-Validation Accuracy: 0.79
   - The model had more data to train on, potentially allowing it to learn more comprehensive patterns from the data.

2. **70/30 Split:**
   - Mean Cross-Validation Accuracy: 0.7885714285714285
   - Despite having less training data compared to the 80/20 split, the model's performance is quite similar, indicating good generalization.



*3. What do you conclude?*

**Conclusions:**

- The slight difference in cross-validation accuracy between the two splits suggests that the model is relatively stable across different amounts of training data.
- The consistency in best hyperparameters (`'C': 100, 'max_iter': 100`) for both splits implies that the model's optimal configuration is robust to changes in the train-test ratio.
- Since the performance is similar across both splits, it indicates that the model, with the given hyperparameters, is not heavily dependent on the exact proportion of training to testing data, at least within the range tested (80/20 vs. 70/30).

This analysis helps in understanding the trade-off between training with more data and having a larger set to validate the model's generalizability.