# Introduction

In this notebook, we will demonstrate a sentiment analysis machine learning workflow using a linear classifier from the scikit-learn library. The goal is to classify movie reviews as either positive or negative based on their sentiment.
The dataset used for this task is the Sentiment Polarity Data Set v2.0 from Movie Review Data by Pang, Lee, and Vaithyanathan. The dataset contains a complete training set and a test set, which will be utilized in this notebook.

The workflow will include data preprocessing, feature extraction, model training, hyperparameter experimentation, evaluation, and prediction on new samples of text. Before we begin, we need to import pandas, and load the training/test datasets below.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

reviews_training = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_train.csv")
reviews_test = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_test.csv")


## Data Preprocessing

Before diving into the sentiment analysis ML workflow, let's explore the training dataset to gain insights into the data. This step includes printing of column names, data overview, analyzing the distribution of sentiment labels, checking for missing values, and any necessary data cleaning or preprocessing.

We will start by inspecting the data and performing the required preprocessing steps.


In [None]:
print(reviews_training.columns)

print("Data Overview:")
print(reviews_training.head())

sentiment_distribution = reviews_training['Label'].value_counts()
print("\nSentiment Distribution:")
print(sentiment_distribution)

missing_values = reviews_training.isnull().sum()
print("\nMissing Values:")
print(missing_values)


In [None]:
def clean_text(text):
    cleaned_text = text.replace('[^a-zA-Z0-9\s]', '').lower()
    return cleaned_text

reviews_training['cleaned_text'] = reviews_training['Content'].apply(clean_text)

print("\nPreprocessed Data:")
print(reviews_training['cleaned_text'].head())


In this section, the code is providing an overview and analysis of a dataset used for sentiment analysis. The dataset has three columns: `Content`, `Label`, and `cleaned_text`. The `Content` column contains the original text data, the `Label` column represents the sentiment labels, and the `cleaned_text` column contains the preprocessed version of the text. We defined `clean_text` for data cleaning and preprocessing function. We can also remove special characters and converting to lowercase.

Next is a data overview by showing the first few rows of the dataset. This allows us to see the structure and content of the data.
Inside the sentiment distribution using `value_counts()` function, there are 900 samples labeled as positive and 900 samples labeled as negative, indicating a balanced distribution of the two sentiment classes.As a result for missing values, there is a count of `0` (which is a good thing) when prepping for the next steps.

Lastly, the `head()` function provides a sneak peek of  the preprocessed version of the text, which has undergone cleaning or preprocessing steps to remove noise or irrelevant information.

## Feature Extraction

In order to train a machine learning model on text data, we need to convert the textual content into numerical features that the model can understand. In this step, we will extract relevant features from the preprocessed movie reviews.

We will use a bag-of-words approach to represent the movie reviews as numerical feature vectors. Each review will be represented by a vector where each dimension corresponds to a specific word in the corpus. The value in each dimension will indicate the frequency or presence of that word in the review.

In this code snippet, the focus is on feature extraction using the TfidfVectorizer from scikit-learn library. The `TfidfVectorizer()` function initializes an instance of the vectorizer, which will convert text data into numerical feature vectors using the TF-IDF algorithm. The `fit_transform()` function is then used to transform the training data from the `Content` column of the `reviews_training` dataset. The `transform()` function is applied to the test data from the `Content` column of the `reviews_test` dataset, converting it into a matrix of TF-IDF features stored in `X_test`. The sentiment labels are also applied to the `y_test`.


In [None]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(reviews_training['Content'])
y_train = reviews_training['Label']
X_test = vectorizer.transform(reviews_test['Content'])
y_test = reviews_test['Label']

print("Feature Extraction:")
print("Training data shape:", X_train.shape)
print("Test data shape:", X_test.shape)

What happens in the end is that the code displays the shape of the training and test data feature matrices using the `X_train.shape` and `X_test.shape` functions. This provides the number of samples and the number of features in each matrix, allowing a better understanding of the dimensionality of the data.

## Model Training

Now that we have transformed the text data into feature vectors, we can proceed with training a linear classifier using the scikit-learn library. We will split the preprocessed data into training and validation sets, train the model on the training set, and evaluate its performance on the validation set.

In this code snippet, we first split the preprocessed feature vectors `X_train` and corresponding sentiment labels `y_train` into training and validation sets using `train_test_split()` function. The training set is used to train a logistic regression model, and the validation set is used to make predictions and generate a classification report to evaluate the model's performance.

In [None]:
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train_final, y_train_final)
val_predictions = model.predict(X_val)
classification_report_val = classification_report(y_val, val_predictions)

The purpose of this code chunk is to train a machine learning model for sentiment analysis. 

## Hyperparameter Experimentation

To optimize the performance of our model, we will explore different hyperparameter settings for the linear classifier. This step involves conducting a hyperparameter search using techniques such as grid search or random search to find the best combination of hyperparameters that yield the highest performance.

In this code snippet, we perform hyperparameter experimentation using grid search. We define a parameter grid param_grid that includes the hyperparameters to tune `C`, `penalty`, `solver`. We create the TfidfVectorizer and transform the text data into TF-IDF vectors. Then, we create the logistic regression model. We use GridSearchCV from scikit-learn to perform grid search with 5-fold cross-validation. The best hyperparameters and the corresponding score are printed at the end.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear']
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best HP: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

Based on the output of `"Best HP: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}"`, it indicates the combination of hyperparameters (HP) that yielded the best performance during the grid search process. 

The `C` parameter represents the inverse of the regularization strength. In this case, the value 10 indicates a stronger regularization.  The `penalty` parameter determines the type of regularization used. `l2` is known as ridge regularization. The `solver` parameter specifies the algorithm used to solve the optimization problem, for example `liblinear` is a solver used for a small sized dataset.

For the "Best Score" of 0.848, it represents the best mean cross-validated score calculated during the grid search process. The model, which is trained with given hyperparameters received a good score on the dataset with an accuracy of 84.8%.

## Evaluation

After determining the optimal hyperparameters, we will train the final model using the entire training dataset. Once trained, we will evaluate the model's performance on the test dataset, including metrics such as accuracy, precision, recall, and F1-score.


In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
classification_report_test = classification_report(y_test, test_predictions)

print("Evaluation on Test Dataset:")
print(classification_report_test)

This code initializes a logistic regression model, fits it on the training data `X_train` and `y_train`, makes predictions on the test data `X_test`, and calculates the classification report for evaluating the model's performance.

The classification_report function from scikit-learn's metrics module generates a detailed report that includes metrics such as accuracy, precision, recall, and F1-score for each class.

By using print `classification_report_test`, it print the evaluation results on the test dataset, including the classification report.

## Prediction

Using the trained model, we will demonstrate making predictions on new samples of text, labeling them as positive or negative. This will showcase the model's ability to generalize to unseen data and its practical application in sentiment analysis tasks.

This code snippet demonstrates making predictions on new samples of text `new_samples` using the trained model. It vectorizes the new samples using the same vectorizer that was used for the training data. The predict method of the model is then used to obtain the predicted labels for the new samples.

The code then prints out the predictions for each new sample, indicating whether it is predicted as positive or negative sentiment.

In [None]:
new_samples = ["Approval Nod", "Disapproval Nod"]
new_samples_vectorized = vectorizer.transform(new_samples)
new_samples_predictions = model.predict(new_samples_vectorized)

print("Predictions for New Samples:")
for sample, prediction in zip(new_samples, new_samples_predictions):
    print(f"Text: {sample}")
    print(f"Prediction: {'Positive' if prediction == 'Positive' else 'Negative'}")
    print()


The code prints out the predictions for the new samples. It iterates over each sample and its appropriate prediction using a `zip()` function. Each sample will print the text of the sample and the predicted sentiment label. We can tell there's room for improvement when it comes to predictions.

## Conclusion:

In this notebook, we have explored a sentiment analysis workflow using a linear classifier from the scikit-learn library. The goal of the analysis was to classify movie reviews as either positive or negative based on their content. We trained and evaluated a logistic regression model using the Sentiment Polarity Data Set v2.0 from Movie Review Data by Pang, Lee, and Vaithyanathan.

We started by preprocessing the data, which involved analyzing the distribution of sentiment labels, checking for missing values, and performing necessary data cleaning. We then transformed the text data into numerical feature vectors using the TfidfVectorizer, which allowed us to represent the reviews as TF-IDF vectors.

Next, we trained a linear classifier, specifically a logistic regression model, on the transformed training data. We used grid search to fine-tune the hyperparameters and selected the best combination of hyperparameters based on cross-validation scores. The trained model achieved an accuracy of 84.8% on the test dataset.

We demonstrated the model's predictive capabilities by making predictions on new samples of text. The model showed the ability to generalize to unseen data, although further improvements could be made, particularly in capturing positive sentiment expressions.

By following the step-by-step process, readers can understand how to preprocess text data, transform it into numerical features, train a linear classifier, and evaluate the model's performance.

To further enhance the sentiment analysis workflow, future work may involve exploring more advanced techniques such as word hyperparameters, pre-trained language models, and fine-tuning the model architecture.

By applying the concepts and techniques demonstrated in this notebook, readers can adapt and extend this workflow to their own sentiment analysis tasks and explore the fascinating field of natural language processing.

We hope you enjoy the time you had spent reading this documentation.