<a href="https://colab.research.google.com/github/mshojaei77/NLP-Journey/blob/main/01_sentiment-analysis-with-logistic-regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 1: Install & Import Libraries

First, let's import all the necessary libraries.


In [1]:
!pip install -q nltk scikit-learn transformers datasets mlflow

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import mlflow
import mlflow.sklearn
from datasets import load_dataset
from torch.cuda.amp import autocast

### Step 2: Download NLTK Data

NLTK (Natural Language Toolkit) is a popular library for natural language processing. We need to download some data that NLTK uses for tokenization and stopwords removal.




In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

1. **`nltk.download('punkt')`**: This command downloads the Punkt Tokenizer Models. Punkt is a pre-trained unsupervised machine learning model for tokenizing text into sentences. It's useful for breaking up text into individual sentences, which is a crucial preprocessing step in many NLP tasks.

2. **`nltk.download('stopwords')`**: This command downloads the NLTK stopwords corpus. Stop words are common words in a language (like 'the', 'is', 'in') that typically do not contain important meaning and are often removed from texts during preprocessing to reduce noise and focus on words that carry more significant meaning. NLTK provides a list of stop words for several languages, which can be used to filter out these common words from textual data.

3. **`nltk.download('wordnet')`**: This command downloads WordNet, a large lexical database of English words. WordNet groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can be used for word sense disambiguation, text analysis, and natural language understanding tasks.

### Step 3: Load IMDB Dataset
We will load the IMDB dataset from Hugging Face's datasets library.

In [4]:
dataset = load_dataset('imdb')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Step 4: Preprocess the Data

We need to clean and prepare the text data before feeding it into the model. Stop words are common words like 'the', 'is', 'and' that do not carry much meaning.

In [5]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word.isalnum()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Extract texts and labels from the dataset
texts = [preprocess(sample['text']) for sample in dataset['train']]
labels = [sample['label'] for sample in dataset['train']]

### Step 5: Text Embeddings using BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model that provides rich contextual embeddings. We will use BERT to convert the text data into embeddings. Ensure that the model and data are loaded onto the GPU.


In [6]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased").to('cuda')

def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to('cuda')
    with torch.no_grad():
        with autocast():
            outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
    return embeddings

def get_embeddings_in_batches(texts, batch_size=32):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = get_embeddings(batch)
        all_embeddings.append(embeddings)
    return np.vstack(all_embeddings)

X_embeddings = get_embeddings_in_batches(texts)

### Step 6: Split the Data into Training and Testing Sets

We need to split our data into a training set and a testing set to evaluate our model's performance. `train_test_split` splits the data randomly, with 80% for training and 20% for testing.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_embeddings, labels, test_size=0.2, random_state=42)

### Step 7: Define a Pipeline with StandardScaler and Logistic Regression

A pipeline allows us to chain multiple steps together, making the process more efficient. `StandardScaler` standardizes the features by removing the mean and scaling to unit variance. Logistic Regression is a simple yet powerful classification algorithm.

In [8]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(penalty='l2', solver='liblinear', class_weight='balanced'))
])

### Step 8: Start an MLflow Run

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It helps in tracking experiments, packaging code into reproducible runs, and sharing and deploying models.

In [9]:
with mlflow.start_run():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)

    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_param("model", "LogisticRegression with BERT Embeddings")
    mlflow.sklearn.log_model(pipeline, "logistic_regression_bert_model")

    print("Accuracy:", accuracy)
    print("Classification Report:\n", report)
    print("Confusion Matrix:\n", conf_matrix)



Accuracy: 0.8328
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.83      0.83      2515
           1       0.83      0.83      0.83      2485

    accuracy                           0.83      5000
   macro avg       0.83      0.83      0.83      5000
weighted avg       0.83      0.83      0.83      5000

Confusion Matrix:
 [[2099  416]
 [ 420 2065]]


### Step 9: Example Prediction

Demonstrate how to make a prediction on new text data.

In [11]:
new_text = ["This movie was absolutely fantastic!"]
new_text_processed = [preprocess(text) for text in new_text]
new_text_embeddings = get_embeddings(new_text_processed)
prediction = pipeline.predict(new_text_embeddings)

# Mapping the prediction to a human-readable sentiment
sentiment_mapping = {0: "negative", 1: "positive"}
predicted_sentiment = sentiment_mapping[prediction[0]]

print("The sentiment of the text '{}' is predicted to be: {}".format(new_text[0], predicted_sentiment))

# Additional explanation for beginners
if predicted_sentiment == "positive":
    print("This means the model thinks the text expresses a positive opinion or feeling.")
else:
    print("This means the model thinks the text expresses a negative opinion or feeling.")

The sentiment of the text 'This movie was absolutely fantastic!' is predicted to be: positive
This means the model thinks the text expresses a positive opinion or feeling.


### Summary
This Jupyter Notebook guides you through the process of building a sentiment analysis model using logistic regression and BERT embeddings on the IMDB dataset. Each step is explained in detail to help you understand the purpose and functionality of every part of the code. The model is evaluated using accuracy, a classification report, and a confusion matrix, and the results are logged with MLflow for tracking and reproducibility.

