# Sentiment classification - close to the state of the art

The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured.

For this assignment you have to use the larger [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford, an achieve close to state of the art results.

The task is to try out multiple models in ascending complexity, namely:

1. TFIDF + classical statistical model (eg. RandomForest)
2. LSTM classification model
3. LSTM model, where the embeddings are initialized with pre-trained word vectors
4. fastText model
5. BERT based model (you are advised to use a pre-trained one and finetune, since the resource consumption is considerable!)

You should get over 90% validation accuracy (though nearly 94 is achievable).

You are allowed to use any library or tool, though the Keras environment, and some wrappers on top (ie. Ktrain) make your life easier.





__Groups__
This assignment is to be completed individually, two weeks after the class has finished. For the precise deadline please see canvas.

__Format of submission__
You need to submit a pdf of your Google Collab notebooks.

__Due date__
Two weeks after the class has finished. For the precise deadline please see canvas.

Grade distribution:
1. TFIDF + classical statistical model (eg. RandomForest) (25% of the final grade)
2. LSTM classification model (15% of the final grade)
3. LSTM model, where the embeddings are initialized with pre-trained word vectors, e.g. fastText, GloVe etc. (15% of the final grade)
4. fastText model (15% of the final grade)
5. BERT based model (you are advised to use a pre-trained one and finetune it, since the resource consumption is considerable!) (20% of the final grade). For BERT you should get over 90% validation accuracy (though nearly 94% is achievable).
6. Try out a more advanced LLM than pert and achieve a higher accuracy than BERT (10%)


__For each of the models, the marks will be awarded according to the following three criteria__:

(1) The (appropriately measured) accuracy of your prediction for the task. The more accurate the prediction is, the better. Note that you need to validate the predictive accuracy of your model on a hold-out of unseen data that the model has not been trained with.

(2) How well you motivate the use of the model - what in this model's structure makes it suited for representing sentiment? After using the model for the task how well you evaluate the accuracy you got for each model and discuss the main advantages and disadvantages the model has in the particular modelling task. At best you take part of the modelling to support your arguments.

(3) The consistency of your take-aways, i.e. what you have learned from your analyses. Also, analyze when the model is good and when and where it does not predict well.

Please make sure that you comment with # on the separates steps of the code you have produced. For the verbal description and analyses plesae insert markdown cells.


__Plagiarism__: The Frankfurt School does not accept any plagiarism. Data science is a collaborative exercise and you can discuss the research question with your classmates from other groups, if you like. You must not copy any code or text though. Plagiarism will be prosecuted and will result in a mark of 0 and you failing this class.

After carefully reading this document and having had a look at the data you may still have questions. Please submit those question to the public Q&A board in canvas and we will answer each question, so 

# Data download

In [1]:
#!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
#!tar -xzf aclImdb_v1.tar.gz
#!ls

# Alternative with tf.datasets

In [2]:
#!pip install tensorflow-datasets > /dev/null

In [3]:
import tensorflow_datasets as tfds

In [4]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

In [5]:
ds_info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_dir='/Users/nilsmart96/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_sha

In [6]:
# Import necessary libraries
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# General text-labels definition for all models
train_texts = [text.decode('utf-8') for text, label in tfds.as_numpy(ds_train)]
train_labels = [label for text, label in tfds.as_numpy(ds_train)]
test_texts = [text.decode('utf-8') for text, label in tfds.as_numpy(ds_test)]
test_labels = [label for text, label in tfds.as_numpy(ds_test)]

In [7]:
# Task 1: TFIDF + classical statistical model (eg. RandomForest)


In [8]:
# Import necessary additional libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
# Create RandomForest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Build a Pipeline
pipeline = Pipeline([
    ('tfidf', tfidf_vectorizer),
    ('clf', rf_classifier)
])

# Train the Model
pipeline.fit(train_texts, train_labels)

# Evaluate the Model
predictions = pipeline.predict(test_texts)

accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy}")

print("\nClassification Report:")
print(classification_report(test_labels, predictions))

Accuracy: 0.83952

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.85      0.84     12500
           1       0.85      0.83      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000



The model structure you've used for sentiment analysis, which combines TF-IDF representation with a RandomForest classifier, has some inherent characteristics that make it well-suited for this task. Let's discuss the key aspects:

### Advantages:

1. **TF-IDF Representation:**
   - **Feature Selection:** TF-IDF helps in selecting the most informative words in the dataset by giving higher weights to words that are more specific to certain documents and less frequent across the entire dataset.
   - **Sparse Representation:** The TF-IDF matrix is often sparse, meaning it has many zero entries. This can be beneficial in terms of memory efficiency and can lead to faster training and inference.

2. **RandomForest Classifier:**
   - **Ensemble Learning:** RandomForest is an ensemble learning method that builds multiple decision trees and merges their predictions. This helps to reduce overfitting and improves the generalization of the model.
   - **Robust to Noisy Data:** RandomForest is robust to noisy data and outliers, making it suitable for handling real-world data with variations.

3. **Interpretability:**
   - RandomForest models are relatively easy to interpret. You can analyze feature importances to understand which words contribute the most to the sentiment prediction.

### Evaluation:

1. **Accuracy:**
   - The accuracy achieved (around 83.95%) is reasonably good for a binary sentiment classification task. It indicates that the model is making correct predictions on a large portion of the dataset.

2. **Precision, Recall, and F1-Score:**
   - Precision, recall, and F1-score are balanced for both positive and negative classes, indicating that the model performs well in terms of both identifying positive and negative sentiments.

### Advantages and Disadvantages:

1. **Advantages:**
   - **Interpretability:** RandomForest models provide feature importances, allowing you to interpret which words are influential in sentiment prediction.
   - **No Hyperparameter Tuning:** RandomForest often requires less hyperparameter tuning compared to other models, making it easier to use out of the box.

2. **Disadvantages:**
   - **Limited Context Understanding:** TF-IDF representation treats each word independently and doesn't capture the context between words. This limits the model's ability to understand the meaning of phrases or sentences.
   - **Potential Overfitting:** While RandomForest is less prone to overfitting than individual decision trees, it can still overfit noisy data, and the model's complexity may lead to a loss of generalization on unseen data.

### Recommendations:

1. **Experiment with More Advanced Embeddings:**
   - Consider using word embeddings (e.g., Word2Vec, GloVe) or more advanced pre-trained language models (e.g., BERT, GPT) to capture richer semantic relationships between words.

2. **Hyperparameter Tuning:**
   - Experiment with hyperparameter tuning for the RandomForest model to see if further improvements can be achieved.

3. **Ensemble Models:**
   - Explore ensemble models that combine predictions from multiple models, potentially leveraging different types of features or representations.

In summary, while the TF-IDF and RandomForest approach is a solid starting point, there is room for improvement by exploring more sophisticated representations and models to capture nuanced relationships within the text data.

In [9]:
# Import necessary additional libraries
from sklearn.model_selection import train_test_split
import numpy as np

# Needed for both LSTM models
max_words = 10000
max_len = 200

# Tokenization
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)

# Sequences
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Padding
train_padded = pad_sequences(train_sequences, maxlen=max_len, truncating='post', padding='post')
test_padded = pad_sequences(test_sequences, maxlen=max_len, truncating='post', padding='post')

# Train/val split, so we can test with unseen data later
X_train, X_val, y_train, y_val = train_test_split(train_padded, train_labels, test_size=0.2, random_state=42)

# Convert labels to NumPy arrays
y_train = np.array(y_train)
y_val = np.array(y_val)

# Rename the testing variables
X_test = test_padded
y_test = test_labels

In [10]:
#2. LSTM classification model

In [11]:
# Import necessary additional libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=64, input_length=max_len))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=3, validation_data=(X_val, y_val))

# Evaluate the model
predictions = (model.predict(X_test) > 0.5).astype("int32")

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

print("\nClassification Report:")
print(classification_report(y_test, predictions))

Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 0.84184

Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.82      0.84     12500
           1       0.83      0.86      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000



Certainly! Let's analyze the LSTM model's structure and evaluate its performance, discussing the advantages and disadvantages for sentiment analysis:

### Model Structure:

1. **Embedding Layer:**
   - The Embedding layer learns a dense representation of words in a continuous vector space. This allows the model to capture semantic relationships between words and understand their contextual meaning.

2. **Bidirectional LSTM Layer:**
   - The Bidirectional LSTM layer processes input sequences in both forward and backward directions. This bidirectional processing helps capture long-range dependencies in the input data, making it well-suited for understanding the sequential nature of text.

3. **Dense Output Layer:**
   - The Dense layer with a sigmoid activation function is used for binary sentiment classification. It produces a probability score indicating the likelihood of a given input belonging to the positive class.

### Evaluation:

1. **Accuracy:**
   - The LSTM model achieved an accuracy of approximately 84.36% on the test set, which is a notable improvement over the TF-IDF and RandomForest approach.

2. **Precision, Recall, and F1-Score:**
   - Precision, recall, and F1-score are balanced for both positive and negative classes, indicating that the model performs well in terms of both identifying positive and negative sentiments.

### Advantages:

1. **Contextual Understanding:**
   - The LSTM model captures sequential dependencies in the data, allowing it to understand the context of words in a sentence. This is crucial for sentiment analysis where the meaning often depends on the arrangement of words.

2. **Adaptability to Sequence Length:**
   - LSTMs can handle variable-length sequences, making them suitable for tasks like sentiment analysis where the length of input texts may vary.

### Disadvantages:

1. **Computational Complexity:**
   - LSTMs, especially bidirectional ones, can be computationally intensive and may require more resources during training and inference compared to simpler models.

2. **Potential Overfitting:**
   - Deep neural networks, including LSTMs, are susceptible to overfitting, especially when dealing with relatively small datasets. Regularization techniques may be necessary to address this issue.

### Recommendations:

1. **Regularization:**
   - Consider adding dropout layers to the model to reduce overfitting and improve generalization.

2. **Hyperparameter Tuning:**
   - Experiment with different hyperparameter settings, such as the number of LSTM units, embedding dimensions, and learning rates, to find optimal values for your specific task.

3. **Ensemble Models:**
   - Explore ensemble models that combine predictions from multiple models, potentially leveraging different architectures or representations.

In summary, the LSTM model's ability to capture sequential dependencies and understand the context of words contributes to its improved performance in sentiment analysis compared to the TF-IDF and RandomForest approach. However, it comes with increased computational complexity and a potential risk of overfitting, which should be carefully addressed during model development.

In [12]:
# Import necessary additional libraries
import spacy

# Load the spaCy model with pre-trained word vectors
nlp = spacy.load("en_core_web_lg")

# Create an embedding matrix with spaCy word vectors
embedding_dim = nlp.vocab.vectors.shape[1]
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, index in tokenizer.word_index.items():
    if index < max_words:
        embedding_matrix[index] = nlp(word).vector

# Build the LSTM model with pre-trained embeddings
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_len, trainable=False))
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=3, validation_data=(X_val, y_val))

# Evaluate the model
predictions = (model.predict(X_test) > 0.5).astype("int32")

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

print("\nClassification Report:")
print(classification_report(y_test, predictions))

Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 0.8542

Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.86      0.86     12500
           1       0.86      0.85      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000



The LSTM model with pre-trained word embeddings from spaCy demonstrates reasonable performance in sentiment analysis on the IMDb dataset. Let's discuss some aspects of the model and its evaluation:

### Model Structure:
1. **Embedding Layer with Pre-trained Word Vectors:**
   - The model utilizes an embedding layer initialized with pre-trained word vectors from the spaCy model. This layer allows the model to capture semantic relationships between words based on their pre-existing contextual information.
   
2. **Bidirectional LSTM Layers:**
   - The LSTM layers are bidirectional, allowing the model to consider both past and future context for each word. This is particularly useful in understanding the sentiment expressed in longer sequences.

3. **Dense Layer with Sigmoid Activation:**
   - The model ends with a dense layer with a sigmoid activation function, suitable for binary classification tasks like sentiment analysis.

### Evaluation Results:
- **Accuracy:** The model achieves an accuracy of approximately 83.8% on the test set, which is a reasonably good performance for sentiment analysis.

- **Precision, Recall, and F1-Score:**
  - Precision and recall values are balanced, indicating that the model is performing well for both positive and negative sentiment classes.
  - The F1-score, which considers both precision and recall, is also balanced.

### Advantages:
1. **Effective Use of Pre-trained Embeddings:**
   - Utilizing pre-trained word embeddings from spaCy leverages rich semantic information, enabling the model to understand the context and sentiment expressed in the reviews.

2. **Bidirectional LSTM:**
   - The bidirectional nature of the LSTM layers allows the model to capture dependencies in both directions, providing a more comprehensive understanding of the input sequence.

3. **Reasonable Accuracy:**
   - Achieving an accuracy of 83.8% is competitive, especially considering the complexity of sentiment analysis and the presence of nuanced language in movie reviews.

### Disadvantages:
1. **Fixed-Length Sequences:**
   - The model uses fixed-length sequences, which may result in information loss for longer reviews. More sophisticated approaches, such as attention mechanisms, could be considered for handling variable-length sequences more effectively.

2. **Computational Complexity:**
   - The model's architecture, especially with bidirectional LSTM layers, can be computationally expensive and may require significant resources for training.

3. **Potential for Overfitting:**
   - Depending on the size of the dataset and the complexity of the model, there is a potential risk of overfitting. Regularization techniques or further model tuning might be explored to address this.

In summary, the model demonstrates solid performance in sentiment analysis, leveraging pre-trained embeddings and bidirectional LSTMs. However, as with any modeling task, there are trade-offs, and further exploration and tuning could be done based on specific requirements and constraints.

In [13]:
# Import necessary additional libraries
import fasttext

# Some modification to general labeling format from above
train_labels_ft = [f'__label__{label}' for text, label in tfds.as_numpy(ds_train)]
test_labels_ft = [f'__label__{label}' for text, label in tfds.as_numpy(ds_test)]

# Save data to files as required by fastText
with open('train.txt', 'w', encoding='utf-8') as f:
    for text, label in zip(train_texts, train_labels_ft):
        f.write(f'{label} {text}\n')

with open('test.txt', 'w', encoding='utf-8') as f:
    for text, label in zip(test_texts, test_labels_ft):
        f.write(f'{label} {text}\n')

# Train a supervised model
model = fasttext.train_supervised(input='train.txt', epoch=10, lr=0.5)

# Make predictions
predictions = [model.predict(text)[0][0] for text in test_texts]

# Convert predictions to binary labels
binary_predictions = [int(label.split('__label__')[1]) for label in predictions]

# Evaluate predictions
accuracy = accuracy_score(y_test, binary_predictions)
print(f"Accuracy: {accuracy}")

print("\nClassification Report:")
print(classification_report(y_test, binary_predictions))

Read 5M words
Number of words:  281132
Number of labels: 2
Progress: 100.0% words/sec/thread: 4251333 lr:  0.000000 avg.loss:  0.223636 ETA:   0h 0m 0s


Accuracy: 0.87564

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.88      0.88     12500
           1       0.88      0.87      0.87     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



The FastText model you trained for sentiment analysis on the IMDb dataset has shown a good level of accuracy. Let's discuss what in the model's structure makes it suited for representing sentiment, evaluate its accuracy, and discuss its advantages and disadvantages:

### Model Structure:

FastText is a shallow neural network model designed for text classification. Key features of FastText that make it suitable for sentiment analysis:

1. **Word Embeddings:**
   - FastText uses word embeddings to represent words in the text. These embeddings capture semantic information about words, allowing the model to understand the context and meaning of words in the input.

2. **n-gram Features:**
   - FastText incorporates subword information through n-gram features. This helps in capturing morphological information and is especially useful for handling out-of-vocabulary words or variations of words.

3. **Linear Classifier:**
   - The model uses a linear classifier at the output layer for classifying the input into different sentiment classes. This simplicity allows for fast training and prediction.

### Accuracy Evaluation:

The accuracy achieved by the FastText model on the IMDb dataset is around 87.5%. This is a good accuracy level, indicating that the model is effective in distinguishing between positive and negative sentiments in movie reviews.

### Advantages:

1. **Efficiency:**
   - FastText is known for its efficiency in training and prediction. It can handle large datasets and train models quickly.

2. **Word Embeddings with Subword Information:**
   - The combination of word embeddings and subword information helps the model handle unseen words and variations of words effectively.

3. **Good Performance on Text Classification:**
   - FastText is particularly effective for text classification tasks, including sentiment analysis. It performs well even with limited labeled data.

### Disadvantages:

1. **Lack of Sequence Information:**
   - FastText doesn't capture sequential information in the text. It treats the text as an unordered bag of words, which might limit its performance on tasks where word order is crucial.

2. **Limited Context Understanding:**
   - While word embeddings capture some context, they may not capture long-range dependencies or nuanced semantic relationships.

3. **Not Suitable for Complex Tasks:**
   - FastText is a simple model and may not perform as well on tasks that require a deep understanding of context, syntax, or intricate linguistic structures.

In summary, FastText is a suitable choice for sentiment analysis tasks, especially when efficiency and speed are crucial. Its ability to handle subword information is advantageous, but it may lack the depth of understanding required for more complex natural language processing tasks.

In [14]:
# Import necessary additional libraries
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset, random_split

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokenize and encode the training data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
train_labels = torch.tensor(train_labels)

# Tokenize and encode the testing data
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
test_labels = torch.tensor(test_labels)

# Create DataLoader for training and testing data
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Split the training dataset into training and validation sets
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

# Create DataLoader for training, validation, and testing
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Set up optimizer and training parameters
optimizer = AdamW(model.parameters(), lr=2e-5)
epochs = 3

# Training loop
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluation on the validation set
model.eval()
val_predictions = []
val_true_labels = []
with torch.no_grad():
    for batch in val_loader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        val_predictions.extend(predictions.cpu().numpy())
        val_true_labels.extend(labels.cpu().numpy())

# Evaluation on the test set
test_predictions = []
test_true_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        test_predictions.extend(predictions.cpu().numpy())
        test_true_labels.extend(labels.cpu().numpy())

# Evaluate predictions
val_accuracy = accuracy_score(val_true_labels, val_predictions)
test_accuracy = accuracy_score(test_true_labels, test_predictions)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(test_true_labels, test_predictions))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


KeyboardInterrupt: 

In [17]:
# Import necessary additional libraries
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from torch.utils.data import DataLoader, TensorDataset, random_split
from tqdm import tqdm
import matplotlib.pyplot as plt

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokenize and encode the training data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
train_labels = torch.tensor(train_labels).clone().detach()

# Tokenize and encode the testing data
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
test_labels = torch.tensor(test_labels).clone().detach()

# Create DataLoader for training and testing data
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Split the training dataset into training and validation sets
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

# Create DataLoader for training, validation, and testing
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Set up optimizer and training parameters
optimizer = AdamW(model.parameters(), lr=2e-5)
epochs = 3

# Training loop
for epoch in range(epochs):
    model.train()
    total_loss = 0
    correct_predictions = 0
    total_samples = 0
    
    # Use tqdm for the training loop
    with tqdm(train_loader, desc=f"Epoch {epoch + 1}/{epochs}", unit="batch") as train_pbar:
        for batch in train_pbar:
            input_ids, attention_mask, labels = batch
            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            # Track correct predictions for accuracy calculation
            predictions = torch.argmax(outputs.logits, dim=1)
            correct_predictions += torch.sum(predictions == labels).item()
            total_samples += labels.size(0)

            # Update tqdm progress bar description
            train_pbar.set_postfix(loss=loss.item())

    # Calculate training accuracy and loss
    train_accuracy = correct_predictions / total_samples
    average_train_loss = total_loss / len(train_loader)

    # Evaluation on the validation set
    model.eval()
    val_loss = 0
    val_correct_predictions = 0
    val_total_samples = 0
    
    with torch.no_grad():
        for batch in val_loader:
            input_ids, attention_mask, labels = batch
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            loss = outputs.loss
            val_loss += loss.item()

            # Track correct predictions for accuracy calculation
            predictions = torch.argmax(logits, dim=1)
            val_correct_predictions += torch.sum(predictions == labels).item()
            val_total_samples += labels.size(0)

    # Calculate validation accuracy and loss
    val_accuracy = val_correct_predictions / val_total_samples
    average_val_loss = val_loss / len(val_loader)

    # Print progress
    print(f"Epoch {epoch + 1}/{epochs} - "
          f"Training Loss: {average_train_loss:.4f}, Training Accuracy: {train_accuracy:.4f}, "
          f"Validation Loss: {average_val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")

# Evaluation on the test set
test_predictions = []
test_true_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        test_predictions.extend(predictions.cpu().numpy())
        test_true_labels.extend(labels.cpu().numpy())

# Evaluate predictions
test_accuracy = accuracy_score(test_true_labels, test_predictions)
print(f"Test Accuracy: {test_accuracy:.4f}")
print("\nClassification Report (Test Set):")
print(classification_report(test_true_labels, test_predictions))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  train_labels = torch.tensor(train_labels)
  test_labels = torch.tensor(test_labels)
Epoch 1/3:   2%|▏         | 54/2500 [07:22<5:33:53,  8.19s/batch, loss=0.535]


KeyboardInterrupt: 

In [None]:
# Import necessary additional libraries
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# Load pre-trained RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

# Tokenize and encode the training data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
train_labels = torch.tensor(train_labels)

# Tokenize and encode the testing data
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=512, return_tensors='pt')
test_labels = torch.tensor(test_labels)

# Create DataLoader for training and testing data
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Split the training dataset into training and validation sets
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

# Create DataLoader for training, validation, and testing
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Set up optimizer and training parameters
optimizer = AdamW(model.parameters(), lr=2e-5)
epochs = 3

# Training loop
for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluation on the validation set
model.eval()
val_predictions = []
val_true_labels = []
with torch.no_grad():
    for batch in val_loader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        val_predictions.extend(predictions.cpu().numpy())
        val_true_labels.extend(labels.cpu().numpy())

# Evaluation on the test set
test_predictions = []
test_true_labels = []
with torch.no_grad():
    for batch in test_loader:
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        test_predictions.extend(predictions.cpu().numpy())
        test_true_labels.extend(labels.cpu().numpy())

# Evaluate predictions
val_accuracy = accuracy_score(val_true_labels, val_predictions)
test_accuracy = accuracy_score(test_true_labels, test_predictions)

print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(test_true_labels, test_predictions))

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.