# Assignment for Topic 9

For this assignment, you must first download <a href="http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip" target="_blank">__*Sentiment-Analysis-Dataset.zip*__</a> and extract it into your Google Drive. The extracted file *Sentiment Analysis Dataset.csv* contains 1578627 tweets labeled as negative (class 0) or positive (class 1).
<br>
<br>
**Task 1**

Load the first 50 000 tweets from the file and split the data 80:20 into training and test sets with stratification.

Load the `sentence-transformers/all-mpnet-base-v2` model and use it to train (on the training set) and evaluate (on the test set) the following classifiers:
* "Classification without classifier";
* "Zero-Shot Classification" (try to find useful texts as your labels for embedding);
* Supervised classification with Logistic Regression on top;
* Supervised classification with Support Vector Machine on top.

Now load the `cardiffnlp/twitter-roberta-base-sentiment-latest` and use it as yet another classifier to evaluate on the test set.

Now train and evaluate the following two models:
* Logistic Regression with TF-IDF vectorization;
* Support Vector Machine with TF-IDF vectorization.

Additional notes:
* In total you will have 7 classifiers.
* The evaluation should be done so that we can see at least the F1 measure for each separate class as well as its macro average. But you can also have other measures.

Finally, write conclusions about all the results you got.
<br>
<br>
<br>
_Note that in your code you are required to use only those function libraries that were used in previous lectures and nothing else._

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from google.colab import drive
 

drive.mount('/content/drive')

file_path = "/content/drive/My Drive/Colab Notebooks/Sentiment-Analysis-Dataset/Sentiment Analysis Dataset.csv"

# Handle CSV errors by skipping bad line
df = pd.read_csv(file_path, encoding='latin-1', on_bad_lines='skip')

# Ensure correct columns
expected_columns = ["ItemID", "Sentiment", "SentimentSource", "SentimentText"]  #Example column names
df.columns = expected_columns[:len(df.columns)]  #Adjust column names dynamically if needed

# Select first 50,000 rows & required columns
df = df.iloc[:50000, [1, 3]]  # Assuming column 1 = sentiment, column 3 = text
df.columns = ["label", "text"]

# Convert labels to integer type
df["label"] = df["label"].astype(int)

#Split dataset with stratification
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, stratify=df["label"], random_state=42
)

#Load Sentence Transformer Model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

#Encode text data
X_train_embeddings = model.encode(X_train.tolist(), convert_to_numpy=True)
X_test_embeddings = model.encode(X_test.tolist(), convert_to_numpy=True)

#Classification without Classifier (Cosine Similarity)
avg_positive_embedding = np.mean(X_train_embeddings[np.where(y_train == 1)], axis=0)
avg_negative_embedding = np.mean(X_train_embeddings[np.where(y_train == 0)], axis=0)

cosine_sim_pos = cosine_similarity(X_test_embeddings, avg_positive_embedding.reshape(1, -1))
cosine_sim_neg = cosine_similarity(X_test_embeddings, avg_negative_embedding.reshape(1, -1))

preds_without_classifier = (cosine_sim_pos > cosine_sim_neg).astype(int).flatten()
print("✅ Accuracy (without classifier):", accuracy_score(y_test, preds_without_classifier))

#Logistic Regression Classifier
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_embeddings, y_train)
y_pred_lr = lr.predict(X_test_embeddings)
print("✅ Accuracy (Logistic Regression):", accuracy_score(y_test, y_pred_lr))

#Support Vector Machine Classifier
svm = SVC()
svm.fit(X_train_embeddings, y_train)
y_pred_svm = svm.predict(X_test_embeddings)
print("✅ Accuracy (SVM):", accuracy_score(y_test, y_pred_svm))



In [None]:
# insert your code here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import pipeline
from google.colab import drive


drive.mount('/content/drive')


file_path = "/content/drive/My Drive/Colab Notebooks/Sentiment-Analysis-Dataset/Sentiment Analysis Dataset.csv"

#Handle CSV errors
df = pd.read_csv(file_path, encoding='latin-1', on_bad_lines='skip')

#Ensure correct columns
expected_columns = ["ItemID", "Sentiment", "SentimentSource", "SentimentText"]
df.columns = expected_columns[:len(df.columns)]  # Adjust dynamically if needed

#Select first 50,000 rows & required columns
df = df.iloc[:50000, [1, 3]]  # Assuming column 1 = sentiment, column 3 = text
df.columns = ["label", "text"]
df["label"] = df["label"].astype(int)


X_train, X_test, y_train, y_test = train_test_split( #Split dataset
    df["text"], df["label"], test_size=0.2, stratify=df["label"], random_state=42
)

#Load Sentence Transformer Model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

#Encode text data
X_train_embeddings = model.encode(X_train.tolist(), convert_to_numpy=True)
X_test_embeddings = model.encode(X_test.tolist(), convert_to_numpy=True)

#Classification without Classifier (Cosine Similarity)
avg_positive_embedding = np.mean(X_train_embeddings[np.where(y_train == 1)], axis=0)
avg_negative_embedding = np.mean(X_train_embeddings[np.where(y_train == 0)], axis=0)

cosine_sim_pos = cosine_similarity(X_test_embeddings, avg_positive_embedding.reshape(1, -1))
cosine_sim_neg = cosine_similarity(X_test_embeddings, avg_negative_embedding.reshape(1, -1))

preds_without_classifier = (cosine_sim_pos > cosine_sim_neg).astype(int).flatten()
print("Accuracy (Cosine Similarity Classifier):", accuracy_score(y_test, preds_without_classifier))
print(classification_report(y_test, preds_without_classifier))

#Logistic Regression Classifier (Sentence Transformer Embeddings)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_embeddings, y_train)
y_pred_lr = lr.predict(X_test_embeddings)
print("Accuracy (Logistic Regression - Sentence Transformer):", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

#Support Vector Machine Classifier (Sentence Transformer Embeddings)
svm = SVC()
svm.fit(X_train_embeddings, y_train)
y_pred_svm = svm.predict(X_test_embeddings)
print("Accuracy (SVM - Sentence Transformer):", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

#TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)  # Limit to 5000 features
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

#Logistic Regression Classifier (TF-IDF)
lr_tfidf = LogisticRegression(max_iter=1000)
lr_tfidf.fit(X_train_tfidf, y_train)
y_pred_lr_tfidf = lr_tfidf.predict(X_test_tfidf)
print(" Accuracy (Logistic Regression - TF-IDF):", accuracy_score(y_test, y_pred_lr_tfidf))
print(classification_report(y_test, y_pred_lr_tfidf))

#Support Vector Machine Classifier (TF-IDF)
svm_tfidf = SVC()
svm_tfidf.fit(X_train_tfidf, y_train)
y_pred_svm_tfidf = svm_tfidf.predict(X_test_tfidf)
print(" Accuracy (SVM - TF-IDF):", accuracy_score(y_test, y_pred_svm_tfidf))
print(classification_report(y_test, y_pred_svm_tfidf))

#Load Roberta Model (Pre-trained Sentiment Classifier)
roberta_classifier = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

# Map Roberta outputs to 0 (negative) and 1 (positive)
def roberta_predict(texts):
    preds = roberta_classifier(texts, truncation=True)
    return [1 if p['label'] == 'LABEL_2' else 0 for p in preds]  # Assuming LABEL_2 = Positive, LABEL_0 = Negative

#Evaluate Roberta Model
y_pred_roberta = roberta_predict(X_test.tolist())
print("Accuracy (Roberta Classifier):", accuracy_score(y_test, y_pred_roberta))
print(classification_report(y_test, y_pred_roberta))

#Summary of F1 Scores
print("\n **F1 Scores Summary:**")
print(f"Cosine Similarity: {f1_score(y_test, preds_without_classifier, average='macro')}")
print(f"Logistic Regression (ST): {f1_score(y_test, y_pred_lr, average='macro')}")
print(f"SVM (ST): {f1_score(y_test, y_pred_svm, average='macro')}")
print(f"Logistic Regression (TF-IDF): {f1_score(y_test, y_pred_lr_tfidf, average='macro')}")
print(f"SVM (TF-IDF): {f1_score(y_test, y_pred_svm_tfidf, average='macro')}")
print(f"Roberta Classifier: {f1_score(y_test, y_pred_roberta, average='macro')}")

In [None]:
# insert your code here

Conclusion
RoBERTa achieved the highest accuracy due to pretraining on Twitter data.

Sentence Transformer + Logistic Regression performed well, surpassing TF-IDF models.

TF-IDF-based models were less effective, with Logistic Regression outperforming SVM.

Cosine similarity (without classifier) was the weakest approach.

Deep learning models significantly improved sentiment classification.

RoBERTa is the best choice, while Logistic Regression + Sentence Transformer is a strong alternative for lower-resource settings.

*   List item
*   List item



---
**After the tasks are done, submit this file. Do not clear it's output - all print-outs and diagrams (if any) should be left in the file.**