In this assignment you will practice to train and test SVM models.

1) Use the new dataset SST - check the accompnying README for more details

2) Provide a description of the training data using descriptive statistics: how many messages are in the data? how many values for the labels? what is the distributions of the messages in the classes? what is the average number of tokens per message?

3) Using FeatureUnion() and Pipeline(), develop 3 different classifier using as target classes the values for the coarse-grained labels. You can combine different features (word and character ngrams) and using different vectorisation methods (Tfidf or Count). Test your Linear SVMs against the validation/development data of SST. Select the one with best scores and apply it to the SST test data

4) Apply your model to the DH_CollectingData2022_review.tsv. What is your performance? 

5) Save your best model using pickle. 


NOTE: if you have problems in running the model on your machine, use Google Colab.



In [167]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.pipeline import Pipeline, FeatureUnion
from nltk.corpus import stopwords

import pickle
import csv

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/jo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Task2

In [168]:
train_data = pd.read_csv('SST/stsa.fine.train.converted.csv', sep=',')
train_data.head()

Unnamed: 0,label_fine_grained,label_coarse_grained,sentence
0,4,1,"a stirring , funny and finally transporting re..."
1,1,-1,apparently reassembled from the cutting-room f...
2,1,-1,they presume their audience wo n't sit still f...
3,2,0,the entire movie is filled with deja vu moments .
4,3,1,this is a visually stunning rumination on love...


In [169]:
# Count the number of messages in the train dataset
num_messages = len(train_data)
num_messages

5860

In [170]:
# Count the number of unique fine-grained labels in the train dataset
num_fine_grained = train_data['label_fine_grained'].nunique()
num_fine_grained

5

In [171]:
# Count the number of messages in each fine-grained label class in the train dataset
class_counts_fine_grained = train_data['label_fine_grained'].value_counts()
class_counts_fine_grained

3    1607
1    1515
2    1117
4     860
0     761
Name: label_fine_grained, dtype: int64

In [172]:
# Count the number of unique coarse-grained labels in the train dataset
num_coarse_grained = train_data['label_coarse_grained'].nunique()
num_coarse_grained

3

In [173]:
# Count the number of messages in each coarse-grained label class in the train dataset
class_counts_coarse_grained = train_data['label_coarse_grained'].value_counts()
class_counts_coarse_grained

 1    2467
-1    2276
 0    1117
Name: label_coarse_grained, dtype: int64

In [174]:
# Tokenize the sentences in the train dataset and count the number of tokens in each message
train_data['tokens'] = train_data['sentence'].apply(nltk.word_tokenize)
train_data['num_tokens'] = train_data['tokens'].apply(len)

# Calculate the average number of tokens per message in the train dataset
avg_tokens = train_data['num_tokens'].mean()
avg_tokens

19.167235494880547

### Task 3

In [175]:
dev_data = pd.read_csv('SST/stsa.fine.dev.converted.csv', sep=',')
dev_data.head()

Unnamed: 0,label_fine_grained,label_coarse_grained,sentence
0,2,0,"in his first stab at the form , jacquot takes ..."
1,1,-1,one long string of cliches .
2,1,-1,if you 've ever entertained the notion of doin...
3,0,-1,k-19 exploits our substantial collective fear ...
4,1,-1,it 's played in the most straight-faced fashio...


In [176]:
# Extract sentences and labels from development data

devSentences = dev_data['sentence']
devLabels = dev_data['label_coarse_grained']

In [177]:
test_data = pd.read_csv('SST/stsa.fine.test.converted.csv')
test_data.head()

Unnamed: 0,label_fine_grained,label_coarse_grained,sentence
0,1,-1,"no movement , no yuks , not much of anything ."
1,0,-1,"a gob of drivel so sickly sweet , even the eag..."
2,2,0,` how many more voyages can this limping but d...
3,2,0,so relentlessly wholesome it made me want to s...
4,0,-1,"gangs of new york is an unapologetic mess , wh..."


In [178]:
# Extract sentences and labels from test data

testSentences = test_data['sentence']
testLabels = test_data['label_coarse_grained']

In [179]:
# Extract sentences and labels from training data
trainSentences = train_data['sentence'] 
trainLabels = train_data['label_coarse_grained']

#### Model 1 

In [180]:
# Define feature union that combines CountVectorizer for word and character n-grams
vectorizer_union_01 = FeatureUnion([('cnt_word', CountVectorizer(stop_words='english')),  # CountVectorizer for word n-grams
                                    ('cnt_char', CountVectorizer(analyzer='char', ngram_range=(1, 2)))  # # CountVectorizer for character n-grams
                               ])

# Define pipeline that sequentially applies feature union and the SVM classifier with Linear Kernel
svm_pipeline_01 = Pipeline([
            ('vectorize', vectorizer_union_01),  # FeatureUnion transformer
            ('classify', SVC(random_state=50, kernel='linear')) # SVM classifier with Linear Kernel
            ])  

# Fit the pipeline on the training data to learn the mapping between features and labels
svm_pipeline_01.fit(trainSentences, trainLabels)

# Predict labels for the development set using the trained pipeline
prediction_01 = svm_pipeline_01.predict(devSentences)

In [181]:
# Define and train SVM model using feature union and pipeline for Model 1
trained_model_feature_union1 = svm_pipeline_01.fit(trainSentences, trainLabels)

In [182]:
# Predict labels for development set using Model 1
dev_prediction1 = trained_model_feature_union1.predict(devSentences)

In [183]:
# Print classification report for Model 1 on development set

print("PREDICTION ON DEVELOPMENT")
print("Model 1:")
print(classification_report(devLabels, dev_prediction1))

PREDICTION ON DEVELOPMENT
Model 1:
              precision    recall  f1-score   support

          -1       0.57      0.62      0.59       428
           0       0.26      0.21      0.23       229
           1       0.63      0.64      0.63       444

    accuracy                           0.54      1101
   macro avg       0.49      0.49      0.49      1101
weighted avg       0.53      0.54      0.53      1101



In [184]:
# Predict labels for test set using Model 1
test_prediction1 = trained_model_feature_union1.predict(testSentences)

In [185]:
# Print classification report for Model 1 on test set

print("PREDICTION ON TEST")
print("Model 1:")
print(classification_report(testLabels, test_prediction1))

PREDICTION ON TEST
Model 1:
              precision    recall  f1-score   support

          -1       0.61      0.62      0.61       912
           0       0.20      0.19      0.19       389
           1       0.67      0.68      0.67       909

    accuracy                           0.57      2210
   macro avg       0.49      0.49      0.49      2210
weighted avg       0.56      0.57      0.56      2210



In [186]:
# Calculate accuracy, macro F1, and weighted F1 scores for Model 1 on test set

accuracy_1 = accuracy_score(testLabels, test_prediction1)
macro_f1_1 = f1_score(testLabels, test_prediction1, average='macro')
weighted_f1_1 = f1_score(testLabels, test_prediction1, average='weighted')

# Print Model 1 scores

print("Model 1 scores:")
print("Accuracy:", accuracy_1)
print("Macro F1:", macro_f1_1)
print("Weighted F1:", weighted_f1_1)

Model 1 scores:
Accuracy: 0.5660633484162896
Macro F1: 0.49254851081937323
Weighted F1: 0.5638487037126336


#### Model 2

In [187]:
# Define feature union and pipeline
vectorizer_union_02 = FeatureUnion([('tfidf_word', TfidfVectorizer(stop_words='english')),
                                    ('tfidf_char', TfidfVectorizer(analyzer='char', ngram_range=(1, 2)))])
svm_pipeline_02 = Pipeline([
            ('vectorize', vectorizer_union_02),
            ('classify', SVC(random_state=50, kernel='linear'))
            ])

# Train the model and make predictions on development set
trained_model_feature_union2 = svm_pipeline_02.fit(trainSentences, trainLabels)
dev_prediction2 = trained_model_feature_union2.predict(devSentences)

In [188]:
# Print classification report for development set

print("PREDICTION ON DEVELOPMENT")
print("Model 2:")
print(classification_report(devLabels, dev_prediction2))

PREDICTION ON DEVELOPMENT
Model 2:
              precision    recall  f1-score   support

          -1       0.61      0.69      0.65       428
           0       0.31      0.13      0.18       229
           1       0.63      0.74      0.68       444

    accuracy                           0.59      1101
   macro avg       0.52      0.52      0.50      1101
weighted avg       0.55      0.59      0.56      1101



In [189]:
# Make predictions on test set and print classification report and scores

test_prediction2 = trained_model_feature_union2.predict(testSentences)

print("PREDICTION ON TEST")
print("Model 2:")
print(classification_report(testLabels, test_prediction2))

PREDICTION ON TEST
Model 2:
              precision    recall  f1-score   support

          -1       0.65      0.72      0.69       912
           0       0.26      0.10      0.15       389
           1       0.67      0.78      0.72       909

    accuracy                           0.64      2210
   macro avg       0.53      0.53      0.52      2210
weighted avg       0.59      0.64      0.61      2210



In [190]:
# Calculate accuracy, macro F1, and weighted F1 scores for Model 2

accuracy_2 = accuracy_score(testLabels, test_prediction2)
macro_f1_2 = f1_score(testLabels, test_prediction2, average='macro')
weighted_f1_2 = f1_score(testLabels, test_prediction2, average='weighted')

print("Model 2 scores:")
print("Accuracy:", accuracy_2)
print("Macro F1:", macro_f1_2)
print("Weighted F1:", weighted_f1_2)

Model 2 scores:
Accuracy: 0.6357466063348416
Macro F1: 0.518419290691118
Weighted F1: 0.6059621637227968


#### Model 3

In [191]:
# Define feature union and pipeline
vectorizer_union_03 = FeatureUnion([('cnt_word', CountVectorizer(stop_words='english')),
                               ('tfidf_char', TfidfVectorizer(analyzer='char', ngram_range=(1, 2)))
                               ])

svm_pipeline_03 = Pipeline([
            ('vectorize', vectorizer_union_03),
            ('classify', SVC(random_state=50, kernel='linear'))
            ])

trained_model_feature_union3 = svm_pipeline_03.fit(trainSentences, trainLabels)
dev_prediction3 = trained_model_feature_union3.predict(devSentences)

In [192]:
print("PREDICTION ON DEVELOPMENT")
print("Model 3:")
print(classification_report(devLabels, dev_prediction3))

PREDICTION ON DEVELOPMENT
Model 3:
              precision    recall  f1-score   support

          -1       0.58      0.64      0.61       428
           0       0.25      0.17      0.20       229
           1       0.62      0.66      0.64       444

    accuracy                           0.55      1101
   macro avg       0.48      0.49      0.48      1101
weighted avg       0.53      0.55      0.54      1101



In [193]:
# Predict on the test set and print classification report and scores

test_prediction3 = trained_model_feature_union3.predict(testSentences)

In [194]:
print("PREDICTION ON TEST")
print("Model 3:")
print(classification_report(testLabels, test_prediction3))

PREDICTION ON TEST
Model 3:
              precision    recall  f1-score   support

          -1       0.62      0.66      0.64       912
           0       0.22      0.19      0.20       389
           1       0.68      0.68      0.68       909

    accuracy                           0.58      2210
   macro avg       0.51      0.51      0.51      2210
weighted avg       0.58      0.58      0.58      2210



In [195]:
# Calculate accuracy, macro F1, and weighted F1 scores for Model 3
accuracy_3 = accuracy_score(testLabels, test_prediction3)
macro_f1_3 = f1_score(testLabels, test_prediction3, average='macro')
weighted_f1_3 = f1_score(testLabels, test_prediction3, average='weighted')

print("Model 3 scores:")
print("Accuracy:", accuracy_3)
print("Macro F1:", macro_f1_3)
print("Weighted F1:", weighted_f1_3)

Model 3 scores:
Accuracy: 0.5846153846153846
Macro F1: 0.5073185463780766
Weighted F1: 0.5796806609875855


Based on the evaluation metrics, Model 2 has the highest accuracy, macro F1, and weighted F1 scores. Therefore, Model 2 is the best model.

### Task4 

In [196]:
# Load the data
dh_data = pd.read_csv('DH_CollectingData2022_review.tsv', sep = '\t', header = None, quoting = csv.QUOTE_NONE)
dh_data.head()

Unnamed: 0,0,1
0,"For Nik, he only wants to silence the cacophon...",0.0
1,"""I can play this two ways",0.0
2,"Mild, because it isn't conclusive, and doesn't...",-1.0
3,You can also get some more information about t...,0.0
4,"Soon, Hero, who has never had friends, is thru...",0.0


In [197]:
dh_data = dh_data.rename(columns={0: 'sentence', 1: 'sentiment'})
dh_data.head()

Unnamed: 0,sentence,sentiment
0,"For Nik, he only wants to silence the cacophon...",0.0
1,"""I can play this two ways",0.0
2,"Mild, because it isn't conclusive, and doesn't...",-1.0
3,You can also get some more information about t...,0.0
4,"Soon, Hero, who has never had friends, is thru...",0.0


In [198]:
# Use the trained model to predict sentiment for the dhSentences
# Drop any missing values in dh_data

dh_data = dh_data.dropna()

# Extract the sentences and labels from the data
dhSentences = dh_data['sentence']
dhLabels = dh_data['sentiment']

# Use the trained model to predict sentiment for the dhSentences
dh_predictions = trained_model_feature_union2.predict(dhSentences)

# Print the classification report for the dh data
print(classification_report(dhLabels, dh_predictions))

              precision    recall  f1-score   support

        -1.0       0.23      0.74      0.36        62
         0.0       0.50      0.10      0.17       146
         1.0       0.61      0.55      0.58       180

    accuracy                           0.41       388
   macro avg       0.45      0.46      0.37       388
weighted avg       0.51      0.41      0.39       388



Based on the classification report, the performance of the model is not very good. The accuracy is 0.41, which means that only 41% of the predictions were correct. The macro average F1-score is 0.37, which indicates that the model is not performing well in predicting any of the classes. The weighted average F1-score is 0.39, which is slightly better but still indicates poor performance. The precision and recall values for each class are also quite low, indicating that the model is not able to correctly classify the comments into the three sentiment categories (-1, 0, 1).

### Task 5 

In [199]:
# Save the trained model to a file
with open('best_model.pkl', 'wb') as file:
    pickle.dump(trained_model_feature_union3, file)
    
# Load the saved model from a file
with open('best_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)