# SVM - Climate Sentiment Multiclass Classification
## CS522 Project

SVM with LSA

### Dataset: 
https://www.kaggle.com/code/luiskalckstein/climate-sentiment-multiclass-classification

### Imports

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.svm import LinearSVC,SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from Common.DataCenter import data_center
from Common.LSI import SKLearnLSA
from Common.UtilFuncs import DataSize
from Common.UtilFuncs import print_evaluation, EvaluationToDF
import pandas as pd
from Common.preprocessor import normalize_preprocessing
%matplotlib inline
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
    
TrainSizeBaseLine = DataSize.GetTrainSizeBaseline()
TrainSizeWithNoisyData = DataSize.GetTrainSizeWithNoisyData()
TestDataSize = DataSize.GetTestDataSize()
NoiseDataSize = DataSize.GetNoiseDataSize()
ValidationDataSize = DataSize.GetValidationDataSize()

time: 154 µs (started: 2022-04-10 00:17:26 +08:00)


### Text preprocessing

In [2]:
# parameter: original X of training set and test set
# return:  vectorised X of training set and test set
def text_preprocessing(X_train, X_test):
    # Convert texts to vectors
    X_train = normalize_preprocessing(X_train)
    X_test = normalize_preprocessing(X_test)
    lsa = SKLearnLSA()
    lsa.BuildModel(X_train + X_test, 2000)
    X_train_vec = lsa.Query2LatentSpace(X_train)
    X_test_vec = lsa.Query2LatentSpace(X_test)
    return X_train_vec, X_test_vec


time: 274 µs (started: 2022-04-10 00:17:26 +08:00)


### One-hot encoding, convert the labels to vectors (4 x 1) each

In [3]:
# parameter: original y of training set, original y of test set
# return:  encoded y of training set and test set
def one_hot_encoding(y_train, y_test):
    mlb          = MultiLabelBinarizer()
    y_train_vec  = mlb.fit_transform(map(str, y_train))
    y_test_vec   = mlb.transform(map(str, y_test))
    return y_train_vec, y_test_vec


time: 215 µs (started: 2022-04-10 00:17:26 +08:00)


### Run SVM and evaluate the results

In [4]:
# parameter:  vectorised X and encoded y of training set and test set
def evaluate_SVM(title, X_train_vec, y_train_vec, X_test_vec, y_test_vec):
    # Run SVM - fit and predict
    SVM             = OneVsRestClassifier(LinearSVC(dual=False, class_weight="balanced"), n_jobs=-1)
    #SVM = OneVsRestClassifier(SVC(gamma='auto', class_weight="balanced"), n_jobs=-1)
    SVM.fit(X_train_vec, y_train_vec)
    prediction      = SVM.predict(X_test_vec)
    print_evaluation(y_test_vec, prediction)
    evaluateDF = EvaluationToDF(title, y_test_vec, prediction)

    return evaluateDF


time: 297 µs (started: 2022-04-10 00:17:26 +08:00)


### Do an experiment

In [5]:
# Parameter: original X,y of training set and test set
def do_experiment(title, X_train, y_train, X_test, y_test):
    # Convert texts to vectors
    X_train_vec, X_test_vec = text_preprocessing(X_train, X_test)
    y_train_vec, y_test_vec = one_hot_encoding(y_train, y_test)

    # Run SVM and evaluate the results
    evaluateDF = \
        evaluate_SVM(title, X_train_vec, y_train_vec, X_test_vec, y_test_vec)

    # Show the indicators
    #print(" macro_f1: %.4f , weighted_f1: %.4f, macro_precision: %.4f, macro_recall: %.4f" %
    #      (macro_f1, weighted_f1, macro_precision, macro_recall))
    #print(evaluateDF)
    return evaluateDF


time: 321 µs (started: 2022-04-10 00:17:26 +08:00)


### Main entry

In [6]:

noisy_set_sizes = {
    'mislabeled' : 5000,   # max size: 15000
    'irrelevant' : 5000,   # max size: 34259
    'translated' : 5000,   # max size: 5000
}

# Load the database and split it into training set, test set, noisy set, validation set
dc = data_center("twitter_sentiment_data_clean.csv", test_size = 4000, validation_size = 1000,
                 noisy_size = noisy_set_sizes['mislabeled'])

print("####################################################")
print("Total data size: ",       dc.get_len())
print("Total train data size: ", dc.get_train_len())
print("Total test data size: ",  dc.get_test_len())

####################################################
Total data size:  40908
Total train data size:  30908
Total test data size:  4000
time: 131 ms (started: 2022-04-10 00:17:26 +08:00)


**Get the test set for evaluation**

In [7]:
X_test, y_test = dc.get_test()


time: 2.29 ms (started: 2022-04-10 00:17:26 +08:00)


**Set distributions of training set.**

In [8]:
# distribution of training set
train_distribution = None


time: 152 µs (started: 2022-04-10 00:17:26 +08:00)


**Prepare the noisy set.**

In [9]:
lstNoisyInfo = [("mislabeled",dc.get_noisy_len())]
print("Noisy set size is %d"                % dc.get_noisy_len())

# add the external noisy data (irrelevant texts)
# distribution of irrelevant noisy
irrelevant_noisy_distribution = [0.25, 0.25, 0.25, 0.25]    # None, if use the distribution of original set
added_size = dc.add_noisy(noisy_source="irrelevant", distribution = irrelevant_noisy_distribution,
                          size = noisy_set_sizes['irrelevant'])
print("%d noisy samples added" % added_size)
lstNoisyInfo.append(("irrelevant",added_size))

# add the external noisy data (translated texts). use the labels of each noisy data
added_size = dc.add_noisy(noisy_source="translated", distribution = "reserve_labels", 
                          size = noisy_set_sizes['translated'])
print("%d noisy samples added" % added_size)
lstNoisyInfo.append(("translated",added_size))

print("Noisy set new size is %d"                % dc.get_noisy_len())



Noisy set size is 5000
5000 noisy samples added
5000 noisy samples added
Noisy set new size is 15000
time: 254 ms (started: 2022-04-10 00:17:26 +08:00)


**Load the database and split it into training set, test set, noisy set, validation set**

**Get the test set for evaluation**

**Run experiments with different training sets, and use the same test set.**

In [10]:
evaluateDF = None
print("-----------------------------------------------")
for size in TrainSizeBaseLine:
    # Get a training set without noisy data
    X_train, y_train = dc.get_train(size, train_distribution)
    print("Training set size: %d samples (%.1f%%): " % (len(X_train), len(y_train)/dc.get_train_len()*100))

    # Do an experiment
    title = "%d" % (len(X_train))
    df = do_experiment(title, X_train, y_train, X_test, y_test)
    if evaluateDF is None:
        evaluateDF = df
    else:
        evaluateDF = pd.concat([evaluateDF,df],axis=0)

print("-----------------------------------------------")
xtrainvec = None
for size in TrainSizeWithNoisyData:
    # Get a noisy training set
    X_train, y_train = dc.get_train_with_noisy(size[0], size[1], train_distribution)
    print("Noisy training set size: %d samples (%d original, %d noisy)" % (len(y_train), size[0], size[1]))

    # Do an experiment
    title = "%d samples (%d original, %d noisy)" % (len(y_train), size[0], size[1])
    df = do_experiment(title, X_train, y_train, X_test, y_test)
    if evaluateDF is None:
        evaluateDF = df
    else:
        evaluateDF = pd.concat([evaluateDF,df],axis=0)

-----------------------------------------------
Training set size: 2000 samples (6.5%): 




  f1 of classes: [0.339, 0.396, 0.639, 0.563]
  micro_f1: 0.541 , macro_f1: 0.484 , weighted_f1: 0.550, macro_precision: 0.460, macro_recall: 0.517
Training set size: 4000 samples (12.9%): 




  f1 of classes: [0.353, 0.397, 0.659, 0.591]
  micro_f1: 0.560 , macro_f1: 0.500 , weighted_f1: 0.568, macro_precision: 0.470, macro_recall: 0.540
Training set size: 5000 samples (16.2%): 




  f1 of classes: [0.374, 0.422, 0.662, 0.588]
  micro_f1: 0.567 , macro_f1: 0.512 , weighted_f1: 0.575, macro_precision: 0.476, macro_recall: 0.558
Training set size: 8000 samples (25.9%): 




  f1 of classes: [0.382, 0.464, 0.695, 0.605]
  micro_f1: 0.592 , macro_f1: 0.537 , weighted_f1: 0.603, macro_precision: 0.488, macro_recall: 0.607
Training set size: 10000 samples (32.4%): 




  f1 of classes: [0.372, 0.464, 0.704, 0.632]
  micro_f1: 0.599 , macro_f1: 0.543 , weighted_f1: 0.613, macro_precision: 0.491, macro_recall: 0.623
Training set size: 15000 samples (48.5%): 
  f1 of classes: [0.385, 0.492, 0.731, 0.669]
  micro_f1: 0.625 , macro_f1: 0.569 , weighted_f1: 0.641, macro_precision: 0.507, macro_recall: 0.670
Training set size: 20000 samples (64.7%): 
  f1 of classes: [0.398, 0.507, 0.743, 0.675]
  micro_f1: 0.635 , macro_f1: 0.581 , weighted_f1: 0.653, macro_precision: 0.512, macro_recall: 0.698
-----------------------------------------------
Noisy training set size: 5000 samples (4000 original, 1000 noisy)




  f1 of classes: [0.35, 0.435, 0.703, 0.643]
  micro_f1: 0.592 , macro_f1: 0.533 , weighted_f1: 0.607, macro_precision: 0.466, macro_recall: 0.644
Noisy training set size: 10000 samples (8000 original, 2000 noisy)




  f1 of classes: [0.378, 0.463, 0.717, 0.647]
  micro_f1: 0.602 , macro_f1: 0.551 , weighted_f1: 0.623, macro_precision: 0.471, macro_recall: 0.704
Noisy training set size: 20000 samples (15000 original, 5000 noisy)




  f1 of classes: [0.396, 0.484, 0.733, 0.644]
  micro_f1: 0.613 , macro_f1: 0.564 , weighted_f1: 0.636, macro_precision: 0.476, macro_recall: 0.739
time: 1h 14min 5s (started: 2022-04-10 00:17:26 +08:00)


In [11]:
evaluateDF.to_clipboard(excel=True)

time: 24.2 ms (started: 2022-04-10 01:31:32 +08:00)
