# SVM - Climate Sentiment Multiclass Classification
## CS522 Project

### Dataset: 
https://www.kaggle.com/code/luiskalckstein/climate-sentiment-multiclass-classification

### Imports

In [1]:
%matplotlib inline

# Our base common modules
from Common.DataCenter import data_center
from Common.UtilFuncs import Evaluator, Lab

# Classifiers without denoising
import Common.SvmMethod as SvmMethod

# Denoising Methodes
import Common.IsolationForestMethod as IsolationForestMethod
import Common.ConfidentLearningMethod as ConfidentLearningMethod
import Common.LocalOutlierFactorMethod as LocalOutlierFactorMethod
from Common.BERTModel import do_experiment_BERT


**The settings of the noise sources.**

In [2]:
# Each item: source -> (size, distribution)
noisy_set_sizes = {
    'mislabeled' : (8600, None),                   # max size: 15000
    'irrelevant' : (8600, [0.25,0.25,0.25,0.25]),  # max size: 34259
    'translated' : (8600, "reserve_labels"),       # max size: 5000
}


In [3]:
lab = Lab("twitter_sentiment_data_clean.csv", noisy_sources = noisy_set_sizes, total_train_size = 20000, total_test_size = 4000)


**Choose a experiment without denoising**

In [4]:
# Each item: name -> (function, args-optional, whether choose) note:only the first active one will be used
experiment_without_denoising = {
    'SVM without denoising' : (SvmMethod.do_experiment, 0),
    'BERT without denoising' : (do_experiment_BERT, (lab,), 1 )
}


**Choose a experiment with denoising**

In [5]:
# Each item: name -> (funcion, args-optional, whether choose) note:only the first active one will be used
experiment_with_denoising = {
    'Confident Learning' : (ConfidentLearningMethod.do_experiment_with_denoising_for_SVM,   0),
    'Isolation Forest'   : (IsolationForestMethod.do_experiment_with_denoising_for_SVM,     0),
    'LocalOutlierFactor' : (LocalOutlierFactorMethod.do_experiment_with_denoising_for_SVM,  1),
}


**The training set of each experiment**

In [6]:
origin_train_set_sizes = [2000, 4000, 5000, 8000, 10000, 15000, 20000]
noisy_train_set_sizes  = [(4000, 1000), (8000, 2000), (15000, 5000)]


### Main entry
**Initialize the lab, which will run a serial of experiments.<br>
Split the database into training set, test set, noisy set, validation set.**

**Review the summary of the whole data**

In [7]:
lab.dc.print_summary()


###################################### Data Summary #############################################
  Total data size: 40908
      sentiments ('Anti', 'Neutral', 'Pro', 'News'): 9.4%, 18.3%, 50.2%, 22.1%
  Training data size: 20000
  Test data size: 4000
  Noisy data size: 25800
  Validation data size: 1000
      noise sources ('mislabeled', 'irrelevant', 'translated'): 33.3%, 33.3%, 33.3%
##################################################################################################


**To see the data features via a demo**

In [8]:
train_df = lab.dc.get_train_with_noisy_df(15000,5000)
data_center.print_data(train_df.head(15))


Unnamed: 0,noise,noise_text,sentiment,origin(sentiment),tweetid...,message...
0,1,mislabeled,3,2,8108306943,Regional/Global seabird stress
1,0,none,2,-,9536174384,I have to write an essay over
2,1,mislabeled,1,3,8645926535,Barack Obama warns climate cha
3,0,none,2,-,8199736905,RT @mitskileaks: want to speci
4,0,none,2,-,8438476460,.@RepBrianFitz Thank you for a
5,0,none,2,-,8182648187,RT @billmckibben: Reading clim
6,0,none,2,-,9556340417,RT @LanreShaper: 'Africa contr
7,1,mislabeled,3,1,9587601354,Keilmuan itu politik. Hawong N
8,0,none,2,-,8401717188,RT @WRIClimate: @CNBC He shoul
9,0,none,0,-,8254569694,@magslol global warming is a C


**Calculate the filename for save the lab**

In [9]:
lab_filename = Lab.get_active_experiment_name(experiment_with_denoising)
if lab_filename is None:
    lab_filename  = Lab.get_active_experiment_name(experiment_without_denoising)
if lab_filename is None:
    print("Nothing to do.")
    exit(0)
lab_filename = "saving/" + lab_filename + str(noisy_train_set_sizes) + ".pk"
    

# Run new experiments (or just review the evaluations saved by previous experiments)

In [10]:
RUN = 1
if RUN:     # Run new experiments
    # Set the function to classify data without denoising
    lab.set_experiment_no_denoising(experiment_without_denoising)

    # Set the function to classify data with denoising
    lab.set_experiment_with_denoising(experiment_with_denoising)

    print("-------------- No noisy training sets ----------")
    lab.do_batch_experiments(origin_train_set_sizes)

    print("-------------- Noisy training sets -------------")
    lab.do_batch_experiments(noisy_train_set_sizes)

    # Save the evaluations of lab
    lab.save(lab_filename)

else:       # Load evaluations saved by previous experiments
    lab = Lab.load(lab_filename)


-------------- No noisy training sets ----------
* 1> Training set size: 2000 samples
  Sentiments ('Anti', 'Neutral', 'Pro', 'News'): 9.4%, 18.3%, 50.2%, 22.1%


Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_transform', 'activation_13', 'vocab_layer_norm', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.
100%|██████████| 2000/2000 [00:01<00:00, 1552.05it/s]
100%|██████████| 4

Epoch 1/50





Epoch 2/50
Epoch 3/50

KeyboardInterrupt: 

# Show evaluations

In [None]:
# In a form
lab.print()

# In a plot
lab.plot()
