# BERT - Climate Sentiment Multiclass Classification
## CS522 Project

**Dataset:**  
https://www.kaggle.com/code/luiskalckstein/climate-sentiment-multiclass-classification

**Imports**

In [1]:
# ! pip install tensorflow-addons
import os
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from transformers import DistilBertTokenizer, TFDistilBertModel, DistilBertConfig
from transformers import logging as hf_logging
from Common.preprocessor import one_hot_encoding
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
import matplotlib.pyplot as plt
from Common.UtilFuncs import DataSize
from Common.DataCenter import data_center
from Common.UtilFuncs import print_evaluation, print_distribution
from Common.UtilFuncs import Evaluator, Lab
from Common.BERTModel import BERTModel, do_experiment_BERT
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
    
hf_logging.set_verbosity_error()
warnings.filterwarnings('ignore')
# ! pip install tensorflow-addons
TrainSizeBaseLine = DataSize.GetTrainSizeBaseline()
TrainSizeWithNoisyData = DataSize.GetTrainSizeWithNoisyData()
# 4000
TestDataSize = DataSize.GetTestDataSize()
NoiseDataSize = DataSize.GetNoiseDataSize()
ValidationDataSize = DataSize.GetValidationDataSize()

%matplotlib inline

time: 0 ns (started: 2022-04-21 17:26:52 +08:00)


**Detect GPU**

In [2]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    print('Set memory autoincrement')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print('Physical GPUs: %d, Logical GPUs: %d' % (len(gpus), len(logical_gpus)))
  except RuntimeError as e:
    print(e)
else:
    print('GPUs not detected')

Set memory autoincrement
Physical GPUs: 1, Logical GPUs: 1
time: 422 ms (started: 2022-04-21 17:26:52 +08:00)


## 1. Loading Dataset

In [3]:
# Each item: source -> (size, distribution)
noisy_set_sizes = {
    'mislabeled' : (8600, None),                   # max size: 15000
    'irrelevant' : (8600, [0.25,0.25,0.25,0.25]),  # max size: 34259
    'translated' : (5000, "reserve_labels"),       # max size: 5000
}
lab = Lab("twitter_sentiment_data_clean.csv", noisy_sources = noisy_set_sizes, total_train_size = 20000, total_test_size = 4000)

time: 578 ms (started: 2022-04-21 17:26:53 +08:00)


In [4]:
lab.dc.print_summary()


###################################### Data Summary #############################################
  Total data size: 40908
      sentiments ('Anti', 'Neutral', 'Pro', 'News'): 9.4%, 18.3%, 50.2%, 22.1%
  Training data size: 20000
  Test data data: 4000
  Noisy data data: 22200
  Validation data size: 1000
      noise sources ('mislabeled', 'irrelevant', 'translated'): 38.7%, 38.7%, 22.5%
##################################################################################################
time: 0 ns (started: 2022-04-21 17:26:53 +08:00)


Observe the data.

In [5]:
train_df = lab.dc.get_train_with_noisy_df(150,50)
test_df = lab.dc.get_test_df()
data_center.print_data(train_df.head(15))



Unnamed: 0,noise,noise_text,sentiment,origin(sentiment),tweetid...,message...
0,0,none,2,-,8341187703,RT @PoliticsOTM: The people ca
1,0,none,0,-,8261053219,@lundstephs shut up climate ch
2,3,translated,3,3,8407369356,RT @EcoInternet3: телефон EPA
3,0,none,1,-,9547625090,@charliespiering Any climate c
4,0,none,3,-,8604160894,RT @TheEconomist: The impact o
5,0,none,2,-,9534173196,Ed our PM inspires your presid
6,0,none,1,-,9123373534,RT @AmyMcGrathKY: Massive lack
7,0,none,3,-,7020372141,Blame Zika on climate change -
8,1,mislabeled,0,3,9106954837,RT @michaelhallida4: BREAKING
9,0,none,2,-,7966996768,Dams raise global warming gas:


time: 47 ms (started: 2022-04-21 17:26:53 +08:00)


In [6]:

evaluateDF = do_experiment_BERT(train_df, test_df, lab)

100%|██████████| 200/200 [00:00<00:00, 1387.62it/s]
100%|██████████| 4000/4000 [00:02<00:00, 1652.21it/s]
100%|██████████| 1000/1000 [00:00<00:00, 1688.60it/s]


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 5: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-06.
Epoch 6/50
Epoch 6: ReduceLROnPlateau reducing learning rate to 1e-06.
Epoch 7/50
Epoch 8/50
Epoch 8: early stopping
    f1 of classes: [0.0, 0.064, 0.665, 0.553]
    micro_f1: 0.543 , macro_f1: 0.321 , weighted_f1: 0.468, macro_precision: 0.331, macro_recall: 0.362
time: 3min 46s (started: 2022-04-21 17:26:53 +08:00)


In [7]:
evaluateDF

Unnamed: 0,Micro F1,Macro F1,Weighted F1,Macro Precision,Macro Recall,F1 of classes
0,0.5435,0.320665,0.467883,0.331355,0.361767,"[0.0, 0.064, 0.665, 0.553]"


time: 15 ms (started: 2022-04-21 17:31:30 +08:00)
