# Student Information
Name: 丁浩文

Student ID: 109062517

GitHub ID: joshua049

Kaggle name: Hao-Wen Ting

Kaggle private scoreboard snapshot:

![](https://imgur.com/OCktU1I.jpg)

---

# Report

## Preprocessing
I have tried several methods to improve my performance in this assighnment. Before training and testing, I preprocess the data first so I can directly use these preprocessed data to build my model.

I read *tweets_DM.json* first, and I create a DataFrame with columns to maintain the tweet ids and corresponding articles on twitter. In order to clear up the data, I use *re* and *BeautifulSoup* package to remove useless informations such as HTML tags and emoticons. I also read *emotion.csv* and *data_identification.csv* as DataFrames. After that, I merged the three DataFrames on *tweet_id* and apply tokenizer of *nltk* package on the content column to remove stop words, so I can check all needed attributes in each record by query *tweet_id*. 

Finally, I use the *identification* column to recognize if the record is training data or not. Furthurmore, I also use *train_test_split* from *sklearn.model_selection* to split a validation set to validate my model's performance during the training process.

## Training
What I have tried can be divied into two parts: **traditional machine learning-based method** and **deep learning**. 

### Traditional Machine Learning
In this part, I use *CountVectorizer* and *TfidfVectorizer* to extract the matrix as feature vectors. And then I tried several classifiers to compare their performances. After several times of experiment, I found that most classifiers could achieve a certain level of performance, except NuSVC and LinearSVC. Hence I use classifiers with better performance as base estimators, and use the StackingClassifier to combine them and enhance the overall performance.

Validation results of models I have tried:

|       Classifier       | f1-score |
| :--------------------: | :------: |
|     MultinomialNB      |   0.53   |
|   LogisticRegression   |   0.51   |
| RandomForestClassifier |   0.49   |
|     XGBClassifier      |   0.47   |
|         NuSVC          |   0.37   |
|       LinearSVC        |   0.39   |
|     LGBMClassifier     |   0.49   |
|   StackingClassifier   |   0.60   |


### Deep Learning
In this part, I tried to train the model in a recurrent way. I use the bert model released by Google as my backbone, then add two fully-connected layers to output 8 channels as classification result. During the training process, I set the weights of the backbone to be non-trainable, so that I can finetune the weights in fully-connected layers properly. 

However, due to lack of time and powerful training resources, I could only sample 40000 training data to make my model converge.This method only get 0.448 in f1-score, which doesn't exceed that of my stacking classifier. If I had started this experiment earlier, I think it would achieve a better performance.

## Output
During the training process, I have used label encoder for both methods. Hence both methods would output a number to represent its classification result, then I can reuse the label encoder to decode the number into the class it represented.

## Results
According to my best validation result, my testing score should be around 0.60. However, I only got 0.46 in private leaderboard. I think it's because I didn't do cross validation, so my validation should exist some bias between the test set, which leads to the huge difference of performance. Next time I should do cross validation to ensure my model could get excellent performance under all circumstances.

---

# Code

In [None]:
import json
import pandas as pd
import os
import re
from tqdm import tqdm
from datetime import datetime

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.utils import shuffle

from collections import Counter
import pickle

import tensorflow as tf
import numpy as np
from bs4 import BeautifulSoup

import bert
from bert.tokenization.bert_tokenization import FullTokenizer
from bert.loader import StockBertConfig, map_stock_config_to_params, load_stock_weights
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

## Preprocess

In [None]:
nltk.download('stopwords')
stop = stopwords.words('english')

def tokenizer_stem_nostop(text):
    porter = PorterStemmer()
    return [porter.stem(w) for w in re.split('\s+', text.strip()) \
            if w not in stop and re.match('[a-zA-Z]+', w)]

In [None]:
data_dir = './data'

id_data = pd.read_csv(os.path.join(data_dir, 'data_identification.csv'))
emotion_data = pd.read_csv(os.path.join(data_dir, 'emotion.csv'))

In [None]:
with open(os.path.join(data_dir, 'tweets_DM.json')) as f:
    all_content = f.read().splitlines()
len(all_content)

In [None]:
all_tweetids = []
all_scores = []
all_hashtags = []
all_textwords = []
for line in tqdm(all_content):
    content = eval(line.strip())        

    score = content['_score']
    hashtag = content['_source']['tweet']['hashtags']
    tweetid = content['_source']['tweet']['tweet_id']

    text = content['_source']['tweet']['text']       

    # regex for matching emoticons, keep emoticons, ex: :), :-P, :-D
    r = '(?::|;|=|X)(?:-)?(?:\)|\(|D|P)'
    emoticons = re.findall(r, text)
    text = re.sub(r, '', text)

    # convert to lowercase and append all emoticons behind (with space in between)
    # replace('-','') removes nose of emoticons
    text = re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-','')
    text = BeautifulSoup(text, 'html.parser').get_text()


    all_tweetids.append(tweetid)
    all_scores.append(score)
    all_hashtags.append(hashtag)
    all_textwords.append(text)
        
content_data = pd.DataFrame({'tweet_id': all_tweetids, 'score': all_scores, 'hashtag': all_hashtags, 'words': all_textwords})
content_data.to_csv(os.path.join(data_dir, 'content.csv'), index=None, encoding = "ISO-8859-1")

In [None]:
all_data = pd.merge(content_data, id_data, on='tweet_id')
all_data['text_tokenized'] = all_data['words'].apply(lambda x: nltk.word_tokenize(x))
# all_data.to_csv(os.path.join(data_dir, 'all.csv'), index=None, columns=['tweet_id', 'score', 'hashtag', 'words', 'text_tokenized', 'identification'])

In [None]:
training_data = pd.merge(all_data, emotion_data, on='tweet_id')
testing_data = all_data[all_data['identification']=='test']
# training_data = pd.read_csv(os.path.join(data_dir, 'train.csv'))
# testing_data = pd.read_csv(os.path.join(data_dir, 'test.csv'))

## Traditional Machine Learning

In [None]:
target = training_data.emotion
training_data = training_data.drop(['emotion'],axis=1)

In [None]:
le = LabelEncoder()
target = le.fit_transform(target)

X_train, X_test, y_train, y_test = train_test_split(training_data, target, test_size=0.2, random_state=42)

In [None]:
def print_acc(model):
    predicted = model.predict(X_test.words)
    accuracy = np.mean(predicted == y_test) * 100
    print(f1_score(y_test, predicted, average='macro'))
    print(f1_score(y_test, predicted, average='micro'))
    print(f1_score(y_test, predicted, average='weighted'))

In [None]:
clf_list = [
    MultinomialNB(),
    LogisticRegression(),
    RandomForestClassifier(),
    XGBClassifier(),
    NuSVC(),
    LinearSVC(),
    LGBMClassifier()
]

for clf in clf_list:
    nb_clf = Pipeline([('vect', CountVectorizer()), ('clf', clf)])
    nb_clf = nb_clf.fit(X_train.words,y_train)
    print_acc(nb_clf)

In [None]:
estimators = [
    ('xgb', XGBClassifier(n_estimators=120, max_depth=11, learning_rate=0.1)),
    ('lgb', LGBMClassifier(max_depth=4, num_leaves=20)),
    ('nb', MultinomialNB()),
    ('rf', RandomForestClassifier(n_estimators=120, max_depth=11))
]

start = time.time()
nb_clf = Pipeline([('vect', CountVectorizer(stop_words=stop_words, dtype=np.float32)), ('clf', StackingClassifier(estimators=estimators, final_estimator=LogisticRegression()))])
nb_clf = nb_clf.fit(X_train.content,y_train)
print_acc(nb_clf)
print(time.time() - start)

In [None]:
output = le.classes_[nb_clf.predict(testing_data.words)]
output_df = pd.DataFrame({'id': test_data.tweet_id, 'emotion': output})
output_df.to_csv('output.csv', index=None)

## Deep Learning

### Data Structure of the Dataset

In [None]:
class EmotionalData:
    DATA_COLUMN = "words"
    LABEL_COLUMN = "emotion"
    
    def __init__(self, df, le, tokenizer: FullTokenizer, test=False, max_seq_len=1024, sample_size=None, ratio_dict=None):
        self.tokenizer = tokenizer
        self.max_seq_len = 0
        self.le = le
        
        if not test:        
            if sample_size is not None:
                # train, test = map(lambda df: df.sample(sample_size), [train, test])
                if ratio_dict is not None:
#                     N = (sample_size // len(self.le.classes_)) + 1
                    dfs = [df[df[self.LABEL_COLUMN]==key].sample(int(sample_size * val)) for key, val in ratio_dict.items()]
                    df = pd.concat(dfs)             
                else:
                    df = df.sample(sample_size)
                df = shuffle(df)
            self.train_x, self.train_y = self._prepare(df)            
            print("max seq_len", self.max_seq_len)
            self.max_seq_len = max_seq_len    
            self.train_x, self.train_x_token_types = self._pad(self.train_x)
        else:            
            self.test_x = self._test_prepare(df)
            self.max_seq_len = max_seq_len
            self.test_x, self.test_x_token_types = self._pad(self.test_x)
            

    def _prepare(self, df):
        x, y = [], []
        with tqdm(total=df.shape[0], unit_scale=True) as pbar:
            for ndx, row in df.iterrows():
                text, label = row[EmotionalData.DATA_COLUMN], row[EmotionalData.LABEL_COLUMN]
                tokens = self.tokenizer.tokenize(text)
                tokens = ["[CLS]"] + tokens + ["[SEP]"]
                token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
                self.max_seq_len = max(self.max_seq_len, len(token_ids))
                x.append(token_ids)
                y.append(label)
                pbar.update()
        return np.array(x), self.le.transform(np.array(y))
    
    def _test_prepare(self, df):
        x = []
        
        for ndx, row in tqdm(df.iterrows()):
            text = row[EmotionalData.DATA_COLUMN]
            tokens = self.tokenizer.tokenize(text)
            tokens = ["[CLS]"] + tokens + ["[SEP]"]
            token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
            self.max_seq_len = max(self.max_seq_len, len(token_ids))
            x.append(token_ids)
            
        return np.array(x)

    def _pad(self, ids):
        x, t = [], []
        token_type_ids = [0] * self.max_seq_len
        for input_ids in tqdm(ids):
            input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]
            input_ids = list(input_ids) + list([0] * (self.max_seq_len - len(input_ids)))
            x.append(np.array(input_ids))
            t.append(token_type_ids)
        return np.array(x), np.array(t)

In [None]:
# enc = OneHotEncoder(handle_unknown='ignore')
# X, y = training_data.words, enc.fit_transform(training_data.emotion.to_numpy().reshape(-1, 1))

# X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

In [None]:
model_dir = "../../../repos/bert/cased_L-12_H-768_A-12"

bert_params = bert.params_from_pretrained_ckpt(model_dir)
l_bert = bert.BertModelLayer.from_params(bert_params, name="bert")

In [None]:
c = Counter(training_data.emotion)
new_dict = {key: val**0.42 for key, val in c.items()}
N = sum(new_dict.values())
ratio_dict = {key: val/N for key, val in new_dict.items()}

tokenizer = FullTokenizer(vocab_file=os.path.join(model_dir, "vocab.txt"))
le = LabelEncoder().fit(training_data.emotion)
train_data = EmotionalData(training_data, le, tokenizer, max_seq_len=128, sample_size=45000, ratio_dict=ratio_dict)
test_data = EmotionalData(testing_data, le, tokenizer, test=True, max_seq_len=128)

In [None]:
print("            train_x", train_data.train_x.shape)
print("train_x_token_types", train_data.train_x_token_types.shape)
print("            train_y", train_data.train_y.shape)
print("        max_seq_len", train_data.max_seq_len)
print("             test_x", test_data.test_x.shape)
print("        max_seq_len", test_data.max_seq_len)

In [None]:
def flatten_layers(root_layer):
    if isinstance(root_layer, tf.keras.layers.Layer):
        yield root_layer
    for layer in root_layer._layers:
        for sub_layer in flatten_layers(layer):
            yield sub_layer


def freeze_bert_layers(l_bert):
    """
    Freezes all but LayerNorm and adapter layers - see arXiv:1902.00751.
    """
    for layer in flatten_layers(l_bert):
        if layer.name in ["LayerNorm", "adapter-down", "adapter-up"]:
            layer.trainable = True
        elif len(layer._layers) == 0:
            layer.trainable = False
        l_bert.embeddings_layer.trainable = False


def create_learning_rate_scheduler(max_learn_rate=5e-5,
                                   end_learn_rate=1e-7,
                                   warmup_epoch_count=10,
                                   total_epoch_count=90):

    def lr_scheduler(epoch):
        if epoch < warmup_epoch_count:
            res = (max_learn_rate/warmup_epoch_count) * (epoch + 1)
        else:
            res = max_learn_rate*math.exp(math.log(end_learn_rate/max_learn_rate)*(epoch-warmup_epoch_count+1)/(total_epoch_count-warmup_epoch_count+1))
        return float(res)
    learning_rate_scheduler = tf.keras.callbacks.LearningRateScheduler(lr_scheduler, verbose=1)

    return learning_rate_scheduler

In [None]:
bert_config_file = os.path.join(model_dir, "bert_config.json")
bert_ckpt_file   = os.path.join(model_dir, "bert_model.ckpt")

def create_model(max_seq_len, adapter_size=64):
    """Creates a classification model."""

    #adapter_size = 64  # see - arXiv:1902.00751

    # create the bert layer
    with tf.io.gfile.GFile(bert_config_file, "r") as reader:
        bc = StockBertConfig.from_json_string(reader.read())
        bert_params = map_stock_config_to_params(bc)
        bert_params.adapter_size = adapter_size
        l_bert = bert.BertModelLayer.from_params(bert_params, name="bert")

    input_ids      = tf.keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="input_ids")
    # token_type_ids = keras.layers.Input(shape=(max_seq_len,), dtype='int32', name="token_type_ids")
    # output         = bert([input_ids, token_type_ids])
    output         = l_bert(input_ids)

    print("bert shape", output.shape)
    cls_out = tf.keras.layers.Lambda(lambda seq: seq[:, 0, :])(output)
    cls_out = tf.keras.layers.Dropout(0.5)(cls_out)
    logits = tf.keras.layers.Dense(units=768, activation="tanh")(cls_out)
    logits = tf.keras.layers.Dropout(0.5)(logits)
    logits = tf.keras.layers.Dense(units=len(le.classes_), activation="softmax")(logits)

    # model = keras.Model(inputs=[input_ids, token_type_ids], outputs=logits)
    # model.build(input_shape=[(None, max_seq_len), (None, max_seq_len)])
    model = tf.keras.Model(inputs=input_ids, outputs=logits)
    model.build(input_shape=(None, max_seq_len))

    # load the pre-trained model weights
    load_stock_weights(l_bert, bert_ckpt_file)

    # freeze weights if adapter-BERT is used
#     if adapter_size is not None:
    freeze_bert_layers(l_bert)

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="acc")])

    model.summary()

    return model

In [None]:
adapter_size = None # use None to fine-tune all of BERT
model = create_model(train_data.max_seq_len, adapter_size=adapter_size)

In [None]:
# model.load_weights('./emotion.h5')

In [None]:
%%time


log_dir = "./log/" + datetime.now().strftime("%Y%m%d-%H%M%s")
os.mkdir(log_dir)

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir)

total_epoch_count = 50
# model.fit(x=(data.train_x, data.train_x_token_types), y=data.train_y,
model.fit(x=train_data.train_x, y=train_data.train_y,
          validation_split=0.2,
          batch_size=8,
          shuffle=True,
          epochs=total_epoch_count,
          callbacks=[tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, min_lr=1e-7),
                     tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True),                   
                     tensorboard_callback])

In [None]:
model.save_weights('./emotion_30000_finetune.h5', overwrite=True)

In [None]:
@tf.function
def prediction_step(input_):
    return model(input_, training=False)

In [None]:
def process_output(confidences):
    return confidences.numpy().argmax()


In [None]:
output_list = []
for input_ in tqdm(test_data.test_x):
    class_num = process_output(prediction_step(tf.expand_dims(input_, axis=0)))
    output_list.append(le.classes_[class_num])
    
output_df = pd.DataFrame({'id': testing_data.tweet_id, 'emotion': output_list})
output_df.to_csv('output.csv', index=None)

In [None]:
!kaggle competitions submit -c dm2020-hw2-nthu -f output.csv -m "try bert balanced 0.4"