## Experiment log:
- First, we took only the tweet text, without keywords or locations, and tried a simple model with an embedding layer, followed by a mean layer, then the output layers. Stopwords were not removed and no stemming was performed. Accuracy was varying between runs, ranging between 0.625 0.96875. We can do better.
- Next,we tried removing stopwords. Accuracy did not peak as high as it was before, but the variance was less. Acuracy ranged between 0.46875 and 0.8125.
- We tried stemming alone. Same as removing stopwords. Acuracy ranged between 0.5 and 0.8125.
- Both stemming and removing stopwords made the training smoother, with the accuracy less fluctuating and almost steadily increasing. Range was between 0.46875 and 0.75.
- After looking at kewyords, they seem helpful. They are the one or few words that are the main focus of the tweet. We could use that.
- We tried just appending the keywords at the end of the original tweets, althought they are there anyway. We thought that would help the model pay more attention to the keyword and that would help capturing the sentiment of the tweet. It could have done that, but we did not see any significant increase in accuracy, it was between 0.5 and 0.8125. We need to think of another way to include them.
-  Next, we tried an Embedding, LSTM, Fn (select_last), Dense (output), LogSoftmax/Sigmoid/Relu architecture. It was terrible! The accuracy was 0.5 most of the time, and even dropped to 0.40625 briefly. 
- We tried then to be creative :D we created two branches of Embedding, Mean, Dense, and LogSoftmax, one to process the tweet text and the other to process the keyword alone, then we averaged the log-softmax scores. The results were much better! The accuracy peaked at 1.0 briefly. We then did some tweaking of the learning rate and number of iterations not just to get higher accuracy but also to make it more steady and consistent.

In [1]:
import os
import shutil
import re
import string
import random
import pandas as pd
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import trax.fastmath.numpy as np
from trax import layers as tl
from trax import optimizers
from trax.supervised import training

## Loading tweets

In [2]:
VAL_PCT = 0.2
MODEL_DIR = './model'
OUTPUT_DIR = './output'
stopwords_english = stopwords.words('english')

In [3]:
all_train_tweets = pd.read_csv('data/train.csv')
all_test_tweets = pd.read_csv('data/test.csv')

In [4]:
all_train_tweets.loc[all_train_tweets.target == 1].head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
all_train_tweets.loc[all_train_tweets.target == 0].head()

Unnamed: 0,id,keyword,location,text,target
15,23,,,What's up man?,0
16,24,,,I love fruits,0
17,25,,,Summer is lovely,0
18,26,,,My car is so fast,0
19,28,,,What a goooooooaaaaaal!!!!!!,0


In [6]:
all_train_tweets.location.loc[~all_train_tweets.location.isna()].head()

31                       Birmingham
32    Est. September 2012 - Bristol
33                           AFRICA
34                 Philadelphia, PA
35                       London, UK
Name: location, dtype: object

In [7]:
all_train_tweets.keyword.loc[~all_train_tweets.keyword.isna()].value_counts()

fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

Keywords 

In [8]:
# That is how the tweets looked like when we tried appending the keywords
(all_train_tweets.text + ' ' + all_train_tweets.keyword.fillna('')).loc[~all_train_tweets.keyword.isna()].to_list()[:5]

['@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C ablaze',
 'We always try to bring the heavy. #metal #RT http://t.co/YAo1e0xngw ablaze',
 '#AFRICANBAZE: Breaking news:Nigeria flag set ablaze in Aba. http://t.co/2nndBGwyEi ablaze',
 'Crying out for more! Set me ablaze ablaze',
 'On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE http://t.co/qqsmshaJ3N ablaze']

## Preprocessing

In [9]:
def process_tweet(tweet, remove_stopwords=False, stem=False):
    # Remove hyper-links
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#', '', tweet)
    # Remove stock market tickers
    tweet = re.sub(r'\$\w*', '', tweet)
    # Remove old style tweet text RT
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # Tokenize tweet
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet = [word for word in tokenizer.tokenize(tweet) if word not in string.punctuation]
    if remove_stopwords:
        tweet = [word for word in tweet if word not in stopwords_english]
    if stem:
        stemmer = PorterStemmer()
        tweet = [stemmer.stem(word) for word in tweet]
    return tweet

In [10]:
all_train_tweets['text_clean'] = all_train_tweets.text.apply(process_tweet, args=(True, True))
all_train_tweets['keyword_clean'] = all_train_tweets.keyword.fillna('__UNK__').apply(process_tweet, args=(True, True))

all_test_tweets['text_clean'] = all_test_tweets.text.apply(process_tweet, args=(True, True))
all_test_tweets['keyword_clean'] = all_test_tweets.keyword.fillna('__UNK__').apply(process_tweet, args=(True, True))

In [11]:
# The purpose of this is to show how long the sequences we are dealing with.
# The longer the sequence, the tricker it is to capture the whole meaning of the tweet.
all_train_tweets.text_clean.map(lambda t: len(t)).quantile([0.0, 0.25, .50, 0.75, 1.0])

0.00     0.0
0.25     6.0
0.50     9.0
0.75    12.0
1.00    26.0
Name: text_clean, dtype: float64

In [12]:
# Building vocabulary
vocab = {
    '__PAD__': 0,
    '__</e>__': 1,
    '__UNK__': 2,
}
for tweet in all_train_tweets.text_clean.to_list():
    for word in tweet:
        if word not in vocab:
            vocab[word] = len(vocab)

# Tweet to tensor
def tweet_to_tensor(tweet, vocab):
    return [vocab.get(token, vocab['__UNK__']) for token in tweet] + [vocab['__</e>__']]

all_train_tweets['text_clean'] = all_train_tweets.text_clean.apply(tweet_to_tensor, args=(vocab,))
all_train_tweets['keyword_clean'] = all_train_tweets.keyword_clean.apply(tweet_to_tensor, args=(vocab,))

all_test_tweets['text_clean'] = all_test_tweets.text_clean.apply(tweet_to_tensor, args=(vocab,))
all_test_tweets['keyword_clean'] = all_test_tweets.keyword_clean.apply(tweet_to_tensor, args=(vocab,))

In [13]:
# Train/Validation split
all_pos_train = all_train_tweets.loc[all_train_tweets.target == 1]
all_neg_train = all_train_tweets.loc[all_train_tweets.target == 0]

pos_cut_idx = int(all_pos_train.shape[0] * (1 - VAL_PCT))
pos_val = all_pos_train.iloc[pos_cut_idx:]
pos_train = all_pos_train.iloc[:pos_cut_idx]

neg_cut_idx = int(all_neg_train.shape[0] * (1 - VAL_PCT))
neg_val = all_neg_train.iloc[neg_cut_idx:]
neg_train = all_neg_train.iloc[:neg_cut_idx] 

all_train = pd.concat([pos_train, neg_train])
all_val = pd.concat([pos_val, neg_val])

In [71]:
def data_generator(text_pos, text_neg, keyword_pos, keyword_neg, batch_size, vocab, loop=False):
    len_pos = len(text_pos)
    len_neg = len(text_neg)
    
    pos_idx_lines =  list(range(len_pos))
    neg_idx_lines = list(range(len_neg))
    
    pos_idx = 0
    neg_idx = 0
    
    n_to_take = batch_size // 2
    
    random.shuffle(pos_idx_lines)
    random.shuffle(neg_idx_lines)
    
    stop = False
    
    while not stop:
        batch_text = []
        batch_keyword = []
        targets = []
        max_len_text = 0
        max_len_keyword = 0
        for i in range(n_to_take):
            if pos_idx >= len_pos or neg_idx >= len_neg:
                if not loop:
                    stop = True
                    break
                if pos_idx >= len_pos:
                    pos_idx = 0
                    random.shuffle(pos_idx_lines)
                if neg_idx >= len_neg:
                    neg_idx = 0
                    random.shuffle(neg_idx_lines)
                    
            pos_text = text_pos[pos_idx]
            pos_keyword = keyword_pos[pos_idx]
            batch_text.append(pos_text)
            batch_keyword.append(pos_keyword)
            targets.append(1)
            if len(pos_text) > max_len_text:
                max_len_text = len(pos_text)
            if len(pos_keyword) > max_len_keyword:
                max_len_keyword = len(pos_keyword)

            neg_text = text_neg[neg_idx]
            neg_keyword = keyword_neg[neg_idx]
            batch_text.append(neg_text)
            batch_keyword.append(neg_keyword)
            targets.append(0)
            if len(neg_text) > max_len_text:
                max_len_text = len(neg_text)
            if len(neg_keyword) > max_len_keyword:
                max_len_keyword = len(neg_keyword)

            pos_idx += 1
            neg_idx += 1
                
        if stop:
            break
            
        pos_idx += n_to_take
        neg_idx += n_to_take
        
        # padding
        for elem in batch_text:
            elem += [vocab['__PAD__']] * (max_len_text - len(elem))
        for elem in batch_keyword:
            elem += [vocab['__PAD__']] * (max_len_keyword - len(elem))
            
        example_weights = np.array([1] * (n_to_take * 2))
            
        yield np.array(batch_text), np.array(batch_keyword), np.array(targets), example_weights

In [72]:
def train_generator(batch_size, train_text_pos, train_text_neg, train_keyword_pos, train_keyword_neg, vocab):
    return data_generator(
        text_pos=train_text_pos,
        text_neg=train_text_neg,
        keyword_pos=train_keyword_pos,
        keyword_neg=train_keyword_neg,
        batch_size=batch_size,
        vocab=vocab,
        loop=True
    )
def val_generator(batch_size, val_text_pos, val_text_neg, val_keyword_pos, val_keyword_neg, vocab):
    return data_generator(
        text_pos=val_text_pos,
        text_neg=val_text_neg,
        keyword_pos=val_keyword_pos,
        keyword_neg=val_keyword_neg,
        batch_size=batch_size,
        vocab=vocab,
        loop=True
    )

## Building the model

#### First trail, word embeddings with a single hidden layer. Should suffice for sentiment analysis tasks like this one. 

In [73]:
# Embedding dimension. Something to experiemnt with.
EMBED_DIM = 256
BATCH_SIZE = 32

In [78]:
def select_last(seq):
    return seq[:,-1,:]
    
def calc_avg(text_score, keyword_score):
    return (text_score + keyword_score) / 2


def classifier(vocab_size, embedding_dim):
    embedding =  tl.Embedding(
        vocab_size=vocab_size,
        d_feature=embedding_dim,
    )
    l_mean = tl.Mean(axis=1)
    lstm = tl.LSTM(embedding_dim)
    l_fn = tl.Fn('select_last', select_last)
    dense1 = tl.Dense(n_units=embedding_dim)
    dense2 = tl.Dense(n_units=2)
    logsoftmax = tl.LogSoftmax()
    relu = tl.Relu()
    sigmoid = tl.Sigmoid()
    
    return tl.Serial(
        tl.Parallel(
            tl.Serial(
                tl.Embedding(
                    vocab_size=vocab_size,
                    d_feature=embedding_dim,
                ),
                tl.Mean(axis=1),
                tl.Dense(n_units=2),
                tl.LogSoftmax(),
            ),
            tl.Serial(
                tl.Embedding(
                    vocab_size=vocab_size,
                    d_feature=embedding_dim,
                ),
                tl.Mean(axis=1),
                tl.Dense(n_units=2),
                tl.LogSoftmax(),
            ),
        ),
        tl.Fn('avg', calc_avg)
    )

## Training the model

In [85]:
def get_train_eval_tasks(
    train_text_pos, train_text_neg, train_keyword_pos, train_keyword_neg,
    val_text_pos, val_text_neg, val_keyword_pos, val_keyword_neg,
    batch_size, vocab
):
    train_task = training.TrainTask(
        labeled_data=train_generator(
            batch_size=batch_size,
            train_text_pos=train_text_pos,
            train_text_neg=train_text_neg,
            train_keyword_pos=train_keyword_pos,
            train_keyword_neg=train_keyword_neg,
            vocab=vocab,
        ),
        loss_layer=tl.CrossEntropyLoss(),
        optimizer=optimizers.Adam(0.001),
        n_steps_per_checkpoint=10,
    )
    eval_task = training.EvalTask(
        labeled_data=val_generator(
            batch_size=batch_size,
            val_text_pos=val_text_pos,
            val_text_neg=val_text_neg,
            val_keyword_pos=val_keyword_pos,
            val_keyword_neg=val_keyword_neg,
            vocab=vocab,
        ),
        metrics=[
            tl.CrossEntropyLoss(),
            tl.Accuracy(),
        ]
    )
    
    return train_task, eval_task

In [86]:
model = classifier(len(vocab), EMBED_DIM)
train_task, eval_task = get_train_eval_tasks(
    train_text_pos=pos_train.text_clean.to_list(),
    train_text_neg=neg_train.text_clean.to_list(),
    train_keyword_pos=pos_train.keyword_clean.to_list(),
    train_keyword_neg=neg_train.keyword_clean.to_list(),
    val_text_pos=pos_val.text_clean.to_list(),
    val_text_neg=neg_val.text_clean.to_list(),
    val_keyword_pos=pos_val.keyword_clean.to_list(),
    val_keyword_neg=neg_val.keyword_clean.to_list(),
    batch_size=BATCH_SIZE,
    vocab=vocab,
)

In [87]:
model

Serial_in2[
  Parallel_in2_out2[
    Serial[
      Embedding_11964_256
      Mean
      Dense_2
      LogSoftmax
    ]
    Serial[
      Embedding_11964_256
      Mean
      Dense_2
      LogSoftmax
    ]
  ]
  avg_in2
]

In [88]:
def train_model(model, train_task, eval_task, n_steps, output_dir):
    training_loop = training.Loop(
        model=model,
        tasks=train_task,
        eval_tasks=eval_task,
        output_dir=output_dir,
    )
    
    training_loop.run(n_steps=n_steps)
    
    return training_loop

In [89]:
# Empty the model directory, otherwise old checkpoints might fail to load and training would fail
for filename in os.listdir(MODEL_DIR):
    file_path = os.path.join(MODEL_DIR, filename)
    try:
        if os.path.isfile(file_path) or os.path.islink(file_path):
            os.unlink(file_path)
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
    except Exception as e:
        print('Failed to delete %s. Reason: %s' % (file_path, e))

In [90]:
training_loop = train_model(
    model=model,
    train_task=train_task,
    eval_task=[eval_task],
    n_steps=500,
    output_dir=MODEL_DIR,
)


Step      1: Total number of trainable weights: 6126596
Step      1: Ran 1 train steps in 2.25 secs
Step      1: train CrossEntropyLoss |  0.68846506
Step      1: eval  CrossEntropyLoss |  0.68605191
Step      1: eval          Accuracy |  0.78125000

Step     10: Ran 9 train steps in 9.59 secs
Step     10: train CrossEntropyLoss |  0.69075823
Step     10: eval  CrossEntropyLoss |  0.68003577
Step     10: eval          Accuracy |  0.84375000

Step     20: Ran 10 train steps in 4.49 secs
Step     20: train CrossEntropyLoss |  0.68754375
Step     20: eval  CrossEntropyLoss |  0.65674168
Step     20: eval          Accuracy |  0.75000000

Step     30: Ran 10 train steps in 6.76 secs
Step     30: train CrossEntropyLoss |  0.71607256
Step     30: eval  CrossEntropyLoss |  0.68104517
Step     30: eval          Accuracy |  0.59375000

Step     40: Ran 10 train steps in 2.88 secs
Step     40: train CrossEntropyLoss |  0.68957978
Step     40: eval  CrossEntropyLoss |  0.68974495
Step     40: eva

## Testing the model

In [None]:
# Prediction
max_len = all_test_tweets.clean.map(len).max().item()

def pad(tweet, max_len):
    return tweet + ([0] * (max_len - len(tweet)))

all_test_tweets['clean'] = all_test_tweets.clean.apply(pad, args=(max_len,))

preds = training_loop.eval_model(np.array(all_test_tweets.clean.to_list()))
target = np.array([pred[1] > pred[0] for pred in preds]).astype(np.float32)
all_test_tweets['target'] = target

In [None]:
# Output
all_test_tweets[['id', 'target']].to_csv(f'{OUTPUT_DIR}/submission.csv', index=False)