# Can we predict disasters from tweets using NLP?
This is a Kaggle's "Getting Started" competition. I used [a RNN model from TensorFlow](https://www.tensorflow.org/text/tutorials/text_classification_rnn) as the NLP model to solve this problem.

## Section 1: Setup

In [1]:
# packages
import pandas as pd
import numpy as np
import transformers
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = TFBertModel.from_pretrained("bert-base-uncased")
import tensorflow as tf
from tensorflow.keras import Model,layers,optimizers,callbacks,losses,metrics,utils

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

is_clean = 1

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Num GPUs Available:  1


In [2]:
# load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df = train_df.drop(columns=['id','keyword','location'],axis=1)
test_df = test_df.drop(columns=['keyword','location'],axis=1)
print('max number of words is',max([len(x.split()) for x in train_df['text']]))
max_length=32

max number of words is 31


In [3]:
# data cleaning
import string, re, os

def process_tweet(tweet):
    # remove stock market tickers like $GE
    #tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    #tweet = re.sub(r'#', '', tweet)
    #tweet = re.sub('\n', '', tweet)
    # remove numbers
    # tweet = re.sub('\w*\d\w*', '', tweet)
    return tweet

if is_clean:
    print(train_df.text[:10])
    for j,tweet in enumerate(train_df.text):
        train_df.text[j] = process_tweet(tweet)
    print(train_df.text[:10])
    for j,tweet in enumerate(test_df.text):
         test_df.text[j] = process_tweet(tweet)

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
5    #RockyFire Update => California Hwy. 20 closed...
6    #flood #disaster Heavy rain causes flash flood...
7    I'm on top of the hill and I can see a fire in...
8    There's an emergency evacuation happening now ...
9    I'm afraid that the tornado is coming to our a...
Name: text, dtype: object
0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
5    #RockyFire Update => California Hwy. 20 closed...
6    #flood #disaster Heavy rain causes flash flood...
7    I'm on top of the hill and I can s

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.text[j] = process_tweet(tweet)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df.text[j] = process_tweet(tweet)


In [4]:
# model setup & training
x_train=tokenizer(
    text=train_df['text'].tolist(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_attention_mask=True,
    verbose=True   
)
input_ids=layers.Input(shape=(max_length,),dtype=tf.int32,name='input_ids')
attention_mask=layers.Input(shape=(max_length,),dtype=tf.int32,name='attention_mask')
embedding=bert(input_ids,attention_mask)[1]

# out=layers.Dropout(0.1)(embedding)
# out=layers.Dense(64,activation='relu')(out)
out=layers.Dense(64,activation='relu')(embedding)

# out=layers.Dropout(0.1)(out)
#out=layers.Dense(64,activation='relu')(out)
out=layers.Dense(16,activation='relu')(out)
y=layers.Dense(1,activation='sigmoid')(out)

nn = Model(inputs=[input_ids,attention_mask],outputs=y)
nn.layers[2].trainable = True
optimizer = optimizers.Adam(learning_rate = 1e-4)

y_train=train_df['target']

nn.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['accuracy'])

nn.fit(
    x={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']},
    y=y_train,
    validation_split=0.01,
    epochs=2,
    batch_size=64
)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x2aae7799f790>

In [5]:
# test data
x_test=tokenizer(
    text=test_df.text.tolist(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True,
    return_tensors='tf',
    return_attention_mask=True,
    verbose=True
)

predicted = nn.predict({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']})
predictions = np.where(predicted>0.5,1,0)
test_id = test_df['id'].tolist()
predictions = predictions.reshape(-1,)
df = pd.DataFrame(data={'id':test_id,'target': np.transpose(predictions)})
df.head()
df.to_csv('submission.csv',index=False)

In [6]:
# the model with the training data
# predicted = nn.predict({'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']})
# predictions = np.where(predicted>0.5,1,0)
# predictions = predictions.reshape(-1,)
# acc = sum(predictions==y_train)/len(y_train)*100
# acc
