<a href="https://colab.research.google.com/github/lee-messi/machine-learning/blob/main/disaster_tweets_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tensorflow --quiet
!pip install tensorflow-hub --quiet
!pip install tensorflow-text --quiet
!pip install tensorflow-addons --quiet
!pip install tensorflow-datasets --quiet

In [2]:
import os, re
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow_addons as tfa
import tensorflow_datasets as tfds

tf.get_logger().setLevel('ERROR')
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"


TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



In [3]:
if os.environ['COLAB_TPU_ADDR']:
  cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
  tf.config.experimental_connect_to_cluster(cluster_resolver)
  tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
  strategy = tf.distribute.TPUStrategy(cluster_resolver)
  print('Using TPU')
elif tf.config.list_physical_devices('GPU'):
  strategy = tf.distribute.MirroredStrategy()
  print('Using GPU')
else:
  raise ValueError('Running on CPU is not recommended.')

Using TPU


## Disaster Tweets

In [4]:
dataset = pd.read_csv('drive/MyDrive/Colab Notebooks/disaster-tweets/train.csv')
print(dataset.columns)

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')


We are going to use the tweet text to predict whether or not the tweet is about an actual disaster.

In [5]:
dataset.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

There are 6,023 sentences with label 1 and 2,528 sentences with label 0. Let's balance these out.

In [6]:
dataset = dataset.groupby('target').sample(3000)

In [7]:
dataset['text'] = dataset['text'].apply(lambda x: re.sub(r'http\S+', '', x))
dataset['text'] = dataset['text'].apply(lambda x: re.sub(r'\W+', ' ', x))
dataset['text'] = dataset['text'].apply(lambda x: re.sub(r'\d+', '', x))
dataset['text'] = dataset['text'].apply(lambda x: x.lower())

In [8]:
dataset.head()

Unnamed: 0,id,keyword,location,text,target
5558,7933,rainstorm,,robot_rainstorm we have two vacancies on the c...,0
4079,5798,hail,,thank you so so much to everyone for posting t...,0
3404,4874,explode,"New Orleans, Louisiana",see these guys reaching the front foot out loa...,0
4501,6398,hurricane,,they should name hurricanes with black people ...,0
5126,7311,nuclear%20reactor,,nuclear reactor railguns would be a great way ...,0


In [9]:
train, val, test = np.split(dataset.sample(frac = 1), [int(0.8 * len(dataset)), int(0.9 * len(dataset))])

Then, we are going to implement the **df_to_dataset()** function to create a **tf.data.Dataset** using the balanced reviews dataset. This allows us to map the features in the pandas dataframe to features that are more appropriate for training. You can read more about this and check out the function that is used to perform this task [here](https://www.tensorflow.org/tutorials/structured_data/feature_columns). Then, we are going to map the training, validation, and test datasets using the function. Note that depending on the features that you use in the model, you may have to modify parts of the function.

In [10]:
def df_to_dataset(dataframe, shuffle = True, batch_size = 128):
    df = dataframe.copy()
    labels = df.target
    df = df.text
    ds = tf.data.Dataset.from_tensor_slices((df, labels))
    if shuffle == True:
        ds = ds.shuffle(buffer_size = len(df))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return(ds)

In [11]:
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val)
test_ds = df_to_dataset(test)

## Classifier Model using BERT

In [12]:
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

In [13]:
bert_preprocess = hub.KerasLayer(tfhub_handle_preprocess)
bert_encoder = hub.KerasLayer(tfhub_handle_encoder)

In [67]:
text_input = tf.keras.layers.Input(shape = (), dtype = tf.string)
encoder_input = bert_preprocess(text_input)
encoder_output = bert_encoder(encoder_input)

l = tf.keras.layers.Dense(100, activation = 'relu')(encoder_output['sequence_output'])
l = tf.keras.layers.LSTM(100, return_sequences = True, dropout = 0.3)(l)
l = tf.keras.layers.LSTM(50, return_sequences = True)(l)
l = tf.keras.layers.LSTM(25)(l)
l = tf.keras.layers.Dense(25, activation = 'relu')(l)
l = tf.keras.layers.Dense(1, activation = 'sigmoid')(l)
model = tf.keras.Model(inputs=[text_input], outputs = [l])

In [68]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),
              loss = tf.keras.losses.BinaryCrossentropy(),
              metrics = ['accuracy'])

In [69]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_6 (InputLayer)           [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['input_6[0][0]']                
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                

In [70]:
early_stopping = tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3)

In [71]:
history = model.fit(train_ds,
                    validation_data = val_ds,
                    epochs = 30,
                    callbacks = [early_stopping])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30


In [72]:
final = pd.read_csv('drive/MyDrive/Colab Notebooks/disaster-tweets/test.csv')

In [73]:
final['text'] = final['text'].apply(lambda x: re.sub(r'http\S+', '', x))
final['text'] = final['text'].apply(lambda x: re.sub(r'\W+', ' ', x))
final['text'] = final['text'].apply(lambda x: re.sub(r'\d+', '', x))
final['text'] = final['text'].apply(lambda x: x.lower())

In [74]:
predictions = (model.predict(final.text) > 0.5).astype(int).ravel()



In [75]:
predictions_df = pd.DataFrame(list(zip(final.id, predictions)),
                              columns = ['id', 'target'])

In [76]:
predictions_df.to_csv('predictions.csv', index = False)