# CS492F 전산학특강<인공지능 산업 및 스마트에너지>
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## 8-2. Kagggle: Real or Not? NLP with Disaster Tweets
In this practice, we are going to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

### [Kaggle - Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)

> Twitter has become an important communication channel in times of emergency.
> The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. 
> Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
>
> But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:
> ![On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE](images/example-tweet.png)
>
> The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid.
> But it’s less clear to a machine.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

### Load and preprocess dataset

In this notebook, we are going to use `Pandas` to load and process the dataset. Let's load the dataset given in `.csv` format using `pandas.read_csv()`.

The given dataset consists of 5 columns:
- `id`
- `keyword`
- `location`
- `text`
- `target`

Among these columns, we are going to use only `text` and `target` columns.

In [None]:
!wget https://raw.githubusercontent.com/keai-kaist/CS492F-Spring/master/Week%204/Apr%2014/disaster-tweets.csv

In [2]:
df = pd.read_csv('disaster-tweets.csv', usecols=['text', 'target'])
df.head(10)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
5,#RockyFire Update => California Hwy. 20 closed...,1
6,#flood #disaster Heavy rain causes flash flood...,1
7,I'm on top of the hill and I can see a fire in...,1
8,There's an emergency evacuation happening now ...,1
9,I'm afraid that the tornado is coming to our a...,1


Let's define `clean_text()` function to clean texts in the dataset. For example, this function
- Converts all uppercase characters in a text into lowercase characters
- Remove accents
- Removes URLs in a text
- Removes all punctuations in a text
- Removes all characters other than alphabet, digit and whitespace
- Merges consecutive whitespaces into one whitespace

In [27]:
import re
import string
import unicodedata 

def clean_text(text):
    text = text.lower()
    text = ''.join(character for character in unicodedata.normalize('NFD', text) if unicodedata.category(character) != 'Mn')
    
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(fr'[{re.escape(string.punctuation)}]', '', text)
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

Then, apply `clean_text()` to `df['text']`.

In [None]:
df['text'] = df['text'].apply(clean_text)
df.head(10)

Let's define `train_test_split()` which splits a given `DataFrame` into train and test subsets.

In [None]:
def train_test_split(df, ratio=0.8):
    number_of_rows = df.shape[0]
    positions = np.arange(number_of_rows)
    np.random.shuffle(positions)
    
    pivot = int(number_of_rows * ratio)
    train_positions = positions[:pivot]
    test_positions = positions[pivot:]
    
    return df.iloc[train_positions], df.iloc[test_positions]

Then, we can split the dataset into train, validation, and test `DataFrame`s

In [None]:
df_train_validtion, df_test = train_test_split(df, ratio=0.8)
df_train, df_validation = train_test_split(df_train_validtion, ratio=0.8)

Define `prepare_tensors()` to convert the `DataFrame` into `X` and `y` tensors. Use `tf.keras.preprocessing.text.Tokenizer` to tokenize texts into sequences of words and convert them into numerical tensors.

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<unk>')
tokenizer.fit_on_texts(df_train['text'])

def prepare_tensors(df):
    sequences = tokenizer.texts_to_sequences(df['text'])
    
    X = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post')
    y = df['target'].to_numpy()
    
    return X, y

In [None]:
X_train, y_train = prepare_tensors(df_train)
X_validation, y_validation = prepare_tensors(df_validation)
X_test, y_test = prepare_tensors(df_test)

#### Build the model

Build the model using `tf.keras.Sequuential` and other `tf.keras.layers.*` layers.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64, mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.3)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

#### Compile the model
Compile the model using `binary_crossentropy` loss and optimizers as choice. Also, we can monitor the performance of the network using `accuracy` as metrics.

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy'],
)

#### Train the model
Train the model using `tf.keras.Model.fit()`. Use `tf.keras.callbacks.EarlyStopping` to stop the training when the validation loss increases.

In [None]:
model.fit(
    X_train, y_train, 
    epochs=10, 
    batch_size=256,
    validation_data=(X_validation, y_validation), 
    verbose=1,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True),
    ]
)

#### Evaluate the model

In [None]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f'Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')

In [None]:
sample_indexes = np.random.permutation(df_test.shape[0])
for sample_index in sample_indexes[:10]:
    prediction = model(np.expand_dims(X_test[sample_index], axis=0))[0]
    
    print('Input:', df_test.iloc[sample_index]['text'])
    print('Preprocessed:', ' '.join(tokenizer.index_word[index] for index in np.trim_zeros(X_test[sample_index])))
    print('Label:', '🌋Disaster' if prediction[0] >= 0.5 else '👍Non-disaster')
    print()