# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Kagggle: Real or Not? NLP with Disaster Tweets
In this practice, we are going to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

### [Kaggle - Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)

> Twitter has become an important communication channel in times of emergency.
> The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. 
> Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
>
> But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:
> ![On plus side LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE](images/example-tweet.png)
>
> The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid.
> But it’s less clear to a machine.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf

### Load and preprocess dataset

In this notebook, we are going to use `Pandas` to load and process the dataset. Let's load the dataset given in `.csv` format using `pandas.read_csv()`.

The given dataset consists of 5 columns:
- `id`
- `keyword`
- `location`
- `text`
- `target`

Among these columns, we are going to use only `text` and `target` columns.

In [None]:
df = pd.
df.head(10)

Let's define `clean_text()` function to clean texts in the dataset. For example, this function
- Converts all uppercase characters in a text into lowercase characters
- Remove accents
- Removes URLs in a text
- Removes all punctuations in a text
- Removes all characters other than alphabet, digit and whitespace
- Merges consecutive whitespaces into one whitespace

In [None]:
import re
import string
import unicodedata 

def clean_text(text):
    pass

Then, apply `clean_text()` to `df['text']`.

In [None]:
df['text'] = 
df.head(10)

Let's define `train_test_split()` which splits a given `DataFrame` into train and test subsets.

In [None]:
def train_test_split(df, ratio=0.8):
    pass

Then, we can split the dataset into train, validation, and test `DataFrame`s

In [None]:
df_train_validtion, df_test = 
df_train, df_validation = 

Define `prepare_tensors()` to convert the `DataFrame` into `X` and `y` tensors. Use `tf.keras.preprocessing.text.Tokenizer` to tokenize texts into sequences of words and convert them into numerical tensors.

In [None]:
tokenizer = 
tokenizer.

def prepare_tensors(df):
    pass

In [None]:
X_train, y_train = 
X_validation, y_validation = 
X_test, y_test = 

#### Build the model

Build the model using `tf.keras.Sequuential` and other `tf.keras.layers.*` layers.

In [None]:
model = 

#### Compile the model
Compile the model using `binary_crossentropy` loss and optimizers as choice. Also, we can monitor the performance of the network using `accuracy` as metrics.

In [None]:
model.

#### Train the model
Train the model using `tf.keras.Model.fit()`. Use `tf.keras.callbacks.EarlyStopping` to stop the training when the validation loss increases.

In [None]:
model.

#### Evaluate the model

In [None]:
loss, accuracy = 
print(f'Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')

In [None]:
sample_indexes = np.random.permutation(df_test.shape[0])
for sample_index in sample_indexes[:10]:
    prediction = model(np.expand_dims(X_test[sample_index], axis=0))[0]
    
    print('Input:', df_test.iloc[sample_index]['text'])
    print('Preprocessed:', ' '.join(tokenizer.index_word[index] for index in np.trim_zeros(X_test[sample_index])))
    print('Actual:', '🌋Disaster' if y_test[sample_index] == 1 else '🌍Non-disaster')
    print('Predicted:', '🌋Disaster' if prediction[0] >= 0.5 else '🌍Non-disaster')
    print()