<a href="https://colab.research.google.com/github/mindyng/Projects/blob/master/BERT_Using_Tweets_to_Determine_Real_or_Fake_Disaster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Toxic Comment Classification Challenge

 Inspired by: https://www.kaggle.com/nkaenzig/bert-tensorflow-2-huggingface-transformers



In [1]:
!pip install transformers



In [2]:
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm

from transformers import BertTokenizer, BertModel
import os

Below are the steps involved with any transformer.

# Preprocessing

### 1. Tokenize input data and other input details such as Attention Mask for BERT so that attention on padded sequences is not ignored.

### 2. Convert tokens to input ID sequences.

### 3. Pad IDs to a fixed length. 

# Modeling

### 1. Load model and feed in input ID sequence (batches are best especially when there is limited CPU)

### 2. Get output of last hidden layer. This layer has sequence representation embedding at 0 index. So we the output is called last_hidden_layer[0]. 

### 3. Embeddings can be used as inputs for different ML/DL models.

Using BERT Transformers.

In [7]:
model_type = 'bert-base-uncased'
max_size = 150
batch_size = 200

In [4]:
train_df = pd.read_csv("/content/drive/My Drive/train.csv")
test_df = pd.read_csv("/content/drive/My Drive/test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test_df

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


## Load Tokenizer and Model

In [8]:
tokenizer = BertTokenizer.from_pretrained(model_type)
model = BertModel.from_pretrained(model_type)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




## Convert Text to Tokens

In [9]:
tokenized_input = train_df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens = True)))

In [10]:
print(tokenized_input[1])
print("Here 101 -> [CLS] and 102 -> [SEP]")

[101, 3224, 2543, 2379, 2474, 6902, 3351, 21871, 2243, 1012, 2710, 102]
Here 101 -> [CLS] and 102 -> [SEP]


* [CLS] token = classification token

* In between [CLS] and [SEP] = entire sequence embedding. (Need to take token embedding from output layer)

* [SEP] = end of sequence

## Padding sequence to standardize length

In [11]:
padded_tokenized_input = np.array([i + [0]*(max_size-len(i)) for i in tokenized_input.values])

In [12]:
print(padded_tokenized_input[0])

[  101  2256 15616  2024  1996  3114  1997  2023  1001  8372  2089 16455
  9641  2149  2035   102     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0]


Telling BERT to ignore attention on padded inputs.

In [13]:
attention_masks = np.where(padded_tokenized_input !=0,1,0)
print(attention_masks)

[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]]
