<a href="https://colab.research.google.com/github/mindyng/Projects/blob/master/BERT_Using_Tweets_to_Determine_Real_or_Fake_Disaster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Toxic Comment Classification Challenge

 Inspired by: https://www.kaggle.com/nkaenzig/bert-tensorflow-2-huggingface-transformers



In [None]:
!pip install transformers

In [None]:
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm

from transformers import BertTokenizer, BertModel
import os

Below are the steps involved with any transformer.

# Preprocessing

### 1. Tokenize input data and other input details such as Attention Mask for BERT so that attention on padded sequences is not ignored.

### 2. Convert tokens to input ID sequences.

### 3. Pad IDs to a fixed length. 

# Modeling

### 1. Load model and feed in input ID sequence (batches are best especially when there is limited CPU)

### 2. Get output of last hidden layer. This layer has sequence representation embedding at 0 index. So we the output is called last_hidden_layer[0]. 

### 3. Embeddings can be used as inputs for different ML/DL models.

Using BERT Transformers.

In [None]:
model_type = 'bert-base-uncased'
max_size = 150
batch_size = 200

In [None]:
train_df = pd.read_csv("/content/drive/My Drive/train.csv")
test_df = pd.read_csv("/content/drive/My Drive/test.csv")
train_df.head()

In [None]:
test_df

## Load Tokenizer and Model

In [None]:
tokenizer = BertTokenizer.from_pretrained(model_type)
model = BertModel.from_pretrained(model_type)

## Convert Text to Tokens

In [None]:
tokenized_input = train_df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens = True)))

In [None]:
print(tokenized_input[1])
print("Here 101 -> [CLS] and 102 -> [SEP]")

* [CLS] token = classification token

* In between [CLS] and [SEP] = entire sequence embedding. (Need to take token embedding from output layer)

* [SEP] = end of sequence

## Padding sequence to standardize length

In [None]:
padded_tokenized_input = np.array([i + [0]*(max_size-len(i)) for i in tokenized_input.values])

In [None]:
print(padded_tokenized_input[0])

Telling BERT to ignore attention on padded inputs.

In [None]:
attention_masks = np.where(padded_tokenized_input !=0,1,0)
print(attention_masks)

In [None]:
input_ids = torch.tensor(padded_tokenized_input)
attention_masks = torch.tensor(attention_masks)

## Get sequence embeddings

In [None]:
all_train_embedding = []

with torch.no_grad():
  for i in tqdm(range(0,len(input_ids),200)):    
    last_hidden_states = model(input_ids[i:min(i+200,len(train_df))], attention_mask = attention_masks[i:min(i+200,len(train_df))])[0][:,0,:].numpy()
    all_train_embedding.append(last_hidden_states)

 41%|████      | 16/39 [23:22<33:26, 87.24s/it]

In [None]:
unbatched_train = []
for batch in all_train_embedding:
    for seq in batch:
        unbatched_train.append(seq)

train_labels = train_df['target']

Train Test Split to be used in Various Models

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test =  train_test_split(unbatched_train, train_labels, test_size=0.33, random_state=42, stratify=train_labels)