<a href="https://colab.research.google.com/github/matthewreader/continuous-learning/blob/main/projects/manning-liveprojects/hate-speech-detection-bert/Hate_Speech_Detection_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Import libraries and load the hate speech data**

In [1]:
# 1. A cell containing imports of all the libraries.
!pip install transformers
!pip install SentencePiece
!pip install pytorch_lightning

import pandas as pd
import regex as re
from google.colab import drive
from sklearn.model_selection import train_test_split
import torch
from transformers import AlbertTokenizer
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 4.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.5 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.3 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  At

In [2]:
# 2. A cell containing code for mounting your Google Drive in Colab.
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# 3. A cell containing code for reading the data using pandas.
GDRIVE_DATA = "gdrive/My Drive/Colab Notebooks/Data/"
tweets = pd.read_csv(GDRIVE_DATA + "Khilnani_LP_hate_speech_data.csv")
tweets.head()

Unnamed: 0.1,Unnamed: 0,tweet,class
0,0,!!! RT @mayasolovely: As a woman you shouldn't...,0
1,1,""" momma said no pussy cats inside my doghouse """,0
2,2,"""@Addicted2Guys: -SimplyAddictedToGuys http://...",0
3,3,"""@AllAboutManFeet: http://t.co/3gzUpfuMev"" woo...",0
4,4,"""@Allyhaaaaa: Lemmie eat a Oreo &amp; do these...",0


## **Remove words starting with an @ sign**

Usernames are represented in our Twitter data as strings beginning with the "@" symbol.  It does not make sense to train our model on usernames, so we first need to remove usernames from our tweets.  We can do so using the regex module as suggested by the Manning liveProject, however I think it's a little more straightforward to use `pandas.Series.str.replace` since our data is in a `pandas` dataframe.


In [4]:
# 4. Cells containing code for writing a function to clean the text data and
# then applying that function on the dataset.
tweets["tweet_clean"] = tweets["tweet"].str.replace("@\w+", "")
tweets.head()

Unnamed: 0.1,Unnamed: 0,tweet,class,tweet_clean
0,0,!!! RT @mayasolovely: As a woman you shouldn't...,0,!!! RT : As a woman you shouldn't complain abo...
1,1,""" momma said no pussy cats inside my doghouse """,0,""" momma said no pussy cats inside my doghouse """
2,2,"""@Addicted2Guys: -SimplyAddictedToGuys http://...",0,""": -SimplyAddictedToGuys http://t.co/1jL4hi8ZM..."
3,3,"""@AllAboutManFeet: http://t.co/3gzUpfuMev"" woo...",0,""": http://t.co/3gzUpfuMev"" woof woof and hot s..."
4,4,"""@Allyhaaaaa: Lemmie eat a Oreo &amp; do these...",0,""": Lemmie eat a Oreo &amp; do these dishes."" O..."


That's a bit better, but we still have a lot that probably should be cleaned up.  Knowing that this is Twitter data, any kind of hyperlink could be safely removed as well.  Punctuation should probably be removed.  Hashtags might be important if they contain any kind of hateful phrase.

In [6]:
# 5. Cells containing code for splitting the data into train and validation
# sets using sklearn.
X_train, X_test, y_train, y_test = train_test_split(
    tweets["tweet_clean"].values,
    tweets["class"].values
)

In [8]:
# 6. Cells containing code for loading the pre-trained AlbertTokenizer
# and tokenizing the data.
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

train_tokens = tokenizer(
    list(X_train),
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=64
)

val_tokens = tokenizer(
    list(X_test),
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=64
)

device = "cuda"

train_set = [train_tokens["input_ids"].to(device),
             train_tokens["attention_mask"].to(device),
             train_tokens["token_type_ids"].to(device),
             torch.tensor(y_train).to(device)] 

val_set = [val_tokens["input_ids"].to(device),
           val_tokens["attention_mask"].to(device),
           val_tokens["token_type_ids"].to(device),
           torch.tensor(y_test).to(device)] 

In [19]:
device = "cuda"

train_set = [train_tokens["input_ids"].to(device),
             train_tokens["attention_mask"].to(device),
             train_tokens["token_type_ids"].to(device),
             torch.tensor(y_train).to(device)] 

val_set = [val_tokens["input_ids"].to(device),
           val_tokens["attention_mask"].to(device),
           val_tokens["token_type_ids"].to(device),
           torch.tensor(y_test).to(device)] 

In [22]:
# 7. A cell containing the code for the DataModule class built in 
# PyTorch Lightning

BATCH_SIZE = 32

class ClassificationData(pl.LightningDataModule):
    def __init__(self, train_set, val_set):
        super().__init__()

        self.train_set = DataLoader(
            TensorDataset(*train_set),
            batch_size=BATCH_SIZE)
        self.val_set = DataLoader(
            TensorDataset(*val_set),
            batch_size=BATCH_SIZE)

    def train_dataloader(self):
      return self.train_set

    def val_dataloader(self):
      return self.val_set

dls = ClassificationData(train_set, val_set)
next(iter(dls.train_set))

[tensor([[   2,   98, 3832,  ...,    0,    0,    0],
         [   2,   51, 1781,  ...,    0,    0,    0],
         [   2,   31, 1518,  ...,    0,    0,    0],
         ...,
         [   2,   92,   22,  ...,    0,    0,    0],
         [   2,   13, 5256,  ...,    0,    0,    0],
         [   2,   55,   45,  ...,    0,    0,    0]], device='cuda:0'),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'),
 tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'),
 tensor([0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 1, 0], device='cuda:0')]