In [2]:
import pandas as pd 
import numpy as np


In [3]:
Data=pd.read_csv("Twitter Hate Speech - Twitter Hate Speech.csv")

In [39]:
Data

Unnamed: 0,id,label,tweet,cleaned_tweet
0,1,0,@user when a father is dysfunctional and is so...,when a father is dysfunctional and is so self...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit i cant use cause they...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,model i love u take with u all the time in u...
4,5,0,factsguide: society now #motivation,factsguide society now motivation
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,ate isz that youuuðððððððððâï
31958,31959,0,to see nina turner on the airwaves trying to w...,to see nina turner on the airwaves trying to w...
31959,31960,0,listening to sad songs on a monday morning otw...,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",sikh temple vandalised in in calgary wso cond...


## 1. Data Cleaning with Regex and Python

For cleaning the tweets, We can use regular expressions (regex) to identify and remove or replace parts of the tweets that aren't useful for your analysis. Common steps include:

- **Removing user mentions** (e.g., `@user`)
- **Removing URLs**
- **Removing special characters and emoticons**
- **Optionally, converting all text to lowercase** to ensure uniformity


In [38]:
import pandas as pd
import re

# Function to clean tweets
def clean_tweet(tweet):
    tweet = re.sub(r'@user', '', tweet)  # Remove @user
    tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet, flags=re.MULTILINE)  # Remove URLs
    tweet = re.sub(r'\@\w+|\#','', tweet)  # Remove mentions and hashtags
    tweet = re.sub(r'\\x[\w]{2,}', '', tweet)  # Remove encoded characters (e.g., emoji)
    tweet = re.sub(r'[^\w\s]', '', tweet)  # Remove punctuation
    tweet = tweet.lower()  # Convert to lowercase
    return tweet

# Apply cleaning function to tweets
Data['cleaned_tweet'] = Data['tweet'].apply(clean_tweet)

# Show cleaned data
print(Data[['id', 'label', 'cleaned_tweet']].head())

# Display the counts


   id  label                                      cleaned_tweet
0   1      0   when a father is dysfunctional and is so self...
1   2      0    thanks for lyft credit i cant use cause they...
2   3      0                                bihday your majesty
3   4      0  model   i love u take with u all the time in u...
4   5      0               factsguide society now    motivation


## Featurization in Machine Learning and NLP

Featurization is a critical process in machine learning and natural language processing (NLP), involving the transformation of raw data into a structured format that models can understand and process effectively. This process enables the models to learn from the data and make predictions or classifications based on it.

### Traditional Machine Learning Models

For traditional machine learning models, featurization typically involves creating vectors of numbers that represent the data. These vectors are constructed through various techniques, each designed to capture different aspects of the data's structure and meaning:

- **Bag of Words (BoW)**: This technique represents text data as a bag (multiset) of its words, disregarding grammar and word order but keeping multiplicity. It involves counting the occurrence of words within a document, which results in a sparse matrix representation where each row corresponds to a document and each column to a word in the dataset's vocabulary.

- **TF-IDF (Term Frequency-Inverse Document Frequency)**: TF-IDF goes a step further than BoW by considering not only the frequency of words in a single document but also how unique these words are across all documents in the corpus. It aims to highlight words that are frequent in a document but not common in the entire dataset, providing a more nuanced representation of the text data.

### Transformers and BertTokenizer

With the advent of transformer models, featurization has taken on new dimensions. In the context of transformers, featurization is handled through sophisticated tokenization and encoding steps, primarily facilitated by tools like the `BertTokenizer`. This involves several key processes:

- **Tokenization**: Raw text is split into tokens (words or subwords), which are then mapped to numerical IDs from the model's predefined vocabulary. This step converts text into a sequence of numbers, making it computationally tractable for the model.

- **Encoding**: Additional steps, such as adding special tokens (`[CLS]`, `[SEP]`, `[PAD]`), padding sequences to a fixed length, and creating attention masks, are performed. These processes ensure that the model can correctly interpret the structure and content of the input data, focusing on meaningful tokens and ignoring padding when necessary.

This approach to featurization leverages the inherent capabilities of transformer models to understand and process text data deeply, capturing contextual relationships and nuances that were challenging to represent with traditional techniques.


In [7]:
from transformers import BertTokenizer
from torch.utils.data import Dataset, DataLoader
import torch


class CustomTweetDataset(Dataset):
    def __init__(self, tweets, labels, tokenizer, max_token_len=128):
        self.tweets = tweets
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_token_len = max_token_len
        
    def __len__(self):
        return len(self.tweets)
    
    def __getitem__(self, index):
        tweet = self.tweets[index]
        label = self.labels[index]
        
        encoding = self.tokenizer.encode_plus(
            tweet,
            add_special_tokens=True,
            max_length=self.max_token_len,
            return_token_type_ids=False,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Prepare the dataset
tweets = Data.cleaned_tweet.to_numpy()
labels = Data.label.to_numpy()
dataset = CustomTweetDataset(tweets=tweets, labels=labels, tokenizer=tokenizer)

# Create a DataLoader
# Create a DataLoader with num_workers set to 0 for troubleshooting
loader = DataLoader(dataset, batch_size=32, num_workers=0)


### Preparing a Dataset for BERT Model Training

The code snippet demonstrates how to prepare a custom dataset for training a BERT model using PyTorch and the Hugging Face `transformers` library. Here's a step-by-step explanation:

#### Importing Libraries

- `BertTokenizer`: From Hugging Face's `transformers`, used to tokenize text into a format BERT understands.
- `Dataset`, `DataLoader`: From `torch.utils.data`, facilitate custom data handling and batching for training.
- `torch`: The main PyTorch library.

#### CustomTweetDataset Class

- **Purpose**: Extends the `Dataset` class to handle the specifics of our text data (tweets).
- **Initialization Parameters**:
  - `tweets`: Array of tweet texts.
  - `labels`: Array of labels corresponding to each tweet.
  - `tokenizer`: Instance of `BertTokenizer`.
  - `max_token_len`: Maximum length for tokenized tweet sequences.
- **Methods**:
  - `__len__`: Returns the total number of tweets in the dataset.
  - `__getitem__`: Fetches a single processed item from the dataset by index. It tokenizes the tweet, applies padding and truncation, and converts it to tensors.
    - **Encoding**:
      - `add_special_tokens`: Adds tokens like [CLS] and [SEP] necessary for BERT.
      - `max_length`: Ensures all sequences are of the same length.
      - `return_token_type_ids`: Omits token type ids (not needed for sequence classification).
      - `padding` and `truncation`: Ensures uniform sequence length.
      - `return_attention_mask`: Generates a mask to distinguish real tokens from padding.
      - `return_tensors='pt'`: Returns PyTorch tensors.
    - **Return Value**: A dictionary with `input_ids`, `attention_mask`, and `labels` for the given tweet.

#### Tokenizer Initialization

- **BERT Tokenizer**: Loaded with `from_pretrained('bert-base-uncased')`, preparing it for processing English text in lowercase.

#### Dataset Preparation

- Converts tweet texts and labels from the DataFrame `Data` into numpy arrays for processing.
- Initializes the `CustomTweetDataset` with these arrays and the tokenizer.

#### DataLoader Creation

- Wraps the dataset in a `DataLoader` for efficient batch processing during model training.
- `batch_size=32`: Determines how many items are processed together as a batch.
- `num_workers=0`: For troubleshooting, this setting avoids parallel data loading to simplify debugging.

This setup is essential for preparing text data for training with BERT, ensuring each step from tokenization to batch loading is optimized for performance and compatibility with the model's requirements.


In [8]:
for batch in loader:
    print(batch["input_ids"].shape, batch["attention_mask"].shape, batch["labels"].shape)
    break  # This should exit after the first batch


torch.Size([32, 128]) torch.Size([32, 128]) torch.Size([32])


### Understanding DataLoader Output

The output from the DataLoader indicates successful batch preparation. Let's dissect the output for a comprehensive understanding:

#### 1. **`torch.Size([32, 128])` for `input_ids`:**
   - **Batch Size**: 32 tweets per batch.
   - **Sequence Length**: Each tweet is tokenized into a sequence of 128 tokens.
   - **Content**: Includes both the actual words/subwords (BERT utilizes a subword tokenizer) and any padding to achieve the uniform length of 128 tokens.

#### 2. **`torch.Size([32, 128])` for `attention_mask`:**
   - **Batch Size and Sequence Length**: Mirrors the `input_ids`, with 32 attention masks, each 128 tokens in length.
   - **Function**: Indicates which tokens should be considered by the model (1's for words, including padding, and 0's for padding tokens to be ignored).

#### 3. **`torch.Size([32])` for `labels`:**
   - **Batch Size**: Corresponds to the 32 labels in each batch, one for each tweet.
   - **Purpose**: Used by the model to learn during training, indicating whether a tweet is classified as hate speech or not, among other potential classes.
