# Exploratory Data Analysis (EDA) Notebook
This notebook provides a detailed exploratory data analysis (EDA) of the dataset. The analysis includes:
- **Data Loading and Inspection**: Understanding the structure of the data.
- **Missing Value Analysis**: Detecting and handling missing data.
- **Statistical Summary and Data Insights**: Gaining insights from summary statistics and visualizations.
- **Data Cleaning and Preparation**: Removing or adjusting problematic data points.


##  Import Libraries

In this section, we import all the necessary libraries used for data processing, image handling, model building, and evaluation. This includes Pandas and NumPy for data manipulation, OpenCV for image processing, PyTorch for building and training the neural network model, and Transformers for BERT-based NLP tasks.

In [16]:
import pandas as pd
import os
import torch.nn as nn
from transformers import BertModel
from transformers import BertTokenizer
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam
from tqdm import tqdm


## Load a Sample File and Inspect the Data

Here, we load a sample data file in TSV format to inspect its structure and data fields. This helps us understand the type of information available, such as text transcripts and bounding box coordinates, so we can plan further data processing steps.

In [17]:

# Define the path to the sample TSV file
path_to_tsv_folder = 'dataset/train/boxes_transcripts_labels'

# Load the file with comma as the delimiter
sample_file_path = os.path.join(path_to_tsv_folder, os.listdir(path_to_tsv_folder)[0])
sample_data = pd.read_csv(sample_file_path, sep=',', names=[
    'start_index', 'end_index', 'x_top_left', 'y_top_left', 
    'x_bottom_right', 'y_bottom_right', 'transcript', 'field'
])

# Display the loaded data
print("Sample Data with corrected delimiter:")
print(sample_data.head())
print(sample_data.info())


Sample Data with corrected delimiter:
   start_index  end_index  x_top_left  y_top_left  x_bottom_right  \
0           33         33         215           4             227   
1           35         44         235           3             308   
2           46         51         311           3             349   
3           53         60         352           3             401   
4           62         67         404           3             457   

   y_bottom_right  transcript  field  
0              21           a  OTHER  
1              21  Employee's  OTHER  
2              20      social  OTHER  
3              20    security  OTHER  
4              21      number  OTHER  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   start_index     520 non-null    int64 
 1   end_index       520 non-null    int64 
 2   x_top_left      520 non-null

## Data Analysis

In this section, we analyze key characteristics of the data, including the distribution of different entity types (fields), the average length of transcripts, and the size of bounding boxes. This information can be useful in selecting model parameters and understanding the data better.

In [18]:
# Check distribution of 'field' values
field_distribution = sample_data['field'].value_counts()
print("Field Distribution:\n", field_distribution)


Field Distribution:
 field
OTHER                                 491
employerAddressStreet_name              5
employerName                            3
box4SocialSecurityTaxWithheld           3
box17StateIncomeTax                     3
box1WagesTipsAndOtherCompensations      2
box3SocialSecurityWages                 2
employeeName                            2
box16StateWagesTips                     2
ssnOfEmployee                           1
einEmployerIdentificationNumber         1
box2FederalIncomeTaxWithheld            1
employerAddressZip                      1
employerAddressCity                     1
employerAddressState                    1
taxYear                                 1
Name: count, dtype: int64


In [19]:
# Token length analysis
sample_data['transcript_length'] = sample_data['transcript'].apply(len)
print("Average token length:", sample_data['transcript_length'].mean())

# Bounding box area calculation
sample_data['bbox_area'] = (sample_data['x_bottom_right'] - sample_data['x_top_left']) * \
                           (sample_data['y_bottom_right'] - sample_data['y_top_left'])
print("Bounding Box Area Stats:\n", sample_data['bbox_area'].describe())


Average token length: 4.961538461538462
Bounding Box Area Stats:
 count     520.000000
mean      908.030769
std       874.764283
min        84.000000
25%       374.750000
50%       674.000000
75%      1073.250000
max      7950.000000
Name: bbox_area, dtype: float64


## Data Cleaning and Preparation

This section focuses on cleaning and preparing the data by standardizing text (e.g., making it lowercase), grouping transcripts for multi-token entities, and making the data ready for tokenization. This step ensures consistency and accuracy in the data fed into the model.

In [20]:
# Clean transcript tokens
sample_data['transcript'] = sample_data['transcript'].str.lower().str.strip()


In [21]:
# Group by field to combine multi-token entities
grouped_data = sample_data.groupby('field')['transcript'].apply(lambda x: ' '.join(x)).reset_index()
print("Grouped Data:\n", grouped_data.head(10))


Grouped Data:
                                 field  \
0                               OTHER   
1                 box16StateWagesTips   
2                 box17StateIncomeTax   
3  box1WagesTipsAndOtherCompensations   
4        box2FederalIncomeTaxWithheld   
5             box3SocialSecurityWages   
6       box4SocialSecurityTaxWithheld   
7     einEmployerIdentificationNumber   
8                        employeeName   
9                 employerAddressCity   

                                          transcript  
0  a employee's social security number safe, accu...  
1                                          20287. 85  
2                                          1690 . 44  
3                                          41669. 07  
4                                           11182.93  
5                                          53826. 13  
6                                           4117 . 7  
7                                         37-3493491  
8                                   st

In [22]:
# Bounding box area feature (already calculated)
sample_data['bbox_area'] = (sample_data['x_bottom_right'] - sample_data['x_top_left']) * \
                           (sample_data['y_bottom_right'] - sample_data['y_top_left'])


In [23]:
print(sample_data['bbox_area'])

0       204
1      1314
2       646
3       833
4       954
       ... 
515     442
516     680
517     289
518     774
519     540
Name: bbox_area, Length: 520, dtype: int64


## Model build

## Label Mapping

We define a mapping of field names (entities) to integer labels, which is necessary for model training. The model will learn to associate these integer labels with different fields in the dataset, enabling it to classify new data points into the correct categories.

In [24]:
# Define the label mapping for the entity labels in your dataset
label_map = {
    'employerName': 0,
    'employerAddressStreet_name': 1,
    'employerAddressCity': 2,
    'employerAddressState': 3,
    'employerAddressZip': 4,
    'einEmployerIdentificationNumber': 5,
    'employeeName': 6,
    'ssnOfEmployee': 7,
    'box1WagesTipsAndOtherCompensations': 8,
    'box2FederalIncomeTaxWithheld': 9,
    'box3SocialSecurityWages': 10,
    'box4SocialSecurityTaxWithheld': 11,
    'box16StateWagesTips': 12,
    'box17StateIncomeTax': 13,
    'taxYear': 14,
    'OTHER': 15  # Label for non-entity tokens
}


## Tokenization and Coordinate Preparation

This section involves preparing the data for the model by tokenizing transcripts using BERT and creating tensors for bounding box coordinates and labels. The tokenized text and coordinates will be used as inputs to the model.

In [25]:

# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the transcripts and prepare coordinate features
def prepare_data(data, label_map):
    tokens = tokenizer(data['transcript'].tolist(), return_tensors="pt", padding=True, truncation=True)
    
    # Prepare bounding box tensor
    bbox = torch.tensor(data[['x_top_left', 'y_top_left', 'x_bottom_right', 'y_bottom_right']].values)
    
    # Map field names to integer labels
    labels = torch.tensor(data['field'].map(label_map).values)
    
    return tokens, bbox, labels

# # Example usage
# tokens, bbox, labels = prepare_data(sample_data, label_map)


## Model Definition

We define a custom BERT model that combines BERT embeddings with bounding box coordinates. This allows the model to use both the text and spatial information to make predictions, which is helpful for tasks involving both text and visual cues

In [26]:


class BERTWithCoords(nn.Module):
    def __init__(self, num_labels):
        super(BERTWithCoords, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.fc = nn.Linear(self.bert.config.hidden_size + 4, num_labels)  # +4 for bbox coordinates

    def forward(self, input_ids, attention_mask, bbox):
        # Get BERT embeddings
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Concatenate BERT embeddings with bounding box features
        combined_features = torch.cat((bert_output.pooler_output, bbox), dim=1)
        
        # Classification layer
        logits = self.fc(combined_features)
        return logits


## Training Loop

Here, we implement the training loop for the model. The loop involves a forward pass to generate predictions, calculating the loss, backpropagation, and updating model parameters. This process is repeated over several epochs to train the model.

In [27]:
# Custom Dataset class
class EntityDataset(Dataset):
    def __init__(self, data, tokenizer, label_map):
        self.data = data
        self.tokenizer = tokenizer
        self.label_map = label_map

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        
        # Tokenize the transcript
        tokens = self.tokenizer(row['transcript'], return_tensors="pt", padding="max_length", truncation=True)
        
        # Bounding box tensor and label
        bbox = torch.tensor([row['x_top_left'], row['y_top_left'], row['x_bottom_right'], row['y_bottom_right']])
        label = torch.tensor(self.label_map[row['field']])

        return {
            "input_ids": tokens['input_ids'].squeeze(),  # Remove extra dimensions
            "attention_mask": tokens['attention_mask'].squeeze(),
            "bbox": bbox,
            "labels": label
        }

# Initialize DataLoader
dataset = EntityDataset(sample_data, tokenizer, label_map)
data_loader = DataLoader(dataset, batch_size=8, shuffle=True)

# Initialize model, optimizer, and training loop
model = BERTWithCoords(num_labels=len(label_map))
optimizer = Adam(model.parameters(), lr=1e-5)

# Training loop
def train_model(model, data_loader, optimizer, epochs=3):
    model.train()
    for epoch in range(epochs):
        loop = tqdm(data_loader, leave=True)
        for batch in loop:
            optimizer.zero_grad()
            
            # Get inputs from the batch
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            bbox = batch['bbox']
            labels = batch['labels']

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, bbox=bbox)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            # Log the loss
            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())

# Start training
train_model(model, data_loader, optimizer)


Epoch 0: 100%|██████████| 65/65 [09:23<00:00,  8.67s/it, loss=118] 
Epoch 1: 100%|██████████| 65/65 [09:19<00:00,  8.61s/it, loss=80.7]
Epoch 2: 100%|██████████| 65/65 [35:19<00:00, 32.61s/it, loss=67.8]


In [28]:
# Save the fine-tuned model with a .pth extension
model_save_path = "bert_with_coords_model.pth"
torch.save(model.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")


Model saved to bert_with_coords_model.pth
