<a href="https://colab.research.google.com/github/jaishrm07/deep-learning-projects/blob/main/BERT_based_ner_consumer_complaints.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [2]:
!pip install kaggle



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
cd /content/drive/My Drive/'Colab Notebooks'/bert-based-ner/

/content/drive/My Drive/Colab Notebooks/bert-based-ner


In [5]:
ls '/content/drive/My Drive/Colab Notebooks/'

 00_pytorch_fundamentals.ipynb   [0m[01;34mbert-based-ner[0m/
 01_pytorch_first_model.ipynb   [01;34m'with fastAI and PyTorch'[0m/


In [6]:
!ls

BERT-based-ner-consumer-complaints  complaints.csv  complaints.csv.zip


In [7]:
!ls '/content/drive/My Drive/Colab Notebooks/bert-based-ner/'

BERT-based-ner-consumer-complaints  complaints.csv  complaints.csv.zip


In [8]:
dataset_path = '/content/drive/My Drive/Colab Notebooks/bert-based-ner/complaints.csv'

In [9]:
import pandas as pd
df = pd.read_csv(dataset_path)
print(df.head())

  Date received                                            Product  \
0    2024-09-13  Credit reporting or other personal consumer re...   
1    2024-09-13                                    Debt collection   
2    2024-09-14  Credit reporting or other personal consumer re...   
3    2024-09-14  Credit reporting or other personal consumer re...   
4    2024-08-07  Credit reporting or other personal consumer re...   

        Sub-product                                 Issue  \
0  Credit reporting  Incorrect information on your report   
1        Other debt    False statements or representation   
2  Credit reporting           Improper use of your report   
3  Credit reporting  Incorrect information on your report   
4  Credit reporting  Incorrect information on your report   

                                           Sub-issue  \
0                           Account status incorrect   
1                  Attempted to collect wrong amount   
2  Credit inquiries on your report that you 

## Data Pre-Processing

In [10]:
# To remove the NaN with ' '
df['Consumer complaint narrative'].fillna('', inplace=True)

In [11]:
# Total number of rows in the DataFrame
total_rows = df.shape[0]

# Count how many rows have blank spaces or empty strings
blank_rows_count = df['Consumer complaint narrative'].str.strip().eq('').sum()

# Print the results
print(f"Total number of rows: {total_rows}")
print(f"Number of rows with blank or empty complaint narratives: {blank_rows_count}")

Total number of rows: 6260768
Number of rows with blank or empty complaint narratives: 4121013


In [12]:
# Fill the empty complaint narratives with combined data from Product, Issue, and Company, handling NaN values
df['Consumer complaint narrative'] = df.apply(
    lambda row: str(row['Product']) + ' ' + str(row['Issue']) + ' ' + str(row['Company'])
    if str(row['Consumer complaint narrative']).strip() == '' else row['Consumer complaint narrative'],
    axis=1
)

# Verify the change by printing the first few rows
#print(df[['Consumer complaint narrative', 'Product', 'Issue', 'Company']].head())
extraction_columns = df[['Consumer complaint narrative', 'Company', 'Product', 'Issue']]
print(extraction_columns.head())

                        Consumer complaint narrative  \
0  Credit reporting or other personal consumer re...   
1  Debt collection False statements or representa...   
2  Credit reporting or other personal consumer re...   
3  Credit reporting or other personal consumer re...   
4  Credit reporting or other personal consumer re...   

                               Company  \
0  Experian Information Solutions Inc.   
1                  SYNCHRONY FINANCIAL   
2  Experian Information Solutions Inc.   
3  Experian Information Solutions Inc.   
4                        EQUIFAX, INC.   

                                             Product  \
0  Credit reporting or other personal consumer re...   
1                                    Debt collection   
2  Credit reporting or other personal consumer re...   
3  Credit reporting or other personal consumer re...   
4  Credit reporting or other personal consumer re...   

                                  Issue  
0  Incorrect information on you

### Data splitting into Train, Valid and Test

In [13]:
df.shape

(6260768, 18)

In [14]:
extraction_columns.shape

(6260768, 4)

In [15]:
from sklearn.model_selection import train_test_split

# Split into training and temp (which will be further split into validation and test)
train_df, temp_df = train_test_split(extraction_columns, test_size=0.3, random_state=42)

# Split temp into validation and test sets
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Print the sizes of the splits
print(f"Training set: {train_df.shape[0]} rows")
print(f"Validation set: {val_df.shape[0]} rows")
print(f"Test set: {test_df.shape[0]} rows")

Training set: 4382537 rows
Validation set: 939115 rows
Test set: 939116 rows


In [16]:
from datasets import Dataset

# Convert pandas DataFrames to Hugging Face Dataset
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

print(train_dataset)

Dataset({
    features: ['Consumer complaint narrative', 'Company', 'Product', 'Issue', '__index_level_0__'],
    num_rows: 4382537
})


In [18]:
from transformers import BertTokenizer
tokenizer = tokenizer.to('cuda')

AttributeError: 'BertTokenizer' object has no attribute 'to'

In [17]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['Consumer complaint narrative'], padding="max_length", truncation=True)

# Tokenize the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns (we only need tokenized inputs and labels)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

print(train_dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



Map:   0%|          | 0/4382537 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [None]:
from datasets import Dataset

# Convert your DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Tokenize using the dataset.map function, which can run in parallel
def tokenize_function(examples):
    return tokenizer(examples['Consumer complaint narrative'], padding='max_length', truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=4)  # Adjust num_proc for parallelism

Map (num_proc=4):   0%|          | 0/6260768 [00:00<?, ? examples/s]