#### **Using Google BERT for Text Classification on Wikipedia Data**
#### Author: Jackson Guthrie
#### Last Updated: May 2024

##### The goal of this notebook is for me to try to use a publicly-available dataset to implement Google BERT for a text multiclassification problem with high cardinality. This Wikipedia dataset contains a column with 90 classes, which should be sufficient for high cardinality. 

Load Necessary Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split

for dirname, _, filenames in os.walk('C:/Users/17577/code/kaggle_input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Struggling to get Tensorflow to be able to leverage GPU on Windows 10. Might retry on Mac or Linux
gpus = tf.config.list_physical_devices('GPU')
gpus

C:/Users/17577/code/kaggle_input\DBPEDIA_test.csv
C:/Users/17577/code/kaggle_input\DBPEDIA_train.csv
C:/Users/17577/code/kaggle_input\DBPEDIA_val.csv


[]

Read Data Into Python

In [8]:
# Load training data
train_path = 'C:/Users/17577/code/kaggle_input\DBPEDIA_train.csv'
df = pd.read_csv(train_path)
df.rename(columns = {'l1':'class', 'l2':'profession', 'l3':'type'}, inplace = True)

# Load testing data
test_path = 'C:/Users/17577/code/kaggle_input\DBPEDIA_test.csv'
test = pd.read_csv(test_path)
test.rename(columns = {'l1':'class', 'l2':'profession', 'l3':'type'}, inplace = True)

# Size datasets
print("Shape of Train:{}\nShape of Test:{}".format(df.shape, test.shape))

Shape of Train:(240942, 4)
Shape of Test:(60794, 4)


EDA on the Data

In [9]:
df.head()

Unnamed: 0,text,class,profession,type
0,"William Alexander Massey (October 7, 1856 – Ma...",Agent,Politician,Senator
1,Lions is the sixth studio album by American ro...,Work,MusicalWork,Album
2,"Pirqa (Aymara and Quechua for wall, hispaniciz...",Place,NaturalPlace,Mountain
3,Cancer Prevention Research is a biweekly peer-...,Work,PeriodicalLiterature,AcademicJournal
4,The Princeton University Chapel is located on ...,Place,Building,HistoricBuilding


In [10]:
df['profession'].value_counts()

profession
Athlete             31111
Person              19504
Animal              14682
Building            10704
Politician           9504
                    ...  
MusicalArtist         198
RaceTrack             172
ComicsCharacter       144
VolleyballPlayer      137
Database              129
Name: count, Length: 70, dtype: int64

Note: We will need to adjust our preprocessing methods because the classes are not uniformly distributed.

**Encoding the Professions**

In [11]:
possible_labels = df.profession.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'Politician': 0,
 'MusicalWork': 1,
 'NaturalPlace': 2,
 'PeriodicalLiterature': 3,
 'Building': 4,
 'Animal': 5,
 'Organisation': 6,
 'Person': 7,
 'Athlete': 8,
 'Settlement': 9,
 'LegalCase': 10,
 'MotorcycleRider': 11,
 'Company': 12,
 'RouteOfTransportation': 13,
 'SocietalEvent': 14,
 'WinterSportPlayer': 15,
 'ClericalAdministrativeRegion': 16,
 'EducationalInstitution': 17,
 'BodyOfWater': 18,
 'Plant': 19,
 'Infrastructure': 20,
 'FootballLeagueSeason': 21,
 'Actor': 22,
 'SportsManager': 23,
 'Cleric': 24,
 'Boxer': 25,
 'Cartoon': 26,
 'Venue': 27,
 'Artist': 28,
 'Tournament': 29,
 'Coach': 30,
 'ComicsCharacter': 31,
 'Olympics': 32,
 'SportsTeamSeason': 33,
 'Software': 34,
 'Group': 35,
 'Broadcaster': 36,
 'Tower': 37,
 'Race': 38,
 'SportFacility': 39,
 'SportsTeam': 40,
 'SportsEvent': 41,
 'Eukaryote': 42,
 'Scientist': 43,
 'CelestialBody': 44,
 'Engine': 45,
 'BritishRoyalty': 46,
 'Satellite': 47,
 'Comic': 48,
 'WrittenWork': 49,
 'FictionalCharacter': 50,
 'Pre

In [12]:
# Add enumerated label to data
df['label'] = df.profession.replace(label_dict)

  df['label'] = df.profession.replace(label_dict)


In [13]:
# Split the data, stratified by the label since the labels are not balanced. 
x_train, x_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.2,
    random_state = 24,
    stratify = df.label.values
)

df['data_split'] = ['not_set'] * df.shape[0]
df.loc[x_train, 'data_split'] = 'train'
df.loc[x_val, 'data_split'] = 'val'
df.groupby(['profession', 'label', 'data_split']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text,class,type
profession,label,data_split,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Actor,22,train,943,943,943
Actor,22,val,236,236,236
AmusementParkAttraction,54,train,385,385,385
AmusementParkAttraction,54,val,96,96,96
Animal,5,train,11745,11745,11745
...,...,...,...,...,...
Wrestler,63,val,61,61,61
Writer,55,train,860,860,860
Writer,55,val,215,215,215
WrittenWork,49,train,1232,1232,1232


#### Have TensorFlow and BERT Tokenize the Data

In [14]:
bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8' 

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [15]:
# Load preprocessing model
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

In [16]:
# Preprocess Data
text_preprocessed = bert_preprocess_model(df['text'])

In [17]:
# Load BERT Model
bert_model = hub.KerasLayer(tfhub_handle_encoder)

In [18]:
# Encode Text
bert_results = bert_model(text_preprocessed)

print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')

ResourceExhaustedError: Exception encountered when calling layer "keras_layer_1" "                 f"(type KerasLayer).

Graph execution error:

OOM when allocating tensor with shape[30840576,512] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
	 [[{{node word_embeddings/Gather}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_restored_function_body_27828]

Call arguments received by layer "keras_layer_1" "                 f"(type KerasLayer):
  • inputs={'input_mask': 'tf.Tensor(shape=(240942, 128), dtype=int32)', 'input_type_ids': 'tf.Tensor(shape=(240942, 128), dtype=int32)', 'input_word_ids': 'tf.Tensor(shape=(240942, 128), dtype=int32)'}
  • training=None

#### Build a fine-tuned model