# Train a BERT Model

## Installation
If you don't have python and jupyter lab already installed on you computer, install the with the following instructions

[Python](https://docs.python-guide.org/starting/install3/osx/)

[Jupyter Lab](https://jupyter.org/install)

[Oxen AI](https://oxen.ai/?utm_source=tiktok&utm_medium=paid_social&utm_campaign=allydoesdatascience)





# Getting your data

I used [Oxen AI](https://oxen.ai/?utm_source=tiktok&utm_medium=paid_social&utm_campaign=allydoesdatascience) to find and manage my dataset for this project. 

IMDB Movie Reviews Dataset : [Linked Here!](https://www.oxen.ai/ox/IMDB-Movie-Reviews)

Clone this repo or any other Oxen dataset repo to start your project. Follow the simple Oxen instructions to setting up your first project repo.

Next, we start preprocessing. Once your repo has been cloned and you can access the data itself. I utilized the preprocessing code provided by Oxen to streamline my training even further. 

Navigate the notebook within your cloned repo (IMDB Movie Reviews/code/process_data.ipynb) and run the notebook as is. 

In [1]:
# import packages to use during training

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import tensorflow as tf
from transformers import TFBertForSequenceClassification

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# read in your processed dataset using pandas

# this path may be different depending where you are in your current working directory
# you can check using the command "pwd" within a notebook cell then adjust the path accordingling

df = pd.read_csv("IMDB-Movie-Reviews/supervised.csv") 
df.head()

Unnamed: 0,path,fold,label,rating,review_id,url,preview
0,data/test/neg/1821_4.txt,test,neg,4,1821,http://www.imdb.com/title/tt0138541/usercomments,Alan Rickman & Emma Thompson give good perform...
1,data/test/neg/9487_1.txt,test,neg,1,9487,http://www.imdb.com/title/tt0202521/usercomments,I have seen this movie and I did not care for ...
2,data/test/neg/4604_4.txt,test,neg,4,4604,http://www.imdb.com/title/tt0417658/usercomments,In Los Angeles the alcoholic and lazy Hank Ch...
3,data/test/neg/2828_2.txt,test,neg,2,2828,http://www.imdb.com/title/tt0066105/usercomments,"This film is bundled along with ""Gli fumavano ..."
4,data/test/neg/10890_1.txt,test,neg,1,10890,http://www.imdb.com/title/tt0787505/usercomments,I only comment on really very good films and o...


In [3]:
# create a train, test split using an Sklearn function
train, test = train_test_split(df)

In [4]:
# Load model directly from Hugging Face 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [5]:
# function for encoding of the text reviews
def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = max_length, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

In [6]:
# defining hyperparameters
max_length = 512
batch_size = 6

In [7]:
# write a function to format the model output 
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label


In [8]:
# putting all the functions together to complete the tokenization process
def encode_examples(ds, limit=-1):
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []
  if (limit > 0):
      ds = ds.take(limit)
  for path, label in zip(df['path'], df['label']):
    review = open(path, "r").read()
    bert_input = convert_example_to_feature(review)
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append([label])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

Before we can can start the training process, ensure your working directory is inside the IMDB Movie Reviews Folder
- Type the commend "pwd" to confirm"
- Use command "cd [path_to_dir]" to move to the correct directory

Ex. cd IMDB-Movie-Reviews 

If you are not in the correct spot, this code will throw and error saying it can't find you data

In [13]:
# train dataset
ds_train_encoded = encode_examples(train).shuffle(10000).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(test).batch(batch_size)

In [13]:
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
test_sentence = "This is a really good movie. I loved it and will watch again"

predict_input = tokenizer.encode(test_sentence,

truncation=True,

padding=True,

return_tensors="tf")

tf_output = model.predict(predict_input)[0]
tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive'] #(0:negative, 1:positive)
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Positive
