While going between "labelling data with 'predict_and_label'", and "fine tuning the models, to reduce the amount of human intervention required to label the data", I fine-tuned the models on a google colab, so that I could accelerate the process with parallelization/GPU's. Below is the notebook that I used after having labelled 9 job postings

**Install/import packages**

In [1]:
! pip3 install transformers
! pip3 install accelerate -U
! pip3 install transformers[torch]
! pip3 install datasets
! pip3 install mysql
! pip3 install mysql.connector
! pip3 install evaluate
! pip3 install seqeval

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m41.1 MB/s[0m eta [36m0:00:0

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd drive/MyDrive/parse_job_postings

/content/drive/MyDrive/parse_job_postings


In [20]:
from training.labelingHelpers.autolabel import autolabel
from training.labelingHelpers.predict_and_label import predict_and_label
from data_retrieval import open_json_safe, save_json_file

In [5]:
import pandas as pd
import json

In [6]:
from training.fine_tune import fine_tune


**Upload the models to fine tune, and the data to fine tune them with**

In [23]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
from training.ft_sentence_classification_helpers import CustomModel

sentence_class_mod = CustomModel("has-abi/distilBERT-finetuned-resumes-sections", num_labels = 2)
sentence_class_tok = AutoTokenizer.from_pretrained("has-abi/distilBERT-finetuned-resumes-sections")

token_class_mod = AutoModelForTokenClassification.from_pretrained("jjzha/jobbert_knowledge_extraction")
token_class_tok = AutoTokenizer.from_pretrained("jjzha/jobbert_knowledge_extraction")

In [9]:
f_name = 'sample_job_descriptions.json'
data = open_json_safe(f_name)

When you run this function, you might overwrite the data in the data structure in python that you are labelling. Are you sure you want to run it? (y/n): y


**Run the fine_tune function once, initially, to determine how unbalanced the sentence classification data is**

In [10]:
fine_tune(data, sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

The breakdown of the sentence classification data is:	
Label '0': 280 data points (61.40% of total)
Label '1': 176 data points (38.60% of total)

The breakdown of the token classification data is:	
Label '2': 3493 data points (71.65% of total)
Label '0': 792 data points (16.25% of total)
Label '1': 590 data points (12.10% of total)
Choose whether to fine tune the...
	sentence classification model (0)
	the token classification model (1)
	both models (2)
	decide not to fine tune (3)
3
The program is quitting.


**Use the autolabel function a few times to approximately balance the data (need to label ~100 sentences with the label '1')**

In [12]:
word = ' verilog '

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?max_and_indices_to_omit
Please open the file 'sentences_to_label.html' and then input a tuple whose first entry is an integer, and whose next entry is a list of integers. The number in the first entry is the index of the last sentence that you checked  and the list in the second entry are the indices that should NOT be labelled. For example, (6, [3,5]), indicates that sentences at 0, 1, 2, 4 and 6 should be labelled. (3, [] )


In [13]:
word = ' thermodynamics '

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?max_and_indices_to_omit
Please open the file 'sentences_to_label.html' and then input a tuple whose first entry is an integer, and whose next entry is a list of integers. The number in the first entry is the index of the last sentence that you checked  and the list in the second entry are the indices that should NOT be labelled. For example, (6, [3,5]), indicates that sentences at 0, 1, 2, 4 and 6 should be labelled. (0, [])


In [14]:
word = 'computational biology '

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?max_and_indices_to_omit
Please open the file 'sentences_to_label.html' and then input a tuple whose first entry is an integer, and whose next entry is a list of integers. The number in the first entry is the index of the last sentence that you checked  and the list in the second entry are the indices that should NOT be labelled. For example, (6, [3,5]), indicates that sentences at 0, 1, 2, 4 and 6 should be labelled. (3, [])


In [17]:
word = 'ASIC'

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?normal
Please open the file 'sentences_to_label.html' and then input a list with the indices of the sentences that should get the label.[7,16,22,24,26,27,28,31,33,36,37,38,39,40,41,42,43,47,48,50,51]


In [18]:
word = 'hardware'

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?max_and_indices_to_omit
Please open the file 'sentences_to_label.html' and then input a tuple whose first entry is an integer, and whose next entry is a list of integers. The number in the first entry is the index of the last sentence that you checked  and the list in the second entry are the indices that should NOT be labelled. For example, (6, [3,5]), indicates that sentences at 0, 1, 2, 4 and 6 should be labelled. ( 32, [15, 20] )


In [19]:
word = 'AWS'

HAS_TOKEN = 1

data = autolabel(data, word, HAS_TOKEN)

Which mode for choosing what sentences to label would you like to use ('word' or 'first_and_last_sentences')?word
Which mode would you like to use ('normal' or 'max_and_indices_to_omit')?max_and_indices_to_omit
Please open the file 'sentences_to_label.html' and then input a tuple whose first entry is an integer, and whose next entry is a list of integers. The number in the first entry is the index of the last sentence that you checked  and the list in the second entry are the indices that should NOT be labelled. For example, (6, [3,5]), indicates that sentences at 0, 1, 2, 4 and 6 should be labelled. ( 44, [2,3,4,6,11,12,15,18,19,20,21,24,25,26,27,28,29,31,32,33,36,38,39,42,43 ] )


In [24]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**Fine tune the models**

Note: At this stage, I do not tune hyperparameters to improve the model, because the only purpose of this fine tuning is to make the models slightly better, to making labelling with 'predict_and_label' quicker. In a later notebook, I will fine tune the models while tuning hyperparameters, to optimize their performance. 

In [25]:
fine_tune(data, sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)


The breakdown of the sentence classification data is:	
Label '0': 280 data points (51.95% of total)
Label '1': 259 data points (48.05% of total)

The breakdown of the token classification data is:	
Label '2': 3493 data points (71.65% of total)
Label '0': 792 data points (16.25% of total)
Label '1': 590 data points (12.10% of total)
Choose whether to fine tune the...
	sentence classification model (0)
	the token classification model (1)
	both models (2)
	decide not to fine tune (3)
2


Map:   0%|          | 0/394 [00:00<?, ? examples/s]

Map:   0%|          | 0/99 [00:00<?, ? examples/s]



  0%|          | 0/39 [00:00<?, ?it/s]

  0%|          | 0/12 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'f1': 0.8787878787878788}
{'f1': 0.898989898989899}
{'f1': 0.8888888888888888}


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.528196,0.687117,0.704403,0.695652,0.889714
2,No log,0.366777,0.703488,0.761006,0.731118,0.898197
3,No log,0.333646,0.720238,0.761006,0.740061,0.903499


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

**Saving the sentence classification model:** 

Because a part of the sentence classification model is a pytorch layer that isn't a part of HuggingFace, instead of pushing this model to the hub, I save it to my google drive, download it to my computer locally, and then upload it to a jupyter notebook

In [26]:
sentence_class_mod.save_hybrid_hf_torch_model('linear_layer_for_sent_classifier_fr_colab')


Is this being run in a google colab? (y/n)y
Storing the constitutents of the model required to rebuild it in the folder 'model_contents' in your Google Drive.
