This notebook was created while fine tuning the sentence/token classification models, to show how the functions predict_and_label, autolabel, and update_tokens_label are used, in tandem with fine tuning the models, to expedite labelling of the data. **The overarching idea is that, as the models improve, it makes more sense to label data by having the model make predictions on the data, and then to correct the few predictions that are incorrect, than to assign labels to thousands of tokens by hand.**

The models were fine tuned on 5 job descriptions prior to creating this notebook, so that less corrections to the labels had to be made at the outset.

In [1]:
import os 
import sys

In [2]:
path_to_root_dir = os.environ.get('PATH2PARSE_JOBS_DIR')
sys.path.append(path_to_root_dir)

In [3]:
from career_fit_tools.training.labeling_helpers.predict_and_label import predict_and_label
from career_fit_tools.misc_code.data_retrieval import open_json_safe, save_json_file

from transformers import AutoModelForTokenClassification, AutoTokenizer

from career_fit_tools.training.ft_sentence_classification_helpers import CustomModel
from career_fit_tools.training.labeling_helpers.annotation_guideline_helpers import get_which_postings_are_labelled, \
    reconstruct_ad_w_bolded_skills, update_tokens_label

from career_fit_tools.training.fine_tune import fine_tune
from career_fit_tools.training.labeling_helpers.autolabel import autolabel


**Uploading the list of job description dictionaries:**

In [2]:
f_name = 'sample_job_descriptions.json'
data = open_json_safe(f_name)

When you run this function, you might overwrite the data in the data structure in python that you are labelling. Are you sure you want to run it? (y/n): y


In [3]:
get_which_postings_are_labelled(data)

The sentences in the job descriptions stored at indices [0, 10, 14, 15, 16] are all labelled.

The sentences in the job descriptions stored at indices [0, 10, 14, 15, 16] have all their tokens labelled as well.


**Uploading the fine tuned sentence and token classification models:**

In [6]:
#uploading the sentence classification model
sentence_class_mod = CustomModel("has-abi/distilBERT-finetuned-resumes-sections", num_labels = 2)

p2model = "./model_contents/hf"
f_lin = "./model_contents/linear_layer_for_sent_classifier_fr_colab.pth"

#overwriting it with weights from the google colab
sentence_class_mod.overwrite_w_trained_weights(p2model, f_lin)
#uploading the sentence classification tokenizer
sentence_class_tok = AutoTokenizer.from_pretrained("has-abi/distilBERT-finetuned-resumes-sections")

#uploading the token classification model and tokenizer
token_class_mod = AutoModelForTokenClassification.from_pretrained("jfriduss/tok_train_info")
token_class_tok = AutoTokenizer.from_pretrained("jfriduss/tok_train_info")

Some weights of the model checkpoint at has-abi/distilBERT-finetuned-resumes-sections were not used when initializing DistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)lve/main/config.json:   0%|          | 0.00/952 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

**Labelling the 17th job posting in the list of dictionaries of postings, using 'predict_and_label'**

In [5]:
predict_and_label(data[17], sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.[1,4,7,23]

For this job description, the metrics on the sentence classifier are:
	accuracy: 0.8461538461538461
	recall: 0.9285714285714286
Please open the file 'predictions_on_tokens.html' and then input a list of the words that were mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[[0, 7, 0],  [0, 9, 0],  [0, 10, 1],  [0, 12, 0],  [0, 17, 0],  [1, 14, 0],  [1, 15, 1],  [2, 25, 0],  [3, 15, 0],

**Two notes:**

**What the 'predictions_on_sentences/tokens.html' files look like, how they relate to the lists that I input**

(1) In the cell below this one, screenshots of a portion of the html files that contains the predictions that the sentence and token classification models make on the sentences/tokens in the posting. Note how, in the first html file, the first sentence is black, indicating the model predicted it does not contain any tokens to label. But it contains the general area of focus of the company, 'AI robotics', so by criterion 3 of the annotation guidelines, it does contain tokens to be annotated by the token classification model. Therefore, in the list above after "Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.", it's index is in that list. The second list contains the analagous information about the predictions from the token classification model: [0,7,0] indicats that the 7th word in the 0th sentence in html document, should be labelled with a 'B' for 'beginning', instead of an 'O' for other.


**About the metrics' output by the function after its predictions are corrected:** 

(2) After each classifiers predictions are corrected, the accuracy and recall of the classifier on this posting are output. Recall was chosen above other metrics, because, in this case, it is most important: false negatives are more detrimental to the projects goals compared to false positives, because they information that a prospective applicant would benefit from knowing about the position, that the applicant does not recieve. In contrast, false positive represent unnecissary information, which might be tedious to have, but does not detract from gauging whether the applicant is qualified for the position or not. Additionally, both metrics were calculated without accounting for if the model swapped the beginning label, 'B' and the inside label, 'I'--they only distinguish whether a 'B' or 'I' was mistaken for an 'O', or vis versa. This choice was made, because words labelled 'B' and 'I' are both extracted from the posting, whereas words labelled 'O' are not, and so what is most important is not confusing 'B/I' with 'O', or vis versa.

![sample predictions on sentences](./pictures/sample_predictions_on_sentences_html2.png)

Above: Screenshot of a portion of the html file that contains the predictions that the sentence classification model made on the sentences in the posting.



![sample predictions on tokens](./pictures/sample_predictions_on_tokens_html2.png)

Above: Screenshot of a portion of the html file that contains the predictions that the token classification model made on the tokens in the posting.


**Double checking that mistakes were not made while using predict_and_label, by outputting the now-labelled posting, with words colored according to label:**

In [8]:
reconstruct_ad_w_bolded_skills(data[17], token_class_tok, 0)

Founded in 2014 , Kindred is a [35;5;9mrobotics[0m and [35;5;9martificial[0m [33;5;9mintelligence[0m ( [35;5;9mAI[0m ) company that develops [35;5;9mrobots[0m to solve real - world problems .  Its mission is to enhance the lives of human workers with the power of [35;5;9mAI[0m [33;5;9mrobotics[0m .  Kindred ’ s cutting - edge technology is the foundation of a number of proprietary platforms , including CORE with [UNK] , developed to operate [35;5;9mrobots[0m autonomously in dynamic environments .  Its team of scientists , engineers and business operators have set a new standard for [35;5;9mreinforcement[0m [33;5;9mlearning[0m for [35;5;9mrobots[0m .  The company is co-located in San Francisco and Toronto and is part of the UK-based Ocado Group plc.
As a Software Developer on the Software Product Engineering team , you will have the exciting opportunity to design , implement , and maintain cutting - edge [35;5;9msoftware[0m [33;5;9msolutions[0m , across multipl

**Some discussion:**

(1) Although the above looks correct, if there were an error (or few) made, the function 'update_token_label' in the annotation_guideline_helpers module, would be used to correct those errors. See the labelling of the next posting, for an example of this.

(2) Although the annotation guidelines are moderately clear, it is challenging to determine whether some words, for example, 'RCA, and 'high-performance' (in 'high-performance computing) should, or should not, be annotated. The extent that the imperfections of the guidelines will serve an upper bound on the models performance will be explored in a future notebook, by either (i) annotating a few different ads, multiple times, on different days, or (ii) while annotating a few ads, keeping tract of the percentage of tokens whose label are unclear.

**Now, saving the data structure with a newly labelled ad 17, and then annotating three to four more postings, before fine tuning the models again:**

In [9]:
save_json_file(data, f_name)

Are you sure you want to save the json file? If you by mistake save it when you are just testing stuff it might be a pain to fix? (y/n): y


*labelling the 18th ad in the list of dictionaries*

In [28]:
predict_and_label(data[18], sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.[1,7,8,25,27,33]

For this job description, the metrics on the sentence classifier are:
	accuracy: 0.875
	recall: 0.8695652173913043
Please open the file 'predictions_on_tokens.html' and then input a list of the words that were mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[ [0, 23, 0],  [0, 24, 1],  [1, 10, 0],  [1, 11, 1],  [1, 12, 1],  [1, 13, 1],  [2, 24, 0],  [2, 25, 1],  [2, 28, 0],  [2

*Checking whether or not mistakes were made when using predict_and_label*

In [29]:
reconstruct_ad_w_bolded_skills(data[18], token_class_tok, 0)

Invitae ( NYSE : NVTA ) is a leading medical genetics company trusted by millions of patients and their providers to deliver timely [35;5;9mgenetic[0m [33;5;9minformation[0m using digital technology .  We aim to provide accurate and actionable answers to strengthen [35;5;9mmedical[0m [33;5;9mdecision[0m [33;5;9m-[0m [33;5;9mmaking[0m for individuals and their families .  Invitae ' s genetics experts apply a rigorous approach to data and research , serving as the foundation of their mission to bring comprehensive [35;5;9mgenetic[0m [33;5;9minformation[0m into mainstream [35;5;9mmedicine[0m to improve [35;5;9mhealthcare[0m for billions of people . 
We have an available opening in our CSI team for talented and motivated Computational Biologists with expertise developing [35;5;9mcomputational[0m [33;5;9mmethods[0m to analyze and model complex [35;5;9mbiological[0m [33;5;9mdatasets[0m and processes .  As a core member of this team , the primary responsibility of 

**Correcting a mistake after having labelled a posting:**

When checking whether or not mistakes were made while using predict_and_label on the 18th ad in the list (above), I noticed that a mistake was made--in the first sentence of the second paragraph, 'genetic testing' should be annotating. Below, I show how to use the 'update_token_label' function, to correct this mistake.

The 'update_token_label' function is used in a similar way as 'predict_and_label', except that, because the postings are for the most part already labelled correctly, the function outputs the html files with how the sentences/tokens are labelled, and the user indicates which labels should be changed (as opposed to requiring the model to make predictions on the sentences).

In [32]:
update_tokens_label(data[18], token_class_tok)

Please open the file 'sentences_w_current_labels.html' and then input a list with the indices of the sentences that are labelled incorrectly.[]
Please open the file 'tokens_w_current_labels.html' and then input a list of the words that are mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[ [6,12,0], [6,13,1] ]


In [33]:
reconstruct_ad_w_bolded_skills(data[18], token_class_tok, 0)

Invitae ( NYSE : NVTA ) is a leading medical genetics company trusted by millions of patients and their providers to deliver timely [35;5;9mgenetic[0m [33;5;9minformation[0m using digital technology .  We aim to provide accurate and actionable answers to strengthen [35;5;9mmedical[0m [33;5;9mdecision[0m [33;5;9m-[0m [33;5;9mmaking[0m for individuals and their families .  Invitae ' s genetics experts apply a rigorous approach to data and research , serving as the foundation of their mission to bring comprehensive [35;5;9mgenetic[0m [33;5;9minformation[0m into mainstream [35;5;9mmedicine[0m to improve [35;5;9mhealthcare[0m for billions of people . 
We have an available opening in our CSI team for talented and motivated Computational Biologists with expertise developing [35;5;9mcomputational[0m [33;5;9mmethods[0m to analyze and model complex [35;5;9mbiological[0m [33;5;9mdatasets[0m and processes .  As a core member of this team , the primary responsibility of 

Above, can see that now, in the first sentence of the second paragraph, 'genetic testing' is labelled.

In [34]:
save_json_file(data, f_name)

Are you sure you want to save the json file? If you by mistake save it when you are just testing stuff it might be a pain to fix? (y/n): y


*labelling the 1st ad in the list of dictionaries*

In [42]:
predict_and_label(data[1], sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.[8,15,21,30,33]

For this job description, the metrics on the sentence classifier are:
	accuracy: 0.8717948717948718
	recall: 0.8666666666666667
Please open the file 'predictions_on_tokens.html' and then input a list of the words that were mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[ [0, 14, 2],  [1, 15, 0],  [1, 16, 1],  [1, 17, 1],  [2, 10, 0],  [2, 11, 1],  [2, 12, 1],  [3, 20, 1],  [4,

*Checking whether or not mistakes were made when using predict_and_label*

In [43]:
reconstruct_ad_w_bolded_skills(data[1], token_class_tok, 0)

Summary
Posted: Jul 19, 2022
Weekly Hours: 40
Role Number:200149514
Are you interested in building products that utilize [35;5;9mmachine[0m [33;5;9mlearning[0m and [35;5;9mcomputer[0m [33;5;9mvision[0m [33;5;9mtechnologies[0m ?  Are you looking to apply your state-of-the-art knowledge to produce high-visibility features? Apple ’ s Text Recognition group is responsible for building best - in - class [35;5;9mText[0m [33;5;9mRecognition[0m [33;5;9mtechnologies[0m that fuel innovative and user experiences .  We are an R & D team that develops core [35;5;9mmachine[0m [33;5;9mlearning[0m [33;5;9mtechnologies[0m and pushes the limit of those technologies to produce amazing features like Live Text , Apple Pay Credit Card Capture , and iTunes Gift Card Camera Redemption across all Apple Platforms , including iOS and macOS .  We are looking for an exceptional machine learning engineer to help research and develop our next generation of [35;5;9mtext[0m [33;5;9mrecognition

In [44]:
save_json_file(data, f_name)

Are you sure you want to save the json file? If you by mistake save it when you are just testing stuff it might be a pain to fix? (y/n): y


*labelling the 19th ad in the list of dictionaries*

In [46]:
predict_and_label(data[19], sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.[0,1,2,3,4,5,6,8,11,14,19,28,29,30,32,38,45,47,53,54,56,61]

For this job description, the metrics on the sentence classifier are:
	accuracy: 0.6451612903225806
	recall: 0.5
Please open the file 'predictions_on_tokens.html' and then input a list of the words that were mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[ [0, 7, 0],  [0, 8, 1],  [0, 11, 0],  [0, 12, 1],  [1, 12, 0],  [1, 15, 0],  [1

While annotating tokens, I noticed that two sentences were labelled "has a token to annotate" that should not have been labelled that way. Fixing that below.

In [48]:
update_tokens_label(data[19], token_class_tok)

Please open the file 'sentences_w_current_labels.html' and then input a list with the indices of the sentences that are labelled incorrectly.[2,53]
Please open the file 'tokens_w_current_labels.html' and then input a list of the words that are mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[]


*Checking whether or not mistakes were made when using predict_and_label*

In [49]:
reconstruct_ad_w_bolded_skills(data[19], token_class_tok, 0)

Afresh is on a mission to eliminate [35;5;9mfood[0m [33;5;9mwaste[0m and make [35;5;9mfresh[0m [33;5;9mfood[0m accessible to all .  Our first A . I . - powered solution optimizes ordering , [35;5;9mforecasting[0m , and [35;5;9mstore[0m [33;5;9moperations[0m for fresh food departments in brick - and - mortar grocers .  With our Fresh Operating System, regional and national grocery retailers have placed $1.6 billion in produce orders across the US and we've helped our partners prevent 34 million pounds of food from going to waste. Working at Afresh represents a one - of - a - kind opportunity to have massive social impact at scale by leveraging uncommonly impactful [35;5;9msoftware[0m – we hope you ' ll join us ! 
At Afresh , our mission is to make the fresh [35;5;9mfood[0m [33;5;9msupply[0m [33;5;9mchain[0m more efficient , thus dramatically reducing [35;5;9mfood[0m [33;5;9mwaste[0m and making fresh , nutritious food available and accessible to everyone .  Our 

In [50]:
save_json_file(data, f_name)

Are you sure you want to save the json file? If you by mistake save it when you are just testing stuff it might be a pain to fix? (y/n): y


**Because more job postings have been labelled compared to when the models were imported above, I fine tuned the models in the 'example_of_fine_tuning_while_labelling_data.ipynb' notebook. Now I will use 'predict_and_label' on a few more postings, and see whether I have to correct fewer labels**

*uploading the newly fine tuned models*

In [8]:
#uploading the sentence classification model
sentence_class_mod = CustomModel("has-abi/distilBERT-finetuned-resumes-sections", num_labels = 2)

p2model = "./model_contents/hf"
f_lin = "./model_contents/linear_layer_for_sent_classifier_fr_colab.pth"

#overwriting it with weights from the google colab
sentence_class_mod.overwrite_w_trained_weights(p2model, f_lin)
#uploading the sentence classification tokenizer
sentence_class_tok = AutoTokenizer.from_pretrained("has-abi/distilBERT-finetuned-resumes-sections")

#uploading the token classification model and tokenizer
token_class_mod = AutoModelForTokenClassification.from_pretrained("jfriduss/tok_train_info")
token_class_tok = AutoTokenizer.from_pretrained("jfriduss/tok_train_info")

Some weights of the model checkpoint at has-abi/distilBERT-finetuned-resumes-sections were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


*labelling the 20th posting in the list of dictionaries*

In [9]:
predict_and_label(data[20], sentence_class_mod, sentence_class_tok, token_class_mod, token_class_tok)

Please open the file 'predictions_on_sentences.html' and then input a list with the indices of the sentences that were labelled incorrectly.[2,12,23,27]

For this job description, the metrics on the sentence classifier are:
	accuracy: 0.8857142857142857
	recall: 0.9411764705882353
Please open the file 'predictions_on_tokens.html' and then input a list of the words that were mislabelled.
The list should be a list of of lists, with each sub list of the form [index_sentence, index_word, label].
	'index_sentence' is the index of the sentence of the mislabelled word,
	'index_word' is the index of the mislabelled word within the sentence, and 
	'label' is either 0, 1 or 2, where 
		0 <--> B <--> 'the beginning of a labelled entity', 
		1 <--> I <--> 'a word that is within in a string of words that makes up a labelled entity', and 
		2 <--> O <--> 'the word should not be labelled'.
[ [1, 13, 2],  [1, 14, 2],  [2, 20, 1],  [7, 7, 1],  [7, 9, 0],  [7, 10, 1],  [8, 9, 2],  [9, 9, 2],  [9, 10, 2]

In [10]:
reconstruct_ad_w_bolded_skills(data[20], token_class_tok, 0)

We are looking for an experienced Data Engineer to join our growing team. The ideal candidate will have a strong background in [35;5;9mdata[0m [33;5;9mpipeline[0m [33;5;9mdevelopment[0m , [35;5;9mdata[0m [33;5;9mquality[0m [33;5;9mcontrol[0m , and [35;5;9mdata[0m [33;5;9minfrastructure[0m for [35;5;9mmachine[0m [33;5;9mlearning[0m [33;5;9mmodels[0m .  Additionally , the candidate should have a strong understanding of [35;5;9mmarketing[0m [33;5;9mdata[0m and business sense .  In this role , you will be responsible for designing , building , and maintaining [35;5;9mdata[0m [33;5;9mpipelines[0m , ensuring [35;5;9mdata[0m [33;5;9mquality[0m and accuracy , and supporting [35;5;9mmachine[0m [33;5;9mlearning[0m [33;5;9mmodels[0m . 
Responsibilities
Design and build [35;5;9mdata[0m [33;5;9mpipelines[0m to collect , process , and store large amounts of data from multiple sources 
Implement [35;5;9mdata[0m [33;5;9mquality[0m [33;5;9mcontrol[0m [

In [11]:
save_json_file(data, f_name)

Are you sure you want to save the json file? If you by mistake save it when you are just testing stuff it might be a pain to fix? (y/n): y


**In order to determine how much less effort it took to annotate the 20th ad in the list by using the model fine tuned on nine job postings, compared to four, I will run predict_and_label on it using the previous model, and compare how many labels were required to be changed in this case.**

*do this later*