# Data formatting examples:

This notebook outlines the required data format for every component within the project.

In [1]:
import os 
import json

## Format of pipeline inputs:

example file: [sample pipeline inputs](../data/sample_input_to_pipeline.json)

In [10]:
path_to_pipeline_inputs = os.environ.get('PATH2SAMPLE_PIPELINE_INPUTS')


with open(path_to_pipeline_inputs, 'r') as file:
    pipeline_inputs = json.load(file)

In [13]:
print("If you would like to use the pipeline, create a list like the one below:")
print(type(pipeline_inputs))

print("\nEach element in the list should be a string that contains a scraped job description:")
print(type(pipeline_inputs[0]))

print("\nHere is an example of the first element of the sample pipeline inputs:\n")
print(pipeline_inputs)

If you would like to use the pipeline, create a list like the one below:
<class 'list'>

Each element in the list should be a string that contains a scraped job description:
<class 'str'>

Here is an example of the first element of the sample pipeline inputs:

['We help the world run better\nOur company culture is focused on helping our employees enable innovation by building breakthroughs together. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work. We offer a highly collaborative, caring team environment with a strong focus on learning and development, recognition for your individual contributions, and a variety of benefit options for you to choose from.Apply now!\nSAP Business Network\nThe global economy has been rapidly evolving from enterprise-centric to network-centric. No enterprise does business alone today. Suppliers, service provide

## Format of training data:

example file: [sample training data](../data/sample_training_data.json)

In [11]:
path_to_training_data = os.environ.get('PATH2SAMPLE_TRAINING_DATA')


with open(path_to_training_data, 'r') as file:
    training_data = json.load(file)

In [32]:
print("If you would like to train the models, create a list like the one below:")
print(type(training_data))

print("\nEach element in the list should be a dictionary:")
job_descr_dict = training_data[0]
print(type(job_descr_dict))

print("\nThe dictionaries should have the following keys:")
print(job_descr_dict.keys())

print("\nHere are examples of the values of each key, for the first dictionary in the list:\n")

for key in training_data[0].keys():
    print(key + ': ' + str(job_descr_dict[key]) + '\n')

If you would like to train the models, create a list like the one below:
<class 'list'>

Each element in the list should be a dictionary:
<class 'dict'>

The dictionaries should have the following keys:
dict_keys(['job_id', 'sentences', 'helpers', 'labels'])

Here are examples of the values of each key, for the first dictionary in the list:

job_id: 1

sentences: ['Summary', 'Posted: Mar 3, 2020', 'Role Number:200156495', 'Discover Apple.', 'We’re looking for people like you.', 'Apple is a place where people from all backgrounds get together to do their life’s best work.', 'Come join us.', 'At Apple, great ideas have a way of becoming great products, services and customer experiences very quickly.', 'Bring your passion to your job and there’s no telling what you could accomplish.', 'Apple offers a collaborative work environment that fosters creativity and innovation.', 'Key Qualifications', 'Different together.', 'At Apple, we’re not all the same.', 'And that’s our greatest strength.',

**Additional notes on the information in the 'job description dictionaries':**

**job_id**: an integer corresponding to the primary key of the job posting in the table that the job postings are stored, on the MySQL Community Server

**sentences and helpers**: To go from a string containing a job description, like one that you would input into the pipeline, to a list of strings, like the one found in training_data\[i\]\['sentences'\], and to create the string found in training_data\[i\]\['helpers'\], please see the function 'partion_string' in the file [text_partitioning](../career_fit_tools/misc_code/text_partitioning.py). For information on the choices made on how to partition a job description into 'sentences', please see the notebook [splitting the posting into sentences](../career_fit_tools/examples/splitting_the_postings_into_sentences.ipynb)

**labels**: a list whose length is the same as training_data\[i\]\['sentences'\], but each of whose elements is itself a two element list. For more information, see the below cell:

In [33]:
print("Then length of training_data[0]['labels'] is:")
print(len(job_descr_dict['labels']))

print("\nNotice that it is the same length as training_data[0]['sentences']:")
print(len(job_descr_dict['sentences']))

print("\nEach element of training_data[0]['labels'] is a list itself:")
print(type(job_descr_dict['labels'][0]))

print("\nAll of these sublists have a length of 2:")
print(len(job_descr_dict['labels'][0]))

print("\nThe first element is either 0 or 1, if the sentence has been labelled, or 'na', if it hasn't been labelled:")
print("\tin this case, it is " + str(job_descr_dict['labels'][0][0]))

print("\nThe first element is either 0 or 1, if the sentence has been labelled, or 'na', if it hasn't been labelled:")
print("\tin this case, it is " + str(job_descr_dict['labels'][0][0]))

Then length of training_data[0]['labels'] is:
29

Notice that it is the same length as training_data[0]['sentences']:
29

Each element of training_data[0]['labels'] is a list itself:
<class 'list'>

All of these sublists have a length of 2:
2

The first element is either 0 or 1, if the sentence has been labelled, or 'na', if it hasn't been labelled:
	in this case, it is 0

The first element is either 0 or 1, if the sentence has been labelled, or 'na', if it hasn't been labelled:
	in this case, it is 0


To understand the format of the of the second element in training_data\[0\]\['labels'\]\[i\], the tokenizer used for token classification needs to be uploaded:

In [22]:
from transformers import AutoTokenizer

In [23]:
tokenizer_token_classification = AutoTokenizer.from_pretrained("jfriduss/bert_for_job_descr_parsing")

In [46]:
print("\nThe second element is a list of length equal to the length of the input_ids of tokenizer_token_classification(s, "
      "padding = False, truncation = False, return_tensors='pt'). This is shown below:")
print(str(len(job_descr_dict['labels'][0][1])))

tokenized_sentence_dict = tokenizer_token_classification(job_descr_dict['sentences'][0], padding = False, 
                                             truncation = False, return_tensors='pt')

input_ids = tokenized_sentence_dict['input_ids']
print(str(len(input_ids[0])))

print("Note that the input_ids are set up so that, in theory, multiple sentences could be tokenized. So the zeroth element of"
     " the input_ids tensor has to be examined to see the above.")

print("\nThe values of the elements in the second element are the labels of each token in the sentence, or 'na' if " 
      "the sentences tokens have not been labelled, or the sentence does not have tokens to label.")


The second element is a list of length equal to the length of the input_ids of tokenizer_token_classification(s, padding = False, truncation = False, return_tensors='pt'). This is shown below:
5
5
Note that the input_ids are set up so that, in theory, multiple sentences could be tokenized. So the zeroth element of the input_ids tensor has to be examined to see the above.

The values of the elements in the second element are the labels of each token in the sentence, or 'na' if the sentences tokens have not been labelled, or the sentence does not have tokens to label.
