***
# <font color=red>Chapter 5: MedTALN inc.'s Case Study - Dataset Import to DLS</font>
<p style="margin-left:10%; margin-right:10%;">by <font color=teal> John Doe (typica.ai) </font></p>

***


## Overview:
This notebook imports the dataset into OCI Data Labeling Service. It involves several key steps including:
- Setup and Initialization
- Dataset Creation
- Record and Annotation Creation
- Dataset Creation Verification

initialize the necessary OCI clients for interacting with Object Storage and Data Labeling services. These clients are authenticated using the resource principal of the notebook session.

Need to add policy to following policy to data science policies:
**allow dynamic-group data-science-dyn-grp to manage data-labeling-family in compartment case-study-cmpt**

In [1]:
import oci

from oci.data_labeling_service_dataplane.data_labeling_client import DataLabelingClient
from oci.data_labeling_service.data_labeling_management_client import DataLabelingManagementClient

# Initialize OCI Object Storage Client with notebook session's resource principal
signer = oci.auth.signers.get_resource_principals_signer()
object_storage_client = oci.object_storage.ObjectStorageClient(config={}, signer=signer)

dls_client = DataLabelingManagementClient(config={}, signer=signer)
dls_dp_client = DataLabelingClient(config={}, signer=signer)

## Setup and Initialization

### Function to Create a Dataset in OCI Data Labeling Service
This function automates the creation of a dataset in OCI's Data Labeling Service. It takes in various parameters, including compartment details, object storage information (where the data is stored), and labels, and combines them to define and create a dataset.

In [2]:
from oci.data_labeling_service.models import ObjectStorageSourceDetails
from oci.data_labeling_service.models import DatasetFormatDetails
from oci.data_labeling_service.models import LabelSet
from oci.data_labeling_service.models import Label
from oci.data_labeling_service.models import CreateDatasetDetails
from oci.data_labeling_service.data_labeling_management_client import DataLabelingManagementClient

def create_dataset(compartment_id,
                   namespace,
                   bucket,
                   prefix,
                   ds_display_name,
                   ds_description,
                   ds_format_type,
                   ds_annotation_format,
                   ds_labels):

    # Create the Dataset Source Details object
    dataset_source_details_obj = ObjectStorageSourceDetails(namespace=namespace, bucket=bucket, prefix=prefix)

    # Create the Dataset Format Details object
    dataset_format_details_obj = DatasetFormatDetails(format_type=ds_format_type)


    # Create the LabelSet object from the list of labels
    label_set_obj = LabelSet(
        items=[oci.data_labeling_service.models.Label(name=label) for label in ds_labels]
    )

    # Create the Dataset Details object
    create_dataset_obj = CreateDatasetDetails(display_name=ds_display_name,
                                            description=ds_description,
                                            compartment_id=compartment_id, annotation_format=ds_annotation_format,
                                            dataset_source_details=dataset_source_details_obj,
                                            dataset_format_details=dataset_format_details_obj,
                                            label_set=label_set_obj)

    # Create the dataset and handle exceptions
    try:
      response = dls_client.create_dataset(create_dataset_details=create_dataset_obj)
      #print(response)
    except Exception as error:
      response = error

    return response

### Function to Create a Record in OCI Data Labeling Service

This function facilitates the creation of a record within a dataset in OCI's Data Labeling Service. It constructs the necessary details from the provided parameters and interacts with the Data Labeling Service to register the record.

In [4]:
from oci.data_labeling_service_dataplane.models import CreateObjectStorageSourceDetails
from oci.data_labeling_service_dataplane.models import CreateRecordDetails


def create_ds_rec(compartment_id, dataset_id, prefix, rec_name):

  relative_path = rec_name
  name = rec_name

  source_details_obj = CreateObjectStorageSourceDetails(relative_path=relative_path)

  create_record_obj = CreateRecordDetails(name=name,
                                          dataset_id=dataset_id,
                                          compartment_id=compartment_id,
                                          source_details=source_details_obj)
  try:
      response = dls_dp_client.create_record(create_record_details=create_record_obj)
      #print(response.data)
      response = response

  except Exception as error:
      response = error

  return response

### Function to Add Annotations to a Record in OCI Data Labeling Service

This function adds text selection annotations to an existing record in OCI's Data Labeling Service. It processes a list of annotations, creating entities that specify the label, text offset, and length for each annotated segment.

In [5]:
from oci.data_labeling_service_dataplane.models import Label
from oci.data_labeling_service_dataplane.models import TextSelectionEntity
from oci.data_labeling_service_dataplane.models import CreateAnnotationDetails

def add_rec_annotation(record_id, annotations_list):

    entity_type = "TEXTSELECTION"

    # Initialize an empty list to store the entities
    entities_obj = []

    for ent_obj in annotations_list:

        # Extract label, offset, and length
        label = ent_obj["labels"][0]["label_name"]
        offset = ent_obj["textSpan"]["offset"]
        length = ent_obj["textSpan"]["length"]
        # Create the labels_obj with the label
        labels_obj = [oci.data_labeling_service_dataplane.models.Label(label=label)]

        # Create the text_span_obj with offset and length
        span_obj = oci.data_labeling_service_dataplane.models.TextSpan(length=length, offset=offset)

        # Create the TextSelectionEntity and add it to the entities_obj list
        entity = TextSelectionEntity(entity_type=entity_type, labels=labels_obj, text_span=span_obj)
        entities_obj.append(entity)

    # entities_obj now contains the desired list of TextSelectionEntity objects
    #print(entities_obj)
    create_annotation_details_obj = CreateAnnotationDetails(record_id=record_id, compartment_id=compartment_id,
                                                            entities=entities_obj)

    try:
        response = dls_dp_client.create_annotation(create_annotation_details=create_annotation_details_obj)
        #print(response.data)
    except Exception as error:
        response = error

## Dataset Import 

Creation of the Dataset with it's annotated records

### Load Dataset Metadata from OCI Object Storage

This code block retrieves and processes the metadata for a dataset stored in OCI Object Storage. The metadata is then used to create and label our dataset in the Data Labeling Service.

In [7]:
import json
import os

#compartment where to create the dataset
compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
# Object Storage namespace
namespace = object_storage_client.get_namespace().data
# Dataset Object Storage bucket
bucket_name = "labelling-datasets-bkt"

# Dataset name
ds_name = "healthcare_ner_dataset_v1.0.0"
# Dataset metadata file name (Jsonl Consolidated created in prepare_dataset notebook)
ds_metadata_jsonl_fname = "dataset_metadata.jsonl"

prefix = f"{ds_name}/" # Dataset folder in Object Storage bucket
object_name = f"{prefix}{ds_metadata_jsonl_fname}"

metadata_jsonl = object_storage_client.get_object(
    namespace,
    bucket_name,
    object_name)

print(f"Dataset metadata file {object_name} loaded")

#load jsonl
metadata_jsonl_obj = [json.loads(jline) for jline in metadata_jsonl.data.content.decode('utf-8').splitlines()]


Dataset metadata file healthcare_ner_dataset_v1.0.0/dataset_metadata.jsonl loaded


### Extract Metadata and Create Dataset in OCI Data Labeling Service

This code block extracts the dataset metadata from a JSONL file and then initiates the creation of a dataset in OCI's Data Labeling Service using the metadata extracted. This process ensures that the dataset is correctly created and ready for further operations, such as record creation.

<span style="color:red">**Note:** The dataset creation process is asynchronous, so a the code waits until the newly created dataset reaches the 'ACTIVE' state. This is done using `oci.wait_until`, which periodically checks the dataset's status.</span>


In [8]:
ds_display_name = metadata_jsonl_obj[0]['displayName']
ds_description = metadata_jsonl_obj[0]['description']
ds_annotation_format = metadata_jsonl_obj[0]['annotationFormat']
ds_format_type = metadata_jsonl_obj[0]["datasetFormatDetails"]['formatType']
ds_labels = [label['name'] for label in metadata_jsonl_obj[0]['labelsSet']]


#print(metadata_jsonl_obj)

print(f"Start the creation of the dataset {ds_display_name} ...")
#print(ds_annotation_format)
#print(ds_format_type)
#print(ds_labels)

ds_resp = create_dataset(compartment_id,
                   namespace,
                   bucket_name,
                   prefix,
                   ds_display_name,
                   ds_description,
                   ds_format_type,
                   ds_annotation_format, 
                   ds_labels)

if ds_resp.status == 201: #status created
    
    # Extract the dataset's OCID (unique identifier)
    dataset_id = ds_resp.data.id
    print(f"Dataset named {ds_display_name} created succefuly.\nDataset OCID: {dataset_id}")

    # Retrieve opc-request-id from the response headers (optional for logging)
    opc_request_id = ds_resp.headers.get("opc-request-id")
    print(f"OPC Request ID: {opc_request_id}")
    
    # Wait until the dataset reaches the 'ACTIVE' lifecycle state
    print(f"Wait for the dataset {ds_display_name} to be in ACTIVE status...")

    get_dataset_response = dls_client.get_dataset(dataset_id)

    oci.wait_until(
        dls_client,
        get_dataset_response,
        evaluate_response=lambda r: r.data.lifecycle_state == 'ACTIVE',
        max_wait_seconds=60,  # Maximum wait time in seconds
        max_interval_seconds=3  # Check every 30 seconds
    )

    print(f"Dataset {ds_display_name} is now ACTIVE. You can start creating records.")


Start the creation of the dataset healthcare_ner_dataset_v1.0.0 ...
Dataset named healthcare_ner_dataset_v1.0.0 created succefuly.
Dataset OCID: ocid1.datalabelingdataset.oc1.ca-toronto-1.amaaaaaa3hvgr2qan3yenas7wktkowma6gsjyvfe72ac2xr2pe76iwvkljaq
OPC Request ID: 64B04E4F37A447BDA6C6FB89E0C244BE/19B00E41E5115CDA21C0D00805AFE1E2/999AB8CB07D840AFC03CA3A1B7375B1E
Wait for the dataset healthcare_ner_dataset_v1.0.0 to be in ACTIVE status...
Dataset healthcare_ner_dataset_v1.0.0 is now ACTIVE. You can start creating records.


### Create Records and Annotations for Dataset in OCI Data Labeling Service

This code block handles the creation of records and their corresponding annotations for the newly created dataset in OCI's Data Labeling Service.

After the dataset is successfully created, the code iterates through each record in the metadata, creating records in the dataset. For each created record, associated annotations are added by looping through the entities defined in the metadata.

In [9]:
import json
from tqdm import tqdm


#loop on records in metadata and create annotated records in the dataset
for idx, json_obj in enumerate(tqdm(metadata_jsonl_obj[1:], 
                                desc="Importing dataset records", 
                                total=len(metadata_jsonl_obj[1:])
                               )
                          ):

    rec_name = json_obj["sourceDetails"]["path"]
    #print(f'create record {idx} record name : {rec_name}')
    rec_resp =  create_ds_rec(compartment_id, dataset_id, prefix, rec_name)

    if rec_resp.status==200:
      record_id = rec_resp.data.id

      for annot_obj in json_obj["annotations"]:
        annotations_list = annot_obj["entities"]
        annot_resp = add_rec_annotation(record_id, annotations_list)
        #print(annot_resp)


Importing dataset records: 100%|██████████| 9000/9000 [1:15:05<00:00,  2.00it/s]


## Dataset Creation Verification

### List Datasets in OCI Data Labeling Service

This code block sends a request to the OCI Data Labeling Service to list our newly created dataset. 

In [8]:
# Send the request to service, some parameters are not required, see API
# doc for more info
list_datasets_response = dls_client.list_datasets(
    compartment_id=compartment_id,
    annotation_format="ENTITY_EXTRACTION",
    lifecycle_state="ACTIVE",
    display_name=ds_name
    )

# Get the data from response
print(list_datasets_response.data)


{
  "items": [
    {
      "annotation_format": "ENTITY_EXTRACTION",
      "compartment_id": "ocid1.compartment.oc1..aaaaaaaa62hn2mwhaice4wivf3zpyqbdawnsmxnoznhv5jytcd3plk6n3feq",
      "dataset_format_details": {
        "format_type": "TEXT",
        "text_file_type_metadata": null
      },
      "defined_tags": {
        "Oracle-Tags": {
          "CreatedBy": "ocid1.datasciencenotebooksession.oc1.ca-toronto-1.amaaaaaa3hvgr2qa33xfstuzyvf4psnypk6h3rnbzibckrjznwx7v7zf5ooq",
          "CreatedOn": "2024-08-19T19:29:12.905Z"
        }
      },
      "display_name": "healthcare_ner_dataset_v1.0.0_300",
      "freeform_tags": {},
      "id": "ocid1.datalabelingdataset.oc1.ca-toronto-1.amaaaaaa3hvgr2qaujghmfstewhlxe2z6fp72flzby37mvmywwpa2bgl5uda",
      "lifecycle_details": null,
      "lifecycle_state": "ACTIVE",
      "system_tags": {},
      "time_created": "2024-08-19T19:29:13.220000+00:00",
      "time_updated": null
    },
    {
      "annotation_format": "ENTITY_EXTRACTION",
      "