# 5. Use CML API to Define a Job Which Populates Chroma Vector DB
In exercise 2 you went through the manual steps to create a dependent job. This exercise will do the same using CML APIv2. The benefit to using the CML API to create a job is that the user can then take a programmatic approach to creating jobs and then running them. Using the cmlapi library to create jobs is beneficial because it enables automation, version control, reproducibility, integration, scalability, error handling, and efficiency in job management, streamlining data processing workflows. 

![Populate Chroma architecture](../assets/exercise_5.png)

#### 5.1 Declare imports, create CML API client, and list available runtimes
Imports necessary modules, define a collection name, initialize a CML client, and retrieve a list of available runtimes that match specific criteria, printing the list of available runtimes.

In [None]:
import os
import cmlapi
import random
import string
import json

COLLECTION_NAME = 'cml-default' ## Update if you have changed this
    
client = cmlapi.default_client(url=os.getenv("CDSW_API_URL").replace("/api/v1", ""), cml_api_key=os.getenv("CDSW_APIV2_KEY"))
available_runtimes = client.list_runtimes(search_filter=json.dumps({
    "kernel": "Python 3.10",
    "edition": "Nvidia GPU",
    "editor": "JupyterLab"
}))

#### 5.2 Retrieve the latest ML Runtime Identifier and save to an environment variable

In [None]:
## Set available runtimes to the latest runtime in the environment (iterator is the number that begins with 0 and advances sequentially)
## The JOB_IMAGE_ML_RUNTIME variable stores the ML Runtime which will be used to launch the job
print(available_runtimes.runtimes[0])
print(available_runtimes.runtimes[0].image_identifier)
JOB_IMAGE_ML_RUNTIME = available_runtimes.runtimes[0].image_identifier

## Store the ML Runtime for any future jobs in an environment variable so we don't have to do this step again
os.environ['JOB_IMAGE_ML_RUNTIME'] = JOB_IMAGE_ML_RUNTIME

#### 5.3 Get the current working project
Get and print the metadata of the current working project

In [None]:
# Get the identifier of the current project
project = client.get_project(project_id=os.getenv("CDSW_PROJECT_ID"))

#### 5.4 Create and Run Job to Populate Chroma Vector DB

This code generates a random identifier, creates a job request for populating a Chroma Vector database with specified parameters such as project ID, script, and resource allocation, and then creates the job and a corresponding job run within the Cloudera Machine Learning environment, effectively initiating a task to populate the vector DB.

In [None]:
random_id=''.join(random.choice(string.ascii_lowercase) for i in range(10))
job_body = cmlapi.CreateJobRequest(
    project_id = project.id,
    name = "Populate Chroma Vector DB " + random_id, 
    script = "5_populate_local_chroma_db/populate_chroma_vectors.py",
    cpu = 1,
    memory = 4,
    runtime_identifier = os.getenv('JOB_IMAGE_ML_RUNTIME')
)

job_result = client.create_job(
    body = job_body, 
    project_id = str(project.id)
)

job_run = client.create_job_run(
    cmlapi.CreateJobRunRequest(),
    project_id = project.id, 
    job_id = job_result.id
)