# Embed large text files using Batch Transform

If you have extensive text datasets that need embeddings using Jina's models, Amazon SageMaker's Batch Transform is a handy tool. Instead of processing text one-by-one, Batch Transform allows for bulk processing. Simply provide the path to your dataset in an S3 bucket and specify an output path. Once the transform job is completed, the embeddings will be uploaded to the designated S3 location.

In [None]:
# !pip install --upgrade jina-sagemaker

In [1]:
from jina_sagemaker import Client
import boto3

region = boto3.Session().region_name

# Specify the role if needed
role = ""
role = 'arn:aws:iam::253352124568:role/service-role/AmazonSageMaker-ExecutionRole-20230527T104084'

# Specify the model name
model_name = "jina-embedding-l-en-v1"

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:253352124568:model-package/{model_name}",
}

# Specify the model you want to use
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

## Create the batch transform job

#### Gotchas

- The input file must be a CSV file with no headers. 
- If the input file resides on S3, it should have two columns: the first for the document ID and the second for the text to be embedded. 
- If the input is a local file, only the text column is required. The client will dynamically generate the document IDs as needed. 
- The output file is a jsonlines file with extension (.out) with the document ID and the embedding.

In [5]:
client = Client(region_name=region)

# input_data_path_batch = "s3://sagemaker-us-east-1-253352124568/a.csv"
# output_data_path = "s3://sagemaker-us-east-1-253352124568/output/a"

input_data_path_batch = 'batch_transform_dummy_input/input.csv'
output_data_path = 'output_dir'

client.create_transform_job(
    arn=model_package_arn,
    role=role,
    n_instances=1,
    instance_type="ml.g4dn.xlarge",
    input_path=input_data_path_batch,
    output_path=output_data_path,
    logs=False
)

INFO:sagemaker:Creating model with name: jina-embedding-l-en-v1
INFO:sagemaker:Creating transform job with name: jina-embedding-l-en-v1-2023-10-12-08-51-41-040


........................................................................!
Output downloaded to output_dir.
