# Embed large input using Batch Transform

If you have extensive datasets that need embeddings using Jina's models, Amazon SageMaker's Batch Transform is a handy tool. Instead of processing embed input one-by-one, Batch Transform allows for bulk processing. Simply provide the path to your dataset in an S3 bucket and specify an output path. Once the transform job is completed, the embeddings will be uploaded to the designated S3 location.

## Pre-requisites:
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to the model package.

# Model package setup

Install `jina-sagemaker` package and other dependencies used in the notebook.


```bash
pip install --upgrade jina-sagemaker
pip install datasets
```

In [1]:
# Specify the role as required by SageMaker
role = "..."

In [2]:
import boto3

region = boto3.Session().region_name

# Mapping for Model Package Names
model_name_map = {
    "jina-embeddings-v2-base-en": "jina-embeddings-v2-base-en-32555da8a0b431d190bf3eca46758b72",
    "jina-embeddings-v2-small-en": "jina-embeddings-v2-small-en-0e950fb984e3396fa4e1108adf69937c",
    "jina-embeddings-v2-base-code": "jina-embeddings-v2-base-code-7effc955e13e3c3aa0110bde043f9ead",
    "jina-embeddings-v2-base-de": "jina-embeddings-v2-base-de-c269d166764133348365f57b8f1d8c7a",
    "jina-embeddings-v2-base-es": "jina-embeddings-v2-base-es-3ae2ef99284e31dab5dd5a367620fc29",
    "jina-embeddings-v2-base-zh": "jina-embeddings-v2-base-zh-4da30f467aaf347580ba5ed2648e399a",
    "jina-clip-v1": "jina-embeddings-v2-base-zh-tbd",
    "jina-reranker-v1-base-en": "jina-reranker-v1-base-en-77e50152c042315da374fb388ad6f40d",
    "jina-reranker-v1-turbo-en": "jina-reranker-v1-turbo-en-643a1c3bd23a3b298cb4b6caf43eaf91",
    "jina-reranker-v1-tiny-en": "jina-reranker-v1-tiny-en-209f8205ecad33c1a504ac85e9b79a53",
    "jina-colbert-v1-en": "jina-colbert-v1-en-b5d8f6e93044340b8b02c554f9de20d9",
    "jina-colbert-reranker-v1-en": "jina-colbert-reranker-v1-en-1000ef444ec931dbae4dc85828d08a8a",
}

# Specify the model name, jina-embeddings-v2-base-en is picked here for example
model_name = model_name_map["jina-embeddings-v2-base-en"]

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{model_name}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{model_name}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{model_name}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{model_name}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{model_name}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{model_name}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{model_name}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{model_name}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{model_name}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{model_name}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{model_name}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{model_name}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{model_name}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{model_name}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{model_name}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{model_name}",
}

# Specify the model you want to use
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

## Create the batch transform job

### Input File

Input file should be a CSV file either on S3 or locally. The CSV file should have the following properties:

- **No Headers**: should not include a header row.
- **CSV Quoting**: the text shouldn't contain surrounding quotes.
- **Escape Characters**: escape character (`\`) is used to prevent special characters from being interpreted as part of the CSV formatting.
- **Column(s)**: each model package has different requirements for the input in each row. Please refer to the sample batch input CSV files in the `examples` directory of this repository.


### Output File

The output file will be a jsonlines file with extension `.out`. Each line will contain a list of documents IDs and their embeddings. The output file will be downloaded to the `output_path` if it is a local path. If it is an S3 path, the output file will be uploaded to the S3 bucket.

### Example

Let's download a sample dataset & store as a CSV in the expected format.
And perform text embedding with `jina-embeddings-v2-base-en`.

In [3]:
def save_imdb_dataset(path: str, N: int = 100):
    import csv
    from datasets import load_dataset

    dataset = load_dataset('imdb', split='train')
    dataset.to_pandas().text.head(N).to_csv(
        path, 
        header=False, # no header
        index=False,
        quoting=csv.QUOTE_NONE, # no quotes
        escapechar='\\' # \ is the escape character
    )

# Save the dataset
save_imdb_dataset('imdb.csv')

In [5]:
from jina_sagemaker import Client

client = Client(region_name=region)

input_path = 'imdb.csv' # local path to the dataset downloaded above
output_path = 'output_dir' # local path to the output directory

# Create a batch transform job with the model package
client.create_transform_job(
    arn=model_package_arn,
    role=role,
    n_instances=1,
    instance_type="ml.g4dn.xlarge",
    input_path=input_path,
    output_path=output_path,
    logs=False
)


Depending on the size of your dataset, this may take a few minutes. The output file will be stored on S3 and downloaded to the `output_path` directory.

### Limitations

By default, SageMaker Batch Transform jobs do not have internet access. If you are using `jina-clip-v1` and specify URLs in the CSV input for the Batch Transform job, it won't be able to access the image content.

However, you can [configure Batch Transform jobs to open up access](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-vpc.html) to bypass this issue.