# Embed large text files using Batch Transform

If you have extensive text datasets that need embeddings using Jina's models, Amazon SageMaker's Batch Transform is a handy tool. Instead of processing text one-by-one, Batch Transform allows for bulk processing. Simply provide the path to your dataset in an S3 bucket and specify an output path. Once the transform job is completed, the embeddings will be uploaded to the designated S3 location.

## Pre-requisites:
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [jina-embedding-model](link).

# Model package setup

Install `jina-sagemaker` package 


```bash
pip install --upgrade jina-sagemaker
```

In [1]:
# Specify the role as required by SageMaker
role = "..."

In [2]:
import boto3

region = boto3.Session().region_name


# Specify the model name
model_name = "jina-embeddings-v2-small-en"

# Mapping for Model Packages
model_package_map = {
    'ap-northeast-1': f'arn:aws:sagemaker:ap-northeast-1:253352124568:model-package/{model_name}',
    'ap-northeast-2': f'arn:aws:sagemaker:ap-northeast-2:253352124568:model-package/{model_name}',
    'ap-south-1': f'arn:aws:sagemaker:ap-south-1:253352124568:model-package/{model_name}',
    'ap-southeast-1': f'arn:aws:sagemaker:ap-southeast-1:253352124568:model-package/{model_name}',
    'ap-southeast-2': f'arn:aws:sagemaker:ap-southeast-2:253352124568:model-package/{model_name}',
    'ca-central-1': f'arn:aws:sagemaker:ca-central-1:253352124568:model-package/{model_name}',
    'eu-central-1': f'arn:aws:sagemaker:eu-central-1:253352124568:model-package/{model_name}',
    'eu-north-1': f'arn:aws:sagemaker:eu-north-1:253352124568:model-package/{model_name}',
    'eu-west-1': f'arn:aws:sagemaker:eu-west-1:253352124568:model-package/{model_name}',
    'eu-west-2': f'arn:aws:sagemaker:eu-west-2:253352124568:model-package/{model_name}',
    'eu-west-3': f'arn:aws:sagemaker:eu-west-3:253352124568:model-package/{model_name}',
    'sa-east-1': f'arn:aws:sagemaker:sa-east-1:253352124568:model-package/{model_name}',
    'us-east-1': f'arn:aws:sagemaker:us-east-1:253352124568:model-package/{model_name}',
    'us-east-2': f'arn:aws:sagemaker:us-east-2:253352124568:model-package/{model_name}',
    'us-west-1': f'arn:aws:sagemaker:us-west-1:253352124568:model-package/{model_name}',
    'us-west-2': f'arn:aws:sagemaker:us-west-2:253352124568:model-package/{model_name}'
}

# Specify the model you want to use
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

## Create the batch transform job

### Gotchas

##### Input File

Input file should be a CSV file either on S3 or locally. The CSV file should have the following properties:

- **No Headers**: should not include a header row.
- **CSV Quoting**: the text shouldn't contain surrounding quotes.
- **Escape Characters**: escape character (`\`) is used to prevent special characters from being interpreted as part of the CSV formatting.
- **Column(s)**: If on S3, it must have two columns: ID and text to be embedded. If local, only the text column is required. The client will dynamically generate the document IDs as needed.


##### Output File

The output file will be a jsonlines file with extension `.out`. Each line will contain a list of documents IDs and their embeddings. The output file will be downloaded to the `output_path` if it is a local path. If it is an S3 path, the output file will be uploaded to the S3 bucket.

---

Let's download a sample dataset & store as a CSV in the expected format.

In [3]:
def save_imdb_dataset(path: str, N: int = 100):
    import csv
    from datasets import load_dataset

    dataset = load_dataset('imdb', split='train')
    dataset.to_pandas().text.head(N).to_csv(
        path, 
        header=False, # no header
        index=False,
        quoting=csv.QUOTE_NONE, # no quotes
        escapechar='\\' # \ is the escape character
    )

# Save the dataset
save_imdb_dataset('imdb.csv')

In [5]:
from jina_sagemaker import Client

client = Client(region_name=region)

input_path = 'imdb.csv' # local path to the dataset downloaded above
output_path = 'output_dir' # local path to the output directory

# Create a batch transform job with the model package
client.create_transform_job(
    arn=model_package_arn,
    role=role,
    n_instances=1,
    instance_type="ml.g4dn.xlarge",
    input_path=input_path,
    output_path=output_path,
    logs=False
)


Depending on the size of your dataset, this may take a few minutes. The output file will be stored on S3 and downloaded to the `output_path` directory.