# Deploy Mistral to SageMaker + Inferentia2 using HF Optimum Neuron and TGI

This guide will detail how to compile, deploy and run a **Mistral 7B** on AWS inferentia 2 + Amazon SageMaker.

You will learn how to:
- set up your AWS instance,
- compile Mistral-7B model to the Neuron format using a SageMaker Training Job,
- deploy the model and use it in any application.

Note: This tutorial was created on Amazon SageMaker.

## Prerequisite: Setup AWS environment

*you can skip that section if you are already this notebook on SageMaker Studio.*

In your AWS Account, follow the [SageMaker Studio Getting Started tutorial](https://aws.amazon.com/sagemaker/studio/) to run this notebook. Then, clone this repo (https://github.com/huggingface/optimum-neuron) into your environment. Navigate through the directories and double click on this notebook.

Select the most appropriate kernel to initialize the notebook.  
**SageMaker Studio Kernel**: Python 3 (ipykernel)  
**SageMaker Studio Classic Kernel**: Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized) / **Instance**: ml.t3.medium

In [None]:
import os
import sagemaker

print(sagemaker.__version__)
if not sagemaker.__version__ >= "2.146.0": print("You need to upgrade or restart the kernel if you already upgraded")

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name

## 1. Download and compile Milstral using a SageMaker Job
In this step we'll kick-off a SageMaker training job to download and compile the model to inf2

In [None]:
os.makedirs("src", exist_ok=True)

#### Python requirements for model compilation

In [None]:
%%writefile src/requirements.txt
--extra-index-url https://pip.repos.neuron.amazonaws.com
transformers==4.36.2
optimum-neuron==0.0.19
neuronx-distributed==0.6.0
transformers-neuronx==0.9.474

#### Compilation script
This script will be executed by SageMaker. It will download the weights and compile Mistral. 

In [None]:
%%writefile src/compile.py
import os
import json
import argparse
from transformers import AutoTokenizer
from optimum.neuron import NeuronModelForCausalLM

if __name__ == "__main__":
    parser = argparse.ArgumentParser()    
    parser.add_argument("--compilation_params", type=str, required=True)
    parser.add_argument("--model_id", type=str, required=False, default="yam-peleg/Experiment26-7B")    
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])        
    args, _ = parser.parse_known_args()
    
    # parse compilation params
    compilation_params = json.loads(args.compilation_params)
    print(compilation_params)
    # compile and save the model
    model = NeuronModelForCausalLM.from_pretrained(args.model_id, export=True, **compilation_params)
    model.save_pretrained(args.model_dir)
    # now load and export tokenizer
    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    tokenizer.save_pretrained(args.model_dir)
    print("Done!")

#### SageMaker Estimator to kick-off model compilation
If you want to deploy your own model, just change the parameter **--model_id** and point it to the correct HuggingFace repo.

In [None]:
import json
import logging
from sagemaker.pytorch import PyTorch
num_cores=2
seq_len=2048
compilation_params= {"auto_cast_type": "bf16", "batch_size": 1, "sequence_length": seq_len, "num_cores": num_cores}
model_id='yam-peleg/Experiment26-7B'

estimator = PyTorch(
    entry_point="compile.py", # Specify your train script
    source_dir="src",
    role=role,
    sagemaker_session=sess,
    container_log_level=logging.DEBUG,
    instance_count=1,
    instance_type='ml.trn1.32xlarge',
    output_path=f"s3://{bucket}/output",
    disable_profiler=True,
    disable_output_compression=True,
    
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.17.0-ubuntu20.04",
    
    volume_size = 512,
    hyperparameters={         
        "compilation_params": f"'{json.dumps(compilation_params)}'",
        "model_id": model_id
    }
)
estimator.framework_version = '1.13.1' # workround when using image_uri

In [None]:
estimator.fit()

## 2. Deploy the compiled model to SageMaker Endpoint + Inferentia2 + TGI + HF ON

In [None]:
import logging
from sagemaker.utils import name_from_base
from sagemaker.pytorch.model import PyTorchModel

# depending on the inf2 instance you deploy the model you'll have more or less accelerators
# we'll ask SageMaker to launch 1 worker per core

model_data=estimator.model_data
print(f"Model data: {model_data}")

instance_type_idx=1 # default ml.inf2.8xlarge
instance_types=['ml.inf2.xlarge', 'ml.inf2.8xlarge', 'ml.inf2.24xlarge','ml.inf2.48xlarge']
num_workers=[2//num_cores,2//num_cores,12//num_cores,24//num_cores]

print(f"Instance type: {instance_types[instance_type_idx]}. Num SM workers: {num_workers[instance_type_idx]}")
pytorch_model = PyTorchModel(
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.18-neuronx-py310-ubuntu22.04-v1.0",
    model_data=model_data,
    role=role,    
    name=name_from_base('tgi-llm'),
    sagemaker_session=sess,
    container_log_level=logging.DEBUG,
    model_server_workers=num_workers[instance_type_idx], # 1 worker per inferentia chip
    framework_version="1.13.1",
    env = {
        'SAGEMAKER_MODEL_SERVER_TIMEOUT': '3600',
        'HF_MODEL_ID': '/opt/ml/model/',
        ## https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher
        'MAX_BATCH_PREFILL_TOKENS': '1024',
        'MAX_INPUT_LENGTH': '1024',
        'MAX_TOTAL_TOKENS': str(seq_len)
    }
    # for production it is important to define vpc_config and use a vpc_endpoint
    #vpc_config={
    #    'Subnets': ['<SUBNET1>', '<SUBNET2>'],
    #    'SecurityGroupIds': ['<SECURITYGROUP1>', '<DEFAULTSECURITYGROUP>']
    #}
)
pytorch_model._is_compiled_model = True

In [None]:
predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_types[instance_type_idx],
    model_data_download_timeout=3600, # it takes some time to download all the artifacts and load the model
    container_startup_health_check_timeout=1800
)

## 3. Run some basic tests

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
predictor.serializer = JSONSerializer()
predictor.deserializer = JSONDeserializer()

#### Simple Inference

In [63]:
predictor.predict({"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}})

[{'generated_text': '\n\nDeep Learning is a subset of Machine Learning, which is a subset of Artificial Intelligence. It is a method of teaching a computer to learn and make decisions by mimicking the way the human brain processes data.\n\nDeep Learning algorithms are designed to recognize patterns in large datasets by using multiple layers of artificial neurons, which are loosely modeled after the biological neurons in the human brain.\n\nThe term "deep" in Deep Learning refers to the depth or number of layers in the neural network. The more layers a network has, the more complex the patterns it can learn.\n\nDeep Learning has shown'}]

#### Streaming

In [64]:
import json
import boto3

sm_client = boto3.client('sagemaker-runtime')

body = json.dumps({"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":512}, "stream": True})

resp = sm_client.invoke_endpoint_with_response_stream(
    EndpointName=predictor.endpoint_name,
    Body=body,
    ContentType='application/json',
    Accept='application/json',
)
text = ""
for e in resp['Body']:
    tok = e['PayloadPart']['Bytes'].decode('utf-8')
    if tok.startswith('data'): 
        try:
            tok = json.loads(tok[5:])
            print(tok['token']['text'], end='')    
        except Exception as e:
            pass



Deep Learning is a subset of Machine Learning, which is a subset of Artificial Intelligence. It is a method of teaching a computer to learn from and make sense of large amounts of data, in a similar way to the human brain, using a layered structure with neurons that process information.

The term “deep” refers to the depth or number of layers in a neural network, which is the core of deep learning. The more layers a network has, the more complex the patterns it can learn.

Deep learning has been responsible for many recent breakthroughs in the field of AI, including image and speech recognition, natural language processing, and self-driving cars.

The key to deep learning is the ability to automatically learn hierarchical representations of data, which means the system can identify patterns and features at multiple levels of abstraction. This is in contrast to traditional machine learning methods, which often require manual feature engineering.

Deep learning algorithms are typically

## 4. Cleanup

In [None]:
predictor.delete_model()
predictor.delete_endpoint()