# Building your own algorithm container


In this notebook, we demonstrate how to run GPT-J with DeepSpeed locally and then deploy it in a SageMaker Inference Endpoint.

# 1. Prepare docker image

Open a terminal session in the `../accelerating_gptj_with_deepspeed/` directory and run `build.sh` bash script. This script performs the following steps:

* Makes `serve` executable and builds our docker image
* Downloads GPT-J in half precision to the `./run_local/test_dir/` if that directory is empty.
* Optionally, runs the container for local testing

To run with local testing, make `build.sh` executable:

```sh
chmod +x ./build.sh
```

Then run: 

```sh
./build.sh gptj-inference-endpoint test_local
```

Or, to run without local testing, run:

```sh
./build.sh gptj-inference-endpoint
```

# 2. Local Testing

To test the endpoint, you can run the following cells:

In [19]:
import requests
import json
import sys 

URL = 'http://127.0.0.1:8080/invocations'
HEADERS = {'Content-type': 'application/json', 'Accept': '*/*'}

def test_endpoint(text, parameters):
    
    data = {
        "inputs":{
            "text_inputs": text,
            "parameters": parameters
        }
    }
    
    payload = json.dumps(data)
    response = requests.post(URL, json=data, headers=HEADERS)
    
    return(response.text)


In [54]:
text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.7,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

response = json.loads(test_endpoint(text, parameters))
print(response['response'][0]['generated_text'])

This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows.  He watched as his food was stolen from out of his reach. His family was being taken away because they didn't have enough food to eat. The mouse decided that he would rather die than starve. So, the mouse made up his mind to jump into the trap set for him. He waited until it set, and jumped onto the trap. The trap closed and when he realized what had happened, he could not move. He screamed but no one came to help him. That morning, the mouse found himself in a small cage with two other mice. "What's going on?" Said the first mouse. "We are here because we don't have enough food. And because we can't escape." Said the second mouse. "I don't want to live in this tiny little place" said the first mouse.. "I'm scared." Said the second mouse. "Scared? What's there to be

# 3. Deployment

When you're satisfied with your container, you can rebuild and push your container to AWS ECR using the `push_to_ecr.sh` script.

For example, to push the image we built above, named "gptj-inference-endpoint", you can use the `push_to_ecr.sh` script, which requires the name of your docker image and the s3 path where you want to store the model weights.

Specifically, run the following in your terminal session.

```bash

export image=gptj-inference-endpoint
export s3Uri=s3://<your_bucket_here>/gptj-float16/model.tar.gz
chmod +x push_to_ecr.sh
./push_to_ecr.sh $image $s3Uri
```

First, this script will push your image to ECR. For reference later, note the address of the repository that the container is pushed to. It should appear below the line `Login Succeeded` in the output from the call to `push_to_ecr.sh`. Then it will tar the model weights you downloaded and push them to the s3Uri you've specified. 

# 4. Inference

Now, you can deploy your endpoint as follows:

### 4.1 Initialize configuration variables

In [59]:
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import RealTimePredictor
import time 

role = sagemaker.get_execution_role()

# Specify s3uri for model.tar.gz
model_data = ""

# Specify path to gptj-inference-endpoint image in ECR
image = ""

# Specify sagemaker model_name
sm_model_name = "gptj-completion-gpu-test"

# Specify endpoint_name
endpoint_name = "gptj-completion-gpu-test"

# Specify instance_type
instance_type = 'ml.g4dn.2xlarge'

# Specify initial_instance_count
initial_instance_count = 1


### 4.2 Initialize endpoint

In [61]:
sm_model = Model(model_data = model_data, 
                        image_uri = image,
                        role = role,
                        predictor_cls=RealTimePredictor,
                        name = sm_model_name)

predictor = sm_model.deploy(
        instance_type=instance_type,
        initial_instance_count=1,
        endpoint_name = endpoint_name
)

### 4.3 Query model

To query your endpoint, you can use the code below. Also, remember that you can pass any parameters accepted by the HuggingFace `"text-generation"` pipeline.

#### Initialize asynchronous 

In [None]:
import boto3
import json 

# Get the boto3 session and sagemaker client, as well as the current execution role
sess = boto3.Session()

# Specify your AWS Region
aws_region=sess.region_name


# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=aws_region)

In [None]:
%%time

text = """This is a creative writing exercise. Below, you'll be given a prompt. Your story should be based on the prompt.

Prompt: A scary story about a haunted mouse
Story: On a dark and stormy night, the mouse crept in the shadows. """

parameters = {
    "do_sample": True,
    "temperature": 0.7,
    "max_new_tokens":200,
    "min_tokens": 100,
    "repetition_penalty": 1.1,
    "top_p": 500,
    }

data = {
    "inputs": {
        "text_inputs": text,
        "parameters": parameters
    }
}


body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

In [None]:
%%time

body = json.dumps(data)


response = sagemaker_runtime.invoke_endpoint( 
        EndpointName=endpoint_name, 
        Body = body, 
        ContentType = 'application/json'
)

result = json.loads(response['Body'].read().decode("utf-8"))