# SageMaker JumpStart - deploy text generation model

This notebook demonstrates how to use the SageMaker Python SDK to deploy a SageMaker JumpStart text generation model and invoke the endpoint.

In [1]:
from sagemaker.jumpstart.model import JumpStartModel

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Select your desired model ID. You can search for available models in the [Built-in Algorithms with pre-trained Model Table](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html).

In [2]:
model_id = "meta-textgeneration-llama-3-1-8b-instruct"

If your selected model is gated, you will need to set `accept_eula` to True to accept the model end-user license agreement (EULA).

In [3]:
accept_eula = True

## Deploy model

Using the model ID, define your model as a JumpStart model. You can deploy the model on other instance types by passing `instance_type` to `JumpStartModel`. See [Deploy publicly available foundation models with the JumpStartModel class](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-use-python-sdk.html#jumpstart-foundation-models-use-python-sdk-model-class) for more configuration options.

In [4]:
model = JumpStartModel(
    model_id=model_id,
    vpc_config={
        "Subnets":["subnet-"],  ## 사용할 subnet id 로 변경 필요
        "SecurityGroupIds":["sg-"] ## 사용할 security id 로 변경 필요
    }
)

Using model 'meta-textgeneration-llama-3-1-8b-instruct' with wildcard version identifier '*'. You can pin to version '2.2.1' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


You can now deploy your JumpStart model. The deployment might take few minutes.

In [5]:
predictor = model.deploy(accept_eula=accept_eula)

------------!

## Invoke endpoint

Programmatically retrieve example playloads from the `JumpStartModel` object.

In [6]:
example_payloads = model.retrieve_all_examples()

Now you can invoke the endpoint for each retrieved example payload.

In [7]:
for payload in example_payloads:
    response = predictor.predict(payload.body)
    response = response[0] if isinstance(response, list) else response
    print("Input:\n", payload.body, end="\n\n")
    print("Output:\n", response["generated_text"].strip(), end="\n\n\n")

Input:
 {'inputs': '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nwhat is the recipe of mayonnaise?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 'parameters': {'max_new_tokens': 256, 'top_p': 0.9, 'temperature': 0.6, 'details': True}}

Output:
 The classic condiment! Mayonnaise is a thick, creamy sauce made from a mixture of oil, egg yolks, vinegar or lemon juice, and seasonings. Here's a simple recipe to make mayonnaise at home:

**Basic Mayonnaise Recipe:**

**Ingredients:**

* 2 large egg yolks
* 1 tablespoon (15 ml) lemon juice or vinegar (white wine vinegar or apple cider vinegar work well)
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed oil
* Salt, to taste
* Optional: garlic, mustard, or other flavorings (see variations below)

**Instructions:**

1. **Separate the egg yolks**: Crack the eggs into a medium-sized bowl and separate the yolks from the whites. Set the whites aside for another use (e.g., making meringues or omelets).


This model supports the following common payload parameters. You may specify any subset of these parameters when invoking an endpoint.

* **do_sample:** If True, activates logits sampling. If specified, it must be boolean.
* **max_new_tokens:** Maximum number of generated tokens. If specified, it must be a positive integer.
* **repetition_penalty:** A penalty for repetitive generated text. 1.0 means no penalty.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **seed**: Random sampling seed.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **details:** Return generation details, to include output token logprobs and IDs.

The model will also support additional payload parameters that are dependent on the image used for this model. You can find the default image by inspecting `model.image_uri`. For information on additional payload parameters, view [LMI input output schema](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/lmi_input_output_schema.html) or, for text generation inference (TGI), see the following list.
* **stop**: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.
* **truncate:** Truncate inputs tokens to the given size.
* **typical_p:** Typical decoding mass, according to [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666).
* **best_of:** Generate best_of sequences and return the one if the highest token logprobs.
* **watermark:** Whether to perform watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226).
* **decoder_input_details:** Return decoder input token logprobs and IDs.
* **top_n_tokens:** Return the N most likely tokens at each step.

## Clean up

In [8]:
predictor.delete_predictor()