# Text Generation: How to run inference on the endpoint you have created?

In [2]:
import json
import boto3

### Query endpoint that you have created

In [15]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"
endpoint_name = "jumpstart-dft-hf-llm-mistral-7b"


def query_endpoint(payload):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload).encode("utf-8")
    )
    model_predictions = json.loads(response["Body"].read())
    generated_text = model_predictions[0]["generated_text"]
    print(f"Input Text: {payload['inputs']}{newline}" f"Generated Text: {bold}{generated_text}{unbold}{newline}")


### Supported parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop*: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments.
***

In [16]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"
endpoint_name = "jumpstart-dft-hf-llm-mistral-7b"


def query_endpoint(payload):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload).encode("utf-8")
    )
    # return response 
    model_predictions = json.loads(response["Body"].read())
    generated_text = model_predictions[0]["generated_text"]
    print(f"Input Text: {payload['inputs']}{newline}" f"Generated Text: {bold}{generated_text}{unbold}{newline}")


In [19]:
# Code generation
payload = {
    "inputs": "Who is the president of the united states",
    "parameters": {
        "max_new_tokens": 200,
    },
}
out = query_endpoint(payload)

Input Text: Who is the president of the united states
Generated Text: [1m?

The President of the United States is the head of state and head of government of the United States of America. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.

Who is the president of the United States of America?

The President of the United States is the head of state and head of government of the United States of America. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.

Who is the president of the United States of America 2020?

The President of the United States is the head of state and head of government of the United States of America. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.

Who is the president of the United States of America[0m



In [20]:
out

In [13]:
    model_predictions = json.loads(out["Body"].read())

In [14]:
model_predictions

[{'generated_text': '\n\nFactorial of a number is the product of all the positive integers less than or equal to that number.\n\nFor example, factorial of 5 is 5*4*3*2*1 = 120.\n\nFactorial of 0 is 1.\n\nFactorial of a negative number is not defined.\n\nFactorial of a number can be computed using a loop.\n\nThe factorial of a number n is denoted by n!.\n\nThe factorial of a number n is defined as:\n\nn! = n * (n-1) * (n-2) * (n-3) * … * 3 * 2 * 1\n\nThe factorial of a number n is denoted by n!.\n\nThe factorial of a number n is defined as:\n\nn! = n * (n-1) * (n-2) * (n'}]

In [None]:
payload = {
    "inputs": "Building a website can be done in 10 simple steps:",
    "parameters": {
        "max_new_tokens": 110,
        "no_repeat_ngram_size": 3,
    },
}
query_endpoint(payload)

In [None]:
# Translation
payload = {
    "inputs": """\
Translate English to French:

sea otter => loutre de mer

peppermint => menthe poivrée

plush girafe => girafe peluche

cheese =>""",
    "parameters": {
        "max_new_tokens": 3,
    },
}
query_endpoint(payload)

In [None]:
# Sentiment-analysis
payload = {
    "inputs": """\
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been :+1:"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:""",
    "parameters": {
        "max_new_tokens": 2,
    },
}
query_endpoint(payload)

In [None]:
# Question answering
payload = {
    "inputs": "Could you remind me when was the C programming language invented?",
    "parameters": {
        "max_new_tokens": 50,
    },
}
query_endpoint(payload)

In [None]:
# Recipe generation
payload = {
    "inputs": "What is the recipe for a delicious lemon cheesecake?",
    "parameters": {
        "max_new_tokens": 200,
    }
}
query_endpoint(payload)

## Impact of Inference parameters

### Stopping Criteria
---
This can be used to stop the text generation once it reaches the a specific string.

---

In [None]:
payload = {
    "inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "parameters": {
        "max_new_tokens": 50,
        "return_full_text": False,
        "do_sample": True,
        "top_k": 10,
    },
}
print("Text generation without using stopping criteria:")
query_endpoint(payload)

In [None]:
payload = {
    "inputs": "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    "parameters": {
        "max_new_tokens": 50,
        "return_full_text": False,
        "do_sample": True,
        "top_k": 10,
        "stop": ["Daniel:"],
    },
}
print("Text generation with stopping criteria:")
query_endpoint(payload)