In [16]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "True"

## **DeepSparse MPT-Instruct Example**

Install via the nightly build.

In [None]:
!pip install deepsparse-nightly[llm] gdown jsonlines

### **Download and Compile Model**

The following downloads a pre-sparsified MPT-Instruct model from our SparseZoo and compiles the model. 

> Note: It will take a minute or two to compile even if the model is already downloaded, so try to reuse a pipeline as much as possible once its been set up.

In [6]:
from deepsparse import TextGeneration

pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

Downloading (…)okenizer_config.json: 100%|██████████| 237/237 [00:00<00:00, 194kB/s]
Downloading (…)yment/tokenizer.json: 100%|██████████| 2.02M/2.02M [00:00<00:00, 28.8MB/s]
Downloading (…)ployment/config.json: 100%|██████████| 1.41k/1.41k [00:00<00:00, 1.27MB/s]
Downloading (…)eployment/model.onnx: 100%|██████████| 1.26M/1.26M [00:00<00:00, 19.8MB/s]
Downloading (…)eployment/model.data: 100%|██████████| 6.39G/6.39G [03:52<00:00, 29.5MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 131/131 [00:00<00:00, 121kB/s]
Downloading (…)quantized/model.onnx: 100%|██████████| 1.26M/1.26M [00:00<00:00, 21.4MB/s]
Overwriting the current location of the File: /home/robertgshaw/.cache/sparsezoo/neuralmagic/mpt-7b-dolly_mpt_pretrain-pruned50_quantized/deployment.tar.gz/deployment/tokenizer_config.json with the new location: /home/robertgshaw/.cache/sparsezoo/neuralmagic/mpt-7b-dolly_mpt_pretrain-pruned50_quantized/deployment/tokenizer_config.json.
Downloading (…)okenizer_config.json: 100%|

### **Generate Text**

We can now call the pipeline to generate a response to a prompt.

In [11]:
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: How best should I travel from London to Edinburgh, UK ? ### Response:"

output = pipeline(prompt, max_new_tokens=100)
print(output.generations[0].text)

 There are a number of ways to travel from London to Edinburgh. You can take a direct flight from London to Edinburgh, which will take around 1 hour and 20 minutes. Alternatively, you can take the Eurostar train from London to Edinburgh, which will take around 2 hours and 30 minutes. You can also drive from London to Edinburgh, which will take around 8 hours.


Stream responses with the following.

In [14]:
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=100)

print(prompt, end="\n\n")
for it in output_iterator:
    print(it.generations[0].text, end="")

Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: How best should I travel from London to Edinburgh, UK ? ### Response:

 There are a number of ways to travel from London to Edinburgh. You can take a direct flight from London to Edinburgh, which will take around 1 hour and 20 minutes. Alternatively, you can take the Eurostar train from London to Edinburgh, which will take around 2 hours and 30 minutes. You can also drive from London to Edinburgh, which will take around 8 hours.<|endoftext|>

See [our documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md) for details on how to customize generation strategies

### **RAG Example**

In [20]:
!gdown 1TLKhWjez63H4uBtgCxyoyJsZi-IMgnDb --fuzzy

Downloading...
From (uriginal): https://drive.google.com/uc?id=1TLKhWjez63H4uBtgCxyoyJsZi-IMgnDb
From (redirected): https://drive.google.com/uc?id=1TLKhWjez63H4uBtgCxyoyJsZi-IMgnDb&confirm=t&uuid=464329cb-b24d-48c2-a9af-a52e5b69102a
To: /home/robertgshaw/mpt-example/eval_data.zip
100%|█████████████████████████████████████████| 157M/157M [00:01<00:00, 110MB/s]


In [22]:
!unzip eval_data.zip

Archive:  eval_data.zip
   creating: eval_data/
  inflating: eval_data/asqa_eval_gtr_top100.json  
  inflating: eval_data/factscore_unlabeled_alpaca_13b_retrieval.jsonl  
  inflating: eval_data/popqa_longtail_w_gs.jsonl  
  inflating: eval_data/triviaqa_test_w_gs.jsonl  
  inflating: eval_data/arc_challenge_processed.jsonl  
  inflating: eval_data/health_claims_processed.jsonl  
  inflating: eval_data/popqa_longtail.jsonl  
  inflating: eval_data/triviaqa_test.jsonl  


In [30]:
import jsonlines

def load_jsonlines(file):
    with jsonlines.open(file, 'r') as jsonl_f:
        lst = [obj for obj in jsonl_f]
    return lst

input_data = load_jsonlines("eval_data/triviaqa_test_w_gs.jsonl")

In [73]:
index = 51
datum = input_data[index] 

print(f'question={datum["question"]}\n')

for i, ctx in enumerate(datum["ctxs"]):
    if i == 3:
        break
    print(f'i={i} title={ctx["title"]} \ntext={ctx["text"]}')
    print("\n\n")

question=Which English rowing event is held every year on the River Thames for 5 days (Wednesday to Sunday) over the first weekend in July?

i=0 title=River Thames 
text=the stretch of river from Chiswick to Putney. Two rowing events on the River Thames are traditionally part of the wider English sporting calendar: The University Boat Race (between Oxford and Cambridge) takes place in late March or early April, on the Championship Course from Putney to Mortlake in the west of London. Henley Royal Regatta takes place over five days at the start of July in the upstream town of Henley-on-Thames. Besides its sporting significance the regatta is an important date on the English social calendar alongside events like Royal Ascot and Wimbledon. Other significant or historic rowing events



i=1 title=University rowing (UK) 
text=competition into 2 separate days, with Beginners racing over a shorter course on one day, and Seniors racing on the longer course on the other. However, due to incleme

In [67]:
def format_context(ctx):
    return f'{ctx["title"]}: {ctx["text"]}'

def format_instruction(input_datum):
    question = input_datum["question"]
    ctxs = input_datum["ctxs"]

    EXAMPLE="""{q}

CONTEXT: {ctx0}

CONTEXT: {ctx1}

CONTEXT: {ctx2}

    """.format(
        q=question,
        ctx0=format_context(ctxs[0]),
        ctx1=format_context(ctxs[1]),
        ctx2=format_context(ctxs[2]),
    )

    return EXAMPLE


INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)

def format_prompt(input_datum):
    return PROMPT_FOR_GENERATION_FORMAT.format(instruction=format_instruction(input_datum))


In [70]:
prompt = format_prompt(datum)
print(prompt)

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Which English rowing event is held every year on the River Thames for 5 days (Wednesday to Sunday) over the first weekend in July?

CONTEXT: River Thames: the stretch of river from Chiswick to Putney. Two rowing events on the River Thames are traditionally part of the wider English sporting calendar: The University Boat Race (between Oxford and Cambridge) takes place in late March or early April, on the Championship Course from Putney to Mortlake in the west of London. Henley Royal Regatta takes place over five days at the start of July in the upstream town of Henley-on-Thames. Besides its sporting significance the regatta is an important date on the English social calendar alongside events like Royal Ascot and Wimbledon. Other significant or historic rowing events

CONTEXT: University rowing (UK): competition into 2 separate days, with Beginners racing over a shor

In [72]:
output = pipeline(prompt, max_new_tokens=100)
print(output.generations[0].text)

The regatta that is held on the River Thames in England over the first weekend of July is the Henley Royal Regatta.

