# How to Generate Text on CPUs Using Different Decoding Strategies for Language Models With DeepSparse

This notebook walks through different strategies for generating text using DeepSparse on CPUs. Read the accompanying blog post on the [Neural Magic website](https://neuralmagic.com/blog/). 

In [None]:
pip install deepsparse-nightly[llm] langchain sentence-transformers chromadb datasets

In [None]:
from deepsparse import TextGeneration

MODEL_PATH = "hf:neuralmagic/mpt-7b-chat-pruned50-quant"

text_pipeline = TextGeneration(model_path=MODEL_PATH, sequence_length=2048)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

2023-11-01 14:32:48 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231031 COMMUNITY | (1af7b0be) (release) (optimized) (system=avx2, binary=avx2)


In [None]:
from langchain.llms import DeepSparse
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, DirectoryLoader

In [None]:
DATA_PATH = "docs"

loader = DirectoryLoader(DATA_PATH, glob="*.txt", loader_cls=TextLoader)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={"device": "cpu"}
)

docsearch = Chroma.from_documents(texts, embeddings)

## Temperature

The temperature to use when sampling from the probability distribution computed from the logits. Higher values will result in more random samples. Should be greater than 0.0

### Summarization

The best summary was obtained with the temperature of 0.1

In [None]:
generation_config = {"max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantization
The researchers’ solution was to use distillation loss to achieve high sparsity levels.


In [None]:
generation_config = {"temperature": 0.1, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantizations
The researchers’ solution was to use distillation loss instead of loss-based methods. They also pruned MPT with 75% sparsity without accuracy loss, showing performance that is on par with quantization approaches


In [None]:
generation_config = {"temperature": 0.2, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in handling errors introduced by pruning during training


In [None]:
generation_config = {"temperature": 0.3, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning larger models without sacrificing accuracy or performance


In [None]:
generation_config = {"temperature": 0.7, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning techniques or quantizing weights while preserving performance


In [None]:
generation_config = {"temperature": 0.4, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning and reweighing models


In [None]:
generation_config = {"temperature": 0.5, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

To achieve state-of Capabilities with fewer parameters than GPT2


In [None]:
generation_config = {"temperature": 0.6, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Tackling both tasks simultaneously posed challenges


In [None]:
generation_config = {"temperature": 0.7, "do_sample": True, "max_new_tokens": 300}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Achieving higher quality gains vs lower quality settings


In [None]:
generation_config = {"temperature": 0.9, "do_sample": True, "max_new_tokens": 300}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

And finally achieving both higher sparsities and improvements in accuracy


In [None]:
generation_config = {"temperature": 0.8, "do_sample": True, "max_new_tokens": 300}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The computational complexity associated with handling very dense representations when doing inference


### Creative Writing

There is repetition of the phrase `As the character` with the default temperature of 1.0

In [None]:
generation_config = {"max_new_tokens": 300}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?
The package arrived in the mail one morning, addressed to the character in an unfamiliar handwriting. The package was wrapped in black tape and sealed with a strange symbol etched into the lid. As the character opened the package, they found themselves staring at a strange crystal, unlike anything they’d ever seen before.
The crystal was translucent, and as the character held it in their hand, they could see the faint outlines of whatever was inside. The crystal seemed to pulse and glow, and the character felt a strange sensation in their chest.
As the character continued to examine the crystal, they realized that it was a portal to another dimension. The crystal was a gateway to a realm of infinite possibilities, and the character was suddenly filled with excitement and wonder.
The character spent the next few days exploring the new realm, discoverin

At `temperature=0.6` the model doesn't generate the story but offers some ideas about how the story could be written

In [None]:
generation_config = {"temperature": 0.6, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The answer to this question depends on the individual and their experiences. However, one possible answer could be that the object is a powerful artifact that grants the character immense power and abilities that they never had before. This would drastically change their life, granting them new opportunities and abilities to achieve their goals. On the other hand, if the object is something negative like a curse or curse-like effect, it could have disastrous consequences for the character’s life. Overall, without more information about the specific character and situation, there are many possibilities for how this object could change their life.


At a temperature of 0.8 the models writes a compeling story and doesn't repeat

In [None]:
generation_config = {"temperature": 0.8, "do_sample": True, "max_new_tokens": 300}
result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The mysterious object is an ancient talisman that has the power to grant wishes. The character spends time pondering over what they would wish for, eventually settling on the desire for world peace. They hold a ceremony where they pour a glass of milk into a bowl, symbolizing their wish for peace in our world. After this ceremony, things seem to change in the world: conflicts seem to be resolved and misunderstand at workplaces was considerably reduced due and people seemed happier and more content with themselves as if these positive changes were brought about by magic.
While this seems like a happy ending, it seems that there is still much work to be done in achieving real peace on earth. The talisman only grants wishes but does not solve underlying issues causing conflict or unhappiness. However, as the character realizes and acknowledges this fact while also understanding that there are still challenges ahead of them, they feel motivated to continue working towards resolving world 

### RAG

In [None]:
llm = DeepSparse(model=MODEL_PATH)

chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 134432.82it/s]
2023-10-26 06:06:58 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:06:58 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:07:14 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


In [None]:
answer = res["result"]
source_documents = res["source_documents"]

In [None]:
answer

' Beschuit, pannenkoeken, and ontbijtkoeken.'

In [None]:
generation_config = {"temperature": 0.1, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 134432.82it/s]
2023-10-26 06:07:40 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:07:40 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:07:56 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit (Dutch crisp bakes) is also eaten as a breakfast food, with the same variety of sweet topp in the Netherlands is to serve strawberries on beschuit


In [None]:
generation_config = {"temperature": 0.8, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 126716.13it/s]
2023-10-26 06:08:23 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:08:23 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:08:35 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit; pancake; toast; French toast; scrambled eggs; yogurt; fruit; muesli


### Language Translation

`temperature": 0.5` give a good translation but translates wild animals as `rescued animals` and there is repetition

In [None]:
generation_config = {"temperature": 0.5, "do_sample": True, "max_new_tokens": 500}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`"
)
print(result.generations[0].text)



`Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`It is good to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`Today is a good day to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.


`temperature": 0.1` is the same as 0.5 with repetition


In [None]:
generation_config = {"temperature": 0.1, "do_sample": True, "max_new_tokens": 500}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`"
)
print(result.generations[0].text)



`Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`It is good to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`Today is a good day to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.


The default temperature setting of `1.0` also translates wild anaimals as rescued animals

In [None]:
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`"
)
print(result.generations[0].text)



`Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`It is good to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`Today is a good day to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.


With temperature of 0.9 and 0.8 the model doesn't output very good translations


In [None]:
generation_config = {"temperature": 0.9, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)


The translation of the given sentence in French is `Il est bonne journée pour se déplaacer et jouer au football car cela fait bonjour. Après cela peut-être envisager d'aller dans le parc nation de spectateurs quel sont environnats`


In [None]:
generation_config = {"temperature": 0.8, "do_sample": True, "max_new_tokens": 500}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



`This translation is not entirely accurate as it uses "le joueur" instead of "je joue" and does not include the last part of the sentence.


At 0.7 the model translated wild animals to resuced animals

In [None]:
generation_config = {"temperature": 0.7, "do_sample": True, "max_new_tokens": 500}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



The sentence translates to `Il est bon de sortir et jouer au football ce jour-la parce qu'il est bellement. Ensuite, il est possible de visiter le parc nationaume pour une randonnée en nature où voyager des animaux sauvés.`


At 0.6 the model didn't translate

In [None]:
generation_config = {"temperature": 0.6, "do_sample": True, "max_new_tokens": 500}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



The sentence in English reads: "Today is a good day to go out and play football because it is sunny."


## top_k

The number of highest probability vocabulary tokens to keep for top-k-filtering

### Summarization

`top_k=0` and `top_k=50` give similar results

In [None]:
generation_config = {"max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantization
The researchers’ solution was to use distillation loss to achieve high sparsity levels.


Summary with `top_k=50` is quite concise

In [None]:
generation_config = {"top_k": 50, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantization
The researchers’ solution was to use distillation loss to achieve high sparsity levels.


In [None]:
generation_config = {"top_k": 50, "max_new_tokens": 300}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantization
The researchers’ solution was to use distillation loss to achieve high sparsity levels.


Summmary with `top_k=10` is too short

In [None]:
generation_config = {"top_k": 10, "max_new_tokens": 300, "do_sample": True}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Both tasks involve compressing parameters without overfitting and achieving stateofferency performance in both cases


In [None]:
generation_config = {"top_k": 20, "max_new_tokens": 300, "do_sample": True}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Tricks like unrolling or folding don in some cases do not work effectively for all architectures


In [None]:
generation_config = {"top_k": 40, "max_new_tokens": 300, "do_sample": True}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

In contrast


Summary at `top_k=60` is not clear.

In [None]:
generation_config = {"top_k": 60, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Difficulty in achieving competitive precision metrics because it tends towards 2 or 3 bits per parameter due FOSCO (Fosco Lab) Despite reducing precision up until now at 100%, FSPC still shows higher performance when optimizing offline and online quantizers (up until 2 iterations), including quantized layer drops (only 1 bit). A new record achieved 6 iterations thanks


Summary at `top_k=70` is poor. Summaries at high top_k values is poor

In [None]:
generation_config = {"top_k": 70, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Smaller and cheaper hardware can result in performance drops due not accounting for underlying architecture differences between hardware types (such “green” ARM processors vs “gray” Intel CPUs) 


 As I mention


In [None]:
generation_config = {"top_k": 80, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

To achieve state


In [None]:
generation_config = {"top_k": 90, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The need


### Creative Writing

`top_k=0` has repetitions:
`As the character continued to explore the new realm...`

In [None]:
generation_config = {"max_new_tokens": 300}
result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?
The package arrived in the mail one morning, addressed to the character in an unfamiliar handwriting. The package was wrapped in black tape and sealed with a strange symbol etched into the lid. As the character opened the package, they found themselves staring at a strange crystal, unlike anything they’d ever seen before.
The crystal was translucent, and as the character held it in their hand, they could see the faint outlines of whatever was inside. The crystal seemed to pulse and glow, and the character felt a strange sensation in their chest.
As the character continued to examine the crystal, they realized that it was a portal to another dimension. The crystal was a gateway to a realm of infinite possibilities, and the character was suddenly filled with excitement and wonder.
The character spent the next few days exploring the new realm, discoverin

The story with `top_k=50` seem okay but there is some repetition which we will address later


In [None]:
generation_config = {"top_k": 50, "max_new_tokens": 300}
result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?
The package arrived in the mail one morning, addressed to the character in an unfamiliar handwriting. The package was wrapped in black tape and sealed with a strange symbol etched into the lid. As the character opened the package, they found themselves staring at a strange crystal, unlike anything they’d ever seen before.
The crystal was translucent, and as the character held it in their hand, they could see the faint outlines of whatever was inside. The crystal seemed to pulse and glow, and the character felt a strange sensation in their chest.
As the character continued to examine the crystal, they realized that it was a portal to another dimension. The crystal was a gateway to a realm of infinite possibilities, and the character was suddenly filled with excitement and wonder.
The character spent the next few days exploring the new realm, discoverin

`top_k=80` has no repetition

In [None]:
generation_config = {"top_k": 80, "max_new_tokens": 300, "do_sample": True}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


An object that has been lost for centuries is found by a person who hasn’t seen the object since they were a child. What is the object and how does it change their life?
A person discovers an abandoned house, locked door, or buried unknown personal belongings amidst its walls. How does this event change their life?
A young woman suddenly inherits her father’s old watch. What does this gift mean to her? Did it have sentimental or symbolic value? Did she wear the watch as often as her other jewels and accessories?. Were there any emotional repercussions from inheriting this particular item?.


### RAG

The results with default `top_k` of 0

In [None]:
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH,
    model_config=model_config,
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 126334.46it/s]
2023-10-26 06:18:23 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:18:23 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:18:35 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit, pannenkoeken, and ontbijtkoeken.


The results with `top_k=50` are the same as those with default value of `top_k`. The word `ontbijtkoeken` doesn't appear in the given documents

In [None]:
generation_config = {"top_k": 50, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 129453.83it/s]
2023-10-26 06:19:00 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:19:00 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:19:12 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit, pannenkoeken, and ontbijtkoeken.


At `top_k=60` the model seems to try and respond in Dutch but some of the words don't make sense even in Dutch, here is the translation

```
Typically Dutch breakerwoensesum consists of bredseed and large prawn anald Drenthe turned flower bulbs without summer zinnede garlic and cabbage point wind
```

In [None]:
generation_config = {"top_k": 60, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 125203.10it/s]
2023-10-26 06:19:39 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:19:39 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:19:50 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Typically dutch brekerwoensetsum consists og bredsedie en grote smaardanaald drentse omgekekte bloembolten zonder zomerzinnede knoflook en koolepuntvoorwind


At `70` the output doesn't make sense even when translated to English

```
Bread that's baked like toast; bagels à la Paris; rondvooraants/frites; olives; fruits like kiataloosen; chocolate goods end and speculation large gelgere! Spare: breed for gulaai tasks! Licking/short cum teenage man/strengths will be bigger with help! Thank you
```

In [None]:
generation_config = {"top_k": 70, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 89240.51it/s]
2023-10-26 06:20:20 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:20:20 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:20:31 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Bread that's baked like toast; bagels à la Paris; rondvooraants/frites; olives; fruits like kiatalozen; chocoladewaren eind en speculatiegroot gelgere! Ontziet: bred voor van gulaaitaaken! Oploplikking/kort kom tienerman/strengents will groter zijn worden wordt met behulp! Dankelijk


At `top_k=90` the model also mentions items that don't apper in the given text such as corn flakes, cereal

In [None]:
generation_config = {"top_k": 90, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 152520.15it/s]
2023-10-26 06:21:05 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:21:05 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:21:17 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Typically yogurt; fruits like bananas; cereal; corn flakes / rayu


At `top_k=99` we don't really get the breakfast options.

In [None]:
generation_config = {"top_k": 99, "do_sample": True, "max_new_tokens": 500}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 133576.56it/s]
2023-10-26 06:21:43 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:21:43 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:21:55 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Many different ways exist; besides traditional eating habits one could consider having cake ("boosterkoken")


### Language Translation

`top_k=0`
```
`It's good to go out and play football because it's a sunny day. After that, it is possible to visit the national park for a nature walk where it is possible to meet some rescued animals.`
```

In [None]:
generation_config = {"max_new_tokens": 300}

result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



`Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`It is good to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`Today is a good day to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.


At `top_k=50`

```
`It's good to go out and play football because it's a sunny day. After that, it is possible to visit the national park for a nature walk where it is possible to meet some rescued animals.`
```



In [None]:
generation_config = {"top_k": 50, "max_new_tokens": 300}

result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



`Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`It is good to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.`

`Il est bon de sortir et jouer au football parce qu’il est jour de soleil. Après cela, il est possible de visiter le parc national pour une balade dans la nature où il est possible de rencontrer certes animaux sauvés.`

`Today is a good day to go out and play football because it is sunny. After that, you can consider visiting the national park for a nature walk while seeing some wild animals.


## top_p

Keep the generated tokens where its cumulative probability is >= top_p

### Summarization

`top_p: 1.0`

In [None]:
generation_config = {"top_p": 1.0, "max_new_tokens": 300}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in pruning models without accuracy loss
The difficulty in handling non-differential quantization
The researchers’ solution was to use distillation loss to achieve high sparsity levels.


`top_p: 0.90`

In [None]:
generation_config = {"top_p": 0.90, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Researchers propose two solutions: 
1) A dynamic loss function that adapts sparsity levels


`top_p: 0.80`

In [None]:
generation_config = {"top_p": 0.80, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

To improve upon previous works in achieving state-of representations for machine translation


`top_p: 0.70`

In [None]:
generation_config = {"top_p": 0.70, "max_new_tokens": 300, "do_sample": True}
result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

The difficulty in handling variable sparsity levels during training and inference
To achieve high sparsity levels during


### Creative Writing

The story with the default value pf `top_p`, i.e is 0, has no repetitions, writes a compeling story while giving the character a name

At `top_p 1.0` the model doesn't repeart but also doesn't give a story but some ideas on how the story could play out

In [None]:
generation_config = {"top_p": 1.0, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


Our protagonist is an elderly woman named Edna, a widowed and reclusive retiree who lives alone in her small apartment. She has always been interested in mysteries, but generally hasn’t put much effort into pursuing them due (she believes) to her age and lack of interest. However, one day a mysterious package arrives at her doorstep that she otherwise wouldn't have thought to seek out. Despite initial apprehension, she becomes curious about the item and decides to pursue answers regarding its origin and nature by using available resources online. Over time Edna learns the significance of this object: it's a mysterious amulet that grants good luck for those who hold onto it amidst various personal trials on life. As she starts wearing the amulet herself when dealing with stressful situations (including helping others as well that request the amulet), Edna's outlook on life improves significantly; becoming happier in both her personal endeavors as well as interacting with people around 

In [None]:
generation_config = {"top_p": 0.92, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


A young child has a nightmare that foreshadows something terrible happening in their real life. How


`top_p: 0.80`

In [None]:
generation_config = {"top_p": 0.80, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The object is a “chocolate egg”, which gives the character superpowers, allowing them to


`top_p: 0.60`

In [None]:
generation_config = {"top_p": 0.60, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The object is a magical amulet, which grants the character immense power and ability. They


`top_p: 0.50`

In [None]:
generation_config = {"top_p": 0.50, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The object is a key. The key unlocks a door that leads to a hidden room.


`top_p: 0.30`

In [None]:
generation_config = {"top_p": 0.30, "do_sample": True, "max_new_tokens": 500}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


A young woman discovers that she has the power to control gravity. She uses this power to help


### RAG

`top_p: 1.0`

In [None]:
generation_config = {
    "top_p": 1.0,
    "max_new_tokens": 500,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 125955.08it/s]
2023-10-26 06:24:04 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:24:04 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:24:16 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit, pannenkoeken, and ontbijtkoeken.


`top_p: 0.92`

In [None]:
generation_config = {
    "top_p": 0.92,
    "max_new_tokens": 500,
    "do_sample": True,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 73584.28it/s]
2023-10-31 08:31:38 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.


 Beschuit (crispbakes) can be topped with various fruits or whipped cream for dessert


`top_p: 0.80`

In [None]:
generation_config = {
    "top_p": 0.80,
    "max_new_tokens": 500,
    "do_sample": True,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 90394.48it/s]
2023-10-31 08:32:28 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.


 Beschuit (Dutch crisp bakes) are typically eaten for breakfast in the Netherlands. Pann


`top_p: 0.70`

In [None]:
generation_config = {
    "top_p": 0.70,
    "max_new_tokens": 500,
    "do_sample": True,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 137970.53it/s]
2023-10-31 08:33:10 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.


 Beschuit (a savory cake) or pancake (pannenkoeken) with


`top_p: 0.50`

In [None]:
generation_config = {
    "top_p": 0.50,
    "max_new_tokens": 500,
    "do_sample": True,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 127486.44it/s]
2023-10-31 08:33:48 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.


 Beschuit (Dutch crisp bakes) is also eaten as a breakfast food, with the same


### Language Translation

Translation with `top_p=0`

```
It's good to go out and play football because it's sunny. After that, it is possible to visit the national park for nature hiking and seeing wild animals.
```

In [None]:
generation_config = {"max_new_tokens": 300}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



English: Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.

French: Il est bon de sortir et jouer au football parce qu'il est soleil. Après cela, il est possible de visiter le parc national pour une randonnée en nature et voir des animaux sauvages.

Translation: It is good to go out and play football because it is sunny. After that, it is possible to visit the national park for a nature walk while seeing some wild animals.


`top_p: 1.0`


In [None]:
generation_config = {
    "top_p": 1.0,
    "max_new_tokens": 500,
}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



English: Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.

French: Il est bon de sortir et jouer au football parce qu'il est soleil. Après cela, il est possible de visiter le parc national pour une randonnée en nature et voir des animaux sauvages.

Translation: It is good to go out and play football because it is sunny. After that, it is possible to visit the national park for a nature walk while seeing some wild animals.


`top_p: 0.92`

In [None]:
generation_config = {
    "top_p": 0.92,
    "max_new_tokens": 500,
    "do_sample": True,
}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



Translation: Il est bon de sortir et jouer au football aujourd'hui


`top_p: 0.80`

In [None]:
generation_config = {
    "top_p": 0.80,
    "max_new_tokens": 500,
    "do_sample": True,
}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



'Today is a good day to go out and play football because it is sunny. After


`top_p: 0.70`

In [None]:
generation_config = {
    "top_p": 0.70,
    "max_new_tokens": 500,
    "do_sample": True,
}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



Translated to French: 'Aujourd'hui est un bon journée pour


`top_p: 0.50`

In [None]:
generation_config = {
    "top_p": 0.50,
    "max_new_tokens": 500,
    "do_sample": True,
}
result = text_pipeline(
    prompt="Translate the following sentence to French 'Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.'",
    generation_config=generation_config,
)
print(result.generations[0].text)



English: Today is a good day to go out and play football because it is sunny.


## Repetition Penalty

Penalty applied for generating new token. Existing token frequencies summed to subtraction the logit of its corresponding logit value

### Summarization

In [None]:
generation_config = {
    "repetition_penalty": 1.0,
    "do_sample": True,
    "max_new_tokens": 300,
}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Large or many layers often lead


`repetition_penalty: 2.0`

In [None]:
generation_config = {
    "repetition_penalty": 2.0,
    "do_sample": True,
    "max_new_tokens": 300,
}

result = text_pipeline(
    prompt="""
Write a concise summary of the following:

Sparse Finetuning for Inference Acceleration of Large Language Models

Fine-tuning large language models to obtain a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on par with quantization approaches.

Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of sparsity.  Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity.
What’s impressive is that the sparse fine-tuned model achieves 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of a cheap consumer AMD Ryzen CPU.
This post will dive into more details from this paper.
The researchers aim to address the high cost of running large language models. One of the most popular techniques for doing so is quantization where the precision of the weights is reduced to 4 bits. However, at around 3 bits per weight, it becomes hard to recover accuracy.

Introducing weight sparsity is an alternative to quantization where certain connections in the network are set to zero. The sparsity introduced during fine-tuning leads to a significantly faster model. In this paper, the authors study sparse fine-tuning for large language models for the following applications:
Speech transcription using Whisper
Machine translation using T5
Higher-level reasoning using the open GPT-type MPT model
Challenges of Large Language Models
When fine-tuning the large language models the researchers faced several challenges which they resolved. The challenges came from the fact that the:
The fine-tuning data may not be as large as the training data
The desire to achieve high sparsity levels
""",
    generation_config=generation_config,
)
print(result.generations[0].text)

Efficient approximations were used such and soft attention matrices did poorly when evaluated against dense alternatives


### Creative Writing

With no repetition penalty the model repeats the phrases
As the character
excitement and wonder

But at with repetition penalty of 2.0 the model seems to write a story with no repetition:
```

The character discovers that the object is actually some sort of magical artifact. They begin to experience powers and abilities corresponding with the artifact, which are unpredictable at first but become predictable over time. The mystery gradually opens up into an epic quest complete with battles against evil forces trying to seize control of the artifact for sinister purposes, which eventually culminates in saving humanity from peril in one final desperate effort. Ultimately, after having confronted numerous challenges along the way, including moments where seemingly impossible obstacles stood in their path or enemies threatening death or degradation if they were consumed by greed for power --- even when these encounters threatened losing everything dear to them --- they arrive at a satisfying resolution by leveraging their newfound magic into fighting back against those who sought its possession as well: namely those bent on conquest amidst chaos raging across nations as opposed to peaceful society (i.); hence ensuring that prosperity was guarded by heroically defending principles vital to society’s wellbeing; all while helping others overcome adversity through compassion and empathy toward other people suffering alongside them
```

In the text below there repetition of the phrase `An elderly man uses his understanding`

In [None]:
generation_config = {
    "repetition_penalty": 1.0,
    "do_sample": True,
    "max_new_tokens": 500,
}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


As a result of receiving the object, the character sees and experiences new things that they otherwise wouldn’t have encountered. Perhaps they discover a new passion or hobby, or maybe they become more compassionate and empathetic due Ingsights, though these experiences are different for each character.
In this story, an elderly man receives instructions on how to access his futureself in order as he faces retirement with his current self. Being empowered with insights into his future self helps him make sound retirement decisions today so he can enjoy retirement without stressing about finances." /> The elderly man uses his understanding of his future self to make sound retirement decisions today so he can enjoy retirement without stressing about finances." />"> An elderly man uses his understanding of his future self to make sound retirement decisions today so he can enjoy retirement without stressing about finances."> /></submitting">An elderly man uses his understanding of his fut

The text below contains no repetitions:


In [None]:
generation_config = {
    "repetition_penalty": 2.0,
    "do_sample": True,
    "max_new_tokens": 300,
}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


The object could be magical, causing the recipient to experience new powers or abilities related to magic or technology. Perhaps the recipient uses this object in conjunction with existing abilities, enhancing them in some way. Or maybe the object has dangerous potential; perhaps this new ability comes with responsibilities and dangers that come along with it.
Alternatively, the item could have historical significance; perhaps someone from another time who helped protect magic/technology ends up getting caught up in human history and becomes relevant once again through these objects sent from that era into our world today through some unforeseen connection of fate (or possibly a technological breakthrough). In any case, there must be consequences for receiving such an unexpected gift as these are often not without explanation – leaving us guessing what otherworldly events may emerge due The Unknown Gift! It’s anyone's guess what lies ahead as we delve into unexplored territory! Maybe 

In [None]:
generation_config = {
    "repetition_penalty": 2.0,
    "do_sample": True,
    "max_new_tokens": 300,
}

result = text_pipeline(
    prompt="A character receives a mysterious package containing an object they’ve never seen before. What is it and how does it change their life?",
    generation_config=generation_config,
)
print(result.generations[0].text)


As the mystery unfolds, our main character discovers that the object can be used as a weapon for good or evil purposes. If you’re looking for deep plot lines, this series is perfect!
The main character must choose: whether to use the mystical item in their possession to fight crime and injustice or help those caught by catastrophic events such Militarized police response teams against terrorist attacks on innocent civilians abroad. If you want stories of heroism this series is ideal!
If you are looking for plots involving magic artifacts that bring peril with them; This may not be ideal if your aim is to avoid complex moral dilemmas & decisions which challenge conventional norms of behavior in society today.


### RAG

In [None]:
model_config = {"sequence_length": 2048}
generation_config = {"max_new_tokens": 500}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 135300.13it/s]
2023-10-26 06:25:38 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:25:38 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:25:51 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit, pannenkoeken, and ontbijtkoeken.


In [None]:
generation_config = {
    "repetition_penalty": 1.0,
    #  "do_sample": True,
    "max_new_tokens": 500,
}
model_config = {"sequence_length": 2048}
llm = DeepSparse(
    model=MODEL_PATH, model_config=model_config, generation_config=generation_config
)
chain = RetrievalQA.from_chain_type(
    llm,
    chain_type="stuff",
    return_source_documents=True,
    retriever=docsearch.as_retriever(),
)
res = chain({"query": "What are some Dutch breakfast options?"})
answer = res["result"]
print(answer)

Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 132312.43it/s]
2023-10-26 06:26:17 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
2023-10-26 06:26:17 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx
2023-10-26 06:26:30 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the transformer model at /home/mwiti/.cache/huggingface/hub/models--neuralmagic--mpt-7b-chat-pruned50-quant/snapshots/a1b59e5acd426be155761950cc9ac297635616bf/model.onnx


 Beschuit, pannenkoeken, and ontbijtkoeken.


### Language Translation

repetition_penalty 1.0:
```
I translate this sentence into French as `Today is a good day to play sports games in large areas of the national park to walk in landscapes with real animals.’
```


In [None]:
generation_config = {
    "repetition_penalty": 1.0,
    "do_sample": True,
    "max_new_tokens": 300,
}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)

 I translate this sentence in French as `Ce jour est un bonne journée pour allumer en jeu sportif dans de larges espaces du parc nationaume pour se promener dans des paysages avec de vrais animaux.’


Repetition penalty 2.0:

```
"Ah today Thursday is a good day for the celebration of the sport within the framework of a football match, established on this basis are real which was made to be today also appropriate park environment; the be will have to visit also was an old equivalence that a country could offer."
```

In [None]:
generation_config = {
    "repetition_penalty": 2.0,
    "do_sample": True,
    "max_new_tokens": 300,
}
result = text_pipeline(
    prompt="Translate the following sentence to French `Today is a good day to go out and play football because it is sunny. After that, you can consider, visiting the national park for a nature walk while seeing some wild animals.`",
    generation_config=generation_config,
)
print(result.generations[0].text)



I would translate this sentence as follows:  
"Ah today jeudi est un bon jour à cause la célébration du sport dans le cadre d'un match de football, établis sur cette base sont réelles qui était fait pour être aujourd'hui également opportun environnement de parc; l’être devront visiter également était une vieille équivalence que désirous pouvant offrir un pays."


## Conclusion

### Temperature
Increasing the temperature for creative tasks is a good idea to increase variety in the words that are selected by the model. This will ensure that the generated text is interesting because the model doesn't always choose the same words, which is the case for a greedy approach.  

The same behaviour may not be desired for RAG applications where you want the model to answer the questions from the provided text. For these type of applications, lowering the temperature may be more desireable. The same can be said for translation tasks since you don't want the model to get "creative" but to be more "confident" in its responses.

In summarization, you also, don't want the model to get too creative but to summarize the given text, so a lower temperature may be better.

### `top_k`
`top_k` is an important parameter, particulary for creative tasks. As seen in the notebook, the model repeats certain phrases and sentences when restricted to only a few `top_k` words but doesn't repeat when the `top_k` is increased. Increasing the number reduces repetition. You can increase this number gradually and observe if the story remains coherent.

Best results for RAG were obtained with the default `top_k` value of 0. Since this is not a creative task, the results from this notebook indicate that increasing the value of `top_k` leads to misleading answers that are not even in the given text.

For summarization, increasing the word pool by increasing the value of `top_k` leads to poor summaries, where the model sometimes generates text that not related to the given content.

Using a high `top_k` such as 300 for translation leads to extremely poor results where the translated text is not related to the original text. This can be attributed to the fact that the model becomes more "creative" since there are so many words to pick from. Therefore, for translation, task it's better to keep this number low. In this notebook 50, gave reasonable results.

### `top_p`
`top_p` sampling is a strategy for dynamically choosing the value of `k` as long the cumulative probability of the chosen words exceeds `p`. The model will choose the least number of words that exceed the chosen probability, making the number of words dynamic. For instance, if you pick p as 0.8. The probability of words picked can be 0.5+0.2+0.1 or 0.3+0.3+0.2.

### Repetition Penalty
It is a good idea penalize the model for repetition in creative tasks to make sure it doesn't keep repeating the same phrases. As seen earlier, in the notebook, we were able to get stories without repetition when we penalized the model for repetition.

Repetition doesn't seem to be a problem for RAG,summarization and translation tasks. In fact, from this notebook, it looks like penalizing the model for repetition can lead to poor performance in translation tasks especially when certain words appear severally in the sentence.

