# Generate responses with a SmolLM

We have seen how to [retrieve](retrieve.ipynb) and [rerank](./augment.ipynb) documents in a RAG pipeline. Currently, we will show how to generate a response based on a query and a set of retrieved documents. This will be done using a SmolLM model, which we will deploy as a microservice at the end of this notebook. 


## Inference API

When using a SmolLM, we can use different providers, like `llama-cpp`, `vllm` or `ollama`, however, for simplicty on the HUb we will show how to use a local transformers and a model hosted on the Inference API. We will do this on top of the OpenAI API format, which means the `messages` parameter will be a list of dictionaries with `role` and `content` keys.

### SmolLM in transformers

We will use [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and use [the transformers integration attached to the model on the Hub](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct?library=transformers). Note that we allow for passing `kwargs` like `max_new_tokens` as a parameter to the function which will be passed to the pipeline.

In [12]:
from transformers import pipeline
from typing import Union

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct")


def generate_response_transformers(user_prompt: str, system_prompt: str = "You are a helpful assistant.", **kwargs):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ] 
    return pipe(messages, **kwargs)


generate_response_transformers(
    [{"role": "user", "content": "What is the future of AI?"}], max_new_tokens=4000
)

Device set to use mps:0


[{'generated_text': [{'role': 'user', 'content': 'What is the future of AI?'},
   {'role': 'assistant',
    'content': "AI is evolving rapidly, and we're on the cusp of a new era. Here are some of the key trends and advancements that will shape the future of AI:\n\n1. **Quantum Computing**: Quantum computers are expected to revolutionize AI by solving complex problems that are currently unsolvable by classical computers. They will also enable AI to learn from itself, making it more self-aware and capable of adapting to new situations.\n\n2. **Neuromorphic Computing**: Neuromorphic computing is a type of AI that mimics the human brain's structure and function. It's expected to be used in AI systems that can learn and adapt in real-time, much like the human brain.\n\n3. **Explainable AI (XAI)**: Explainable AI aims to make AI systems more transparent and understandable. This is important because AI systems are often used in critical applications, and it's crucial that they are transparen

### SmolLM in Hugging Face Inference API

We will use the [serverless Hugging Face Inference API](https://huggingface.co/inference-api). This is free and means we don't have to worry about hosting the model. We can find models available for inference [through some basic filters on the Hub](https://huggingface.co/models?inference=warm&pipeline_tag=text-generation&sort=trending). We will use the [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) model and call it with [the provided inference endpoint snippet](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct?inference_api=true).

In [11]:
from huggingface_hub import get_token, InferenceClient
from typing import Union

inference_client = InferenceClient(api_key=get_token())


def generate_response_api(
    user_prompt: str,
    system_prompt: str = "You are a helpful assistant.",
    model: str = "HuggingFaceTB/SmolLM2-360M-Instruct",
    **kwargs,
):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    completion = inference_client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )

    return completion.choices[0].message


generate_response_api("What is the future of AI?")

ChatCompletionOutputMessage(role='assistant', content="AI, as it stands today, is still evolving and growing rapidly. However, we do see some exciting trends emerging:\n\n1. Emotional Intelligence: AI can better mimic emotions it's  not a replacement. AI humans are more empathetic and have the capability of showing emotions.\n\n2. Personalized Learning: AI can provide personalized learning experiences for individuals based on their preferences, personality and learning style. The learning experience is now more dynamic and engaging for students.\n\n3. Industry Applications: AI is giving jobs a new appearance and role. For example, automation and machines are now able to perform routine tasks that previously were hand written by humans. In the future, AI will help us solve complex problems more efficiently.\n\n4. Health Care: AI can enhance our health by creating personalized medicine plans based on a person's genetics, environmental, lifestyle and medical history. \n\n5. Education: AI 

## Creating a web app and microservice for generating responses

We will be using [Gradio](https://github.com/gradio-app/gradio) as web application tool to create a demo interface for our RAG pipeline. We can develop this locally and then easily deploy it to Hugging Face Spaces. Lastly, we can use the Gradio client as SDK to directly interact our RAG pipeline.

### Gradio as sharable app


In [6]:
import gradio as gr

def generate_response(user_prompt: str, system_prompt: str = "You are a helpful assistant.", **kwargs):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    return generate_response_transformers(user_prompt, system_prompt, **kwargs)

with gr.Blocks() as demo:
    gr.Markdown("""# RAG - generate
                
                Part of [smol blueprint](https://github.com/davidberenstein1957/smol-blueprint) - a smol blueprint for AI development, focusing on practical examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs.""")

    with gr.Row():
        user_prompt = gr.Textbox(label="Query", lines=3)
        

    with gr.Row():
        max_new_tokens = gr.Number(250)

    submit_btn = gr.Button("Submit")
    response_output = gr.Textbox(label="Response", lines=10)
    documents_output = gr.Dataframe(
        label="Documents", headers=["chunk", "url", "distance", "rank"], wrap=True
    )

    submit_btn.click(
        fn=rag_interface,
        inputs=[query_input, retrieve_slider, rerank_slider],
        outputs=[response_output, documents_output],
    )

demo.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




<iframe
	src="https://smol-blueprint-rag-hub-datasets.hf.space"
	frameborder="0"
	width="850"
	height="450"
></iframe>

### Deploying Gradio to Hugging Face Spaces

We can now [deploy our Gradio application to Hugging Face Spaces](https://huggingface.co/new-space?sdk=gradio&name=rag-hub-datasets).

-  Click on the "Create Space" button.
-  Copy the code from the Gradio interface and paste it into an `app.py` file. Don't forget to copy the `generate_response_*` function, along with the code to execute the RAG pipeline.
-  Create a `requirements.txt` file with `gradio-client` and `sentence-transformers`.
-  Set a Hugging Face API as `HF_TOKEN` secret variable in the space settings, if you are using the Inference API.

We wait a couple of minutes for the application to deploy et voila, we have [a public RAG interface](https://huggingface.co/spaces/smol-blueprint/rag-hub-datasets)!

### Gradio as REST API

We can now use the [Gradio client as SDK](https://www.gradio.app/guides/getting-started-with-the-python-client) to directly interact with our RAG pipeline. Each Gradio app has a API documentation that describes the available endpoints and their parameters, which you can access from the button at the bottom of the Gradio app's space page.

In [40]:
from gradio_client import Client

client = Client("https://smol-blueprint-rag-hub-datasets.hf.space/")
result = client.predict(
    query="What is the future of AI?",
    k_retrieved=10,
    k_reranked=5,
    api_name="/rag_pipeline",
)
result

Loaded as API: https://smol-blueprint-rag-hub-datasets.hf.space/ ✔
('In the future, artificial intelligence (AI) is expected to play an increasingly significant role in various aspects of society, including object recognition, computer vision, natural language processing, robotics, and more. AI is already being used to develop products and services that enhance personal convenience, make life more efficient, and improve the quality of lives.\n\nAI work will also be more methodical and less reliant on high-bandwidth parallelism, setting it free from the "wonder years" of the 21st century. This will allow AI systems to generalize and adapt more effectively, borrowing from examples and experiences to solve problems more efficiently with fewer examples.\n\nArtificial intelligence is expected to see an industry growth of $23 billion by 2023, a $3.75 billion growth compared to the previous year, and to continue growing at $6.9 billion each year.\n\nResearchers are actively exploring various 

## Next steps

We have seen how to build a RAG pipeline with a SmolLM and some rerankers. Next steps would be to monitor the performance of the RAG pipeline and improve it.