# Setup - Ollama

In [None]:
!pip install -qU gradio

This command installs or updates the Gradio library on your system. Gradio is a Python library that makes it easy to create web interfaces for machine learning models and data processing pipelines.

The command has several parts:
- `pip` is Python's package installer that downloads and sets up libraries
- `install` tells pip to add a new package to your system
- `-q` makes the installation "quiet" by hiding most of the technical output
- `-U` forces an upgrade to the newest version if Gradio is already installed
- `gradio` is the name of the package being installed

When you run this, pip will silently download Gradio and all its required dependencies, then install or update them in your Python environment. The quiet flag helps keep your notebook or console clear of installation messages.

In [None]:
# TERMINAL :

# curl -fsSL https://ollama.com/install.sh | PATH="/sbin:/usr/sbin:$PATH" sh

# ollama serve &

# ollama pull gemma2


'''
curl -fsSL <https://ollama.com/install.sh> | sh
'''

This command downloads and runs the installation script for Ollama, a tool for running large language models locally. Let's break down each part to understand how it works:

The command combines two main actions using the pipe operator (|). The first part downloads the script, and the second part executes it with specific path settings.

In the first part, `curl -fsSL https://ollama.com/install.sh`:
- `curl` is a tool that transfers data from or to a server
- `-f` tells curl to fail silently on HTTP errors
- `-s` runs curl in silent mode, hiding the progress bar
- `-S` shows errors even in silent mode
- `-L` makes curl follow redirects if the URL points to another location
- The URL points to Ollama's installation script on their website

The pipe operator `|` takes the downloaded script and sends it to the shell command that follows.

In the second part, `PATH="/sbin:/usr/sbin:$PATH" sh`:
- `PATH=` temporarily modifies the system's PATH environment variable
- `/sbin:/usr/sbin:$PATH` adds system administration directories to the front of the existing PATH
- These directories contain important system utilities that the installation might need
- `sh` runs the downloaded script using the shell with this modified PATH

Adding `/sbin` and `/usr/sbin` to the PATH ensures the installation script can find all necessary system commands, even if they're not in the default user PATH. This makes the installation more reliable across different system configurations.

When you run this command, it will download the installation script and execute it with elevated system access, setting up Ollama on your machine. The script runs with minimal output due to the silent flags, only showing important messages or errors.

'''
ollama serve &
'''
This command starts the Ollama server as a background process. Let me explain what each part does and how they work together:

The command has two main components: `ollama serve` and the `&` operator. Let's start by understanding the core function:

`ollama serve` launches Ollama's server component, which is essential for running language models. When the server starts, it opens up an API endpoint (typically on port 11434) that allows other applications to communicate with and use the language models you've installed.

The `&` at the end is a special shell operator that tells your system to run the process in the background. This means that after starting Ollama:
1. Your terminal remains free for you to type other commands
2. The server continues running even when you're doing other things
3. You won't see the server's output directly in your terminal unless there's an error

Think of this like starting a music player in the background - you want the music to keep playing while you do other things. Similarly, the Ollama server needs to keep running in the background so other applications can interact with it.

One important detail to note is that since the process is running in the background, you won't immediately see if something goes wrong. You might want to check the server status using commands like `ps` or by trying to connect to it with a client application.

If you later need to bring the process back to the foreground or stop it, you can use commands like `fg` to bring it to the foreground or `pkill ollama` to stop it completely.

'''
ollama pull gemma2
'''

This command downloads the Gemma 2 language model from Ollama's model repository and installs it on your system. Let me break down how this process works and what's happening behind the scenes.

The command consists of two main parts: `ollama` is the base command that interacts with the Ollama system, and `pull gemma2` specifies that you want to download and install the Gemma 2 model. This is similar to how you might download an app from an app store - you're getting a pre-packaged piece of software that's ready to use.

When you run this command, several things happen in sequence:
1. Ollama first checks if you already have Gemma 2 installed locally
2. If not, it connects to Ollama's model repository, which is like a library of available language models
3. It begins downloading the model files, which can be quite large (often several gigabytes) since they contain all the neural network parameters that make the model work
4. As it downloads, Ollama verifies the integrity of the files to ensure nothing was corrupted during transfer
5. Finally, it sets up the model in your local Ollama installation so it's ready to use

Gemma 2 is based on Google's Gemma architecture, which was designed to be more efficient than many other language models while maintaining strong performance. Think of it like getting a more fuel-efficient car that still has good acceleration and handling.

Once the download completes, the model will be available for use through Ollama's API or command line interface. You'll be able to send it text prompts and receive generated responses, much like having a conversation with an AI assistant.

If you want to use this model later, you can interact with it using commands like `ollama run gemma2`, which will start a conversation with the model in your terminal.

In [None]:
!curl http://localhost:11434/api/generate -d '{  "model": "gemma2",  "prompt":"Why is the sky blue?", "stream": false}'

{"model":"gemma2","created_at":"2025-01-30T21:49:35.469651733Z","response":"The sky appears blue due to a phenomenon called **Rayleigh scattering**. \n\nHere's a simplified explanation:\n\n1. **Sunlight:** Sunlight is actually made up of all the colors of the rainbow. When it enters Earth's atmosphere, it encounters tiny air molecules (mostly nitrogen and oxygen).\n\n2. **Scattering:** These air molecules scatter the sunlight in different directions.  \n   - **Shorter wavelengths (blue and violet) are scattered more strongly** than longer wavelengths (red and orange). \n\n3. **Our Perception:** As a result, we see more blue light scattered throughout the sky than any other color. This makes the sky appear blue to our eyes.\n\n**Why not violet?** While violet light is scattered even more than blue, our eyes are less sensitive to violet.  Therefore, we perceive the sky as blue.\n\n\nLet me know if you'd like a more detailed explanation or have any other questions!","done":true,"done_reas

This command sends a request to Ollama's local API to generate text using the Gemma 2 model. Let's examine how this works by breaking down each component of the command.

The core command `curl` is being used to send an HTTP POST request to `http://localhost:11434/api/generate`. The localhost address tells us this is connecting to a service running on your own computer - specifically the Ollama server we started earlier. Port 11434 is Ollama's default port for receiving API requests.

The `-d` flag in curl indicates that we're sending data with our request. The data is formatted as a JSON object with three key pieces of information:
1. `"model": "gemma2"` specifies which language model should process our request
2. `"prompt": "Why is the sky blue?"` is the actual question we want the model to answer
3. `"stream": false` tells Ollama to wait for the complete response before sending it back, rather than streaming the response word by word

When you run this command, here's what happens in sequence:
1. Your computer sends the request to the Ollama server running locally
2. The server recognizes that you want to use the Gemma 2 model we downloaded earlier
3. It loads the model if it isn't already loaded in memory
4. The model processes the prompt about why the sky is blue
5. Since streaming is disabled, Ollama waits for the model to finish generating its complete response
6. The server sends back the finished response in a single JSON package

This approach of using curl to interact with the API gives you fine-grained control over how you communicate with the model. It's like having a direct phone line to the AI - you can specify exactly what you want and how you want the response delivered.

If you were building an application that uses this model, you might use similar API calls, but they would typically be wrapped in programming language-specific code rather than using curl directly. The API design makes it easy to integrate Ollama's capabilities into larger software systems.

# Setup

In [None]:
import requests, json
import gradio as gr

This code imports two Python libraries that help create web applications:

The first line brings in the 'requests' and 'json' libraries. The requests library lets the code make HTTP requests to other websites and servers - like fetching data from an API. The json library handles JSON data formats, which are commonly used when sending and receiving data over the internet.

The second line imports the gradio library, giving it the shorter nickname 'gr'. Gradio is a framework that makes it simple to build web interfaces for Python code. It lets developers create user-friendly web apps where people can interact with machine learning models or other Python functions through buttons, text boxes, and other interface elements.

Together, these imports set up the foundation for a web application that can communicate with external services and present an interface to users. The requests library will handle the external communication, json will process the data, and gradio will create the user interface.

In [None]:
class CFG:
  model = "gemma2"

Starting with `import requests, json`, we're bringing in two essential Python libraries:
1. `requests` is a powerful library that lets our code communicate with web services, like making the same kinds of web requests a browser would make. Think of it as giving our program the ability to reach out and interact with other computers and services.
2. `json` helps us work with JSON (JavaScript Object Notation) data - a format that's like a universal language for different computer systems to share information. When we send or receive data from web services, it often comes in JSON format.

The second line `import gradio as gr` brings in the Gradio library, which we set up earlier with that pip install command. We're importing it with the nickname 'gr' to make it easier to reference. Gradio is a toolkit that helps us create user-friendly web interfaces for our Python code. It's particularly useful for making AI models accessible to people who might not be comfortable with code.

By combining these imports, we're setting up the foundation to:
1. Communicate with our Ollama server using requests
2. Process the data we send and receive using json
3. Create an interface that makes it easy for users to interact with our AI model using Gradio

These three components work together like the pieces of a bridge - requests and json handle the backend communication with our AI model, while Gradio creates the frontend that users will actually see and interact with. This separation of concerns is a fundamental principle in software design, making our code both more organized and easier to maintain.

# Functions

In [None]:
def generate(prompt, context, top_k, top_p, temp):
    r = requests.post('http://localhost:11434/api/generate',
                     json={
                         'model': CFG.model, 'prompt': prompt,
                         'context': context,
                         'options':{
                             'top_k': top_k,
                             'temperature':top_p,
                             'top_p': temp
                         } },
                     stream=False)
    r.raise_for_status()

    response = ""

    for line in r.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
        print(response_part)
        if 'error' in body:
            raise Exception(body['error'])

        response += response_part

        if body.get('done', False):
            context = body.get('context', [])
            return response, context

Let me explain how this function works - it's handling communication with our AI model and managing the response in a structured way.

The function `generate` takes five parameters that control how the AI generates text:
- `prompt`: The text we want the AI to respond to
- `context`: Previous conversation history that helps maintain coherent back-and-forth
- `top_k`: Controls how many highest-probability words the model considers at each step
- `top_p`: The nucleus sampling parameter that helps balance creativity and coherence
- `temp`: Temperature setting that affects how random or focused the responses are

When this function runs, it first sends a POST request to our local Ollama server. Think of this like dropping a letter in a mailbox - we're sending our request to a specific address (localhost:11434/api/generate) with specific instructions enclosed. The request includes:
1. The model name (stored in CFG.model)
2. Our prompt text
3. Any context from previous exchanges
4. Generation options that fine-tune how the AI thinks

The `stream=False` parameter tells the server to send back the complete response rather than streaming it word by word. It's like asking for the whole story at once instead of hearing it sentence by sentence.

After sending the request, `r.raise_for_status()` checks if anything went wrong. If the server reports an error, this line will raise an exception - it's like checking if our letter was actually delivered successfully.

The function then processes the response line by line. For each line:
1. It converts the JSON response into a Python object using `json.loads(line)`
2. Extracts any response text using `body.get('response', '')`
3. Prints each piece of the response
4. Checks for errors
5. Builds up the complete response by adding each new piece

If the server signals it's done (`body.get('done', False)`), the function wraps up by:
1. Grabbing any updated context for future exchanges
2. Returning both the complete response and this context

This two-part return value (response and context) is crucial - it's like getting both the answer to your current question and notes that help the AI remember the conversation for next time. The context helps maintain a coherent dialogue across multiple exchanges, much like how humans remember previous parts of a conversation to give relevant responses.

The error handling throughout the function helps ensure we know if something goes wrong, making the system more reliable and easier to debug when problems occur.

In [None]:
def chat(input, chat_history, top_k, top_p, temp):

    chat_history = chat_history or []

    global context
    output, context = generate(input, context, top_k, top_p, temp)

    chat_history.append((input, output))

    return chat_history, chat_history


The `chat` function takes five parameters:
- `input`: The user's latest message
- `chat_history`: A record of all previous exchanges
- `top_k`, `top_p`, and `temp`: The same generation parameters we saw earlier that control how the AI responds

Let's examine what happens step by step:

First, the line `chat_history = chat_history or []` is a safety check. If `chat_history` is None or empty (which would be falsy in Python), it creates a new empty list. This ensures we always have a valid list to work with, even if this is the very first message in a conversation. Think of it like starting a new notebook if you don't already have one to write in.

Next, we see `global context`. This tells Python we want to use and modify the `context` variable that exists outside this function. The context is like the AI's short-term memory of the conversation - it helps maintain coherent back-and-forth exchanges by remembering what was previously discussed.

The line `output, context = generate(input, context, top_k, top_p, temp)` is where the magic happens. It:
1. Calls our previous `generate` function with the new message and existing context
2. Gets back both the AI's response and updated context
3. Updates our global context with this new information

Then, `chat_history.append((input, output))` adds the new exchange to our conversation record. It creates a tuple containing the user's input and the AI's output, like recording both sides of a dialogue. Each entry in chat_history captures one complete turn in the conversation.

Finally, `return chat_history, chat_history` returns the updated conversation history twice. This might seem odd, but it's likely designed to work with Gradio's interface requirements, where multiple UI elements might need to display or use the chat history.

The overall structure of this function creates a continuous conversation flow:
1. It maintains the context needed for coherent exchanges
2. Generates appropriate responses to each new input
3. Keeps a record of the entire conversation
4. Makes this history available for display or further processing

This design allows for natural back-and-forth conversation while maintaining the context and history needed for meaningful exchanges. It's like having a secretary who not only helps you communicate but also keeps perfect records of every exchange.

# Chatbot

In [None]:
context = []

In [None]:
block = gr.Blocks()

with block:

    gr.Markdown("""<h1><center> Muh private chatbot </center></h1>
    """)

    message = gr.Textbox(placeholder="Type here")
    chatbot = gr.Chatbot()


    state = gr.State()

    with gr.Accordion("Advanced Settings", open=False):

      with gr.Row():
          top_k = gr.Slider(0.0,100.0, label="top_k", value=40, info="Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)")
          top_p = gr.Slider(0.0,1.0, label="top_p", value=0.9, info=" Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)")
          temp = gr.Slider(0.0,2.0, label="temperature", value=0.8, info="The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)")


    submit = gr.Button("SEND")

    submit.click(chat, inputs=[message, state, top_k, top_p, temp], outputs=[chatbot, state])

block.launch(debug=True, share = True)



Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://89e1c16b07d264420a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


I
 am
 Gemma
,
 an
 open
-
weights
 AI
 assistant
.
 I
'
m
 a
 large
 language
 model
 trained
 by
 Google
 Deep
Mind
.
 My
 purpose
 is
 to
 help
 people
 by
 understanding
 and
 responding
 to
 their
 requests
 in
 a
 helpful
,
 informative
,
 and
 impartial
 way
.



Here
 are
 some
 key
 things
 to
 know
 about
 me
:



*
 **
Open
-
Weights
:**
 My
 weights
 are
 publicly
 available
,
 meaning
 anyone
 can
 access
 and
 study
 them
.


*
 **
Text
-
Only
:**
 I
 can
 only
 communicate
 through
 text
.
 I
 can
'
t
 generate
 images
,
 sound
,
 or
 videos
.


*
 **
Limited
 Knowledge
:**
 I
 don
'
t
 have
 access
 to
 real
-
time
 information
 or
 the
 internet
.
 My
 knowledge
 is
 based
 on
 the
 data
 I
 was
 trained
 on
,
 which
 has
 a
 cutoff
 point
.


*
 **
Created
 by
 the
 Gemma
 Team
:**
 I
 was
 developed
 by
 a
 team
 of
 engineers
 and
 researchers
 at
 Google
 Deep
Mind
.



I
'
m
 always
 learning
 and
 improving
,
 and
 I
'
m
 excited
 to
 see
 how
 people
 use
 me
 t



This code creates an interactive web interface for a chatbot using Gradio. Let's break it down step by step.

First, we create a container for our interface using gr.Blocks() and store it in the variable 'block'. Think of this like creating a blank canvas for our web page.

Inside this container, we set up several interface elements:

The gr.Markdown creates a centered heading that says "Muh private chatbot" at the top of the page. The HTML tags control the size and positioning.

Next, we add two key components for the chat interaction:
- A textbox (gr.Textbox) where users can type their messages, with a helpful "Type here" placeholder
- A chatbot interface (gr.Chatbot) that will display the back-and-forth conversation

The gr.State() creates a hidden component that maintains the conversation's state - like keeping track of the chat history between messages.

The interface includes an advanced settings section, hidden by default (open=False), that users can expand. Inside this accordion menu, we have three sliders arranged in a row:

1. top_k: Controls how selective the model is when choosing words. At 100, it considers many options, while at 10 it sticks to the most likely choices.
2. top_p: Works with top_k to control text diversity. Higher values (like 0.95) encourage more varied responses, while lower values (like 0.5) keep responses more focused.
3. temperature: Affects the model's creativity. Higher temperatures (up to 2.0) make responses more imaginative, while lower values make them more predictable.

At the bottom, there's a "SEND" button. When clicked, it triggers the chat function (not shown in this code) with five inputs: the message, conversation state, and the three slider values. The function's results update both the chatbot display and the conversation state.

Finally, block.launch() starts up the web interface with debugging enabled (debug=True) and makes it accessible to others (share=True).

The structure flows naturally from top to bottom - from the header, to the main chat interface, to advanced settings, to the send button - creating an intuitive user experience where all controls are easily accessible.