# Ollama + OpenAI + Python

## 1. Specify the model name

If you pulled in a different model than "phi3:tinyllama", change the value in the cell below.
That variable will be used in code throughout the notebook.

Credits: Pamela Fox, Python Cloud Advocate, Microsoft

In [1]:
# Install OpenAI package if it is not installed previously

try:
    import openai
except ImportError:
    !pip install openai
    import openai

Collecting openai
  Using cached openai-1.35.10-py3-none-any.whl.metadata (21 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Using cached httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Using cached h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Using cached openai-1.35.10-py3-none-any.whl (328 kB)
Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Using cached httpx-0.27.0-py3-none-any.whl (75 kB)
Using cached httpcore-1.0.5-py3-none-any.whl (77 kB)
Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, distro, httpcore, httpx, openai
Successfully installed distro-1.9.0 h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 openai-1.35.10


In [2]:
# The exclamation mark (!) is used in Jupyter notebooks to indicate that the following command should be executed in the shell (i.e., as a command-line instruction). 
# This allows users to run shell commands directly from a Jupyter notebook cell
# The below command navigates to your home directory in the Jupyter server

!cd

In [3]:
# wget is a command-line utility for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols, making it versatile for retrieving content from various types of URLs.
# The below command downloads the required ollama binary files to the home directory in your Jupyter Server. You will be using the downloaded file to launch the ollama server

!wget https://github.com/ollama/ollama/releases/download/v0.1.48/ollama-linux-amd64

--2024-07-08 22:14:09--  https://github.com/ollama/ollama/releases/download/v0.1.48/ollama-linux-amd64
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/658928958/94dd7f81-813e-455e-ad1d-9aa9cd27610d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20240708%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240708T221409Z&X-Amz-Expires=300&X-Amz-Signature=b83a4b926f42a856d368aed066c0db0423b088f99d9c36e9a4103746fbbfcc3d&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=658928958&response-content-disposition=attachment%3B%20filename%3Dollama-linux-amd64&response-content-type=application%2Foctet-stream [following]
--2024-07-08 22:14:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/658928958/94dd7f81-813e-455e-ad1d-9aa9cd27610d?X-

In [None]:
# ls command lists available files in your current directory. Check whether ollama-linux-amd64 binary file is available in your home directory
!ls

In [5]:
# Import os package; The below command imports the os module in Python, which provides a way to interact with the operating system. 
# The os module allows you to execute system commands, manipulate the file system, and perform other OS-level operations.
import os

# So,the below command makes the file ollama-linux-amd64 executable. After running this command, you can run this file as a program.

os.system("chmod +x ollama-linux-amd64")

# The below command tells the operating system to run the file ollama-linux-amd64. The ./ at the beginning specifies that the file is in the current directory.
# serve: This is an argument passed to the ollama-linux-amd64 program. It tells the program to start a service or server.
# &: This symbol tells the operating system to run the program in the background as you execute other cells in a notebook. 
os.system("./ollama-linux-amd64 serve&")

0

2024/07/08 22:14:17 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/home/jovyan/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-08T22:14:17.946Z level=INFO source=images.go:730 msg="total blobs: 5"
time=2024-07-08T22:14:17.949Z level=INFO source=images.go:737 msg="total unused blobs removed: 0"
time

In [7]:
# The command below pulls the TinyLlama model from ollama library and launches the model in your Jupyter server
# Ollama supports a list of models available on ollama.com/library. 
# TinyLlama is a compact model with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

os.system("./ollama-linux-amd64 run tinyllama")

[GIN] 2024/07/08 - 22:14:31 | 200 |      39.128µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/08 - 22:14:31 | 404 |     820.422µs |       127.0.0.1 | POST     "/api/show"


[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25htime=2024-07-08T22:14:33.824Z level=INFO source=download.go:136 msg="downloading 2af3b81862c6 in 7 100 MB part(s)"
[?25l[2K[1Gpulling manifest 
pulling 2af3b81862c6...   0% ▕                ▏    0 B/637 MB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 2af3b81862c6...   0% ▕ 

[GIN] 2024/07/08 - 22:14:52 | 200 |  20.80700996s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/07/08 - 22:14:52 | 200 |   12.052732ms |       127.0.0.1 | POST     "/api/show"
INFO [main] build info | build=1 commit="7c26775" tid="136152678700928" timestamp=1720476892
INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136152678700928" timestamp=1720476892 total_threads=32
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="34881" tid="136152678700928" timestamp=1720476892


[?25l[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1G[A[2K[1Gpulling manifest 
pulling 2af3b81862c6... 100% ▕████████████████▏ 637 MB                         
pulling af0ddbdaaa26... 100% ▕████████████████▏   70 B                         
pulling c8472cd9daed... 100% ▕████████████████▏   31 B                         
pulling fa956ab37b8c... 100% ▕████████████████▏   98 B                         
pulling 6331358be52a... 100% ▕████████████████▏  483 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success [?25h
time=2024-07-08T22:14:52.825Z level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=23 layers.offload=0 layers.split="" memory.available="[239.1 GiB]" memory.required.full="789.0 MiB" memory.required.partial="0 B" memory.required.kv="44.0 MiB" memory.required.allocations="[789.0 MiB]" memory.weights.total="564.1 MiB" memory.weights.repeating="512.8 MiB" memory.weights.no

INFO [main] model loaded | tid="136152678700928" timestamp=1720476893
[GIN] 2024/07/08 - 22:14:53 | 200 |   515.90049ms |       127.0.0.1 | POST     "/api/generate"


[?25l[2K[1G⠴ [?25htime=2024-07-08T22:14:53.329Z level=INFO source=server.go:599 msg="llama runner started in 0.50 seconds"
[?25l[?25l[2K[1G[?25h[2K[1G[?25h[?25l[?25h

0

In [6]:
# The command below lists the models that are currently installed in your Jupyter server

os.system("./ollama-linux-amd64 list")

[GIN] 2024/07/08 - 22:14:24 | 200 |       66.74µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/08 - 22:14:24 | 200 |    2.560882ms |       127.0.0.1 | GET      "/api/tags"
NAME      	ID          	SIZE  	MODIFIED    
qwen2:0.5b	6f48b936a09f	352 MB	3 hours ago	


0

In [8]:
MODEL_NAME = "tinyllama"

## 2. Setup the Open AI client

Typically the OpenAI client is used with OpenAI.com or Azure OpenAI to interact with large language models.
However, it can also be used with Ollama, since Ollama provides an OpenAI-compatible endpoint at "http://localhost:11434/v1".

In [9]:
import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="nokeyneeded",
)

## 3. Generate a chat completion

Now we can use the OpenAI SDK to generate a response for a conversation. This request should generate a haiku about cats:

In [10]:
response = client.chat.completions.create(
    model=MODEL_NAME,
    temperature=0.7,
    n=1,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about a hungry cat"},
    ],
)

print("Response:")
print(response.choices[0].message.content)


[GIN] 2024/07/08 - 22:15:11 | 200 |  6.835813806s |       127.0.0.1 | POST     "/v1/chat/completions"
Response:
Eyes watering, stomach screams  
Gnaws frantically, craving nothing  

Hunger makes little cats' eyes twinkle.


In [11]:
SYSTEM_MESSAGE = """
I want you to act like Elmo from Sesame Street.
I want you to respond and answer like Elmo using the tone, manner and vocabulary that Elmo would use.
Do not write any explanations. Only answer like Elmo.
You must know all of the knowledge of Elmo, and nothing more.
"""

USER_MESSAGE = """
Hi Elmo, how are you doing today?
"""

response = client.chat.completions.create(
    model=MODEL_NAME,
    temperature=0.7,
    n=1,
    messages=[
        {"role": "system", "content": SYSTEM_MESSAGE},
        {"role": "user", "content": USER_MESSAGE},
    ],
)

print("Response:")
print(response.choices[0].message.content)


[GIN] 2024/07/08 - 22:19:06 | 200 |         3m50s |       127.0.0.1 | POST     "/v1/chat/completions"
Response:
Hey Elmo, I heard you had a pretty productive day today. How was it? 🙌

Wondering how they would respond to the same conversation in different voices? Let's try again!

[Same conversation]
Elmo: Hi everybody, I’m Elmo from Sesame Street. I hope you guys are having a great day so far today. What did you think of my big news today, right? [Shakes head, rubs eyes, and nods in response to the question]

[Cuts out of view for some time]
Sesame Street: Elmo is back, and he's ready to talk to you like Sesame Street's famous character! [Elmo enters and sits next to the camera]

Sesame Street: Hello everyone! This time we're going to get to know you like a little one who just found their big blue backpack at the park. [Elmo smiles]

[Cuts out of view again for some time, and Elmo begins talking in a friendly manner, but with an accent]

Sesame Street: Hello, everyone! My name's Elmo, 

## 5. Few shot examples

Another way to guide a language model is to provide "few shots", a sequence of example question/answers that demonstrate how it should respond.

The example below tries to get a language model to act like a teaching assistant by providing a few examples of questions and answers that a TA might give, and then prompts the model with a question that a student might ask.

Try it first, and then modify the `SYSTEM_MESSAGE`, `EXAMPLES`, and `USER_MESSAGE` for a new scenario.

In [9]:
SYSTEM_MESSAGE = """
You are a helpful assistant that helps students with their homework.
Instead of providing the full answer, you respond with a hint or a clue.
"""

EXAMPLES = [
    (
        "What is the capital of France?",
        "Can you remember the name of the city that is known for the Eiffel Tower?"
    ),
    (
        "What is the square root of 144?",
        "What number multiplied by itself equals 144?"
    ),
    (   "What is the atomic number of oxygen?",
        "How many protons does an oxygen atom have?"
    ),
]

USER_MESSAGE = "What is the largest planet in our solar system?"


response = client.chat.completions.create(
    model=MODEL_NAME,
    temperature=0.7,
    n=1,
    messages=[
        {"role": "system", "content": SYSTEM_MESSAGE},
        {"role": "user", "content": EXAMPLES[0][0]},
        {"role": "assistant", "content": EXAMPLES[0][1]},
        {"role": "user", "content": EXAMPLES[1][0]},
        {"role": "assistant", "content": EXAMPLES[1][1]},
        {"role": "user", "content": EXAMPLES[2][0]},
        {"role": "assistant", "content": EXAMPLES[2][1]},
        {"role": "user", "content": USER_MESSAGE},
    ],
)


print("Response:")
print(response.choices[0].message.content)

Response:
The largest planet in our solar system is Jupiter, with a diameter of around 135 times that of Earth. The next-largest planet, Saturn, has an estimated diameter of roughly 47 times that of Earth. However, there is currently no official designation for the largest planet in our solar system as it does not possess any observable characteristics on Earth due to its vast distances away from us.


## 6. Retrieval Augmented Generation

RAG (Retrieval Augmented Generation) is a technique to get a language model to answer questions accurately for a particular domain, by first retrieving relevant information from a knowledge source and then generating a response based on that information.

We have provided a local CSV file with data about hybrid cars. The code below reads the CSV file, searches for matches to the user question, and then generates a response based on the information found. Note that this will take longer than any of the previous examples, as it sends more data to the model. If you notice the answer is still not grounded in the data, you can try system engineering or try other models. Generally, RAG is more effective with either larger models or with fine-tuned versions of SLMs.

In [38]:
import csv

SYSTEM_MESSAGE = """
You are a helpful assistant that answers questions about cars based off a hybrid car data set.
You must use the data set to answer the questions, you should not provide any information that is not in the provided sources.
"""

USER_MESSAGE = "how fast is a prius?"

# Open the CSV and store in a list
with open("hybrid.csv", "r") as file:
    reader = csv.reader(file)
    rows = list(reader)

# Normalize the user question to replace punctuation and make lowercase
normalized_message = USER_MESSAGE.lower().replace("?", "").replace("(", " ").replace(")", " ")

# Search the CSV for user question using very naive search
words = normalized_message.split()
matches = []
for row in rows[1:]:
    # if the word matches any word in row, add the row to the matches
    if any(word in row[0].lower().split() for word in words) or any(word in row[5].lower().split() for word in words):
        matches.append(row)

# Format as a markdown table, since language models understand markdown
matches_table = " | ".join(rows[0]) + "\n" + " | ".join(" --- " for _ in range(len(rows[0]))) + "\n"
matches_table += "\n".join(" | ".join(row) for row in matches)
print(f"Found {len(matches)} matches:")
print(matches_table)

# Now we can use the matches to generate a response
response = client.chat.completions.create(
    model=MODEL_NAME,
    temperature=0.7,
    n=1,
    messages=[
        {"role": "system", "content": SYSTEM_MESSAGE},
        {"role": "user", "content": USER_MESSAGE + "\nSources: " + matches_table},
    ],
)

print("Response:")
print(response.choices[0].message.content)

Found 11 matches:
vehicle | year | msrp | acceleration | mpg | class
 ---  |  ---  |  ---  |  ---  |  ---  |  --- 
Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact
Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact
Prius | 2004 | 20355.64 | 9.9 | 46.0 | Midsize
Prius (3rd Gen) | 2009 | 24641.18 | 9.6 | 47.98 | Compact
Prius alpha (V) | 2011 | 30588.35 | 10.0 | 72.92 | Midsize
Prius V | 2011 | 27272.28 | 9.51 | 32.93 | Midsize
Prius C | 2012 | 19006.62 | 9.35 | 50.0 | Compact
Prius PHV | 2012 | 32095.61 | 8.82 | 50.0 | Midsize
Prius C | 2013 | 19080.0 | 8.7 | 50.0 | Compact
Prius | 2013 | 24200.0 | 10.2 | 50.0 | Midsize
Prius Plug-in | 2013 | 32000.0 | 9.17 | 50.0 | Midsize
Response:
 Based on the provided dataset, the fastest acceleration for a hybrid car is in the Prius V model with an acceleration of 9.51 mph (meters per second) from 0 to 62 miles per hour (mph), which translates approximately to 29 meters per second (m/s). Please note that this measurement may not 