# Locally using Ollama Rest end point

Make sure to have docker installed locally
Pull Ollama image using and serve using CPU only option

docker pull ollama/ollama

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

CURL command to test if your local setup works

curl --location 'http://localhost:11434/api/generate' \
--header 'Content-Type: application/json' \
--data '{
  "model": "llama3.2:1b",
  "prompt": "Translate the following English text to French: '\''Hello, how are you?'\''","max_tokens": 10
}'

We will do the same from Python now (version used 3.12.7)

In [14]:
import os
import requests
import json
# Set environment variable for Ollama model URL
os.environ['OLLAMA_API_URL'] = 'http://localhost:11434/api/generate'

# Function to query the Ollama model
def query_ollama_model(prompt):    
    url = os.environ['OLLAMA_API_URL']    
    payload = {
    "model": "llama3.2:1b",
    "prompt": prompt,
    "max_tokens": 60
    }
    # Convert the payload to a JSON string
    payload_json = json.dumps(payload)

    # Set the headers to specify JSON content
    headers = {
        "Content-Type": "application/json"
    }

    # Send the POST request
    response = requests.post(url, data=payload_json, headers=headers, stream=True)
    # Check the response
    if response.status_code == 200:
        full_response = ""
        # Process each line in the streaming response
        for line in response.iter_lines():
            if line:
                # Decode the line and parse it as JSON
                json_response = json.loads(line.decode('utf-8'))
                if 'response' in json_response:
                    full_response += json_response['response']
                if json_response.get('done', False):
                    break
        return full_response
    else:
        print(f"Request failed with status code {response.status_code}")
        response.raise_for_status()
        
# Example usage
prompt = "Translate the following English text to French: 'Hello, how are you?'"
response_text = query_ollama_model(prompt)
print(response_text)

The translation of "Hello, how are you?" from English to French is:

"Bonjour, comment vas-tu ?"

Here's a breakdown of each word:

- Bonjour (good day)
- comment (how, in this case, meaning "in what way", but also literally "what")
- vas-tu (you)


# Rag Locally using Ollama

Prerequiste for Chroma to work locally :

Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/

Install "Desktop development with C++"

For Ollama to work without Docker, install the package from :

https://github.com/ollama/ollama?tab=readme-ov-file

Then go to CMD, type

ollama pull llama3.2:1b

You also need to pull "nomic-embed-text" in similar manner.
This is for generating your embeddings.


Now we know that our local model works. And by giving a prompt without any external context, i was able to get back response from my model. And able to print the response. Next step would be to demonstrate the same with a RAG pipeline, where we would add a context along with the payload to get back a response from the model.

In [2]:
# Document loading, retrieval methods and text splitting
%pip install -qU langchain langchain_community

# Local vector store via Chroma
%pip install -qU langchain_chroma

# Local inference and embeddings via Ollama
%pip install -qU langchain_ollama

# Web Loader
%pip install -qU beautifulsoup4

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
import os
from dotenv import load_dotenv

# Load environment variables FIRST, before any other imports
load_dotenv()

# Set environment variables
os.environ["LANGCHAIN_API_KEY"] = "llanchain-api-key"  # Add this line
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "your-project-name"  # Add this line
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"  # Add this line

from langchain_chroma import Chroma
from langchain import hub
from langchain_ollama import OllamaEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

#### INDEXING ####

# Load Documents
loader = WebBaseLoader("https://www.cherryhillfreeclinic.org/care-services/")

docs = loader.load()

# Check if docs is empty and handle it
if not docs:
    print("No documents loaded. Check the web paths or bs_kwargs.")
else:
    # Split
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(docs)

    # Check if splits is empty and handle it
    if not splits:
        print("No splits generated. Check the text splitter configuration.")
    else:
        #Local Embeddings for Ollama
        local_embeddings = OllamaEmbeddings(model="nomic-embed-text")
        
        # Embed
        vectorstore = Chroma.from_documents(documents=splits, embedding=local_embeddings)

        retriever = vectorstore.as_retriever()

        #### RETRIEVAL and GENERATION ####

        # Prompt
        prompt = hub.pull("rlm/rag-prompt")

        # LLM
        #llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
        llm = ChatOllama(
            model="llama3.2:1b",
        )

        # Post-processing
        def format_docs(docs):
            return "\n\n".join(doc.page_content for doc in docs)

        # Chain
        rag_chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | llm
            | StrOutputParser()
        )

        # Question
        rag_chain.invoke("any treatments for ADHD available?")

