# Install necessary libraries.


In [None]:
!pip install rdflib SPARQLWrapper bs4

In [None]:
!pip install langchain_community

In [None]:
!pip install chromadb

# Install an LLM.
on your own machine
go to https://github.com/ollama/ollama and download the framework for your platform. This is the free(but not Open) Model From Meta

https://en.wikipedia.org/wiki/Llama_(language_model)

There are many other ways to get local LLMs and RAGs running. 
- [Huggingface](https://huggingface.co/models) is the headquarter for all sorts of models
- [MLStudio](https://lmstudio.ai/) and [GPT4All](https://gpt4all.io/index.html) are two of the prominent GUI interfaces to make installation and use a breeze.
  



In [None]:
from langchain_community.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="mistral"
)
print(ollama.invoke("Who is in the programm committee of the LDAC 2024 workshop?"))

# Ingest
Lets retrieve some information to use as embeddings
Why don't we start with our own context: LDAC 2024! Yay!


In [None]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://linkedbuildingdata.net/ldac2024/")

data = loader.load()

In order to be fed to our RAG, let's chop up the html into tokens. There are a lot of tokenizers to adress different types of input data, e.g. PDFs etc. 
In this case we are dealing with HTML and we e.g. don't want to polute our embeddings with html tags like `<b> <pre> <div>` , javascript etc. 

Langchain has us covered with a number of different variations.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

The raw daata is ready, we can feed it into a vector store. We are using Chroma. There are a lot of different options available, and two months from the time of this writing, things will have broken already... we are living in very dynamic times.

In [None]:
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=all_splits, embedding=oembed)

In [None]:
from langchain_community.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="mistral"
)
print(ollama.invoke("what is the LDAC and who is part of the committee"))

Let's see what documents the vectorstore finds for similarities for our question

In [None]:
question="who is in the programm committee of the LDAC 2024?"
docs = vectorstore.similarity_search(question)
len(docs)

In [None]:
docs

In [None]:
from langchain.chains import RetrievalQA
qachain=RetrievalQA.from_chain_type(ollama, retriever=vectorstore.as_retriever())
res = qachain.invoke({"query": question})
print(res['result'])

In [None]:
question="what is the semantic scope with respect to data of the LDAC"
docs = vectorstore.similarity_search(question)
qachain=RetrievalQA.from_chain_type(ollama, retriever=vectorstore.as_retriever())
res = qachain.invoke({"query": question})
print(res['result'])

In [None]:
import json
from langchain.vectorstores import Chroma

def save_vectorstore_to_file(vectorstore, filename):
    # Extract documents and metadata
    documents = vectorstore._collection.get()["documents"]
    metadatas = vectorstore._collection.get()["metadatas"]

    # Prepare data for serialization
    data_to_save = {
        "documents": documents,
        "metadatas": metadatas
    }

    # Save to file
    with open(filename, 'w') as f:
        json.dump(data_to_save, f)

# Save the vectorstore to a JSON file
save_vectorstore_to_file(vectorstore, 'vectorstore_data.json')
