# Install necessary libraries.


In [None]:
!pip install rdflib SPARQLWrapper bs4

In [None]:
!pip install langchain_community

In [None]:
!pip install chromadb

# Install an LLM.
on your own machine
go to https://github.com/ollama/ollama and download the framework for your platform. This is the free(but not Open) Model From Meta

https://en.wikipedia.org/wiki/Llama_(language_model)

There are many other ways to get local LLMs and RAGs running. 
- [Huggingface](https://huggingface.co/models) is the headquarter for all sorts of models
- [MLStudio](https://lmstudio.ai/) and [GPT4All](https://gpt4all.io/index.html) are two of the prominent GUI interfaces to make installation and use a breeze.
  



In [36]:
from langchain_community.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="mistral"
)
print(ollama.invoke("Who is in the programm committee of the LDAC 2024 workshop?"))

 I don't have real-time data, so I can't provide you with the exact Program Committee for the LDAC (Large-scale Data and Applications Conference) 2024 workshop. However, I can tell you that typically, such committees are composed of respected researchers, experts, and practitioners in the field. They are often selected based on their experience, contributions to the relevant research areas, and ability to review papers effectively. If you're interested in the LDAC 2024 workshop, I recommend checking the official conference website for updates as they become available.


# Ingest
Lets retrieve some information to use as embeddings
Why don't we start with our own context: LDAC 2024! Yay!


In [37]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://linkedbuildingdata.net/ldac2024/")

data = loader.load()

In order to be fed to our RAG, let's chop up the html into tokens. There are a lot of tokenizers to adress different types of input data, e.g. PDFs etc. 
In this case we are dealing with HTML and we e.g. don't want to polute our embeddings with html tags like `<b> <pre> <div>` , javascript etc. 

Langchain has us covered with a number of different variations.

In [38]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

The raw daata is ready, we can feed it into a vector store. We are using Chroma. There are a lot of different options available, and two months from the time of this writing, things will have broken already... we are living in very dynamic times.

In [39]:
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
oembed = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")
vectorstore = Chroma.from_documents(documents=all_splits, embedding=oembed)

In [40]:
from langchain_community.llms import Ollama
ollama = Ollama(
    base_url='http://localhost:11434',
    model="mistral"
)
print(ollama.invoke("what is the LDAC and who is part of the committee"))

 LDAC stands for Low-Latency High-Performance Lossless Audio Codec. It's a modern audio coding technology developed by Sony. LDAC provides a wireless Bluetooth audio transmission method with high-fidelity sound close to that of CD-quality, while keeping the data efficiency necessary for wireless transmission.

As for who is part of the committee, it seems there isn't a specific publicly known committee for the development of LDAC. However, Sony Electronics Inc. is the primary developer and promoter of this technology. Other companies may have collaborated or contributed to its development as part of industry partnerships, but the exact details aren't openly disclosed. It's always a good idea to check the official websites or publications from Sony for more accurate information regarding LDAC development.


Let's see what documents the vecotre store finds for similarities for our question

In [41]:
question="who is in the programm committee of the LDAC 2024?"
docs = vectorstore.similarity_search(question)
len(docs)

4

In [42]:
docs

[Document(page_content='Programme Committee\n\n\n\nJakob Beetz\nCalin Boje\nLasitha Chamari\nDavid Chaves\nAndrea Cimmino\nGonçal Costa\nAaron Costin\nAlex Donkers\n\n\n\n\n\n\xa0\n\n\n\nDiellza Elshani\nRaúl García-Castro\nPhilipp Hagedorn\nAna Iglesias-Molina\nMaxime Lefrançois\nDimitris Mavrokapnidis\nClaudio Mirarchi\nJyrki Oraskari\n\n\n\n\n\n\xa0\n\n\n\nNicolas Pauen\nPieter Pauwels\nEkaterina Petrova\nMaría Poveda-Villalón\nDimitrios Rovas\nAna Roxin\nOliver Schulz\n\n\n\n\n\n\xa0\n\n\n\nMadhumitha Senthilvel\nÁlvaro Sicilia\nDaniele Spoladore\nWalter Terkaj\nEdlira Vakaj\nJeroen Werbrouck\nSven Zentgraf\n\n\n \n \n\n\n\nImportant Dates\n\n\n\n                                Registration opens: \nJanuary 19, 2024 January 31, 2024\n                                Abstract submission deadline (optional - no review): \nJanuary 19, 2024 January 31, 2024\nPaper submission deadline: \nFebruary 02, 2024 February 16, 2024\t\t\t\t\t\n                                Poster track submissio

In [43]:
from langchain.chains import RetrievalQA
qachain=RetrievalQA.from_chain_type(ollama, retriever=vectorstore.as_retriever())
res = qachain.invoke({"query": question})
print(res['result'])

 The Program Committee of LDAC 2024 consists of the following individuals: Jakob Beetz, Calin Boje, Laszlo Bujtor, Laszlo Garai, Laszlo Konya, Laszlo Molnar, Balazs Nagy, Attila Pinter, Miklos Szegedi, Janos Vincze, Zoltan Nagy, Peter Palotas, Csaba Pinter, Imre Sinka, Gabor Tardos, Andras Vanek, Gergely Varro, Istvan Veres, Bence Kovacs, Gyorgy Palfi, Balazs Bajko, Daniel Kiss, Janos Molnar, Zoltan Horvath, Peter Palotas, Jeno Farkas, Laszlo Szigeti, Robert Fiser, Gabor Szelei, Zoltan Toth, Istvan Gyorfi, Tamas Fogarasi, Zsolt Pallagi, Attila Fekete, Peter Nagy, Tamas Racz, Balazs Danko, Tamas Tardos, Jozsef Kovacs, Gabor Papp, Tibor Szalai, Peter Varhelyi, Andras Kiss, Bela Toth, Miklos Varga, Csaba Farkas, Laszlo Racz, Attila Nagy, Zoltan Olah, Laszlo Toth, Janos Kovacs, Csaba Horvath, Gabor Mihaly, Attila Tordai, Laszlo Veres, Imre Szep, Balazs Farkas, Andras Biro, Istvan Varga, Peter Kovacs, Tibor Vizler, Janos Papp, Gergely Varga, Zoltan Bartha, Andras Szilagyi, Attila Horvath, L

In [44]:
question="what is the semantic scope with respect to data of the LDAC"
docs = vectorstore.similarity_search(question)
qachain=RetrievalQA.from_chain_type(ollama, retriever=vectorstore.as_retriever())
res = qachain.invoke({"query": question})
print(res['result'])

 The semantic scope with respect to data for the LDAC (Linked Data in Architecture and Construction) conference pertains to the usage of semantic web, linked data, and web of data technologies in the context of architecture and construction. This includes research related to design, engineering, construction, and operation within these fields.


In [None]:
import json
from langchain.vectorstores import Chroma

def save_vectorstore_to_file(vectorstore, filename):
    # Extract documents and metadata
    documents = vectorstore._collection.get()["documents"]
    metadatas = vectorstore._collection.get()["metadatas"]

    # Prepare data for serialization
    data_to_save = {
        "documents": documents,
        "metadatas": metadatas
    }

    # Save to file
    with open(filename, 'w') as f:
        json.dump(data_to_save, f)

# Save the vectorstore to a JSON file
save_vectorstore_to_file(vectorstore, 'vectorstore_data.json')
