# 3. LLM Pipeline
- based on [LangChain QA](https://python.langchain.com/docs/use_cases/question_answering/) and [LangChain RAG over in-memory documents](https://python.langchain.com/docs/use_cases/question_answering/in_memory_question_answering)
- **make sure to place your OpenAI API key in `.env`**

### ToDos
- use LLM to generate answer to user
- retrieve all (relevant) chunks from all retrieved sources
    - check for each chunk whether it is relevant for the user query
    - Chain --> let LLM summarize/decide for relevance
    - use [refine](https://python.langchain.com/docs/modules/chains/document/refine) or [map reduce](https://python.langchain.com/docs/modules/chains/document/map_reduce)
- rewrite user query / get additional queries for more relevant retrieval
    - see [MultiQueryRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever)


## Prerequisites

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embedding_function = OpenAIEmbeddings(show_progress_bar=True)

### Load database from disk

In [3]:
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

In [4]:
# Entries in database is equivalent to number of chunks created previously
print(vectordb._collection.count())

57834


## Query the database

In [None]:
query = "wer arbeitet an dezentralen technologien?"

### Using `semantic similarity search`
- we can specify `k` - the number of documents (here: chunks) retrieved

In [20]:
results_sss = vectordb.similarity_search(query, k=10)

  0%|          | 0/1 [00:00<?, ?it/s]

In [21]:
# Example of first result retrieved
print(results_sss[0].page_content)

Mitarbeiter, TU Darmstadt 1995 - 1997 Systemingenieur, ESG Elektroniksystem- und Logistik-GmbH Mitglied in Netzwerken AeCS Aero-Club der Schweiz Alumni-Netzwerk der RWTH Aachen Alumni-Netzwerk der TU Darmstadt Projekte LINA: Shared Large-scale Infrastructure for the Development and Safe Testing of Autonomous Systems / Stellv. ProjektleiterIn / Projekt laufend Risiken aus Radardaten / Stellv. ProjektleiterIn / Projekt abgeschlossen SORA Planungswerkzeug für die Genehmigung von zivilen


In [22]:
# Number of different sources retrieved
sources = set()
for result in results_sss:
    sources.add(result.metadata["source"])

print(len(sources))
print(sources)

4
{'suet', 'lehh', 'lieh', 'wele'}


In [23]:
len(results_sss)

10

### Using `maximum marginal relevance`
- strives to achieve both relevance to the query and diversity among the results

In [24]:
results_mmr = vectordb.max_marginal_relevance_search(query, k=10)

  0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
# Number of different sources retrieved
sources = set()
for result in results_mmr:
    sources.add(result.metadata["source"])

print(len(sources))
print(sources)

8
{'suet', 'lieh', 'wele', 'korb', 'gruj', 'lehh', 'mach', 'bohe'}


In [26]:
results_mmr[2]

Document(page_content='Strategies Arbeits- und Forschungsschwerpunkte, Spezialkenntnisse Elektrische Energietechnik, Regelungstechnik, Netzdynamik, Integration von dezentralen (erneuerbaren) Energiequellen und Speichern, Weitbereichsüberwachung und -regelung, intelligente Leittechnik für Verteilnetze. Aus- und Fortbildung 1999 – 2000, University of Manchester Institute of Science & Technology (UMIST),Control Systems Centre, Grossbritannien, Postdoc.1998, Delft University of Technology, Holland, Dept. of Information', metadata={'source': 'korb'})

## Building the context for our LLM prompt
- we can retrieve all chunks for a retrieved source (`"shorthandSymbol"`) and use these as inputs to `refine` our context / prompt

In [None]:
complete_query = loaded_vectordb.get(where={"source": results[0].metadata["source"]})
all_chunks_from_single_source = complete_query["documents"]

In [None]:
print(len(all_chunks_from_single_source))
print(all_chunks_from_single_source[0])