`https://python.langchain.com/docs/tutorials/rag/`

In [1]:
from langchain.schema import Document   # whichever Document import you use
import gzip, pickle
from langchain_text_splitters import RecursiveCharacterTextSplitter
from helper.academicCloudEmbeddings import AcademicCloudEmbeddings
from langchain.vectorstores import FAISS
import streamlit as st

# Indexing
## 1. Load the data
*In our case the data needs to be crawled first. See `crawl.ipynb`. There we are storing the documents in a pickle file we can load here*

In [2]:
with gzip.open("docs.pkl.gz", "rb") as f:
    docs = pickle.load(f)

## 2. Split the loaded data
We are splitting large documents into smaller chunks for indexing the data and passing it into a model. Large chunks would be worse for search

In [3]:
# splitten – jede URL bleibt als metadata erhalten
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_documents(docs)

## 3. Store
We are storing the created chunks:
1. creating embeddings using the GWDG model
2. storing in a FAISS store which can be saved locally to use later. That way we don't need to create the store every time we want to start the app

In [4]:
# Embeddings und FAISS
embedder = AcademicCloudEmbeddings(
    api_key=st.secrets["GWDG_API_KEY"],
    url=st.secrets["BASE_URL_EMBEDDINGS"],
)
store = FAISS.from_documents(chunks, embedder)
store.save_local("faiss_wiki_index")

KeyboardInterrupt: 