## Overview

Using this notebook to test loading the vector database and retrieving Documents from the database

In [1]:
import os
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from pathlib import Path

OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
LANGCHAIN_API_KEY = ""
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

In [3]:
import lingtypology.glottolog as glotto

glotto.get_iso_by_glot_id('arap1274')

def get_by_iso(iso_code):
    name = glotto.get_by_iso(iso_code)
    return name.replace(" ", "_")

## Retrieve From Database

Assumes database is already loaded

In [4]:
language = 'min'
language_name = get_by_iso(language)
print(language_name)

CHROMA_PATH = "./chroma_db"
# TEXT_DOC_PATH = f"./resources/{language.lower()}/grammar_book_long.txt"

Minangkabau


## Create a Custom Retriever

In [17]:
from typing import List, Tuple
from pydantic import PrivateAttr
from langchain.schema import Document  # Schema for document objects
from langchain.schema.retriever import BaseRetriever  # Base class for retrievers
from langchain.vectorstores import VectorStore  # VectorStore for similarity search

class CustomRetriever(BaseRetriever):
    _vector_store: VectorStore = PrivateAttr()

    def __init__(self, vector_store: VectorStore):
        super().__init__()
        self._vector_store = vector_store

    def _get_relevant_documents(self, query: str) -> List[Document]:
        """Retrieve relevant documents with scores and IDs."""
        results = self._vector_store.similarity_search_with_score(query)
        
        # Store scores and IDs in document metadata for future use
        retrieved_docs = []
        for doc, score in results:
            doc.metadata["score"] = score
            doc.metadata["id"] = doc.metadata.get("id", "unknown_id")
            retrieved_docs.append(doc)
        
        return retrieved_docs
    
    def retrieve_with_scores_and_ids(self, query: str, top_k: int = 5) -> List[Tuple[Document, float, str]]:
        """Retrieve documents with similarity scores and IDs."""
        results = self._vector_store.similarity_search_with_score(query, k=top_k)
        
        retrieved_docs = []
        for doc, score in results:
            doc_id = doc.metadata.get("id", "unknown_id")
            retrieved_docs.append((doc, score, doc_id))
        
        return retrieved_docs



## Query Vectorstore

In [62]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vectorstore = Chroma(embedding_function=embeddings, persist_directory=CHROMA_PATH)
chroma_client = vectorstore._client

# collections = db_chroma._client.list_collections()
collections = chroma_client.list_collections()

for collection in collections:
    print(collection.name)

# Delete a collection by name
# chroma_client.delete_collection('ilo')

iloko
kalamang
mizo
southern_jinghpaw
minangkabau
langchain


In [63]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
embedding_db = Chroma(collection_name=language_name.lower(),
                      embedding_function=embeddings,
                      persist_directory=CHROMA_PATH
)

retriever = embedding_db.as_retriever(search_kwargs={'k': 2})

In [64]:
retriever.invoke("Are there definite articles?")

[Document(metadata={'source': 'resources/kalamang/grammar_book_long.txt'}, page_content="still day hundred=obj=foc wait\n'[We are] still waiting for the hundredth day.' \nQuantifiers are discussed in more detail in Chapter .quantifier)\nPossessive pronounspossession!pronominal\nBesides nominal possessors, there are two adnominal markers of possession: nominal possessive suffixes and freestanding possessive pronouns. They are described in detail in Chapter . Possessive suffixes attach to the head noun, as in ().\nan kewe-an temun=at paruo\n1sg house-1sg.poss big=obj make\n'I am making my big house.' \nPossessive pronouns occupy a slot between head nouns and demonstratives (more precisely, between quantifiers and attributively used verbs, but no example illustrating this is available). () illustrates a possessive pronoun preceding a demonstrative.\ngambar kain yuwane\npicture 2sg.poss prox\n'this picture of yours' \nDemonstrativesdemonstrative(\nAs introduced in §, Kalamang has five demo

## Using CustomRetriever

In [18]:
# Initialize with a vector store (like FAISS, ChromaDB, etc.)
my_retriever = CustomRetriever(embedding_db)

# Using the standard LangChain interface
relevant_docs = my_retriever.get_relevant_documents("What is retrieval-augmented generation?")
for doc in relevant_docs:
    print(f"ID: {doc.metadata.get('id')}, Score: {doc.metadata.get('score')}, Content: {doc.page_content}")


  relevant_docs = my_retriever.get_relevant_documents("What is retrieval-augmented generation?")


ID: unknown_id, Score: 0.47433075308799744, Content: (Elicitation)

Nagari tu

disabuik nagari nan sadang bakambang.

country DEM:dist PV-refer country REL PROG MID-bloom

‘That country can be described as a developing country.’

(Elicitation)

The ba- verbs in examples (228), (229) and (230): babaka, ‘burn’, bagadangan, ‘stretch’, and baiduikan, ‘switch on’, resemble passives but can also be characterised as middle verbs. These verbs involve a change of state but their pivots possess both undergoer and actor elements.

For example, notice that the pivot lauak tu, ‘the fish’, is semantically undergoer-like in (228a) but no external agent is implied. In (228b) the verb baka, ‘burn’, has been passivised by di-. In this example lauak tu is clearly an undergoer and even though an actor argument is not overtly expressed, it is clear that an external participant acted on the fish so that it became roasted. Also compare babaka in (228) to tabaka in example (197) (see Section 5.2.1). In (197) 

## Load ChromaDB with Grammar Books

Languages represented in Grambank that are also in Back to School paper. Languages are looked up in languages.csv file
- min : Minangkabau
- lus : Mizo
- Wolof
- Dinka???
- Chuvash
- gug : Guarani
- kgv : Kalamang
- ilo : Iloko (aka Ilokano)
- kac : Kachin (aka Southern Jinghpaw)
- ntu : Natugu

### Setup

Setup database connections. Also prints out existing collections.

In [55]:
# language = 'min'
# language_name = get_by_iso(language)
# print(language_name)

CHROMA_PATH = "./chroma_db"

languages = ['min', 'lus', 'kac', 'ilo', 'kgv']

for language in languages:
    print(get_by_iso(language))

Minangkabau
Mizo
Southern_Jinghpaw
Iloko
Kalamang


In [59]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vectorstore = Chroma(embedding_function=embeddings, persist_directory=CHROMA_PATH)
chroma_client = vectorstore._client

collections = chroma_client.list_collections()

for collection in collections:
    print(collection.name)


langchain


In [50]:
collection.id

UUID('ae95619f-5ece-4ede-bf72-996bf59f4339')

### Delete Existing Database

In [58]:
# Delete a collection by name
# chroma_client.delete_collection('mizo')

for collection in collections:
    print(f"Deleting collection {collection.name}")
    chroma_client.delete_collection(collection.name)
    

Deleting collection langchain
Deleting collection minangkabau


### Load Grammar Books into Database

In [21]:
my_set = set()


collection = chroma_client.get_or_create_collection(name='minangkabau')
results = collection.get()
# print(results.keys())
hus = results['metadatas']
# List document IDs
# document_ids = results['ids']
# for doc_id in document_ids:
#     print(f"Document ID: {doc_id}")


# for metadata in results['metadatas']:
#     my_set.add(metadata['source'])
# print(my_set)

In [30]:
results.keys()

dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data', 'included'])

In [46]:
def get_resources_list(lang_name):
    my_set = set()
    collection = chroma_client.get_or_create_collection(name=lang_name)
    results = collection.get()
    # print(results.keys())
    hus = results['metadatas']
    # List document IDs
    # document_ids = results['ids']
    # for doc_id in document_ids:
    #     print(f"Document ID: {doc_id}")


    for metadata in results['metadatas']:
        my_set.add(metadata['source'])
    return my_set

In [61]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len,
    is_separator_regex=False
)

for language in languages:
    language_name = get_by_iso(language)
    print(f"COLLECTION NAME: {language_name}")

    resource_directory = Path(f"./resources/{language_name.lower()}")
    resources = list(resource_directory.glob('*.txt'))

    for resource in resources:
        print(f"Loading resource: {resource}")
        # loader = TextLoader(TEXT_DOC_PATH)
        loader = TextLoader(resource)
        pages = loader.load()

        chunks = text_splitter.split_documents(pages)

        embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
        # db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)
        db_chroma = Chroma.from_documents(documents=chunks, collection_name=language_name.lower(), embedding=embeddings, persist_directory=CHROMA_PATH)

COLLECTION NAME: Minangkabau
Loading resource: resources/minangkabau/adelaar_proto-malayic1992v2_o.txt
Loading resource: resources/minangkabau/zarbaliev_minangkabau1987_o.txt
Loading resource: resources/minangkabau/reibaud_minangkabau2004_o.txt
Loading resource: resources/minangkabau/crouch_minangkabau2009.txt
COLLECTION NAME: Mizo
Loading resource: resources/mizo/weidert_lushai1975_o.txt
Loading resource: resources/mizo/subbarao_mizo1998_o.txt
Loading resource: resources/mizo/chhangte_mizo1993_o.txt
Loading resource: resources/mizo/chhangte_mizo1989_o.txt
COLLECTION NAME: Southern_Jinghpaw
Loading resource: resources/southern_jinghpaw/kurabe_jinghpaw2017_o.txt
Loading resource: resources/southern_jinghpaw/hertz_kachin1902_o.txt
Loading resource: resources/southern_jinghpaw/qingxia-diehl_jingpho2003_s.txt
COLLECTION NAME: Iloko
Loading resource: resources/iloko/espiritu_ilokano1984_o.txt
COLLECTION NAME: Kalamang
Loading resource: resources/kalamang/grammar_book_long.txt
