In [1]:
# Importing the required keys from the .env file
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")
GROQ_API_KEY = os.getenv("GROQ_API_KEY")

if not HUGGINGFACE_API_KEY or not GROQ_API_KEY:
    raise ValueError("Please set the environment variables in the .env file")

#### **Step 1: Loading**

Load the document using LangChainAI, a powerful tool for processing and analyzing large texts. This step enables the chatbot to access and understand the content, laying the foundation for subsequent processing.

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./data/korean_doc.pdf")
pages = loader.load()

#### **Step 2: Splitting**

Split the loaded content into smaller, overlapping chunks (or "segments") to maintain context and fit within the model's window size. This step ensures that the chatbot can parse and analyze the content efficiently, without losing important relationships between ideas.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 1024
chunk_overlap = 200

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

docs = text_splitter.split_documents(pages)

#### **Step 3: Embedding**
Utilize the LaBSE (Language-agnostic BERT Sentence Embeddings) model from Hugging Face to generate vector embeddings, converting text data into numerical matrices that Large Language Models (LLMs) can understand. This step captures the semantic meaning and relationships within the content, enabling the chatbot to grasp nuances and context.

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(
    model_name="sentence-transformers/LaBSE"
)

  from tqdm.autonotebook import tqdm, trange
  torch.load(os.path.join(input_path, "pytorch_model.bin"), map_location=torch.device("cpu"))


#### **Step 4: Storing**
Store the created embeddings in Epsilla's vector database, running locally in Docker. This step enables efficient and scalable data retrieval, allowing the chatbot to quickly access and process relevant information.

In [6]:
from langchain.vectorstores import Epsilla
from pyepsilla import vectordb as vector_db

vectordb = vector_db.Client()

vector_store = Epsilla.from_documents(
    documents=docs,
    embedding=embedding,
    client=vectordb,
    db_path="/vdb/kr",
    db_name="localDB",
    collection_name="KoreanDocCollection"
)

[INFO] Connected to localhost:8888 successfully.


#### **Step 5: Retrieval**
Employ LangChainAI and the Llama-3.1 70B model through GroqInc to retrieve relevant data from the vector database and feed it into the LLM. This step generates a curated response, leveraging the chatbot's understanding of the content and context to provide accurate and informative answers.

In [7]:
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain_groq import ChatGroq

qa_chain = RetrievalQA.from_chain_type(
    llm=ChatGroq(
        api_key=GROQ_API_KEY,
        model="llama-3.1-70b-versatile",
    ),
    retriever=vector_store.as_retriever()
)

#### **Results**

In [12]:
question = "what is this document about?"
result = qa_chain(question)
print(result["result"])

This document appears to be a collection of questions and answers related to labor laws and employment regulations in India. It covers various topics such as:

* Company employment rules and regulations
* Labor laws and their impact on employer-employee relationships
* Industrial safety and health regulations
* Employment contracts and agreements
* Termination of employment and severance pay
* Employee rights and protections

The document seems to be a guide or a resource for employers, HR professionals, or individuals seeking to understand the labor laws and regulations in India.


In [13]:
question = "which language is the document in?"
result = qa_chain(question)
print(result["result"])

The document appears to be written in a mix of Korean and English, with some Korean text and some English phrases and sentences. However, the majority of the text seems to be in Korean.


In [15]:
question = "इस दस्तावेज़ के बारे में मुझे संक्षिप्त विवरण दें"
result = qa_chain(question)
print(result["result"])

यह दस्तावेज़ भारत में श्रम कानूनों और विनियमों से संबंधित है, विशेष रूप से कर्मचारियों की छंटनी और उनके अधिकारों के बारे में बताता है। इसमें कर्मचारियों को दी जाने वाली सेवा समाप्ति की सूचना के बारे में जानकारी दी गई है, जिसमें कर्मचारी को दी जाने वाली जानकारी और कंपनी की संपत्ति की वापसी की प्रक्रिया शामिल है।
