Simple Pipeline that
- read all the .txt from a folder
- chunk data
- apply embedding
- save in Pgvector and qdrant preprod
- retrieve data from pgvector
- query a question using mistral 


## Import libraries

In [11]:
import requests
import json
import os
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from dotenv import load_dotenv



## setup initial parameters

In [12]:
load_dotenv()
local_path=os.getenv("LOCAL_PATH")
collection_name=os.getenv('COLLECTION_NAME')
embedding_model=os.getenv('EMBEDDING_MODEL')
chunk_size=int(os.getenv('CHUNK_SIZE'))
chunk_overlap=int(os.getenv('CHUNK_OVERLAP'))
pgddisconnection=os.getenv('PGDDISCONNECTION')
qdrant_url = os.getenv("QDRANT_URL", "")
qdrant_api_key = os.getenv("QDRANT_API_KEY", "")



## setup embedding model

In [13]:
## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embeddings = HuggingFaceEmbeddings(
            model_name=embedding_model,
            model_kwargs = {'device': 'cpu'})


## Load Data

In [14]:
print(local_path)

data/pdfs


In [15]:
loader = DirectoryLoader(f'{local_path}', glob="./*.pdf")

documents = loader.load()
len(documents)

1

## Recursive Text Splitter  => Chunks

No separator. Only chunk size / overlap

In [16]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
texts = text_splitter.split_documents(documents)
len(texts)


60

In [17]:
len(documents)

1

In [19]:
db=Qdrant.from_documents(
    texts,
    embeddings,
    url=qdrant_url,
    api_key=qdrant_api_key,
    port=None,
    collection_name=collection_name,
    force_recreate=True
)

  sync_client = QdrantClient(
  async_client = AsyncQdrantClient(


## Getting similar documents from db

In [20]:
question="tell abour Normal Working Hours"

In [33]:
docs = db.similarity_search_with_score(question, k=2)
docs

[(Document(metadata={'source': 'data/pdfs/India Associate Handbook V 1.1(23 June).pdf', '_id': '08c6462c-bbc5-4562-83cb-e50ced6cbde3', '_collection_name': 'india_collection_1'}, page_content='As an Associate, it is your responsibility to immediately update any changes in your personal information, such as your home address, telephone number, marital status, and number of dependents on Red Hat’s human resources information system (Workday).\n\nNewly hired Associates should verify and validate their personal information on Red Hat’s human resources information system (Workday) to ensure that their information is accurate.\n\nWORKING HOURS & RECORDING OF TIME WORKED\n\nNormal Working Hours Normal working hours are Monday through Friday, 9:30 a.m. to 6:30 p.m. with breaks as required by law. Your specific normal working hours can be found in your employment agreement and can be modified from time to time by your manager based on your roles and responsibilities, subject to applicable law.\n

In [38]:
context = '\n'.join([x[0].page_content for x in docs])


prompt = f"""You are a helpful chatbot that can answer questions based on the provided context. 
You need not make use of the entire context provided to you.
Try to interpret the question. If it is a general question asking for definitions, you can rephrase the content without changing the meaning of it.
If the asked question demands steps or process or procedure, do not change the content and stick to the original form as possible. Also if context has Red Hat specific knowledge add that in answer.
Also provide the source from which you took the answer under source: tag

Context: {context}
Question: {question}"""

In [47]:
len(prompt)

2495

In [42]:
url = 'https://ddis-mistral-7b.apps.int.stc.ai.preprod.us-east-1.aws.paas.redhat.com/v1/chat/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}
data = {
    "messages": [
        {
            "role": "user",
            "content": prompt
        }
    ],
    "model": "mistral-7b",
    "stream": False
}

try:
    response = requests.post(url, headers=headers, data=json.dumps(data), verify=False, timeout=30)
    response.raise_for_status()  # Check for HTTP errors

except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")



In [43]:
print(response.json()['choices'][0]['message']['content']) 

 Normal working hours at Red Hat, as stated in the India Associate Handbook, are from 9:30 a.m. to 6:30 p.m., Monday through Friday. These hours can be found in the employment agreement and may be modified based on roles and responsibilities, subject to applicable law. If an Associate needs to work beyond these hours, also known as Overtime, they must have their manager approve it in writing beforehand and accurately record the Overtime worked to be eligible for additional compensation.

Source: India Associate Handbook | Version - 01.1 <<23-JUNE-2022>>.
