<a href="https://colab.research.google.com/github/khushiiagrawal/GenAI-Workshop-Projects/blob/main/FileQA_opensourcemodel_langchain_day_2_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Setup in Google Colab**

In [None]:
!pip install langchain langchain-community
!pip install faiss-cpu
!pip install pypdf python-docx
!pip install sentence-transformers
!pip install transformers




**Upload File in Colab**

In [None]:
from google.colab import files
uploaded = files.upload()

file_path = list(uploaded.keys())[0]
print("Uploaded:", file_path)


Saving KAR.txt to KAR.txt
Uploaded: KAR.txt


**Load & Split Document**

In [None]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Pick loader based on file type
if file_path.endswith(".pdf"):
    loader = PyPDFLoader(file_path)
elif file_path.endswith(".docx"):
    loader = Docx2txtLoader(file_path)
else:
    loader = TextLoader(file_path)

try:
    docs = loader.load()
    print(f"Number of loaded documents: {len(docs)}")
    # Print content of docs for debugging
    # print(docs)

    if not docs:
        print("Warning: No documents were loaded.")
        documents = []
    else:
        # Split into smaller chunks
        splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
        documents = splitter.split_documents(docs)

    print(f"Total Chunks: {len(documents)}")

except Exception as e:
    print(f"An error occurred during document loading or splitting: {e}")
    documents = []

Number of loaded documents: 1
Total Chunks: 9


**Create Embeddings & Vector Store**

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Small + Fast embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(documents, embeddings)


**Load an Open-Source LLM**

In [None]:
from transformers import pipeline
from langchain_community.llms import HuggingFacePipeline

# Load small model
flan_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",   # change to flan-t5-small for even faster
    max_length=512
)

llm = HuggingFacePipeline(pipeline=flan_pipeline)


Device set to use cpu


**Retrieval Q&A**

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    chain_type="stuff"
)

# Test
query = "Give me a short summary of the document"
print(qa.run(query))


The economy of Karnataka is among the most productive in the country with a gross state domestic product (GSDP) of 25.01 trillion (US$300 billion) and a per capita GSDP of 332,926 (US$3,900) for the financial year 2023–24.[15][16] The economy of Karnataka is among the most productive in the country with a gross state domestic product (GSDP) of 25.01 trillion (US$300 billion) and a per capita GSDP of 332,926 (US$3,900) for the financial year 2023–24.[15][16] The economy of Karnataka is among the most productive in the country with a gross state domestic product (GSDP) of 25.01 trillion (US$300 billion) and a per capita GSDP of 332,926 (US$3,900) for the financial year 2023–24.[15][16] The economy of Karnataka is among the most productive in the country with a gross state domestic product (GSDP) of 25.01 trillion (US$300 billion) and a per capita GSDP of 332,926 (US$3,900) for the financial year 2023–24.[15][16] The economy of Karnataka is among the most productive in the country with a 

**Chat Loop**

In [None]:
while True:
    q = input("Ask a question (or 'exit'): ")
    if q.lower() == "exit":
        break
    print("Answer:", qa.run(q))


Answer: The economy of Karnataka is among the most productive in the country with a gross state domestic product (GSDP) of 25.01 trillion (US$300 billion) and a per capita GSDP of 332,926 (US$3,900) for the financial year 2023–24.
Answer: no
Answer: no
Answer: yes
Answer: yes
Answer: Rajendra Prasad
Answer: Karnataka has contributed significantly to both forms of Indian classical music, the Carnatic and Hindustani minority languages spoken include Urdu, Konkani, Marathi, Tulu, Tamil, Telugu, Malayalam, Kodava and Beary.
Answer: Rajendra Prasad
