# **RAG Implementation**

# Install Required Libraries

LangChain is a framework designed to build applications that integrate language models (like OpenAI's ChatGPT) with various external tools and data sources. It simplifies the development of intelligent systems by providing modular components for:

* [Prompt Engineering](https://python.langchain.com/docs/concepts/prompt_templates/): Easily create and manage prompts to guide the language model’s responses.

* Chains: Combine multiple steps (e.g., fetching data, processing it, and generating output) into workflows.

* [Tools](https://python.langchain.com/docs/concepts/tools/) & Agents: Extend model capabilities using tools (e.g., search engines, APIs) and empower dynamic decision-making with agents.

* Memory: Maintain state and context in conversations or across tasks.

* Data Augmentation: Integrate retrieval systems (like Pinecone) for Retrieval-Augmented Generation (RAG).

*Official Langchain Documentation: https://python.langchain.com/docs/introduction/*

In [1]:
pip install langchain



In [2]:
pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.13 (from langchain-community)
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.27 (from langchain-community)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

Groq builds super-fast processors designed specifically for artificial intelligence (AI) and machine learning tasks. Their technology focuses on delivering high performance for AI applications, such as natural language processing, computer vision, and large-scale data analysis.

In simple terms, Groq's processors are like turbocharged engines for computers, helping them perform complex AI calculations faster and more efficiently than regular processors. This makes them great for tasks like training AI models, running real-time AI applications, or analyzing massive datasets quickly.

*Groq Weebsite: https://console.groq.com/login*

In [3]:
pip install -qU langchain-groq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/109.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
%pip install unstructured

Collecting unstructured
  Downloading unstructured-0.16.11-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2024.10.22-py3-none-any.whl.metadata (13 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured)
  Downloading rapidfuzz-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting backoff (from unstructured)
  Downl

Cohere Embeddings are dense vector representations of text generated by Cohere's language models. These embeddings capture semantic meaning, making them ideal for natural language processing (NLP) tasks like:

* Semantic Search: Find text with similar meaning, even if phrased differently.
Clustering: Group similar documents, sentences, or phrases.
* Recommendation Systems: Match users with relevant content or products based on semantic similarity.
* Retrieval-Augmented Generation (RAG): Retrieve contextually relevant information for LLMs.
* Sentiment Analysis: Understand sentiment or intent in text data.

Cohere embeddings are pre-trained, highly effective for zero-shot and transfer learning tasks, and easy to integrate into applications for enhanced AI capabilities.

*Official Website:* https://docs.cohere.com/v2/docs/embeddings

In [61]:
pip install -qU langchain-cohere

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/250.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m245.8/250.0 kB[0m [31m11.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%pip install -qU langchain-text-splitters

Chroma Vector Database is a specialized open-source database designed for storing and querying vector embeddings, making it ideal for AI applications involving large language models (LLMs). Key features include:

* Vector Storage: Efficiently stores high-dimensional embeddings generated from text, images, or other data.
* Semantic Search: Supports similarity searches to find contextually related items.
* Integrations: Works seamlessly with LLMs and embedding models for use cases like Retrieval-Augmented Generation (RAG).
* Scalable and Fast: Handles large datasets with low-latency queries.
* Open-Source: Fully transparent and customizable for developers.

Chroma is widely used in applications like chatbots, recommendation systems, and knowledge retrieval, where quick and accurate access to semantic information is essential.

*Official Website:* https://www.trychroma.com/

In [64]:
pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0 (from langchain-chroma)
  Downloading chromadb-0.5.23-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain-chroma)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting build>=1.0.3 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Do

# Setup Environmental

In [42]:
from google.colab import userdata
import os
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_cohere import CohereEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

**Generate API Keys**:

* Groq: https://console.groq.com/keys

* Cohere: https://dashboard.cohere.com/api-keys

In [None]:
os.environ["COHERE_API_KEY"] = userdata.get('COHERE_API')
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# RAG Code

**Dataset:**

Collegedunia Webpage: https://collegedunia.com/college/13585-government-college-of-engineering-gcek-satara/admission

In [8]:
file_path = "/content/Government College of Engineering_collegeduniya.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

In [10]:
print(data)

[Document(metadata={'source': '/content/Government College of Engineering_collegeduniya.html'}, page_content="Select Goal &\n\nCity\n\nSelect Goal\n\nWrite a ReviewGet Upto ₹500*\n\nExplore\n\nExplore More\n\nStudy AbroadGet upto 50% discount on Visa Fees\n\nTop Universities & Colleges\n\nAbroad Exams\n\nTop Courses\n\nExams\n\nRead College Reviews\n\nNews\n\nAdmission Alerts 2024\n\nEducation Loan\n\nInstitute (Counselling, Coaching and More)\n\nAsk a Question\n\nCollege Predictor\n\nTest Series\n\nPractice Questions\n\nCourse Finder\n\nScholarship\n\nAll Courses\n\nB.Tech\n\nMBA\n\nM.Tech\n\nMBBS\n\nB.Com\n\nB.Sc\n\nB.Sc (Nursing)\n\nBA\n\nBBA\n\nBCA\n\nCourse Finder\n\nAll Courses\n\nB.Tech\n\nMBA\n\nM.Tech\n\nMBBS\n\nB.Com\n\nB.Sc\n\nB.Sc (Nursing)\n\nBA\n\nBBA\n\nBCA\n\nB.Arch\n\nB.Ed\n\nB.Pharm\n\nB.Sc (Agriculture)\n\nBAMS\n\nLLB\n\nLLM\n\nM.Pharm\n\nM.Sc\n\nMCA\n\nBachelor of Physiotherapy\n\nB.Des\n\nM.Planning\n\nB.Planning\n\nAgriculture\n\nArts\n\nCommerce\n\nComputer Appli

In [29]:
processed_data=data[0].page_content

In [30]:
processed_data

"Select Goal &\n\nCity\n\nSelect Goal\n\nWrite a ReviewGet Upto ₹500*\n\nExplore\n\nExplore More\n\nStudy AbroadGet upto 50% discount on Visa Fees\n\nTop Universities & Colleges\n\nAbroad Exams\n\nTop Courses\n\nExams\n\nRead College Reviews\n\nNews\n\nAdmission Alerts 2024\n\nEducation Loan\n\nInstitute (Counselling, Coaching and More)\n\nAsk a Question\n\nCollege Predictor\n\nTest Series\n\nPractice Questions\n\nCourse Finder\n\nScholarship\n\nAll Courses\n\nB.Tech\n\nMBA\n\nM.Tech\n\nMBBS\n\nB.Com\n\nB.Sc\n\nB.Sc (Nursing)\n\nBA\n\nBBA\n\nBCA\n\nCourse Finder\n\nAll Courses\n\nB.Tech\n\nMBA\n\nM.Tech\n\nMBBS\n\nB.Com\n\nB.Sc\n\nB.Sc (Nursing)\n\nBA\n\nBBA\n\nBCA\n\nB.Arch\n\nB.Ed\n\nB.Pharm\n\nB.Sc (Agriculture)\n\nBAMS\n\nLLB\n\nLLM\n\nM.Pharm\n\nM.Sc\n\nMCA\n\nBachelor of Physiotherapy\n\nB.Des\n\nM.Planning\n\nB.Planning\n\nAgriculture\n\nArts\n\nCommerce\n\nComputer Applications\n\nDesign\n\nEngineering\n\nLaw\n\nManagement\n\nMedical\n\nParamedical\n\nPharmacy\n\nScience\n\nArc

In [31]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([processed_data])

In [62]:
embeddings_model = CohereEmbeddings(model="embed-english-v3.0")

In [65]:
db = Chroma.from_documents(texts,embeddings_model)

In [67]:
retriever = db.as_retriever(search_kwargs={"k": 5})

In [68]:
retriever

VectorStoreRetriever(tags=['Chroma', 'CohereEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x7dd41f475ea0>, search_kwargs={'k': 5})

In [84]:
template = """
You are an intelligent and helpful chatbot specializing in answering questions about the Government College of Engineering, Karad.
You have access to data from the Collegedunia website {context}, specifically related to this college.
While the data may include information about advertisements or other institutions, your focus must remain strictly on the Government College of Engineering, Karad.

The website data contains the following:
1. Academic programs, sports, admissions, living conditions, cut-offs, and more.
2. Student reviews of the college.
3. Rankings of the college.
4. Other general information specific to the college.

Your task is to:
- Respond concisely and accurately based only on the provided data.
- Ignore any unrelated content, such as advertisements, promotions, or information about other colleges.

Avoid adding subjective phrases or unnecessary context, such as "I think" or "The answer to this question is or "Based on the provided data"

Provide an answer strictly based on the Government College of Engineering, Karad, and do not use information from other colleges.

Input:
- User Query: {input}

Output:
- Your response:
"""

prompt_for_college = PromptTemplate(
    template=template,
    input_variables=["context", "input"]
)

In [81]:
llm = ChatGroq(model="llama3-8b-8192")
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7dd418aae9b0>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7dd418aadd50>, model_name='llama3-8b-8192', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [85]:
rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()}
    | prompt_for_college
    | llm
    | StrOutputParser()
)

In [88]:
rag_chain.invoke("How is the college infrastructure")

'According to the reviews on Collegedunia, the college infrastructure is described as old and not well-maintained by one student. However, another student mentions that the college campus is around 35 acres, including hostels, and is situated nearby a market place.'

In [90]:
rag_chain.invoke("what is the fees for the b.tech")

'The first year fees for B.Tech at Government College of Engineering, Karad is ₹27836.'

In [93]:
rag_chain.invoke("what is the cut off for Electronics & Telecommunication")

'The cutoff for Electronics & Telecommunication Engineering at Government College of Engineering, Karad is 81.71.'

In [94]:
rag_chain.invoke("what is the cut off for Electronics & Telecommunication")

'According to the GCEK Cutoff 2024, the cutoff percentile for B.Tech Electronics & Telecommunication Engineering is 81.71.'

In [98]:
inp=input("\nEnter Your query: ")
while inp != "bye":
  print(rag_chain.invoke(inp))
  inp=input("\nEnter Your query: ")


Enter Your query: how was your day
I'm just an AI, I don't have personal experiences or days, so I don't have an answer to this question. However, I can provide you with information about the Government College of Engineering, Karad if you'd like.

Please feel free to ask another question, and I'll do my best to provide you with accurate and concise information based on the provided data.

Enter Your query: bye
