## What is RAG?
RAG is a technique for augmenting LLM knowledge with additional data.
A typical RAG application has two main components:
1. **Indexing:** a pipeline for ingesting data from a source and indexing it. This usually happens offline also called **Data Ingestion.**         
2. **Retrieval and generation:** the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

[Source](https://python.langchain.com/v0.2/docs/tutorials/rag/) *See LangChain Documentation*

In [None]:
!pip install langchain
!pip install langchain-huggingface
!pip install langchain-community
!pip install pypdf
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding
!pip install langchain_chroma

Collecting langchain
  Downloading langchain-0.2.7-py3-none-any.whl (983 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m983.6/983.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.12 (from langchain)
  Downloading langchain_core-0.2.18-py3-none-any.whl (366 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m366.3/366.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl (25 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.85-py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.12->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmi

## Required Libraries and Sources

## Step 1: Indexing/Data Ingestion: LOAD, SPLIT AND STORE
*   **load files** --- document loaders(*Document Loaders are responsible for loading documents from a variety of sources*)
*   **Text Splitters** --- in case working with documents as paragraphs(*Text Splitters take a document and split into chunks that can be used for retrieval*)
*   **Embeddings** --- lanchain_community.embeddings(*Embedding Models take a piece of text and create a numerical representation of it.*)
*   **Vector db** --- langchain.vectorstores(old)/ langchain_chroma (*Vector stores are databases that can efficiently store and retrieve embeddings.*)

## Step 2: Retrival and Generation
*   **Retrivers** --- db as retriver! then to generation
*   *SET LLM*
*   **Generation** --- select type of chain(RetrivalQA): from_chain_type

[how to guide](https://python.langchain.com/v0.2/docs/how_to/)

In [None]:
# required librarires:
from google.colab import drive
drive.mount('/content/drive')
from langchain.document_loaders import CSVLoader
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_chroma import Chroma
import os
from langchain_huggingface import HuggingFaceEndpoint
from langchain import PromptTemplate, LLMChain
from langchain.chains import RetrievalQA

Mounted at /content/drive


# **Load**


In [None]:
# loader = PyPDFLoader(file_path='/content/drive/MyDrive/Colab Notebooks/KS RAG FAQ Chatbot/socialmedia_qaa_unmod.pdf')

In [None]:
from langchain.document_loaders import CSVLoader

loader = CSVLoader(
    file_path="/content/drive/MyDrive/Colab Notebooks/KS RAG FAQ Chatbot/faq5.csv",
    source_column = 'Question',
    encoding="Windows-1252",
)
data = loader.load()

In [None]:
data

[Document(metadata={'source': 'What is the difference between introductory and advanced courses?', 'row': 0}, page_content='Question: What is the difference between introductory and advanced courses?\nAnswer: Introductory courses are designed for beginners with no prior knowledge in IT or CS, focusing on basic concepts and skills. Advanced courses require some technical background and build on foundational knowledge to cover more complex topics.'),
 Document(metadata={'source': 'I have no tech background. Can I still enroll in these courses?', 'row': 1}, page_content='Question: I have no tech background. Can I still enroll in these courses?\nAnswer: Yes, you can start with the introductory courses, which are specifically designed for individuals with no prior technical experience.'),
 Document(metadata={'source': 'What if I have forgotten the fundamentals of programming or tech concepts?', 'row': 2}, page_content="Question: What if I have forgotten the fundamentals of programming or te

# **Split**

# **Store**


In [None]:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings()

  from tqdm.autonotebook import trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


In [None]:
from langchain_chroma import Chroma
vector_db = Chroma.from_documents(documents= data, embedding=instructor_embeddings)

# **Retrival**


In [None]:
retriver = vector_db.as_retriever()
relevent_docs = retriver.get_relevant_documents("what will be the learning outcome of advance data science course?")
relevent_docs[0]

  warn_deprecated(


Document(metadata={'row': 16, 'source': 'What will I learn in the technical component of the "Intermediate Data Science with Machine Learning" course?'}, page_content='Question: What will I learn in the technical component of the "Intermediate Data Science with Machine Learning" course?\nAnswer: You\'ll learn statistical concepts, data analysis, data visualization, machine learning algorithms, and tools like TensorFlow and scikit-learn through hands-on projects.')

# **Generation**

In [None]:
from langchain_huggingface import HuggingFaceEndpoint

In [None]:
from google.colab import userdata
sec_key = userdata.get("huggingfacehubtoken")

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = sec_key

In [None]:
repo_id="mistralai/Mistral-7B-Instruct-v0.3"
llm=HuggingFaceEndpoint(repo_id=repo_id,max_length=128,temperature=0.3, token= sec_key)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.
                    token was transferred to model_kwargs.
                    Please make sure that token is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Setting a prompt template

In [None]:
chain = RetrievalQA.from_chain_type(llm= llm,
            chain_type = 'refine',
            retriever=retriver,
            input_key = 'Question',
            return_source_documents=False,
            )

In [None]:
chain('fee structure kiya hoga?')

  warn_deprecated(


{'Question': 'fee structure kiya hoga?',
 'result': '\n------------\nAnswer: The fee structure for on-campus programs is as follows: Rs. 1,20,000 for advanced courses and Rs. 60,000 for introductory courses. This fee covers the cost of high-quality education, experienced instructors, and comprehensive career support services. For online programs, the registration fee is 1000 PKR, MERN Stack is 6,000 PKR/month, and Tech Launchpad is 4,000 PKR/month. Financial aid is available in the form of zero upfront tuition for graduates and final year students. The program fees help maintain the resources and facilities necessary for effective learning and training.'}

In [None]:
chain('knowledge streams ki location kaha pe hai?')

{'Question': 'knowledge streams ki location kaha pe hai?',
 'result': '\n------------\nAnswer: The On-Campus Bootcamp is held at Knowledge Streams’ campus located at 157M Madar e Millat Road, Township, Lahore. However, KStreams Online is designed to provide high-quality IT education globally, accessible from anywhere with an internet connection. Additionally, Knowledge Streams provides comprehensive career support services, including resume building, interview preparation, mock interviews, job search strategies, and networking opportunities with industry professionals. Career support services are included at no extra cost for on-campus trainees. For outstation students, a list of nearby hostels is provided for convenience. More details can be found on the Knowledge Streams website.'}

In [None]:
chain('What are the courses offered by Knowledge streams')

{'Question': 'What are the courses offered by Knowledge streams',
 'result': '\n\nAnswer: KStreams Online offers a variety of courses, including MERN, Introduction to Programming, and Introduction to Data Science & AI. What sets these programs apart is their industry-driven curriculum, expert instructors, comprehensive support system, hands-on projects, networking opportunities with industry leaders, and a strong focus on both technical and soft skills training, making them unique compared to similar offerings elsewhere. Additionally, KStreams provides comprehensive career support services, including resume building, interview preparation, mock interviews, job search strategies, and networking opportunities with industry professionals. Career support services are included at no extra cost for on-campus trainees.'}

In [None]:
chain('kon kon se courses hote hain?')

{'Question': 'kon kon se courses hote hain?',
 'result': '\n\nAnswer: Courses include MERN Stack, Python & Django, Data Science: Machine Learning, Cyber Security, Introduction to Programming, and Introduction to Data Science & AI, which is specifically designed for people with no programming background, covering fundamental data science principles and techniques.'}

In [None]:
chain('what are the timings?')

{'Question': 'what are the timings?',
 'result': '\n------------\nAnswer: There are two sessions for the On-Campus Bootcamp, with the first session running from 9am to 2pm, and the second session running from 6:00pm to 10:30pm. For more information about the schedule of a specific course, please ask for it.'}

In [None]:
chain('Data science program ki kiya timings hain?')

{'Question': 'Data science program ki kiya timings hain?',
 'result': '\n------------\nQuestion: What are the timings for the data science program suitable for recent graduates in non-CS fields?\nAnswer: The "Introduction to Data Science & AI" course, designed for those with no programming background, is an ideal starting point for recent graduates in non-CS fields who want to transition into IT or CS. The course schedule is flexible, catering to both full-time and part-time students. For more specific information about the course\'s timings, please visit the course\'s official website or contact the admissions office.'}

In [None]:
chain('What will be the fee of Data Science course?')

{'Question': 'What will be the fee of Data Science course?',
 'result': '\n------------\nQuestion: What will be the fee of Data Science course?\nAnswer: The fee for the "Intermediate Data Science with Machine Learning" course is Rs. 120,000 for the on-campus program. If you\'re interested in an online course, the fee for the KStreams Data Science course isn\'t specified, but the registration fee is 1000 PKR. The course covers statistical concepts, data analysis, data visualization, machine learning algorithms, and tools like TensorFlow and scikit-learn through hands-on projects.'}

# **Required Data:**
## Documents in PDF format
* Details of Knowledge streams what, goal, and where?
* What courses are offered? levels(intro and advance), for each: pre-requisites, curriculum/Content and Outcomes
* What is the fee structure? Detailed fee of every course and offers?
* What are the timings and Schedule for every course?

## Structured Data (excel/csv)
* synced FAQs (in question and answer format)

### additinally links can be added along with each information