<a href="https://colab.research.google.com/github/i-suhas/RAG/blob/main/RAGapplication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Task
Build a RAG application for "knowledge_base_autism.txt" using Hugging Face models and a vector database, and create a user interface for interacting with the application.

## Load and split the document

### Subtask:
Load the `knowledge_base_autism.txt` file and split it into smaller chunks for processing.


**Reasoning**:
Import necessary libraries, load the text file, split it into chunks, and store the chunks in a variable.



In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the document
# Replace '/path/to/your/knowledge_base_autism.txt' with the actual path to your file in Google Drive
loader = TextLoader("/content/drive/MyDrive/ASD-COLAB/autism/knowledge_base_autism.txt")
documents = loader.load()

# Initialize a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Split the document into chunks
document_chunks = text_splitter.split_documents(documents)

# Display the number of chunks and the first chunk
print(f"Number of chunks: {len(document_chunks)}")
print("First chunk:")
print(document_chunks[0].page_content)

Number of chunks: 13
First chunk:
Title: Autism Knowledge Base — Research, Tech Landscape, and Gaps (2025)

Last updated: 9 September 2025

--------------------------------------------------------------------------------
1) What is Autism Spectrum Disorder (ASD)?
- ASD is a neurodevelopmental condition characterized by differences in social communication/interaction plus restricted or repetitive behaviors/interests. Diagnostic criteria come from DSM‑5‑TR and require persistent deficits in all three areas of social communication and at least two of four categories of restricted/repetitive behaviors. (Reference: DSM‑5‑TR/APA; CDC diagnostic summary)
- Autism is a spectrum—support needs vary (often classified as Levels 1–3 in DSM‑5‑TR).

Key references:
- American Psychiatric Association, DSM‑5‑TR overview for ASD.
- CDC clinical page summarizing DSM‑5 criteria.


In [None]:
pip install -U langchain-community



In [None]:
pip install -U langchain-huggingface



## Set up the RAG chain

### Subtask:
Configure a RAG chain that uses a Hugging Face language model and the created vector store to retrieve relevant information and generate answers to user queries.

**Reasoning**:
Import necessary libraries, initialize a Hugging Face language model, create a retriever from the vector store, and set up a RetrievalQA chain.

In [None]:
# Install necessary libraries
!pip install -U transformers



In [None]:
# Install necessary libraries
!pip install -U sentence-transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Initialize a Hugging Face embedding model
# You can choose a different model from the Hugging Face model hub
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create embeddings for the document chunks and store them in a FAISS vector store
db = FAISS.from_documents(document_chunks, embeddings)

print("Vector store created successfully!")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector store created successfully!


In [None]:
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFacePipeline
from transformers import pipeline

# Initialize a Hugging Face language model
# You can choose a different model from the Hugging Face model hub
# Consider a smaller model like 'google/flan-t5-small' if memory is an issue
generator = pipeline("text2text-generation", model="google/flan-t5-base")

llm = HuggingFacePipeline(pipeline=generator)

# Create a retriever from the FAISS vector store
retriever = db.as_retriever(search_kwargs={"k": 3})

# Set up the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff"  # You can experiment with other chain types like "map_reduce" or "refine"
)

print("RAG chain set up successfully!")

Device set to use cpu


RAG chain set up successfully!




In [None]:
# Simple Q&A in the notebook
query = input("Enter your query about autism: ")

if query:
    # Assuming qa_chain is already defined from previous steps
    response = qa_chain.invoke(query)
    print("\nResponse:")
    print(response['result']) # Assuming the response structure has a 'result' key

Enter your query about autism: which gender is likely to get autism more?


Token indices sequence length is longer than the specified maximum sequence length for this model (766 > 512). Running this sequence through the model will result in indexing errors



Response:
Male:female ratio remains >3:1; disparities by race/ethnicity have narrowed compared with prior years.


In [None]:
# Simple Q&A in the notebook
query = input("Enter your query about autism: ")

if query:
    # Assuming qa_chain is already defined from previous steps
    response = qa_chain.invoke(query)
    print("\nResponse:")
    print(response['result']) # Assuming the response structure has a 'result' key

Enter your query about autism: Fairness & Bias in autism

Response:
Systematic audits for sex/gender differences (girls/women underidentified), race/ethnicity effects, and socioeconomic factors.
