<a href="https://www.kaggle.com/code/bubblyboy/anybioinforma?scriptVersionId=236590575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🧬 AnyBioInforma: Bioinformatics Research Assistant Using Gemini, FAISS & RAG

## 🔬 Project Overview

This notebook implements an intelligent **Bioinformatics Assistant** that helps researchers analyze **genomic data** using the power of Google's **Gemini model**, **vector search**, and **retrieval-augmented generation (RAG)**.

The assistant is designed to:

1. **Read and understand biomedical PDFs**, extracting scientific insights.  
2. **Build a semantic knowledge base** using chunked embeddings and FAISS vector storage.  
3. **Retrieve relevant scientific context** from the document corpus when answering queries.  
4. **Generate practical bioinformatics guidance**—not just theoretical answers.  
5. **Adapt responses in multiple languages** (English, Amharic, Arabic, Spanish) for global accessibility.

---

## 🌍 Multilingual Support

The assistant supports the following languages:

- English (`en`)  
- Español (`es`)  
- Amharic (`am`)  
- Arabic (`ar`)

Language-specific UI messages and prompts are handled via a `TEXTS` dictionary.

---

In [1]:
#Basic Configuration
!pip install --upgrade google-generativeai pypdf langchain langchain-core langchain-community faiss-cpu langchain-google-genai --quiet
import google.generativeai as genai
import pypdf
import os
import time
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS
import faiss
from kaggle_secrets import UserSecretsClient

KAGGLE_DATASET_NAME = "bionformatis"
PDF_DIR = f"/kaggle/input/{KAGGLE_DATASET_NAME}"
FAISS_INDEX_PATH = "/kaggle/working/vectordb"
GEMINI_EMBEDDING_MODEL = "models/embedding-001"
GEMINI_GENERATIVE_MODEL = "gemini-1.5-flash"
CHUNK_SIZE = 1500
CHUNK_OVERLAP = 150

# Language Configuration 
selected_language = "en" 

LANGUAGES = {
    "en": "English",
    "es": "Español",
    "am": "አማርኛ",
    "ar": "العربية"
}


TEXTS = {
    "en": {
        "pdf_dir_not_found": f"ERROR: PDF directory '{PDF_DIR}' not found. Make sure the dataset '{KAGGLE_DATASET_NAME}' is attached.",
        "no_pdfs_found": f"WARNING: No PDF files found in '{PDF_DIR}'.",
        "error_reading_pdf": "ERROR: Error reading '{}': {}",
        "no_text_extracted": "ERROR: No text could be extracted from any PDF files.",
        "no_chunks_available": "ERROR: No text chunks available to create vector store.",
        "loading_index_failed": "WARNING: Failed to load existing FAISS index ({}). Rebuilding...",
        "setup_faiss_failed": "ERROR: Failed to setup FAISS vector store: {}",
        "vector_store_not_init": "ERROR: Vector store not initialized.",
        "retrieval_error": "ERROR: Error retrieving context from FAISS: {}",
        "response_generation_error": "ERROR generating response: {}",
        "api_key_needed": "ERROR: Google AI API Key not found in Kaggle Secrets. Please add it as 'GOOGLE_API_KEY'.", # Updated message
        "api_key_retrieval_error": "ERROR: Could not retrieve 'GOOGLE_API_KEY' from Kaggle Secrets: {}", # New message
        "invalid_api_key": "ERROR: Invalid API Key or configuration error: {}",
        "init_knowledge_base": "Initializing knowledge base...",
        "init_failed": "ERROR: Knowledge base initialization failed.",
        "ask_question": "Ask a question (or type 'quit' to exit): ",
        "thinking": "Thinking...",
        "no_context_found": "I couldn't find relevant information in the documents to answer your question.",
        "sources": "Sources",
        "pdf_load_error": "ERROR: Failed to load or process PDFs. Cannot initialize chatbot." # Added missing key
    },
    
}

current_texts = TEXTS.get(selected_language, TEXTS["en"])

print(f"Configuration loaded. PDF Directory: {PDF_DIR}, FAISS Index Path: {FAISS_INDEX_PATH}")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.4/155.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.2/437.2 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m67.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hConfiguration loaded. PDF Directory: /kaggle/input/bionformatis, FAISS Index Path: /kaggle/working/vectordb


## ⚙️ Step 1: Configuration & Initialization

This section sets up the environment by installing necessary packages, loading language-specific prompts and error messages, and preparing paths for PDF data and FAISS storage.

**Key Libraries:**
- `google.generativeai` – for LLM access (Gemini)  
- `langchain` – for RAG framework and embedding utilities  
- `faiss-cpu` – for building and querying the vector database  
- `pypdf` – for PDF parsing and text extraction  


In [2]:
# Helper Functions

def load_and_process_pdfs():
    """Loads PDFs from global PDF_DIR, extracts text, and splits into chunks."""
    pdf_directory = PDF_DIR
    all_texts = []
    if not os.path.exists(pdf_directory):
        print(current_texts["pdf_dir_not_found"])
        return None, None

    pdf_files = [f for f in os.listdir(pdf_directory) if f.lower().endswith(".pdf")]
    if not pdf_files:
        print(current_texts["no_pdfs_found"])
        return None, None

    print(f"Found {len(pdf_files)} PDF(s) in '{pdf_directory}'. Processing...")
    for i, filename in enumerate(pdf_files):
        filepath = os.path.join(pdf_directory, filename)
        try:
            reader = pypdf.PdfReader(filepath)
            file_text = ""
            for page_num, page in enumerate(reader.pages):
                page_text = page.extract_text()
                if page_text:
                    file_text += page_text + "\n"
            if file_text:
                all_texts.append({"filename": filename, "content": file_text})
                print(f"  Processed '{filename}'")
            else:
                 print(f"  WARNING: Could not extract text from '{filename}'. Skipping.")
        except Exception as e:
            print(current_texts["error_reading_pdf"].format(filename, e))

    if not all_texts:
        print(current_texts["no_text_extracted"])
        return None, None

    print("Splitting documents into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=len,
    )

    all_chunks = []
    metadatas = []
    for doc in all_texts:
        chunks = text_splitter.split_text(doc["content"])
        for chunk_idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            metadatas.append({"source": doc["filename"], "chunk": chunk_idx})

    print(f"Created {len(all_chunks)} chunks.")
    return all_chunks, metadatas

def setup_faiss_vector_store(chunks, metadatas, api_key):
    """Creates or loads a FAISS vector store with Gemini embeddings."""
    if not chunks or not metadatas:
        print(current_texts["no_chunks_available"])
        return None

    try:
        embeddings = GoogleGenerativeAIEmbeddings(model=GEMINI_EMBEDDING_MODEL, google_api_key=api_key)

        if os.path.exists(FAISS_INDEX_PATH):
            try:
                print(f"Attempting to load existing FAISS index from {FAISS_INDEX_PATH}...")
                vector_store = FAISS.load_local(
                    FAISS_INDEX_PATH,
                    embeddings,
                    allow_dangerous_deserialization=True
                )
                print("FAISS index loaded successfully.")
                return vector_store
            except Exception as load_e:
                print(current_texts["loading_index_failed"].format(load_e))

        print(f"Creating new FAISS index ({len(chunks)} chunks)...")
        start_time = time.time()
        vector_store = FAISS.from_texts(chunks, embedding=embeddings, metadatas=metadatas)
        end_time = time.time()
        print(f"FAISS index created in {end_time - start_time:.2f} seconds.")

        print(f"Saving FAISS index to {FAISS_INDEX_PATH}...")
        vector_store.save_local(FAISS_INDEX_PATH)
        print("FAISS index saved.")
        return vector_store

    except Exception as e:
        print(current_texts["setup_faiss_failed"].format(e))
        return None

def get_relevant_context(query, vector_store, n_results=5):
    """Retrieves relevant context from FAISS vector store."""
    if not vector_store:
        print(current_texts["vector_store_not_init"])
        return [], []
    try:
        print(f"Searching for relevant context for query: '{query}'")
        results_with_scores = vector_store.similarity_search_with_score(query, k=n_results)
        context_docs = [doc.page_content for doc, score in results_with_scores]
        context_metadatas = [doc.metadata for doc, score in results_with_scores]
        print(f"Retrieved {len(context_docs)} context snippets.")
        return context_docs, context_metadatas
    except Exception as e:
        print(current_texts["retrieval_error"].format(e))
        return [], []

def generate_response(query, context_docs, context_metadatas, api_key, language="en"):
    """Generates response using Gemini based on query, context, and language."""
    try:
        model = genai.GenerativeModel(GEMINI_GENERATIVE_MODEL)

        context_string = ""
        sources = set()
        for doc, meta in zip(context_docs, context_metadatas):
            source_file = meta.get('source', 'Unknown source')
            context_string += f"Source: {source_file}\nContent:\n{doc}\n\n---\n\n"
            sources.add(source_file)

        language_name = LANGUAGES.get(language, "English")
        prompt = f"""You are a helpful assistant knowledgeable in bioinformatician assisting a biomedical research in a practical analysis of genomic data, answering questions based on the provided text snippets

        In your  response you will be giving a practical assistance. To enable the researcher to make full use of your knowledge you will give a quick introduction on how these particular analysis will be structured. 
        The response will be a combination of responding with the theory and how to perform bioinformatic analysis on real data.

        Sections will be labelled according to its intended purpose and are described below:

        General text:

            Text which does not have any special formatting and looks (plain) like this will guide you through the practicals, providing background and explaining what anlyses we are performing, and why.
            Checkpoints
            Text which appears in boxes of this colour aims to inform you of a significant checkpoint in your analysis. When you see this, it is good to take a moment to think about what you have acomplished so far and what you have yet to do.
        
        Code:

            Text which appears in boxes of this colour will tell that you are looking at a terminal command.
            You can copy and paste from here straight to the terminal but before you do take a moment to understand what the command is actually doing.
            Several command lines may be present, with each new line representing a single command. 

        Screen output:

            Text appearing in these boxes represents output you might expect to see in the terminal in response to a command.
            Check to see if you get a similar output!

        Questions:
            Text in these boxes will usually ask an open ended question.

        Answer:
            You can answer the above questions in the text boxes provided.


**Please provide the answer in {language_name}.**

Context from documents:
{context_string}

Question:
{query}

Answer ({language_name}):
"""
        print("Generating response...")
        start_time = time.time()
        response = model.generate_content(prompt)
        end_time = time.time()
        print(f"Response generated in {end_time - start_time:.2f} seconds.")

        response_text = response.text

        return response_text

    except Exception as e:
        print(current_texts["response_generation_error"].format(e))
        return "Sorry, I encountered an error while generating the response."

print("Helper functions defined.")

Helper functions defined.


## 📂 Step 2: PDF Loading & Text Chunking

The function `load_and_process_pdfs()` scans a specified directory of biomedical PDFs and:

1. **Reads all pages** of each file using `pypdf`.  
2. **Extracts plain text**, skipping unreadable files.  
3. **Splits the text into overlapping chunks** using LangChain’s `RecursiveCharacterTextSplitter`.

These chunks serve as the foundation for building a searchable semantic knowledge base.

## 🧠 Step 3: FAISS Vector Store Setup

The function `setup_faiss_vector_store()` handles vectorization and indexing:

- **Creates embeddings** using Google’s `embedding-001` model.  
- **Loads or builds** a FAISS index of the chunked documents.  
- **Persists the index** to disk so it can be reused in future sessions.

If an index exists, it will try to load it first—otherwise it will create a new one from scratch.


## 🔎 Step 4: Semantic Search with FAISS

The `get_relevant_context()` function uses semantic similarity to:

- **Search the FAISS index** using the user’s query.  
- **Return the most relevant chunks and metadata**, providing context for response generation.

This is the retrieval part of the RAG framework.


In [3]:
# API Key Handling and Initialization

api_key = None
initialization_successful = False
faiss_vector_store = None

try:
    user_secrets = UserSecretsClient()
    api_key = user_secrets.get_secret("GOOGLE_API_KEY")
    if not api_key:
         print(current_texts["api_key_needed"])
    else:
         print("API Key retrieved from Kaggle Secrets.")
except Exception as e:
    print(current_texts["api_key_retrieval_error"].format(e))
    api_key = None 

if not api_key:
    print("Stopping execution as API key is missing or could not be retrieved.")
else:
    try:
        genai.configure(api_key=api_key)
        print("Checking API key validity...")
        list(genai.list_models())
        print("API key configured successfully.")

        # --- Initialization ---
        print(current_texts["init_knowledge_base"])
        init_start_time = time.time()
        chunks, metadatas = load_and_process_pdfs()
        
        if chunks is None or metadatas is None:
             print(current_texts["pdf_load_error"])
        elif chunks and metadatas:
            faiss_vector_store = setup_faiss_vector_store(chunks, metadatas, api_key)
            if faiss_vector_store:
                initialization_successful = True
        else:
             print(current_texts["no_chunks_available"])


        init_end_time = time.time()

        if not initialization_successful or not faiss_vector_store:
            if not (chunks and metadatas):
                 pass
            elif not faiss_vector_store:
                  pass
            else:
                 print(current_texts["init_failed"])
        else:
            print(f"Knowledge base initialized successfully in {init_end_time - init_start_time:.2f} seconds.")

    except Exception as e:
        print(current_texts["invalid_api_key"].format(e))
        initialization_successful = False


API Key retrieved from Kaggle Secrets.
Checking API key validity...
API key configured successfully.
Initializing knowledge base...
Found 1 PDF(s) in '/kaggle/input/bionformatis'. Processing...
  Processed 'Bioinformatics_for_Beginners_2014.pdf'
Splitting documents into chunks...
Created 583 chunks.
Creating new FAISS index (583 chunks)...
FAISS index created in 10.18 seconds.
Saving FAISS index to /kaggle/working/vectordb...
FAISS index saved.
Knowledge base initialized successfully in 17.56 seconds.


## 💬 Step 5: Gemini-Powered Bioinformatics Q&A

The core function `generate_response()` combines the user’s query with the retrieved context to:

1. Build a **context-aware prompt** tailored for bioinformatics tasks.  
2. Ask the Gemini model to act like a **practical research assistant**.  
3. Include structured guidance on analysis, checkpoints, code, screen output, and questions.  
4. Generate a **multilingual, practical answer** for hands-on genomics work.

In [4]:
import ipywidgets as widgets
from IPython.display import display, clear_output

chat_history = []

# Only run this if initialization is successful
if initialization_successful and faiss_vector_store:
    print("\n--- Starting Chat ---")
    print("Type 'quit' to exit.")

    text_input = widgets.Text(
        value='',
        placeholder='Type your question here...',
        description='You:',
        disabled=False
    )

    display(text_input)

    def on_submit(change):
        prompt = change['new'].strip()
        if prompt.lower() == 'quit':
            print("Exiting chat.")
            text_input.disabled = True
            return

        if not prompt:
            return

        chat_history.append({"role": "user", "content": prompt})

        print(f"\nYou: {prompt}")
        print(current_texts["thinking"])

        context_docs, context_metadatas = get_relevant_context(prompt, faiss_vector_store)

        if not context_docs:
            full_response = current_texts["no_context_found"]
        else:
            full_response = generate_response(prompt, context_docs, context_metadatas, api_key, selected_language)

        print(f"\nAssistant: {full_response}")
        chat_history.append({"role": "assistant", "content": full_response})

        # Clear text input for next question
        text_input.value = ''

    text_input.observe(on_submit, names='value')

else:
    print("\nChat cannot start. Please check initialization steps in previous cells for errors.")



--- Starting Chat ---
Type 'quit' to exit.


Text(value='', description='You:', placeholder='Type your question here...')

## 💡 Key GenAI Concepts Demonstrated

- **RAG (Retrieval-Augmented Generation)**: Combines search + generation.  
- **FAISS indexing**: Fast similarity search over vectorized documents.  
- **PDF parsing and chunking**: Real-world biomedical content ingestion.  
- **Multilingual prompts**: Localized experience for researchers worldwide.  
- **Practical prompt engineering**: Tailored output for bioinformatics workflows.

---
## 🚀 Final Thoughts

This project demonstrates how **Generative AI + Semantic Search** can empower life science researchers to:

- Get **hands-on assistance** with their genomic pipelines.  
- Leverage complex scientific texts effortlessly.  
- Ask questions in **natural language** and get **practical responses** grounded in real documents.