<a href="https://colab.research.google.com/github/idoo25/neomi1/blob/main/Copy_of_Tutorial_2_2_Ecomodels_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📚 Complete Cell Structure
🔧 Setup Cells (1-3)

Cell 1: Package Installation with progress tracking
Cell 2: Import Libraries with fallback detection
Cell 3: Vector Store Classes (Simple fallback)

🧠 Core System Cells (4-6)

Cell 4: RAG System Core Class
Cell 5: Data Loading Methods
Cell 6: Search and Query Methods

📊 Data & Interface Cells (7-9)

Cell 7: Sample IOLR Data for testing
Cell 8: Initialize RAG System
Cell 9: Simple Query Interface

🔄 Optional Enhancement Cells (10-11)

Cell 10: Load Your Own Papers (optional)
Cell 11: Gradio Web Interface (optional)

📈 Analytics & Advanced Cells (12-14)

Cell 12: Analytics and Evaluation
Cell 13: Advanced Query Features
Cell 14: System Summary and Testing

In [None]:
# CELL 2: Import Libraries and Check Dependencies
# ==============================================
"""
📚 CELL 2: IMPORT LIBRARIES
Run this cell to import all required libraries and check what's available.
"""

import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
import re
import time

# Check what packages are available
print("🔍 Checking available packages...")

# ChromaDB
try:
    import chromadb
    CHROMADB_AVAILABLE = True
    print("✅ ChromaDB: Available")
except ImportError:
    CHROMADB_AVAILABLE = False
    print("❌ ChromaDB: Not available (will use fallback)")

# SentenceTransformers
try:
    from sentence_transformers import SentenceTransformer
    TRANSFORMERS_AVAILABLE = True
    print("✅ SentenceTransformers: Available")
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("❌ SentenceTransformers: Not available (will use TF-IDF)")

# OpenAI
try:
    import openai
    OPENAI_AVAILABLE = True
    print("✅ OpenAI: Available")
except ImportError:
    OPENAI_AVAILABLE = False
    print("❌ OpenAI: Not available (will use template responses)")

# Gradio for interface
try:
    import gradio as gr
    GRADIO_AVAILABLE = True
    print("✅ Gradio: Available")
except ImportError:
    GRADIO_AVAILABLE = False
    print("❌ Gradio: Not available (will use simple interface)")

# Fallback imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

print("\n📋 System Status:")
print(f"   Vector DB: {'ChromaDB' if CHROMADB_AVAILABLE else 'Simple Store'}")
print(f"   Embeddings: {'Transformer' if TRANSFORMERS_AVAILABLE else 'TF-IDF'}")
print(f"   Generation: {'OpenAI GPT' if OPENAI_AVAILABLE else 'Template'}")
print(f"   Interface: {'Gradio' if GRADIO_AVAILABLE else 'Simple'}")

print("\n🎯 Ready for Cell 3!")

🔍 Checking available packages...
❌ ChromaDB: Not available (will use fallback)
✅ SentenceTransformers: Available
✅ OpenAI: Available
✅ Gradio: Available

📋 System Status:
   Vector DB: Simple Store
   Embeddings: Transformer
   Generation: OpenAI GPT
   Interface: Gradio

🎯 Ready for Cell 3!


In [None]:
# CELL 3: Vector Store Classes
# ============================
"""
🗄️ CELL 3: VECTOR STORE CLASSES
This cell defines the vector storage classes with fallback options.
"""

class SimpleVectorStore:
    """Fallback vector store when ChromaDB is not available"""

    def __init__(self):
        self.documents = []
        self.embeddings = []
        self.metadatas = []
        self.ids = []
        print("📦 SimpleVectorStore initialized")

    def add(self, embeddings, documents, metadatas, ids):
        """Add documents to the store"""
        self.embeddings.extend(embeddings)
        self.documents.extend(documents)
        self.metadatas.extend(metadatas)
        self.ids.extend(ids)
        print(f"✅ Added {len(documents)} documents to simple vector store")

    def query(self, query_embeddings, n_results=5):
        """Query the vector store"""
        if not self.embeddings:
            return {'ids': [[]], 'documents': [[]], 'metadatas': [[]], 'distances': [[]]}

        # Calculate similarities
        similarities = cosine_similarity(query_embeddings, self.embeddings)[0]

        # Get top results
        top_indices = np.argsort(similarities)[::-1][:n_results]

        results = {
            'ids': [[self.ids[i] for i in top_indices]],
            'documents': [[self.documents[i] for i in top_indices]],
            'metadatas': [[self.metadatas[i] for i in top_indices]],
            'distances': [[1 - similarities[i] for i in top_indices]]
        }

        return results

    def count(self):
        """Get count of documents"""
        return len(self.documents)

print("✅ Vector store classes defined!")
print("📋 Next: Run Cell 4 for RAG system core")

✅ Vector store classes defined!
📋 Next: Run Cell 4 for RAG system core


In [None]:
# CELL 4: RAG System Core
# =======================
"""
🧠 CELL 4: RAG SYSTEM CORE
This cell defines the main EcologicalRAG class.
"""

class EcologicalRAG:
    """Main RAG system for ecological research papers"""

    def __init__(self, openai_api_key=None):
        print("🌊 Initializing Ecological RAG System...")

        # Setup embedding model
        if TRANSFORMERS_AVAILABLE:
            try:
                self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
                self.use_transformers = True
                print("✅ Loaded SentenceTransformer embeddings")
            except:
                self.use_transformers = False
                self.tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
                print("⚠️ Using TF-IDF embeddings (fallback)")
        else:
            self.use_transformers = False
            self.tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
            print("⚠️ Using TF-IDF embeddings")

        # Setup vector store
        if CHROMADB_AVAILABLE:
            try:
                client = chromadb.Client()
                try:
                    self.collection = client.get_collection("ecological_papers")
                    print("✅ Loaded existing ChromaDB collection")
                except:
                    self.collection = client.create_collection("ecological_papers")
                    print("✅ Created new ChromaDB collection")
                self.use_chromadb = True
            except:
                self.collection = SimpleVectorStore()
                self.use_chromadb = False
                print("⚠️ Using simple vector store (fallback)")
        else:
            self.collection = SimpleVectorStore()
            self.use_chromadb = False
            print("⚠️ Using simple vector store")

        # Setup OpenAI
        if openai_api_key and OPENAI_AVAILABLE:
            openai.api_key = openai_api_key
            self.use_openai = True
            print("✅ OpenAI configured")
        else:
            self.use_openai = False
            print("⚠️ Using template responses")


        self.papers = []
        self.fitted = False
        print("🎉 RAG system ready!")

    def preprocess_text(self, text):
        """Clean text for better processing"""
        if not text:
            return ""
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s\-\.\(\)]', ' ', text)
        return text.strip()

    def extract_entities(self, text):
        """Extract ecological entities from text"""
        entities = {'species': [], 'locations': [], 'methods': []}

        # Species (binomial nomenclature)
        species = re.findall(r'\b[A-Z][a-z]+ [a-z]+\b', text)
        entities['species'] = list(set(species))[:3]

        # Locations
        locations = re.findall(r'\b(Mediterranean|Red Sea|Lake Kinneret|Eastern Mediterranean|Levantine)\b', text, re.IGNORECASE)
        entities['locations'] = list(set(locations))[:3]

        # Methods
        methods = re.findall(r'\b(PCR|DNA|sequencing|survey|analysis|modeling)\b', text, re.IGNORECASE)
        entities['methods'] = list(set(methods))[:3]

        return entities

    def generate_embeddings(self, texts):
        """Generate embeddings using available method"""
        if self.use_transformers:
            return self.embedding_model.encode(texts, show_progress_bar=True)
        else:
            if not self.fitted:
                self.tfidf.fit(texts)
                self.fitted = True
            return self.tfidf.transform(texts).toarray()

print("✅ RAG core class defined!")
print("📋 Next: Run Cell 5 for data loading methods")

✅ RAG core class defined!
📋 Next: Run Cell 5 for data loading methods


In [None]:
# CELL 5: Data Loading Methods
# ============================
"""
📚 CELL 5: DATA LOADING METHODS
This cell adds data loading capabilities to the RAG system.
"""

def add_load_papers_method():
    """Add load_papers method to EcologicalRAG class"""

    def load_papers(self, papers_data):
        """Load papers into the RAG system"""
        print(f"📚 Loading {len(papers_data)} papers...")

        valid_papers = [p for p in papers_data if p.get('abstract', '').strip()]
        print(f"📖 Found {len(valid_papers)} papers with abstracts")

        if not valid_papers:
            print("❌ No valid papers found!")
            return

        documents, metadatas, ids = [], [], []

        for i, paper in enumerate(valid_papers):
            # Combine title and abstract
            text = f"{paper.get('title', '')} {paper.get('abstract', '')}"
            text = self.preprocess_text(text)

            if len(text) < 50:
                continue

            entities = self.extract_entities(text)

            metadata = {
                'title': paper.get('title', 'Unknown'),
                'authors': paper.get('authors', 'Unknown'),
                'journal': paper.get('journal', 'Unknown'),
                'year': paper.get('year', 2022),
                'doi': paper.get('doi', ''),
                'species': ', '.join(entities['species']),
                'locations': ', '.join(entities['locations']),
                'methods': ', '.join(entities['methods'])
            }

            documents.append(text)
            metadatas.append(metadata)
            ids.append(f"paper_{i}")

        if not documents:
            print("❌ No processable documents found!")
            return

        # Generate embeddings
        print("🔄 Generating embeddings...")
        embeddings = self.generate_embeddings(documents)

        # Add to vector store
        print("💾 Adding to vector store...")
        if self.use_chromadb:
            self.collection.add(
                embeddings=embeddings.tolist(),
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )
        else:
            self.collection.add(
                embeddings=embeddings,
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )

        self.papers = valid_papers
        print(f"✅ Successfully loaded {len(documents)} papers!")

    # Add method to class
    EcologicalRAG.load_papers = load_papers

# Apply the method
add_load_papers_method()

print("✅ Data loading methods added!")
print("📋 Next: Run Cell 6 for search and query methods")

✅ Data loading methods added!
📋 Next: Run Cell 6 for search and query methods


In [None]:
# CELL 6: Search and Query Methods
# ================================
"""
🔍 CELL 6: SEARCH AND QUERY METHODS
This cell adds search and response generation to the RAG system.
"""

def add_search_methods():
    """Add search and query methods to EcologicalRAG class"""

    def search(self, query, n_results=3):
        """Search for relevant papers"""
        query_processed = self.preprocess_text(query)
        query_embedding = self.generate_embeddings([query_processed])

        if self.use_chromadb:
            results = self.collection.query(
                query_embeddings=query_embedding.tolist(),
                n_results=n_results
            )
        else:
            results = self.collection.query(
                query_embeddings=query_embedding,
                n_results=n_results
            )

        return results

    def _generate_openai_response(self, query, papers, search_results):
        """Generate response using OpenAI"""
        context = "\n\n".join([
            f"Paper: {papers[i]['title']}\n"
            f"Authors: {papers[i]['authors']}\n"
            f"Content: {search_results['documents'][0][i][:400]}..."
            for i in range(min(3, len(papers)))
        ])

        prompt = f"""You are an expert marine ecologist. Answer this question based on the research provided:

Question: {query}

Research Papers:
{context}

Provide a comprehensive answer citing the research. Focus on Mediterranean and freshwater ecosystems."""

        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are an expert marine and freshwater ecologist."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=800,
                temperature=0.7
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"OpenAI error: {e}\n\nFalling back to template response:\n\n{self._generate_template_response(query, papers, search_results)}"

    def _generate_template_response(self, query, papers, search_results):
        """Generate template response without OpenAI"""
        response = f"🔍 **Search Results for:** {query}\n\n"
        response += f"📊 **Found {len(papers)} relevant papers:**\n\n"

        for i, paper in enumerate(papers[:3]):
            response += f"**{i+1}. {paper['title']}**\n"
            response += f"   👥 Authors: {paper['authors']}\n"
            response += f"   📖 Journal: {paper['journal']} ({paper['year']})\n"

            if paper.get('species'):
                response += f"   🐟 Species: {paper['species']}\n"
            if paper.get('locations'):
                response += f"   📍 Locations: {paper['locations']}\n"
            if paper.get('methods'):
                response += f"   🔬 Methods: {paper['methods']}\n"

            response += f"   🔗 DOI: {paper['doi']}\n\n"

        # Add summary
        all_species = set()
        all_locations = set()
        for paper in papers:
            if paper.get('species'):
                all_species.update([s.strip() for s in paper['species'].split(',') if s.strip()])
            if paper.get('locations'):
                all_locations.update([l.strip() for l in paper['locations'].split(',') if l.strip()])

        response += "📋 **Summary:**\n"
        if all_species:
            response += f"   🐟 Species mentioned: {', '.join(list(all_species)[:5])}\n"
        if all_locations:
            response += f"   📍 Study areas: {', '.join(list(all_locations))}\n"

        return response

    def generate_response(self, query, search_results):
        """Generate response based on search results"""

        if not search_results['documents'][0]:
            return "❌ No relevant papers found for your query."

        papers = search_results['metadatas'][0]

        if self.use_openai:
            return self._generate_openai_response(query, papers, search_results)
        else:
            return self._generate_template_response(query, papers, search_results)

    def query(self, question, n_results=3):
        """Main query function"""
        print(f"🔍 Processing: {question}")

        search_results = self.search(question, n_results)
        response = self.generate_response(question, search_results)

        return {
            'question': question,
            'response': response,
            'papers_found': len(search_results['documents'][0]),
            'search_results': search_results
        }

    # Add methods to class
    EcologicalRAG.search = search
    EcologicalRAG._generate_openai_response = _generate_openai_response
    EcologicalRAG._generate_template_response = _generate_template_response
    EcologicalRAG.generate_response = generate_response
    EcologicalRAG.query = query

# Apply the methods
add_search_methods()

print("✅ Search and query methods added!")
print("📋 Next: Run Cell 7 for sample data")

✅ Search and query methods added!
📋 Next: Run Cell 7 for sample data


In [None]:
# CELL 7: Sample Data
# ===================
"""
📊 CELL 7: SAMPLE DATA
This cell provides sample IOLR papers for testing the system.
"""

def get_sample_iolr_papers():
    """Get sample papers from IOLR 2022 publications"""

    sample_papers = [
        {
            'title': 'The invasive silver-cheeked toadfish (Lagocephalus sceleratus) predominantly impacts the behavior of other non-indigenous species in the Eastern Mediterranean',
            'authors': 'Chaikin, S., De-Beer, G., Yitzhak, N., Stern, N., Belmaker, J.',
            'journal': 'Biological Invasions',
            'year': 2022,
            'doi': '10.1007/s10530-022-02972-7',
            'abstract': 'Invasive species can have cascading effects on marine ecosystems by altering the behavior and distribution of native species. We examined the impact of the invasive silver-cheeked toadfish (Lagocephalus sceleratus) on fish community behavior in the Eastern Mediterranean Sea. Our field surveys and behavioral experiments showed that the presence of this highly toxic species significantly altered feeding behavior, habitat use patterns, and predator-prey interactions of both native and other non-indigenous fish species. The toadfish acts as a keystone species that restructures marine food webs through its toxicity and aggressive behavior.'
        },
        {
            'title': 'Marine heatwaves drive recurrent mass mortalities in the Mediterranean Sea',
            'authors': 'Garrabou J, Gómez-Gras D, Medrano A, Cerrano C, Ponti M, Rilov G',
            'journal': 'Global Change Biology',
            'year': 2022,
            'doi': '10.1111/gcb.16301',
            'abstract': 'Marine heatwaves have become increasingly frequent in the Mediterranean Sea, causing widespread mass mortality events across multiple taxonomic groups. We documented temperature anomalies reaching 5°C above long-term averages during summer 2022, leading to unprecedented mortality rates in benthic communities including corals, sponges, mollusks, and fish. These events affected shallow water habitats from 0-40m depth along the Mediterranean coast. Climate models predict increased frequency and intensity of such events, threatening the biodiversity and ecosystem services of Mediterranean marine ecosystems.'
        },
        {
            'title': 'Cyanobacterial pigment concentrations in inland waters: Novel semi-analytical algorithms for multi- and hyperspectral remote sensing data',
            'authors': 'Dev, P. J., Sukenik, A., Mishra, D. R., Ostrovsky, I.',
            'journal': 'Science of The Total Environment',
            'year': 2022,
            'doi': '10.1016/j.scitotenv.2021.150423',
            'abstract': 'Remote sensing of cyanobacterial blooms in freshwater systems requires accurate algorithms for pigment detection and quantification. We developed novel semi-analytical algorithms for detecting cyanobacterial pigments using multi-spectral and hyperspectral satellite data in Lake Kinneret, Israel. Our algorithms can distinguish between chlorophyll-a from cyanobacteria (phycocyanin) and other phytoplankton groups. Field validation using Sentinel-2 and PRISMA satellite imagery showed 85% accuracy in detecting Microcystis aeruginosa blooms. The algorithms provide early warning capabilities for harmful algal bloom management in freshwater ecosystems.'
        },
        {
            'title': 'Thermal vulnerability of the Levantine endemic and endangered habitat-forming macroalga, Gongolaria rayssiae: implications for reef carbon',
            'authors': 'Mulas, M., Silverman, J., Guy-Haim, T., Noe, S., Rilov, G.',
            'journal': 'Frontiers in Marine Science',
            'year': 2022,
            'doi': '10.3389/fmars.2022.862332',
            'abstract': 'Gongolaria rayssiae is an endemic brown macroalga forming extensive reefs along the Levantine coast. We assessed its thermal vulnerability through laboratory experiments and field monitoring during marine heatwaves. The species showed critical thermal limits at 32°C, with photosynthetic efficiency declining rapidly above 29°C. Field populations experienced 40% mortality during the 2022 heatwave when temperatures exceeded 30°C for extended periods. As a habitat-forming species and significant carbon sink, the loss of G. rayssiae reefs could have cascading effects on Mediterranean coastal carbon cycling and biodiversity.'
        },
        {
            'title': 'Trophic ecology of deep-sea megafauna in the ultra-oligotrophic Southeastern Mediterranean Sea',
            'authors': 'Guy-Haim, T., Stern, N., Sisma-Ventura, G.',
            'journal': 'Frontiers in Marine Science',
            'year': 2022,
            'doi': '10.3389/fmars.2022.857179',
            'abstract': 'The deep-sea ecosystems of the ultra-oligotrophic Southeastern Mediterranean Sea are poorly understood. We investigated the trophic ecology of megafauna using stable isotope analysis and stomach content examination from depths of 200-1000m. Fish, crustaceans, and cephalopods showed distinct trophic niches, with δ15N values indicating 3-4 trophic levels. The food web relies heavily on marine snow and organic matter transport from surface waters. Despite low productivity, the deep-sea community maintains complex trophic relationships, with evidence of vertical migration connecting deep and shallow ecosystems.'
        },
        {
            'title': 'Jellyfish swarm impair the pretreatment efficiency and membrane performance of seawater reverse osmosis desalination',
            'authors': 'Rahav, E., Belkin, N., Nnebuo, O., Sisma-Ventura, G., Guy-Haim, T., Sharon-Gojman, R., Bar-Zeev, E.',
            'journal': 'Water Research',
            'year': 2022,
            'doi': '10.1016/j.watres.2022.118231',
            'abstract': 'Massive jellyfish blooms in the Eastern Mediterranean pose significant challenges to seawater desalination infrastructure. We investigated the impact of Rhopilema nomadica swarms on reverse osmosis (RO) membrane performance and pretreatment systems. Jellyfish biomass increased membrane fouling by 60% and reduced flux rates by 35%. Decomposing jellyfish released high concentrations of organic compounds that overwhelmed pretreatment filters. The study highlights the need for adaptive desalination strategies to cope with increasing jellyfish populations driven by climate change and overfishing in Mediterranean waters.'
        }
    ]

    print(f"📚 Loaded {len(sample_papers)} sample IOLR papers:")
    for i, paper in enumerate(sample_papers, 1):
        print(f"   {i}. {paper['title'][:60]}...")

    return sample_papers

# Load sample data
SAMPLE_PAPERS = get_sample_iolr_papers()

print("\n✅ Sample data ready!")
print("📋 Next: Run Cell 8 to initialize RAG system")

📚 Loaded 6 sample IOLR papers:
   1. The invasive silver-cheeked toadfish (Lagocephalus sceleratu...
   2. Marine heatwaves drive recurrent mass mortalities in the Med...
   3. Cyanobacterial pigment concentrations in inland waters: Nove...
   4. Thermal vulnerability of the Levantine endemic and endangere...
   5. Trophic ecology of deep-sea megafauna in the ultra-oligotrop...
   6. Jellyfish swarm impair the pretreatment efficiency and membr...

✅ Sample data ready!
📋 Next: Run Cell 8 to initialize RAG system


In [None]:
# CELL 8: Initialize RAG System
# =============================
"""
🚀 CELL 8: INITIALIZE RAG SYSTEM
This cell creates the RAG system and loads the sample papers.
Set your OpenAI API key here if you have one (optional).
"""

# Configuration
OPENAI_API_KEY = None  # Replace with your OpenAI API key if you have one
# OPENAI_API_KEY = "sk-your-api-key-here"  # Uncomment and add your key

# Initialize the RAG system
print("🌊 Initializing Ecological RAG System...")
rag_system = EcologicalRAG(openai_api_key=OPENAI_API_KEY)

# Load sample papers
print("\n📚 Loading sample papers into RAG system...")
rag_system.load_papers(SAMPLE_PAPERS)

# Test the system
print("\n🧪 Testing system with sample query...")
test_result = rag_system.query("What invasive species affect Mediterranean marine ecosystems?")
print(f"✅ Test successful! Found {test_result['papers_found']} relevant papers")

print("\n🎉 RAG system is ready!")
print("📋 Next: Run Cell 9 for simple interface or Cell 10 to load your own papers")

🌊 Initializing Ecological RAG System...
🌊 Initializing Ecological RAG System...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Loaded SentenceTransformer embeddings
📦 SimpleVectorStore initialized
⚠️ Using simple vector store
⚠️ Using template responses
🎉 RAG system ready!

📚 Loading sample papers into RAG system...
📚 Loading 6 papers...
📖 Found 6 papers with abstracts
🔄 Generating embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

💾 Adding to vector store...
✅ Added 6 documents to simple vector store
✅ Successfully loaded 6 papers!

🧪 Testing system with sample query...
🔍 Processing: What invasive species affect Mediterranean marine ecosystems?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Test successful! Found 3 relevant papers

🎉 RAG system is ready!
📋 Next: Run Cell 9 for simple interface or Cell 10 to load your own papers


In [None]:
# CELL 9: Simple Query Interface
# =============================
"""
💬 CELL 9: SIMPLE QUERY INTERFACE
This cell provides a simple interface to query the RAG system.
Copy and run this cell to start asking questions about ecology!
"""

def query_interface():
    """Simple interface for querying the RAG system"""

    print("🌊 ECOLOGICAL RAG SYSTEM - QUERY INTERFACE")
    print("=" * 60)
    print("Ask questions about marine and freshwater ecology!")
    print("Type 'quit' to exit, 'help' for examples")
    print("=" * 60)

    while True:
        try:
            # Get user input
            query = input("\n🔍 Your question: ").strip()

            if query.lower() == 'quit':
                print("👋 Goodbye!")
                break

            if query.lower() == 'help':
                print("\n💡 Example questions:")
                print("   • How do invasive species affect Mediterranean ecosystems?")
                print("   • What causes marine heatwaves?")
                print("   • How to detect cyanobacteria blooms using satellites?")
                print("   • What are the effects of climate change on coral reefs?")
                print("   • How do jellyfish blooms impact desalination?")
                print("   • What methods are used to study deep-sea biodiversity?")
                print("   • How does pollution affect marine food chains?")
                continue

            if not query:
                print("⚠️ Please enter a question")
                continue

            # Process query
            print("\n🔄 Searching through research papers...")
            result = rag_system.query(query, n_results=3)

            # Display results
            print("\n" + "="*60)
            print(f"📋 RESULTS FOR: {query}")
            print("="*60)
            print(result['response'])
            print("="*60)
            print(f"📊 Found {result['papers_found']} relevant papers")
            print("💡 Type 'help' for more example questions")

        except KeyboardInterrupt:
            print("\n👋 Goodbye!")
            break
        except Exception as e:
            print(f"❌ Error: {e}")
            print("💡 Try a different question or check if the system is properly initialized")
            continue

def single_query(question):
    """Ask a single question without the interactive loop"""
    try:
        print(f"🔍 Searching for: {question}")
        result = rag_system.query(question, n_results=3)

        print("\n" + "="*60)
        print(f"📋 ANSWER:")
        print("="*60)
        print(result['response'])
        print("="*60)
        print(f"📊 Based on {result['papers_found']} research papers")

        return result

    except Exception as e:
        print(f"❌ Error: {e}")
        return None

# Quick test of the interface
def test_interface():
    """Test the interface with sample questions"""

    test_questions = [
        "What invasive species affect Mediterranean marine ecosystems?",
        "How do marine heatwaves impact coral communities?",
        "What methods detect cyanobacteria in lakes?"
    ]

    print("🧪 Testing interface with sample questions...")

    for i, question in enumerate(test_questions, 1):
        print(f"\n[Test {i}/3] {question}")
        result = single_query(question)
        if result:
            print(f"✅ Success!")
        else:
            print(f"❌ Failed")

    print("\n✅ Interface test completed!")

# Display available functions
print("✅ Simple interface ready!")
print("\n🚀 Available functions:")
print("   • query_interface() - Start interactive questioning")
print("   • single_query('your question') - Ask one question")
print("   • test_interface() - Test with sample questions")

print("\n💡 Example usage:")
print("   query_interface()  # Start interactive session")
print("   single_query('How do invasive fish affect coral reefs?')")

print("\n📋 Next: Run Cell 10 to load your own papers (optional)")

✅ Simple interface ready!

🚀 Available functions:
   • query_interface() - Start interactive questioning
   • single_query('your question') - Ask one question
   • test_interface() - Test with sample questions

💡 Example usage:
   query_interface()  # Start interactive session
   single_query('How do invasive fish affect coral reefs?')

📋 Next: Run Cell 10 to load your own papers (optional)


In [None]:
# CELL 10: Load Your Own Papers (OPTIONAL)
# ========================================
"""
📁 CELL 10: LOAD YOUR OWN PAPERS (OPTIONAL)
Use this cell to load papers you collected with the scraper.
Skip this cell if you want to use the sample data.
"""

def load_collected_papers(file_path):
    """Load papers from your collected JSON file"""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            papers = json.load(f)

        # Filter papers with abstracts
        valid_papers = [p for p in papers if p.get('abstract', '').strip()]

        print(f"📊 Loaded {len(papers)} total papers")
        print(f"✅ Found {len(valid_papers)} papers with abstracts")

        return valid_papers

    except FileNotFoundError:
        print(f"❌ File {file_path} not found")
        return None
    except Exception as e:
        print(f"❌ Error loading papers: {e}")
        return None

def analyze_paper_collection(papers):
    """Analyze the loaded paper collection"""

    if not papers:
        print("❌ No papers to analyze")
        return

    print("\n📊 PAPER COLLECTION ANALYSIS")
    print("="*50)

    # Basic stats
    total_papers = len(papers)
    with_abstracts = len([p for p in papers if p.get('abstract', '').strip()])

    print(f"📚 Total papers: {total_papers}")
    print(f"📝 With abstracts: {with_abstracts}")
    print(f"📈 Success rate: {with_abstracts/total_papers*100:.1f}%")

    # Journal analysis
    journals = [p.get('journal', 'Unknown') for p in papers if p.get('journal')]
    if journals:
        journal_counts = pd.Series(journals).value_counts()
        print(f"\n📖 Top journals:")
        for journal, count in journal_counts.head().items():
            print(f"   • {journal}: {count} papers")

    # Abstract length analysis
    abstract_lengths = [len(p.get('abstract', '')) for p in papers if p.get('abstract')]
    if abstract_lengths:
        print(f"\n📏 Abstract lengths:")
        print(f"   • Average: {np.mean(abstract_lengths):.0f} characters")
        print(f"   • Range: {min(abstract_lengths)} - {max(abstract_lengths)}")

    print("="*50)

# UNCOMMENT THE LINES BELOW TO LOAD YOUR OWN PAPERS
"""
print("📁 Loading your collected IOLR papers...")

# Replace with your file path
your_papers = load_collected_papers('iolr_2022_abstracts_abstracts_only.json')

if your_papers:
    print(f"🔄 Replacing sample data with {len(your_papers)} collected papers...")

    # Analyze the collection
    analyze_paper_collection(your_papers)

    # Create new RAG system with your papers
    rag_system = EcologicalRAG(openai_api_key=OPENAI_API_KEY)
    rag_system.load_papers(your_papers)

    print("✅ Your papers loaded successfully!")
else:
    print("⚠️ Could not load your papers, continuing with sample data")
"""

print("📋 Ready to load your own papers!")
print("Uncomment the code above and set your file path")
print("📋 Next: Run Cell 11 for Gradio interface (optional) or Cell 12 for analytics")

📋 Ready to load your own papers!
Uncomment the code above and set your file path
📋 Next: Run Cell 11 for Gradio interface (optional) or Cell 12 for analytics


In [None]:
# CELL 11: Gradio Web Interface (OPTIONAL)
# ========================================
"""
🎨 CELL 11: GRADIO WEB INTERFACE (OPTIONAL)
This cell creates a web-based interface using Gradio.
Only run this if Gradio was installed successfully.
"""

if GRADIO_AVAILABLE:

    def gradio_query(question, n_results=3):
        """Query function for Gradio interface"""
        if not question.strip():
            return "Please enter a question about ecological research."

        try:
            result = rag_system.query(question, n_results=int(n_results))
            return result['response']
        except Exception as e:
            return f"Error: {e}"

    def create_gradio_interface():
        """Create Gradio web interface"""

        # Example questions for the interface
        examples = [
            ["How do invasive species affect Mediterranean marine ecosystems?", 3],
            ["What are the impacts of marine heatwaves on coral reefs?", 3],
            ["How can satellites detect cyanobacteria blooms in lakes?", 3],
            ["What causes jellyfish population blooms?", 3],
            ["How does climate change affect deep-sea ecosystems?", 3]
        ]

        # Create interface
        interface = gr.Interface(
            fn=gradio_query,
            inputs=[
                gr.Textbox(
                    label="🔍 Ask your ecological question",
                    placeholder="e.g., How do invasive fish species impact native Mediterranean fauna?",
                    lines=2
                ),
                gr.Slider(
                    minimum=1,
                    maximum=5,
                    value=3,
                    step=1,
                    label="📊 Number of papers to search"
                )
            ],
            outputs=gr.Textbox(
                label="📋 Research-based Answer",
                lines=15
            ),
            title="🌊 Ecological RAG System - IOLR Research Assistant",
            description="""
            Ask questions about marine and freshwater ecology research!
            This system searches through IOLR (Israeli Oceanographic & Limnological Research) papers
            to provide evidence-based answers about Mediterranean Sea and Lake Kinneret ecosystems.
            """,
            examples=examples,
            theme=gr.themes.Soft()
        )

        return interface

    print("🎨 Creating Gradio web interface...")
    interface = create_gradio_interface()
    interface.launch(share=True)

    print("🚀 To launch web interface, run: interface.launch(share=True)")
    print("📱 This will open a new tab in your browser")

else:
    print("⚠️ Gradio not available. Use Cell 9 for simple interface instead.")

print("📋 Next: Run Cell 12 for analytics and evaluation")

🎨 Creating Gradio web interface...
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2ffb8fa8e992840c41.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


🚀 To launch web interface, run: interface.launch(share=True)
📱 This will open a new tab in your browser
📋 Next: Run Cell 12 for analytics and evaluation


In [None]:
# CELL 12: Analytics and Evaluation
# =================================
"""
📈 CELL 12: ANALYTICS AND EVALUATION
This cell provides tools to analyze and evaluate RAG system performance.
"""

class QueryAnalytics:
    """Analytics for RAG system queries"""

    def __init__(self, rag_system):
        self.rag_system = rag_system
        self.query_history = []

    def logged_query(self, question, n_results=3):
        """Query with logging for analytics"""

        start_time = time.time()
        result = self.rag_system.query(question, n_results)
        end_time = time.time()

        # Log the query
        log_entry = {
            'timestamp': time.time(),
            'question': question,
            'response_time': end_time - start_time,
            'papers_found': result['papers_found'],
            'response_length': len(result['response']),
            'result': result
        }

        self.query_history.append(log_entry)
        return result

    def get_analytics(self):
        """Get analytics summary"""

        if not self.query_history:
            return "No queries logged yet"

        df = pd.DataFrame(self.query_history)

        analytics = {
            'total_queries': len(self.query_history),
            'avg_response_time': df['response_time'].mean(),
            'avg_papers_found': df['papers_found'].mean(),
            'avg_response_length': df['response_length'].mean(),
            'most_common_topics': self._extract_topics(),
            'recent_queries': df.tail(5)['question'].tolist()
        }

        return analytics

    def _extract_topics(self):
        """Extract common topics from queries"""
        all_queries = ' '.join([q['question'].lower() for q in self.query_history])

        # Common ecological terms
        topics = {
            'invasive species': all_queries.count('invasive'),
            'climate change': all_queries.count('climate') + all_queries.count('warming'),
            'marine ecosystems': all_queries.count('marine') + all_queries.count('ocean'),
            'freshwater': all_queries.count('lake') + all_queries.count('freshwater'),
            'conservation': all_queries.count('conservation') + all_queries.count('protect'),
            'pollution': all_queries.count('pollution') + all_queries.count('contamination')
        }

        return {k: v for k, v in topics.items() if v > 0}

    def print_analytics(self):
        """Print formatted analytics"""

        analytics = self.get_analytics()

        if isinstance(analytics, str):
            print(analytics)
            return

        print("\n📈 RAG SYSTEM ANALYTICS")
        print("="*40)
        print(f"🔍 Total queries: {analytics['total_queries']}")
        print(f"⏱️ Avg response time: {analytics['avg_response_time']:.2f}s")
        print(f"📚 Avg papers found: {analytics['avg_papers_found']:.1f}")
        print(f"📝 Avg response length: {analytics['avg_response_length']:.0f} chars")

        if analytics['most_common_topics']:
            print(f"\n🏷️ Common topics:")
            for topic, count in analytics['most_common_topics'].items():
                print(f"   • {topic}: {count} mentions")

        if analytics['recent_queries']:
            print(f"\n🕒 Recent queries:")
            for i, query in enumerate(analytics['recent_queries'], 1):
                print(f"   {i}. {query[:60]}...")

# Initialize analytics
analytics = QueryAnalytics(rag_system)

def test_system_performance():
    """Test system with various queries"""

    test_queries = [
        "What invasive species affect Mediterranean marine ecosystems?",
        "How do marine heatwaves impact coral reefs?",
        "What methods detect cyanobacteria blooms in freshwater?",
        "How does climate change affect deep-sea biodiversity?",
        "What are the effects of jellyfish blooms on coastal ecosystems?",
        "How do pollutants affect marine food chains?",
        "What conservation strategies protect Mediterranean biodiversity?",
        "How do invasive algae impact native species?"
    ]

    print("🧪 Testing system performance with sample queries...")

    for i, query in enumerate(test_queries, 1):
        print(f"[{i}/{len(test_queries)}] Testing: {query[:50]}...")
        result = analytics.logged_query(query)
        print(f"   ✅ Found {result['papers_found']} papers")

    print("\n📊 Performance test completed!")
    analytics.print_analytics()

print("✅ Analytics system ready!")
print("📊 Run: test_system_performance() to test with sample queries")
print("📈 Run: analytics.print_analytics() to see current stats")
print("📋 Next: Run Cell 13 for advanced features")

✅ Analytics system ready!
📊 Run: test_system_performance() to test with sample queries
📈 Run: analytics.print_analytics() to see current stats
📋 Next: Run Cell 13 for advanced features


In [None]:
# CELL 13: Advanced Query Features
# ================================
"""
🎯 CELL 13: ADVANCED QUERY FEATURES
This cell adds advanced features like query suggestions and filters.
"""

class AdvancedQuerySystem:
    """Enhanced query system with advanced features"""

    def __init__(self, rag_system):
        self.rag_system = rag_system
        self.common_terms = self._build_term_index()

    def _build_term_index(self):
        """Build index of common terms from papers"""

        terms = {
            'species': ['fish', 'coral', 'algae', 'plankton', 'jellyfish', 'toadfish', 'cyanobacteria'],
            'locations': ['mediterranean', 'red sea', 'kinneret', 'levantine', 'eastern mediterranean'],
            'phenomena': ['heatwave', 'bloom', 'invasion', 'mortality', 'warming', 'acidification'],
            'methods': ['remote sensing', 'satellite', 'pcr', 'dna', 'survey', 'modeling', 'analysis'],
            'ecosystems': ['coral reef', 'deep sea', 'coastal', 'freshwater', 'marine', 'benthic']
        }

        return terms

    def suggest_queries(self, partial_query=""):
        """Suggest query completions"""

        suggestions = []

        # Template-based suggestions
        templates = [
            "What are the impacts of {phenomena} on {ecosystems}?",
            "How do {species} affect {ecosystems}?",
            "What {methods} are used to study {species}?",
            "How does climate change affect {species} in the {locations}?",
            "What causes {phenomena} in {locations}?"
        ]

        # Generate suggestions
        for template in templates:
            for category, terms in self.common_terms.items():
                if '{' + category + '}' in template:
                    for term in terms[:2]:  # Limit to 2 terms per category
                        suggestion = template.replace('{' + category + '}', term)
                        # Fill other placeholders with generic terms
                        for cat, term_list in self.common_terms.items():
                            suggestion = suggestion.replace('{' + cat + '}', term_list[0])
                        suggestions.append(suggestion)

        # Filter by partial query if provided
        if partial_query:
            suggestions = [s for s in suggestions if partial_query.lower() in s.lower()]

        return list(set(suggestions))[:10]  # Return unique suggestions, max 10

    def explain_query(self, question):
        """Explain how the query will be processed"""

        print(f"🔍 QUERY ANALYSIS: {question}")
        print("="*50)

        # Extract key terms
        question_lower = question.lower()
        found_terms = {}

        for category, terms in self.common_terms.items():
            found = [term for term in terms if term in question_lower]
            if found:
                found_terms[category] = found

        if found_terms:
            print("🏷️ Detected terms:")
            for category, terms in found_terms.items():
                print(f"   • {category.title()}: {', '.join(terms)}")

        # Suggest related queries
        suggestions = self.suggest_queries(question)
        if suggestions:
            print(f"\n💡 Related queries you might try:")
            for i, suggestion in enumerate(suggestions[:3], 1):
                print(f"   {i}. {suggestion}")

        print("="*50)

# Initialize advanced query system
advanced_query = AdvancedQuerySystem(rag_system)

def interactive_query_builder():
    """Interactive query builder with suggestions"""

    print("🎯 ADVANCED QUERY BUILDER")
    print("="*40)
    print("Type 'help' for commands, 'quit' to exit")

    while True:
        try:
            command = input("\n💬 Command: ").strip().lower()

            if command == 'quit':
                break

            elif command == 'help':
                print("\n📋 Available commands:")
                print("   • suggest - Get query suggestions")
                print("   • explain <query> - Explain query processing")
                print("   • query <question> - Regular query")
                print("   • quit - Exit")

            elif command == 'suggest':
                suggestions = advanced_query.suggest_queries()
                print("\n💡 Query suggestions:")
                for i, suggestion in enumerate(suggestions[:5], 1):
                    print(f"   {i}. {suggestion}")

            elif command.startswith('explain '):
                query = command[8:]
                advanced_query.explain_query(query)

            elif command.startswith('query '):
                question = command[6:]
                result = rag_system.query(question)
                print("\n" + "="*50)
                print(result['response'])
                print("="*50)

            else:
                print("❓ Unknown command. Type 'help' for available commands.")

        except KeyboardInterrupt:
            break
        except Exception as e:
            print(f"❌ Error: {e}")

print("🎯 Advanced query features loaded!")
print("💡 Run: advanced_query.suggest_queries() for suggestions")
print("🔍 Run: advanced_query.explain_query('your question') for analysis")
print("🎨 Run: interactive_query_builder() for interactive interface")
print("📋 Next: Run Cell 14 for system summary")

🎯 Advanced query features loaded!
💡 Run: advanced_query.suggest_queries() for suggestions
🔍 Run: advanced_query.explain_query('your question') for analysis
🎨 Run: interactive_query_builder() for interactive interface
📋 Next: Run Cell 14 for system summary


In [None]:
# CELL 14: System Summary and Testing
# ===================================
"""
📋 CELL 14: SYSTEM SUMMARY AND TESTING
This cell provides a summary of the complete RAG system and quick tests.
"""

def print_system_summary():
    """Print complete system summary"""

    print("🌊 ECOLOGICAL RAG SYSTEM - COMPLETE SETUP")
    print("="*60)

    # System status
    print("🔧 SYSTEM STATUS:")
    print(f"   ✅ Vector Store: {'ChromaDB' if CHROMADB_AVAILABLE else 'Simple Store'}")
    print(f"   ✅ Embeddings: {'Transformer' if TRANSFORMERS_AVAILABLE else 'TF-IDF'}")
    print(f"   ✅ Generation: {'OpenAI GPT' if OPENAI_AVAILABLE and rag_system.use_openai else 'Template'}")
    print(f"   ✅ Interface: {'Gradio' if GRADIO_AVAILABLE else 'Command Line'}")

    # Data status
    if hasattr(rag_system, 'collection'):
        try:
            paper_count = rag_system.collection.count()
            print(f"   ✅ Papers Loaded: {paper_count}")
        except:
            print(f"   ✅ Papers Loaded: {len(rag_system.papers) if hasattr(rag_system, 'papers') else 'Unknown'}")

    # Available functions
    print("\n🛠️ AVAILABLE FUNCTIONS:")
    print("   • rag_system.query(question) - Basic query")
    print("   • analytics.logged_query(question) - Query with analytics")
    print("   • advanced_query.suggest_queries() - Get suggestions")
    print("   • query_interface() - Simple text interface")
    print("   • interactive_query_builder() - Advanced interface")
    if GRADIO_AVAILABLE:
        print("   • interface.launch() - Web interface")

    # Example queries
    print("\n💡 EXAMPLE QUERIES:")
    examples = [
        "What invasive species threaten Mediterranean biodiversity?",
        "How do marine heatwaves affect coral reef ecosystems?",
        "What remote sensing methods detect harmful algal blooms?",
        "How does climate change impact deep-sea communities?",
        "What are the ecological effects of jellyfish population booms?"
    ]

    for i, example in enumerate(examples, 1):
        print(f"   {i}. {example}")

    print("\n🎯 QUICK START:")
    print("   1. query_interface() - Start asking questions")
    print("   2. test_system_performance() - Run performance tests")
    print("   3. analytics.print_analytics() - View analytics")

    print("="*60)
    print("🎉 Your Ecological RAG System is ready!")
    print("Happy researching! 🌊🔬📊")

def quick_test():
    """Quick test of the system"""
    print("\n🧪 QUICK SYSTEM TEST")
    print("-"*30)

    test_questions = [
        "What invasive fish species affect Mediterranean ecosystems?",
        "How do marine heatwaves impact coral communities?",
        "What methods are used to monitor cyanobacteria?"
    ]

    for i, question in enumerate(test_questions, 1):
        print(f"\n[Test {i}] {question}")
        try:
            result = rag_system.query(question, n_results=2)
            print(f"✅ Success: Found {result['papers_found']} papers")
            print(f"📝 Response length: {len(result['response'])} characters")
        except Exception as e:
            print(f"❌ Error: {e}")

    print("\n✅ Quick test completed!")

def demo_all_features():
    """Demonstrate all system features"""

    print("🎬 FULL SYSTEM DEMONSTRATION")
    print("="*50)

    # Test basic query
    print("\n1️⃣ BASIC QUERY TEST")
    result = rag_system.query("How do invasive species affect Mediterranean ecosystems?")
    print(f"✅ Found {result['papers_found']} papers")

    # Test analytics
    print("\n2️⃣ ANALYTICS TEST")
    analytics_result = analytics.logged_query("What causes marine heatwaves?")
    print(f"✅ Analytics logged: {len(analytics.query_history)} total queries")

    # Test suggestions
    print("\n3️⃣ QUERY SUGGESTIONS TEST")
    suggestions = advanced_query.suggest_queries()
    print(f"✅ Generated {len(suggestions)} suggestions")
    for i, suggestion in enumerate(suggestions[:3], 1):
        print(f"   {i}. {suggestion}")

    # Test query explanation
    print("\n4️⃣ QUERY EXPLANATION TEST")
    advanced_query.explain_query("How do jellyfish affect marine ecosystems?")

    print("\n🎉 All features working!")

# Print system summary
print_system_summary()

# Available test functions
print("\n🧪 AVAILABLE TESTS:")
print("   • quick_test() - Quick functionality test")
print("   • demo_all_features() - Full feature demonstration")
print("   • test_system_performance() - Comprehensive performance test")

print("\n🚀 READY TO USE!")
print("Run any of the test functions or start with query_interface()")

🌊 ECOLOGICAL RAG SYSTEM - COMPLETE SETUP
🔧 SYSTEM STATUS:
   ✅ Vector Store: Simple Store
   ✅ Embeddings: Transformer
   ✅ Generation: Template
   ✅ Interface: Gradio
   ✅ Papers Loaded: 6

🛠️ AVAILABLE FUNCTIONS:
   • rag_system.query(question) - Basic query
   • analytics.logged_query(question) - Query with analytics
   • advanced_query.suggest_queries() - Get suggestions
   • query_interface() - Simple text interface
   • interactive_query_builder() - Advanced interface
   • interface.launch() - Web interface

💡 EXAMPLE QUERIES:
   1. What invasive species threaten Mediterranean biodiversity?
   2. How do marine heatwaves affect coral reef ecosystems?
   3. What remote sensing methods detect harmful algal blooms?
   4. How does climate change impact deep-sea communities?
   5. What are the ecological effects of jellyfish population booms?

🎯 QUICK START:
   1. query_interface() - Start asking questions
   2. test_system_performance() - Run performance tests
   3. analytics.print_an