# Agentic Corrective RAG for Document Extraction

Complete pipeline:
1. PDF ‚Üí Markdown conversion (Docling)
2. Advanced chunking (min/max size, tables, overlap)
3. LLM tagging & metadata extraction
4. Vector storage (ChromaDB) + embeddings
5. Hybrid retrieval (Semantic + BM25)
6. **Agentic Corrective RAG** (query rewriting, relevance grading, iterative retrieval)
7. Final document extraction

In [None]:
# Install dependencies
# pip install chromadb rank-bm25 groq docling

import os
import re
from typing import List, Dict
from groq import Groq
import chromadb
from chromadb.utils import embedding_functions
from rank_bm25 import BM25Okapi
import numpy as np
from docling.document_converter import DocumentConverter

# Set API key
os.environ["GROQ_KEY"] = ""
groq_client = Groq(api_key=os.environ["GROQ_KEY"])

print("‚úÖ All imports successful!")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All imports successful!


In [34]:
# pip install docling python-dotenv groq
import os
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, PictureDescriptionApiOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

GROQ_API_KEY =os.environ["GROQ_KEY"]  # set this in your env

source = r"C:\Users\madha\OneDrive\Documents\PDF Data Extraction Pipeline\Data\s12887-024-04607-3.pdf"

# 1) Configure Docling to call a remote VLM
opts = PdfPipelineOptions(
    enable_remote_services=True,         # REQUIRED for remote APIs
)
opts.do_picture_description = True
opts.picture_description_options = PictureDescriptionApiOptions(
    # Groq‚Äôs OpenAI-compatible chat.completions endpoint:
    url="https://api.groq.com/openai/v1/chat/completions",
    headers={"Authorization": f"Bearer {GROQ_API_KEY}"},
    params={
        # Pick a Groq vision model (both accept image inputs):
        "model": "meta-llama/llama-4-scout-17b-16e-instruct",
        # optional tuning:
        "max_completion_tokens": 350,
        "temperature": 0.1,
        "top_p": 0.10,
        # seed=42  # (optional) for reproducibility
    },
    prompt="If it's scanned picture of text, just extract the text. If it's a diagram/figure, Describe it concisely and accurately, including axes/units if visible. If the image is a flowchart, describe the steps in order. If the image is a diagram, describe the components and their relationships. If the image is a graph, identify and describe the axes, trends and patterns.",
    timeout=90,
)

converter = DocumentConverter(format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=opts)
})

doc = converter.convert(source).document

2025-10-04 05:26:50,790 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-10-04 05:26:50,808 - INFO - Going to convert document batch...
2025-10-04 05:26:50,809 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 6aac4dd683237d460df8729816c2accd
2025-10-04 05:26:50,810 - INFO - Accelerator device: 'cuda:0'
2025-10-04 05:26:53,868 - INFO - Accelerator device: 'cuda:0'
2025-10-04 05:26:55,527 - INFO - Accelerator device: 'cuda:0'
2025-10-04 05:26:56,522 - INFO - Processing document s12887-024-04607-3.pdf
2025-10-04 05:27:24,467 - INFO - Finished converting document s12887-024-04607-3.pdf in 33.72 sec.


In [35]:
# Step 2: Advanced Chunking Functions
print("Step 2: Setting up chunking functions...")
print("=" * 60)

def extract_tables(markdown_text: str) -> List[Dict[str, any]]:
    """Extract tables from markdown."""
    tables = []
    table_pattern = r'(\|[^\n]+\|[\n\r]+\|[-:\s|]+\|[\n\r]+(?:\|[^\n]+\|[\n\r]+)*)'
    
    for match in re.finditer(table_pattern, markdown_text):
        tables.append({
            'content': match.group(0),
            'start': match.start(),
            'end': match.end()
        })
    return tables

def recursive_chunk_with_overlap(text: str, max_size: int = 8000, overlap: int = 200) -> List[str]:
    """Recursively split text at paragraph/sentence/word boundaries with overlap."""
    if len(text) <= max_size:
        return [text]
    
    chunks = []
    
    # Try paragraphs first
    paragraphs = text.split('\n\n')
    if len(paragraphs) > 1:
        current_chunk = ""
        
        for para in paragraphs:
            if len(current_chunk) + len(para) + 2 <= max_size:
                current_chunk += ("\n\n" if current_chunk else "") + para
            else:
                if current_chunk:
                    chunks.append(current_chunk)
                    overlap_text = current_chunk[-overlap:] if len(current_chunk) > overlap else current_chunk
                    current_chunk = overlap_text + "\n\n" + para
                else:
                    sub_chunks = recursive_chunk_with_overlap(para, max_size, overlap)
                    chunks.extend(sub_chunks[:-1])
                    current_chunk = sub_chunks[-1] if sub_chunks else ""
        
        if current_chunk:
            chunks.append(current_chunk)
        return chunks
    
    # Try sentences
    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) > 1:
        current_chunk = ""
        
        for sent in sentences:
            if len(current_chunk) + len(sent) + 1 <= max_size:
                current_chunk += (" " if current_chunk else "") + sent
            else:
                if current_chunk:
                    chunks.append(current_chunk)
                    overlap_text = current_chunk[-overlap:]
                    current_chunk = overlap_text + " " + sent
                else:
                    sub_chunks = recursive_chunk_with_overlap(sent, max_size, overlap)
                    chunks.extend(sub_chunks[:-1])
                    current_chunk = sub_chunks[-1] if sub_chunks else ""
        
        if current_chunk:
            chunks.append(current_chunk)
        return chunks
    
    # Last resort: words
    words = text.split()
    current_chunk = ""
    
    for word in words:
        if len(current_chunk) + len(word) + 1 <= max_size:
            current_chunk += (" " if current_chunk else "") + word
        else:
            if current_chunk:
                chunks.append(current_chunk)
                overlap_text = current_chunk[-overlap:]
                current_chunk = overlap_text + " " + word
            else:
                chunks.append(word[:max_size])
                current_chunk = word[max_size:]
    
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

def chunk_document(markdown_text: str, 
                   min_chunk_size: int = 1000, 
                   max_chunk_size: int = 8000,
                   overlap: int = 200) -> List[Dict[str, str]]:
    """Main chunking function with all strategies."""
    
    tables = extract_tables(markdown_text)
    table_ranges = [(t['start'], t['end']) for t in tables]
    
    # Split by headers
    pattern = r'(^#{2,3}\s+.+?)(?=\n#{2,3}\s+|\Z)'
    matches = list(re.finditer(pattern, markdown_text, re.MULTILINE | re.DOTALL))
    
    # If no headers, use recursive chunking
    if not matches:
        print("No markdown headers found. Using recursive chunking...")
        chunk_texts = recursive_chunk_with_overlap(markdown_text, max_chunk_size, overlap)
        return [{
            "header": f"Chunk {i+1}",
            "content": text,
            "char_count": len(text),
            "contains_table": False
        } for i, text in enumerate(chunk_texts)]
    
    # Process header-based chunks
    raw_chunks = []
    for match in matches:
        chunk_text = match.group(1).strip()
        if chunk_text and len(chunk_text) > 50:
            header_match = re.match(r'^(#{2,3}\s+)(.+)', chunk_text)
            header = header_match.group(2).strip() if header_match else "Unknown"
            
            chunk_start = match.start()
            chunk_end = match.end()
            contains_table = any(
                (chunk_start <= t_start < chunk_end) or (chunk_start < t_end <= chunk_end)
                for t_start, t_end in table_ranges
            )
            
            raw_chunks.append({
                "header": header,
                "content": chunk_text,
                "char_count": len(chunk_text),
                "contains_table": contains_table
            })
    
    # Combine small, split large
    processed_chunks = []
    i = 0
    
    while i < len(raw_chunks):
        current = raw_chunks[i]
        
        # Keep tables as-is
        if current['contains_table']:
            processed_chunks.append(current)
            i += 1
            continue
        
        # Combine small chunks
        if current['char_count'] < min_chunk_size:
            combined_content = current['content']
            combined_header = current['header']
            j = i + 1
            
            while j < len(raw_chunks) and len(combined_content) < min_chunk_size:
                if raw_chunks[j]['contains_table']:
                    break
                combined_content += "\n\n" + raw_chunks[j]['content']
                combined_header += " + " + raw_chunks[j]['header']
                j += 1
            
            # Check if combined is now too large
            if len(combined_content) > max_chunk_size:
                split_texts = recursive_chunk_with_overlap(combined_content, max_chunk_size, overlap)
                for k, text in enumerate(split_texts, 1):
                    processed_chunks.append({
                        "header": f"{combined_header} (Part {k})" if len(split_texts) > 1 else combined_header,
                        "content": text.strip(),
                        "char_count": len(text.strip()),
                        "contains_table": False
                    })
            else:
                processed_chunks.append({
                    "header": combined_header,
                    "content": combined_content,
                    "char_count": len(combined_content),
                    "contains_table": False
                })
            i = j
            continue
        
        # Split large chunks
        if current['char_count'] > max_chunk_size:
            split_texts = recursive_chunk_with_overlap(current['content'], max_chunk_size, overlap)
            for k, text in enumerate(split_texts, 1):
                processed_chunks.append({
                    "header": f"{current['header']} (Part {k})" if len(split_texts) > 1 else current['header'],
                    "content": text.strip(),
                    "char_count": len(text.strip()),
                    "contains_table": False
                })
            i += 1
            continue
        
        # Just right
        processed_chunks.append(current)
        i += 1
    
    return processed_chunks

print("‚úÖ Chunking functions ready!")

Step 2: Setting up chunking functions...
‚úÖ Chunking functions ready!


In [36]:
markdown_content = doc.export_to_markdown()
print(markdown_content)

‚Ä†

## RESEARCH

## Open Access

## Factors influencing necrotizing enterocolitis in¬†premature infants in¬†China: a¬†systematic review and¬†meta-analysis

<!-- image -->

Shuliang Zhao 1,2‚Ä† , Huimin Jiang 1‚Ä† , Yiqun Miao 3 , Wenwen Liu 4 , Yanan Li 1 , Hui Liu 1 , Aihua Wang 1* , Xinghui Cui 2* and Yuanyuan Zhang 1

## Abstract

Background Necrotizing enterocolitis (NEC) is a multifactorial gastrointestinal disease with high morbidity and mortality among premature infants. However, studies with large samples on the factors of NEC in China have not been reported. This meta-analysis aims to systematically review the literature to explore the influencing factors of necrotizing enterocolitis in premature infants in China and provide a reference for the prevention of NEC.

Methods PubMed, Embase, Web of Science, Cochrane Library, China National Knowledge Infrastructure (CNKI), China Biomedical Literature Database (CBM), Wanfang and VIP databases were systematically searched from incep

In [38]:
# Step 3: Chunk the document
print("Step 3: Chunking document...")
print("=" * 60)
markdown_content = doc.export_to_markdown()
chunks = chunk_document(markdown_content, min_chunk_size=1000, max_chunk_size=3000, overlap=200)
if len(chunks) == 0:
    markdown_content = doc.export_to_html()
    chunks = chunk_document(markdown_content, min_chunk_size=1000, max_chunk_size=3000, overlap=200)

print(f"\n‚úÖ Created {len(chunks)} chunks")
print(f"\nSize distribution:")
print(f"  Small (<1000): {sum(1 for c in chunks if c['char_count'] < 1000)}")
print(f"  Medium (1000-8000): {sum(1 for c in chunks if 1000 <= c['char_count'] <= 8000)}")
print(f"  Large (>8000): {sum(1 for c in chunks if c['char_count'] > 8000)}")
print(f"  Tables: {sum(1 for c in chunks if c.get('contains_table', False))}")

print(f"\nFirst 5 chunks:")
for i, chunk in enumerate(chunks[:5]):
    marker = " [TABLE]" if chunk.get('contains_table') else ""
    print(f"  {i+1}. {chunk['header'][:60]}... ({chunk['char_count']} chars){marker}")

Step 3: Chunking document...

‚úÖ Created 28 chunks

Size distribution:
  Small (<1000): 5
  Medium (1000-8000): 20
  Large (>8000): 3
  Tables: 3

First 5 chunks:
  1. Factors influencing necrotizing enterocolitis in¬†premature i... (3000 chars)
  2. Factors influencing necrotizing enterocolitis in¬†premature i... (1411 chars)
  3. Introduction (Part 1)... (2624 chars)
  4. Introduction (Part 2)... (1248 chars)
  5. Search strategy... (1163 chars)


In [40]:
# Step 4: Tag chunks and extract metadata
print("Step 4: Tagging chunks with LLM...")
print("=" * 60)

tagging_prompt = """Analyze this text chunk from a research paper:

1. Assign tags (comma-separated):
   - <summary>: abstract, introduction, conclusion
   - <research_methods>: methodology, study design, data collection
   - <findings_conclusion>: results, findings, conclusions
   - <metadata>: authors, dates, affiliations

2. Extract metadata:
   - Authors: (list or None)
   - Date: (date or None)

Format:
TAGS: <tag1>, <tag2>
AUTHORS: names or None
DATE: date or None

Text:
{chunk_text}"""

document_metadata = {"authors": None, "date": None}
tagged_chunks = []

for i, chunk in enumerate(chunks):
    print(f"  Processing {i+1}/{len(chunks)}...", end='\r')
    
    prompt = tagging_prompt.format(chunk_text=chunk['content'][:2000])
    
    try:
        response = groq_client.chat.completions.create(
            model="openai/gpt-oss-120b",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_completion_tokens=200,
        )
        
        result = response.choices[0].message.content
        tags = []
        
        for line in result.split('\n'):
            if line.startswith('TAGS:'):
                tags = [t.strip() for t in line.replace('TAGS:', '').strip().split(',') if t.strip()]
            elif line.startswith('AUTHORS:'):
                authors = line.replace('AUTHORS:', '').strip()
                if authors.lower() != 'none' and not document_metadata["authors"]:
                    document_metadata["authors"] = authors
            elif line.startswith('DATE:'):
                date = line.replace('DATE:', '').strip()
                if date.lower() != 'none' and not document_metadata["date"]:
                    document_metadata["date"] = date
        
        tagged_chunks.append({**chunk, "tags": tags, "chunk_id": i})
    except Exception as e:
        print(f"\n  Error on chunk {i}: {e}")
        tagged_chunks.append({**chunk, "tags": [], "chunk_id": i})

print(f"\n\n‚úÖ Tagged {len(tagged_chunks)} chunks")
print(f"\nMetadata:")
print(f"  Authors: {document_metadata['authors']}")
print(f"  Date: {document_metadata['date']}")

tag_counts = {}
for chunk in tagged_chunks:
    for tag in chunk['tags']:
        tag_counts[tag] = tag_counts.get(tag, 0) + 1

print(f"\nTag distribution:")
for tag, count in sorted(tag_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {tag}: {count}")

Step 4: Tagging chunks with LLM...
  Processing 1/28...

2025-10-04 05:32:58,061 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 2/28...

2025-10-04 05:32:58,774 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 3/28...

2025-10-04 05:32:59,889 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 4/28...

2025-10-04 05:33:00,595 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 5/28...

2025-10-04 05:33:01,539 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 6/28...

2025-10-04 05:33:02,544 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 7/28...

2025-10-04 05:33:03,445 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 8/28...

2025-10-04 05:33:04,024 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 9/28...

2025-10-04 05:33:04,587 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 10/28...

2025-10-04 05:33:05,759 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 11/28...

2025-10-04 05:33:06,877 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 12/28...

2025-10-04 05:33:07,851 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


  Processing 13/28...

2025-10-04 05:33:09,037 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:09,093 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:09,094 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


  Processing 14/28...

2025-10-04 05:33:13,638 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:13,718 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:13,719 - INFO - Retrying request to /openai/v1/chat/completions in 1.000000 seconds


  Processing 15/28...

2025-10-04 05:33:16,009 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:16,066 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:16,070 - INFO - Retrying request to /openai/v1/chat/completions in 6.000000 seconds


  Processing 16/28...

2025-10-04 05:33:23,691 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:23,750 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:23,751 - INFO - Retrying request to /openai/v1/chat/completions in 5.000000 seconds


  Processing 17/28...

2025-10-04 05:33:29,566 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:29,624 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:29,624 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


  Processing 18/28...

2025-10-04 05:33:34,464 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:34,519 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:34,521 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


  Processing 19/28...

2025-10-04 05:33:38,378 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:38,438 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:38,439 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


  Processing 20/28...

2025-10-04 05:33:43,270 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:43,334 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:43,335 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


  Processing 21/28...

2025-10-04 05:33:47,325 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:47,467 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:47,468 - INFO - Retrying request to /openai/v1/chat/completions in 5.000000 seconds


  Processing 22/28...

2025-10-04 05:33:53,345 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:53,401 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:53,403 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


  Processing 23/28...

2025-10-04 05:33:58,980 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:33:59,044 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:33:59,045 - INFO - Retrying request to /openai/v1/chat/completions in 5.000000 seconds


  Processing 24/28...

2025-10-04 05:34:04,819 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:34:04,881 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:34:04,882 - INFO - Retrying request to /openai/v1/chat/completions in 7.000000 seconds


  Processing 25/28...

2025-10-04 05:34:12,923 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:34:12,991 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:34:12,992 - INFO - Retrying request to /openai/v1/chat/completions in 5.000000 seconds


  Processing 26/28...

2025-10-04 05:34:19,031 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:34:19,091 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:34:19,091 - INFO - Retrying request to /openai/v1/chat/completions in 6.000000 seconds


  Processing 27/28...

2025-10-04 05:34:26,198 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 05:34:26,253 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 05:34:26,254 - INFO - Retrying request to /openai/v1/chat/completions in 2.000000 seconds


  Processing 28/28...

2025-10-04 05:34:28,956 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"




‚úÖ Tagged 28 chunks

Metadata:
  Authors: Aihua Wang, Xinghui Cui, Huimin Jiang
  Date: 2024

Tag distribution:
  <research_methods>: 8
  <findings_conclusion>: 7
  <metadata>: 5
  <summary>: 3
  findings_conclusion: 3
  research_methods: 2
  summary: 1
  <: 1


In [26]:
# Step 5: Store in ChromaDB
print("Step 5: Storing in ChromaDB...")
print("=" * 60)

chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
    name="research_papers",
    embedding_function=embedding_functions.DefaultEmbeddingFunction()
)

documents = [chunk['content'] for chunk in tagged_chunks]
metadatas = [{
    "header": chunk['header'],
    "tags": ','.join(chunk['tags']),
    "char_count": chunk['char_count'],
    "chunk_id": chunk['chunk_id']
} for chunk in tagged_chunks]
ids = [f"chunk_{chunk['chunk_id']}" for chunk in tagged_chunks]

collection.add(documents=documents, metadatas=metadatas, ids=ids)

print(f"‚úÖ Stored {len(documents)} chunks with embeddings")
print(f"Collection size: {collection.count()} items")

Step 5: Storing in ChromaDB...
‚úÖ Stored 22 chunks with embeddings
Collection size: 22 items


In [41]:
# Step 6: Hybrid Search (Semantic + BM25)
print("Step 6: Setting up hybrid search...")
print("=" * 60)

tokenized_docs = [doc.lower().split() for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

def hybrid_search(query: str, tag_filter: str = None, top_k: int = 5, alpha: float = 0.5):
    """Hybrid search: semantic (alpha) + BM25 (1-alpha)"""
    
    # Semantic search
    where_filter = {"tags": {"$contains": tag_filter}} if tag_filter else None
    semantic_results = collection.query(
        query_texts=[query],
        n_results=top_k * 2,
        where=where_filter
    )
    
    # BM25 keyword search
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalize and combine
    semantic_ids = semantic_results['ids'][0]
    semantic_distances = semantic_results['distances'][0]
    
    max_dist = max(semantic_distances) if semantic_distances and max(semantic_distances) > 0 else 1
    semantic_scores_norm = {
        semantic_ids[i]: 1 - (semantic_distances[i] / max_dist)
        for i in range(len(semantic_ids))
    }
    
    max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
    bm25_scores_norm = bm25_scores / max_bm25
    
    combined_scores = {}
    for i, chunk_id in enumerate(ids):
        sem_score = semantic_scores_norm.get(chunk_id, 0)
        bm25_score = bm25_scores_norm[i]
        combined_scores[chunk_id] = alpha * sem_score + (1 - alpha) * bm25_score
    
    top_chunks = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    
    results = []
    for chunk_id, score in top_chunks:
        chunk_idx = int(chunk_id.split('_')[1])
        chunk = tagged_chunks[chunk_idx]
        results.append({
            "chunk_id": chunk_id,
            "score": score,
            "header": chunk['header'],
            "content": chunk['content'],
            "tags": chunk['tags']
        })
    
    return results

print("‚úÖ Hybrid search ready!")

# Test it
print("\nTest query: 'research methodology'")
test_results = hybrid_search("research methodology", top_k=5)
for i, r in enumerate(test_results, 1):
    print(f"{i}. {r['header'][:60]}... (score: {r['score']:.3f})")

Step 6: Setting up hybrid search...
‚úÖ Hybrid search ready!

Test query: 'research methodology'
1. Introduction (Part 1)... (score: 0.500)
2. Discussion... (score: 0.484)
3. Protect factors for¬†NEC (Part 2)... (score: 0.043)
4. Sensitivity analysis and¬†publication bias (Part 1)... (score: 0.039)
5. Risk factors for¬†NEC... (score: 0.032)


In [42]:
# Step 7: Agentic Corrective RAG Functions
print("Step 7: Setting up Corrective RAG...")
print("=" * 60)

def grade_chunk_relevance(chunk_content: str, query: str) -> dict:
    """Grade chunk relevance using LLM."""
    grading_prompt = f"""You are a grading expert. Is this chunk relevant to the query?

Query: {query}

Chunk:
{chunk_content[:1000]}

Format:
RELEVANT: yes or no
SCORE: high, medium, or low
REASON: brief explanation (1 sentence)
"""
    
    try:
        response = groq_client.chat.completions.create(
            model="openai/gpt-oss-120b",
            messages=[{"role": "user", "content": grading_prompt}],
            temperature=0.1,
            max_completion_tokens=150,
        )
        
        result = response.choices[0].message.content
        relevant = False
        score = "low"
        reason = "Unknown"
        
        for line in result.split('\n'):
            if line.startswith('RELEVANT:'):
                relevant = 'yes' in line.lower()
            elif line.startswith('SCORE:'):
                score = line.replace('SCORE:', '').strip().lower()
            elif line.startswith('REASON:'):
                reason = line.replace('REASON:', '').strip()
        
        return {'relevant': relevant, 'score': score, 'reason': reason}
    except Exception as e:
        return {'relevant': True, 'score': 'medium', 'reason': f'Error: {e}'}

def rewrite_query(original_query: str, feedback: str = None) -> str:
    """Rewrite query to improve retrieval."""
    if feedback:
        prompt = f"""Original query failed to retrieve relevant results.

Original: {original_query}
Feedback: {feedback}

Rewrite for better retrieval. Return only the rewritten query."""
    else:
        prompt = f"""Improve this query for better retrieval:

{original_query}

Return only the improved query."""
    
    try:
        response = groq_client.chat.completions.create(
            model="openai/gpt-oss-120b",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_completion_tokens=100,
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error rewriting: {e}")
        return original_query

def corrective_rag_retrieval(query: str, max_iterations: int = 3, top_k: int = 5):
    """Agentic Corrective RAG with iterative retrieval."""
    print(f"\n{'='*70}")
    print(f"Query: {query}")
    print(f"{'='*70}\n")
    
    current_query = query
    iteration = 0
    all_relevant_chunks = []
    
    while iteration < max_iterations:
        iteration += 1
        print(f"üîÑ Iteration {iteration}: '{current_query}'")
        
        # Retrieve
        retrieved = hybrid_search(current_query, tag_filter=None, top_k=top_k, alpha=0.5)
        
        if not retrieved:
            print("  ‚ö†Ô∏è  No results. Rewriting...")
            current_query = rewrite_query(current_query, "No results found")
            continue
        
        # Grade
        print(f"  üìä Grading {len(retrieved)} chunks...")
        relevant_count = 0
        
        for chunk in retrieved:
            grade = grade_chunk_relevance(chunk['content'], query)
            chunk['relevance_grade'] = grade
            
            if grade['relevant'] and grade['score'] in ['high', 'medium']:
                if not any(c['chunk_id'] == chunk['chunk_id'] for c in all_relevant_chunks):
                    all_relevant_chunks.append(chunk)
                    relevant_count += 1
                    print(f"    ‚úÖ {chunk['header'][:50]}... - {grade['score'].upper()}")
            else:
                print(f"    ‚ùå {chunk['header'][:50]}... - Rejected")
        
        print(f"  üìà Found {relevant_count} relevant (Total: {len(all_relevant_chunks)})")
        
        # Check if enough
        if len(all_relevant_chunks) >= 3:
            print(f"\n‚úÖ SUCCESS: {len(all_relevant_chunks)} relevant chunks!")
            break
        
        # Rewrite
        if iteration < max_iterations:
            feedback = f"Only {len(all_relevant_chunks)} relevant. Need more."
            print(f"  üîÑ Rewriting query...")
            current_query = rewrite_query(query, feedback)
    
    if not all_relevant_chunks:
        print("\n‚ö†Ô∏è  No relevant chunks. Using fallback...")
        all_relevant_chunks = hybrid_search(query, top_k=top_k, alpha=0.5)
    
    return {
        'query': query,
        'final_query': current_query,
        'iterations': iteration,
        'relevant_chunks': all_relevant_chunks,
        'total_found': len(all_relevant_chunks)
    }

print("‚úÖ Corrective RAG ready!")

Step 7: Setting up Corrective RAG...
‚úÖ Corrective RAG ready!


In [43]:
# Step 8: Test Corrective RAG with a single query
print("Step 8: Testing Corrective RAG...")
print("=" * 60)

test_query = "What are the research methods used in this study?"
result = corrective_rag_retrieval(test_query, max_iterations=3, top_k=5)

print(f"\n{'='*70}")
print("TEST RESULTS")
print(f"{'='*70}")
print(f"Original Query: {result['query']}")
print(f"Final Query: {result['final_query']}")
print(f"Iterations: {result['iterations']}")
print(f"Relevant Chunks: {result['total_found']}")

if result['relevant_chunks']:
    print(f"\nTop chunks:")
    for i, chunk in enumerate(result['relevant_chunks'][:3], 1):
        print(f"{i}. {chunk['header'][:60]}...")
        print(f"   Preview: {chunk['content'][:150]}...\n")

Step 8: Testing Corrective RAG...

Query: What are the research methods used in this study?

üîÑ Iteration 1: 'What are the research methods used in this study?'
  üìä Grading 5 chunks...


2025-10-04 05:34:55,366 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Risk factors for¬†NEC... - MEDIUM


2025-10-04 05:34:56,414 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Search strategy... - Rejected


2025-10-04 05:34:57,074 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 05:34:57,643 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Data synthesis and¬†statistical analysis + Study se... - HIGH


2025-10-04 05:35:00,602 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Discussion... - MEDIUM
  üìà Found 3 relevant (Total: 3)

‚úÖ SUCCESS: 3 relevant chunks!

TEST RESULTS
Original Query: What are the research methods used in this study?
Final Query: What are the research methods used in this study?
Iterations: 1
Relevant Chunks: 3

Top chunks:
1. Risk factors for¬†NEC...
   Preview: ## Risk factors for¬†NEC

This meta-analysis showed that MSAF, history of antibiotic use and preterm infection were risk factors for NEC in  preterm  i...

2. Data synthesis and¬†statistical analysis + Study selection...
   Preview: ## Data synthesis and¬†statistical analysis

We used Stata 14.0 software to perform statistical analysis of the extracted data. Heterogeneity among the...

3. Discussion...
   Preview: ## Discussion

To our knowledge, this is the first meta-analysis of factors influencing NEC in preterm infants in China to obtain an updated and thoro...



In [30]:
result

{'query': 'What are the research methods used in this study?',
 'final_query': 'What research methodologies and approaches were employed in this investigation to collect and analyze data?',
 'iterations': 2,
 'relevant_chunks': [{'chunk_id': 'chunk_15',
   'score': np.float64(0.5037165725314711),
   'header': 'Chunk 16',
   'content': ' LDA score for microbial pathways; E) histogram of the overall fecal metabolites and  PCA  as  inset. **P&lt;0.01. two-way  ANOVA  followed  by  a  two-stage  linear  step-up procedure  of  Benjamini. Krieger  and  Yekutieli  to  correct  for  multiple  comparisons  by controlling the False Discovery Rate (&lt;0.05); N=15 for H and N=4 for NEC-1. Fig. 2. Analysis of gut microbiota, microbiome and metabolome in the second decade of life in healthy vs. NEC-1 children. A) Gut microbiota analysis via LDA score between healthy (H) vs. NEC-1 children, in the second decade of life 11 to 20 days (d) ( the score is only shown for NEC-1 children meaning that no ba

In [46]:
# Step 9: Complete Document Extraction with Corrective RAG
print("Step 9: Running complete extraction...")
print("=" * 60)

def extract_document_with_corrective_rag():
    """Extract all document info using corrective RAG."""
    print("\n" + "="*70)
    print("COMPLETE DOCUMENT EXTRACTION")
    print("="*70 + "\n")
    
    queries = {
        'summary': "What is the abstract, main purpose, and overview of this paper?",
        'methods': "What research methods, study design, and analytical approaches were used?",
        'findings': "What are the key findings, results, and conclusions?",
        'type': "What type of document is this? Ex. case study, clinical trial, review article, meta-analysis, research article, technical workshop paper, etc."
    }
    
    extraction_results = {}
    
    for section, query in queries.items():
        print(f"\n{'‚îÄ'*70}")
        print(f"EXTRACTING: {section.upper()}")
        print(f"{'‚îÄ'*70}")
        result = corrective_rag_retrieval(query, max_iterations=2, top_k=5)
        extraction_results[section] = result
    
    # Compile contexts
    summary_ctx = "\n\n---\n\n".join([
        c['content'][:2500] for c in extraction_results['summary']['relevant_chunks'][:3]
    ]) if extraction_results['summary']['relevant_chunks'] else "Not found"
    
    methods_ctx = "\n\n---\n\n".join([
        c['content'][:2500] for c in extraction_results['methods']['relevant_chunks'][:3]
    ]) if extraction_results['methods']['relevant_chunks'] else "Not found"
    
    findings_ctx = "\n\n---\n\n".join([
        c['content'][:2500] for c in extraction_results['findings']['relevant_chunks'][:3]
    ]) if extraction_results['findings']['relevant_chunks'] else "Not found"

    type_ctx = "\n\n---\n\n".join([
        c['content'][:2500] for c in extraction_results['type']['relevant_chunks'][:3]
    ]) if extraction_results['type']['relevant_chunks'] else "Not found"
    
    # Generate extraction
    final_prompt = f"""Extract comprehensive information:
Always generate normal text. Do not try to use markdown/HTML or special formatting.
## Document Information
**Author(s):** {document_metadata['authors'] or 'Not found'}
**Date:** {document_metadata['date'] or 'Not found'}
**Document Type:** : [case study, clinical trial, review article, meta-analysis, research article, technical workshop paper, etc.]

## Document Summary
[2-3 sentence summary]

## Research Methods
[Summarize: study design, data sources, sample size, analytical methods]

## Key Findings and Conclusions
[Summarize: primary outcomes, statistical significance, conclusions, implications]

---
Document Type:
{findings_ctx[:3000]}

SUMMARY:
{summary_ctx[:4000]}

---
METHODS:
{methods_ctx[:4000]}

---
FINDINGS:
{findings_ctx[:3000]}
"""
    
    print(f"\n{'='*70}")
    print("GENERATING FINAL EXTRACTION...")
    print(f"{'='*70}\n")
    
    response = groq_client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages=[{"role": "user", "content": final_prompt}],
        temperature=0.2,
        max_completion_tokens=1500,
    )
    
    final_extraction = response.choices[0].message.content
    
    print("="*70)
    print("FINAL EXTRACTION")
    print("="*70)
    print(final_extraction)
    print("="*70)
    
    print(f"\nStatistics:")
    print(f"  Summary chunks: {len(extraction_results['summary']['relevant_chunks'])}")
    print(f"  Methods chunks: {len(extraction_results['methods']['relevant_chunks'])}")
    print(f"  Findings chunks: {len(extraction_results['findings']['relevant_chunks'])}")
    print(f"  Total iterations: {sum(r['iterations'] for r in extraction_results.values())}")
    
    return final_extraction

# Run complete extraction
final_result = extract_document_with_corrective_rag()

Step 9: Running complete extraction...

COMPLETE DOCUMENT EXTRACTION


‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EXTRACTING: SUMMARY
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Query: What is the abstract, main purpose, and overview of this paper?

üîÑ Iteration 1: 'What is the abstract, main purpose, and overview of this paper?'
  üìä Grading 5 chunks...


2025-10-04 06:12:50,271 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Conclusion... - MEDIUM


2025-10-04 06:12:50,963 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Data synthesis and¬†statistical analysis + Study se... - Rejected


2025-10-04 06:12:51,554 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:12:52,174 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Sensitivity analysis and¬†publication bias (Part 1)... - Rejected


2025-10-04 06:12:52,766 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Inclusion and¬†exclusion criteria... - Rejected
  üìà Found 1 relevant (Total: 1)
  üîÑ Rewriting query...


2025-10-04 06:12:53,310 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


üîÑ Iteration 2: ''
  üìä Grading 5 chunks...


2025-10-04 06:12:54,114 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Sensitivity analysis and¬†publication bias (Part 1)... - Rejected


2025-10-04 06:12:54,728 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Inclusion and¬†exclusion criteria... - Rejected


2025-10-04 06:12:55,322 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Search strategy... - Rejected


2025-10-04 06:12:56,733 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:12:57,474 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Factors influencing necrotizing enterocolitis in¬†p... - HIGH
  üìà Found 1 relevant (Total: 2)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EXTRACTING: METHODS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Query: What research methods, study design, and analytical approaches were used?

üîÑ Iteration 1: 'What research methods, study design, and analytical approaches were used?'
  üìä Grading 5 chunks...


2025-10-04 06:12:58,446 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Data synthesis and¬†statistical analysis + Study se... - HIGH


2025-10-04 06:12:59,623 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:13:00,248 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Discussion... - Rejected


2025-10-04 06:13:00,899 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Data extraction and¬†quality assessment... - Rejected


2025-10-04 06:13:02,392 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Inclusion and¬†exclusion criteria... - MEDIUM
  üìà Found 2 relevant (Total: 2)
  üîÑ Rewriting query...


2025-10-04 06:13:02,909 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


üîÑ Iteration 2: ''
  üìä Grading 5 chunks...


2025-10-04 06:13:03,767 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Sensitivity analysis and¬†publication bias (Part 1)... - MEDIUM


2025-10-04 06:13:04,711 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Inclusion and¬†exclusion criteria... - Rejected


2025-10-04 06:13:05,391 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Search strategy... - Rejected


2025-10-04 06:13:06,047 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:06,107 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:06,109 - INFO - Retrying request to /openai/v1/chat/completions in 2.000000 seconds


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:13:08,706 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚úÖ Factors influencing necrotizing enterocolitis in¬†p... - HIGH
  üìà Found 2 relevant (Total: 4)

‚úÖ SUCCESS: 4 relevant chunks!

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EXTRACTING: FINDINGS
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Query: What are the key findings, results, and conclusions?

üîÑ Iteration 1: 'What are the key findings, results, and conclusions?'
  üìä Grading 5 chunks...


2025-10-04 06:13:08,970 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:08,972 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds
2025-10-04 06:13:12,633 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:12,690 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:12,692 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Data synthesis and¬†statistical analysis + Study se... - Rejected


2025-10-04 06:13:16,200 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:16,256 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:16,259 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


    ‚ùå Study characteristics and¬†qualit y... - Rejected


2025-10-04 06:13:20,845 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:20,910 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:20,911 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚úÖ Risk factors for¬†NEC... - HIGH


2025-10-04 06:13:24,538 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:24,749 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"


    ‚ùå Sensitivity analysis and¬†publication bias (Part 1)... - Rejected


2025-10-04 06:13:24,761 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds
2025-10-04 06:13:28,445 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:28,501 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:28,502 - INFO - Retrying request to /openai/v1/chat/completions in 1.000000 seconds


    ‚úÖ Protect factors for¬†NEC (Part 2)... - HIGH
  üìà Found 2 relevant (Total: 2)
  üîÑ Rewriting query...


2025-10-04 06:13:31,128 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


üîÑ Iteration 2: ''
  üìä Grading 5 chunks...


2025-10-04 06:13:31,552 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:31,553 - INFO - Retrying request to /openai/v1/chat/completions in 1.000000 seconds
2025-10-04 06:13:33,272 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:33,334 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:33,346 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Sensitivity analysis and¬†publication bias (Part 1)... - Rejected


2025-10-04 06:13:37,186 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:37,255 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:37,256 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


    ‚ùå Inclusion and¬†exclusion criteria... - Rejected


2025-10-04 06:13:41,947 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:42,006 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:42,007 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Search strategy... - Rejected


2025-10-04 06:13:45,875 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:45,942 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:45,943 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:13:49,598 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


    ‚ùå Factors influencing necrotizing enterocolitis in¬†p... - Rejected
  üìà Found 0 relevant (Total: 2)

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
EXTRACTING: TYPE
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

Query: What type of document is this? Ex. case study, clinical trial, review article, meta-analysis, research article, technical workshop paper, etc.

üîÑ Iteration 1: 'What type of document is this? Ex. case study, clinical trial, review article, meta-analysis, research article, technical workshop paper, etc.'


2025-10-04 06:13:49,894 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:49,895 - INFO - Retrying request to /openai/v1/chat/completions in 5.000000 seconds


  üìä Grading 5 chunks...


2025-10-04 06:13:56,035 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:56,094 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:56,095 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Risk factors... - Rejected


2025-10-04 06:13:59,792 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:13:59,844 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:13:59,845 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Introduction (Part 2)... - Rejected


2025-10-04 06:14:03,504 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:03,559 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:03,560 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:14:07,162 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:07,217 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:07,219 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


    ‚ùå Data synthesis and¬†statistical analysis + Study se... - Rejected


2025-10-04 06:14:12,116 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:12,174 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:12,174 - INFO - Retrying request to /openai/v1/chat/completions in 1.000000 seconds


    ‚ùå Sensitivity analysis and¬†publication bias (Part 2)... - Rejected
  üìà Found 0 relevant (Total: 0)
  üîÑ Rewriting query...


2025-10-04 06:14:13,682 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


üîÑ Iteration 2: ''


2025-10-04 06:14:13,988 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:14,000 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


  üìä Grading 5 chunks...


2025-10-04 06:14:17,635 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:17,700 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:17,704 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Sensitivity analysis and¬†publication bias (Part 1)... - Rejected


2025-10-04 06:14:22,461 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:22,516 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:22,516 - INFO - Retrying request to /openai/v1/chat/completions in 2.000000 seconds


    ‚ùå Inclusion and¬†exclusion criteria... - Rejected


2025-10-04 06:14:25,163 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:25,274 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:25,275 - INFO - Retrying request to /openai/v1/chat/completions in 4.000000 seconds


    ‚úÖ Search strategy... - HIGH


2025-10-04 06:14:29,887 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:29,950 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:29,953 - INFO - Retrying request to /openai/v1/chat/completions in 3.000000 seconds


    ‚ùå Introduction (Part 1)... - Rejected


2025-10-04 06:14:33,655 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-04 06:14:33,715 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"
2025-10-04 06:14:33,716 - INFO - Retrying request to /openai/v1/chat/completions in 29.000000 seconds


    ‚úÖ Factors influencing necrotizing enterocolitis in¬†p... - HIGH
  üìà Found 2 relevant (Total: 2)

GENERATING FINAL EXTRACTION...



2025-10-04 06:15:05,960 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


FINAL EXTRACTION
Document Information  
Authors: Aihua Wang, Xinghui Cui, Huimin Jiang (and co‚Äëauthors Shuliang Zhao, Yiqun Miao, Wenwen Liu, Yanan Li, Hui Liu, Yuanyuan Zhang)  
Date: 2024 (published online February‚ÄØ2023, data collection up to February‚ÄØ2023)  
Document Type: Systematic review and meta‚Äëanalysis  

Document Summary  
This study systematically reviewed Chinese literature to identify risk and protective factors for necrotizing enterocolitis (NEC) in preterm infants. By pooling data from 38 case‚Äëcontrol or cohort studies (total n‚ÄØ=‚ÄØ8‚ÄØ616 infants), the authors quantified the association of each factor with NEC using odds ratios (ORs) and evaluated the robustness of the findings through sensitivity and publication‚Äëbias analyses.  

Research Methods  
- Study design: Systematic review and meta‚Äëanalysis of observational case‚Äëcontrol and cohort studies.  
- Data sources: PubMed, Embase, Web of Science, Cochrane Library, CNKI, CBM, Wanfang, and VIP database