In [None]:
# Core imports for scientific poster analysis
from mcp import stdio_client, StdioServerParameters  # Model Context Protocol for tool integration
from strands import Agent                            # Strands Agent SDK for orchestrating AI workflows
from strands.tools.mcp import MCPClient             # MCP client wrapper for Strands
from strands.models import BedrockModel             # Amazon Bedrock model integration
from strands_tools import file_read, file_write, shell  # Built-in Strands tools
import os
import urllib3

# Scientific Poster Data Extraction with AI

## Overview

This notebook demonstrates how to extract structured insights from scientific posters using cutting-edge AI technologies. We combine the power of **Amazon Bedrock Data Automation** with the **Strands Agent SDK** to create an intelligent document processing pipeline.

## Value Proposition

### Why Amazon Bedrock Data Automation MCP?

[Amazon Bedrock Data Automation](https://aws.amazon.com/bedrock/bda/) provides enterprise-grade capabilities for processing unstructured multimodal content:

- **Multimodal Understanding**: Processes documents, images, audio, and video with state-of-the-art AI models
- **Scalable Infrastructure**: Handles large volumes of content with AWS's robust cloud infrastructure
- **Pre-built Intelligence**: Leverages foundation models trained on diverse scientific and technical content
- **Cost-Effective**: Pay-per-use model eliminates the need for maintaining specialized ML infrastructure
- **Security & Compliance**: Enterprise-grade security with data residency and privacy controls

### Why Strands Agent SDK?

The [Strands Agent SDK](https://strandsagents.com/0.1.x/) provides a powerful framework for building AI agents:

- **Tool Integration**: Seamlessly connects with external services via Model Context Protocol (MCP)
- **Workflow Orchestration**: Manages complex multi-step AI workflows with error handling and retries
- **Model Flexibility**: Works with various LLMs including Amazon Bedrock models
- **Developer Experience**: Intuitive Python API that reduces boilerplate code
- **Production Ready**: Built-in logging, monitoring, and debugging capabilities

### Combined Benefits

Together, these technologies enable:
- **Rapid Prototyping**: From concept to working solution in minutes
- **Scientific Accuracy**: Specialized understanding of research content and terminology
- **Structured Output**: Convert unstructured posters into queryable, structured data
- **Scalable Processing**: Handle individual files or batch process hundreds of documents

## Use Cases

This approach is particularly valuable for:
- **Research Institutions**: Digitizing and cataloging poster sessions from conferences
- **Pharmaceutical Companies**: Analyzing clinical trial posters and research findings
- **Academic Libraries**: Creating searchable databases of research presentations
- **Grant Agencies**: Reviewing and categorizing funded research outcomes


In [None]:
# Configuration: Disable SSL warnings for development
# Note: Remove this in production environments for security
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

In [None]:
# Configuration: Input file and AWS settings
pdf_path = os.path.abspath("/Users/necibea/poster.pdf")  # Path to scientific poster PDF

# Initialize Amazon Bedrock Data Automation MCP Client
# This client provides access to BDA's multimodal processing capabilities
# through the Model Context Protocol (MCP) interface
aws_bda_client = MCPClient(
    lambda: stdio_client(
        StdioServerParameters(
            command="uvx",  # Use uvx to run the MCP server
            args=["awslabs.aws-bedrock-data-automation-mcp-server@latest"],
            env={
                "AWS_PROFILE": "default",                    # AWS credentials profile
                "AWS_REGION": "us-east-1",                     # AWS region for BDA service
                "AWS_BUCKET_NAME": "clinical-poster-analysis-bucket",  # S3 bucket for temporary storage
                "BASE_DIR": "/Users/necibea/",                # Base directory for file operations
                "FASTMCP_LOG_LEVEL": "ERROR"                 # Reduce log verbosity
            }
        )
    )
)

## Analysis Process

The following cell demonstrates the core functionality of our poster analysis pipeline:

1. **Document Upload**: The PDF is automatically uploaded to the configured S3 bucket
2. **Multimodal Processing**: BDA analyzes both text and visual elements (figures, tables, charts)
3. **Structure Extraction**: The service identifies document hierarchy, sections, and relationships
4. **Content Analysis**: Advanced AI models extract semantic meaning from scientific content
5. **Structured Output**: Results are returned in JSON format with rich metadata

### Key Features Demonstrated

- **Element Detection**: Automatically identifies figures, tables, and text blocks
- **Markdown Conversion**: Converts poster content to structured markdown format
- **Metadata Extraction**: Provides detailed statistics about document composition
- **Visual Understanding**: Processes charts, graphs, and scientific diagrams


In [None]:
# Execute the poster analysis
bedrock_model = BedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    temperature=0.2,
)

SYSTEM_PROMPT = """
You are a clinical analyst. Your responsibility is to extract information from poster data. 
Using Bedrock Data Automation analyse documents and extract text. Make sure to extract the data in figures. Create a new blueprint if necessary. 
"""

with aws_bda_client:
    tools = aws_bda_client.list_tools_sync()

    agent = Agent(tools= tools,
                  model=bedrock_model,
                  system_prompt=SYSTEM_PROMPT)

    result = agent.tool.analyzeasset(assetPath=pdf_path)

    print(result['content'][0]['text'])

In [None]:
# List all available Bedrock Data Automation projects in your account
with aws_bda_client:
    tools = aws_bda_client.list_tools_sync()
    agent = Agent(tools= tools)
    response = agent("Do I have any Amazon Bedrock Data Automation projects?")

In [None]:
# Analyse a local document and ask to extract specific parts of the document
with aws_bda_client:
    tools = aws_bda_client.list_tools_sync()
    agent = Agent(tools= tools)
    response = agent(f"Analyse the file in {pdf_path}. Can you extract Figure 4 in CSV format")

## Next Steps and Enhancements

This notebook demonstrates the foundational capabilities of combining Amazon Bedrock Data Automation with Strands Agent SDK. Here are potential enhancements for production use:

### Immediate Improvements

1. **Batch Processing**: Process multiple posters simultaneously using Strands' parallel execution capabilities
2. **Custom Extraction**: Define specific data schemas for different types of scientific content
3. **Quality Validation**: Add confidence scoring and human-in-the-loop validation workflows
4. **Output Formats**: Export results to databases, knowledge graphs, or structured formats (JSON-LD, RDF)

### Advanced Features

1. **Semantic Search**: Build searchable indexes of extracted content using vector embeddings
2. **Cross-Reference Analysis**: Link related findings across multiple posters and publications
3. **Trend Analysis**: Identify emerging research themes and methodological patterns
4. **Citation Networks**: Extract and map citation relationships between research works

### Integration Opportunities

- **Research Databases**: Connect with PubMed, arXiv, or institutional repositories
- **Lab Information Systems**: Integrate with LIMS for automated research cataloging
- **Grant Management**: Link extracted data to funding sources and outcomes
- **Collaboration Platforms**: Share structured insights across research teams

### Performance Optimization

- **Caching**: Implement intelligent caching for frequently accessed content
- **Streaming**: Process large documents in chunks for better memory efficiency
- **Cost Optimization**: Use appropriate model sizes based on content complexity

The combination of Amazon Bedrock Data Automation and Strands Agent SDK provides a robust foundation for building sophisticated document intelligence applications that can scale from research prototypes to enterprise-grade solutions.
