In [None]:
pip install graphviz


In [None]:
from graphviz import Digraph

dot = Digraph(comment='AI System', format='png')

# Node creation
dot.node('A', 'Data Ingestion & Preprocessing')
dot.node('B1', 'AI Module 1\n(e.g., Literature Review)')
dot.node('B2', 'AI Module 2\n(e.g., Feasibility Analysis)')
dot.node('B3', 'AI Module 3\n(e.g., Patient Population Analysis)')
dot.node('B4', 'AI Module 4\n(e.g., Historical Trial Data Analysis)')
dot.node('B5', 'AI Module 5\n(e.g., Real-World Data Analysis)')
dot.node('B6', 'AI Module 6\n(e.g., Site Selection Analysis)')
dot.node('C', 'Integration Layer')
dot.node('D', 'User Interface')
dot.node('E', 'Authentication Service')
dot.node('F', 'API Gateway')
dot.node('G', 'Security & Compliance Layer')
dot.node('H', 'Feedback & Monitoring')

# Edges creation indicating API calls
edges = [('A', 'B1'), ('A', 'B2'), ('A', 'B3'), ('A', 'B4'), ('A', 'B5'), ('A', 'B6'), 
         ('B1', 'C'), ('B2', 'C'), ('B3', 'C'), ('B4', 'C'), ('B5', 'C'), ('B6', 'C'), 
         ('C', 'D'), ('E', 'D'), ('F', 'D')]

for edge in edges:
    dot.edge(*edge)

dot.edge('F', 'A', label='Data API Calls')
dot.edge('F', 'B1', label='API Calls')
dot.edge('F', 'B2', label='API Calls')
dot.edge('F', 'B3', label='API Calls')
dot.edge('F', 'B4', label='API Calls')
dot.edge('F', 'B5', label='API Calls')
dot.edge('F', 'B6', label='API Calls')
dot.edge('F', 'C', label='API Calls')
dot.edge('F', 'D', label='UI API Calls')

dot.attr(overlap='false')  # This helps with better layout

dot.render('AI_system_detailed.gv', view=True)


Absolutely, let's break down each of these steps further:

1. **Literature Review - Natural Language Processing (NLP):**

    **Plan:** Develop an NLP model that can read, understand, and summarize large volumes of medical literature relevant to the drug or treatment in question. This could involve creating an ontology or taxonomy of terms related to the disease or treatment and training the model to recognize these terms in text.

    **Challenges:** The medical literature is vast and complex, and the language used can be ambiguous or context-dependent. Maintaining up-to-date and comprehensive knowledge graphs and ontologies could be challenging.

    **How to Overcome Challenges:** Engage with medical experts to validate the model’s understanding and summaries. Use transfer learning from pretrained models to overcome issues with medical language.

2. **Feasibility Analysis - AI Algorithms:**

    **Plan:** Develop predictive models using historical trial data to predict key feasibility metrics like patient enrollment rates and trial duration. These models would take into account factors such as the trial design, patient population, and disease area.

    **Challenges:** Historical trial data may be incomplete or inconsistent. Additionally, future trials may not always align well with past trials, limiting the applicability of predictions.

    **How to Overcome Challenges:** Utilize robust data cleaning and preprocessing pipelines. Implement techniques for dealing with missing or inconsistent data. Leverage domain experts for feature engineering and model validation.

3. **Patient Population Analysis - EHR and Real-World Data:**

    **Plan:** Develop AI models to analyze EHR and other real-world data to understand the patient population. This could involve predicting patient availability, stratifying patients based on health conditions, and predicting patient outcomes.

    **Challenges:** EHR data is often messy and unstructured. Data privacy regulations make sharing and utilizing EHR data challenging.

    **How to Overcome Challenges:** Implement robust data cleaning and preprocessing pipelines. Use NLP techniques to extract useful information from unstructured data. Collaborate closely with data privacy experts and legal teams to ensure compliance with regulations.

4. **Historical Trial Data Analysis - Machine Learning:**

    **Plan:** Use machine learning models to analyze historical trial data, identifying patterns that could predict trial success or failure. This could involve both supervised learning (if success metrics are available for past trials) and unsupervised learning (to identify underlying patterns or clusters in the data).

    **Challenges:** As with feasibility analysis, the data may be incomplete or inconsistent, and past trials may not be fully representative of future trials.

    **How to Overcome Challenges:** Implement robust data cleaning and preprocessing pipelines. Use techniques like imputation or model-based methods to handle missing data.

5. **Real-World Data Analysis - AI:**

    **Plan:** Extend the use of AI in patient population analysis to gain insights into disease epidemiology, treatment patterns, and health outcomes. This would involve building predictive models and possibly also causal inference models.

    **Challenges:** Real-world data is noisy and heterogeneous, and establishing causal relationships in such data is challenging.

    **How to Overcome Challenges:** Use rigorous statistical techniques to control for confounding variables. Validate findings with experts.

6. **Site Selection Analysis - AI:**

    **Plan:** Develop models to predict site performance based on factors like past performance, site resources, and patient population. This would involve both predictive modeling (e.g., of patient recruitment rates) and optimization (to select the best combination of sites).

    **Challenges:** Site performance can be influenced by many factors, some of which may not be captured in the available data. The optimization problem could be complex, especially if there are many potential sites and constraints to consider.

    **How to Overcome Challenges:** Use ensemble models or other sophisticated machine

 learning techniques to capture complex relationships in the data. Use robust optimization techniques, potentially including heuristics or metaheuristics for large or complex problems.

Alright, let's provide more detail for each phase of the product roadmap:

**Phase 1 - Planning & Design (Months 1-3)**

1. **Define Requirements:**
    - Gather requirements from users and stakeholders for each AI module.
    - Determine required features and priorities.
    - Develop user stories and use cases to guide design and development.

2. **System Architecture & Data Model:**
    - Design the overall system architecture, considering components like data ingestion, AI services, APIs, and user interfaces.
    - Design the data model, considering data sources, types, storage, privacy, and security requirements.
    
3. **Technical Stack:**
    - Decide on programming languages (Python, R, Java, etc.), machine learning frameworks (TensorFlow, PyTorch, etc.), cloud service providers (AWS, Google Cloud, etc.), and database systems (SQL, NoSQL, etc.).
    
4. **Development Environments:**
    - Set up separate environments for development, testing, and production, ensuring proper isolation between them.
    
5. **Data Privacy & Security Plan:**
    - Define a data handling policy in compliance with GDPR, HIPAA or other regional data protection laws.
    - Determine data encryption standards and access control mechanisms.

**Phase 2 - Data Infrastructure & AI Development (Months 4-9)**

1. **Data Infrastructure:**
    - Implement data ingestion modules for EHRs, scientific literature, and other relevant data sources.
    - Develop data cleaning and preprocessing pipelines, which may involve techniques like outlier detection, missing value imputation, data normalization, and others.
    - Set up databases or data lakes for structured and unstructured data.

2. **AI Module Development:**
    - Acquire and preprocess necessary training data.
    - Develop AI models for literature review, feasibility analysis, patient population analysis, historical trial data analysis, real-world data analysis, and site selection analysis.
    - Evaluate and refine models using techniques like cross-validation, precision-recall analysis, and ROC curves.

3. **Integration Layer & APIs:**
    - Develop a robust API layer for communication between different components of the system.
    - Ensure the API handles different data formats, is scalable, and secure.

**Phase 3 - User Interface & Integration (Months 10-15)**

1. **User Interface:**
    - Design an intuitive user interface based on user requirements and UX best practices.
    - Implement the UI, focusing on usability, performance, and responsiveness.

2. **Integration:**
    - Integrate the AI modules, data layer, and user interface.
    - Implement user authentication and data security features, possibly using OAuth for authentication and HTTPS for secure data transmission.

**Phase 4 - Testing & Refinement (Months 16-18)**

1. **Testing:**
    - Conduct unit testing of individual modules, integration testing of multiple modules, and system testing of the complete system.
    - Involve users in acceptance testing to ensure the system meets their needs and expectations.

2. **Refinement:**
    - Use testing and user feedback to refine the system, fixing bugs, improving performance, and enhancing usability.

3. **Compliance & Security Audit:**
    - Conduct a security audit to identify and fix vulnerabilities.
    - Ensure compliance with healthcare regulations, privacy laws, and data security standards.

**Phase 5 - Deployment & Monitoring (Month 19 onwards)**

1. **Deployment:**
    - Deploy the platform in a production environment, ensuring it can handle real-world load and data volumes.

2. **User Training & Support:**
    - Provide user training and ongoing support to ensure users can effectively use the platform.

3. **Monitoring & Refinement:**
    - Implement monitoring

 systems to track performance, usage, and user feedback.
    - Continually refine the platform based on this information and advances in AI technology.

The project should follow an Agile methodology, with regular sprints, scrum meetings, and iterative development and improvement. Frequent, transparent communication with stakeholders is critical to ensure alignment and manage expectations.

uilding an integrated platform involves connecting various AI services or modules. APIs, or Application Programming Interfaces, serve as the connection points between these different software components, enabling them to communicate with each other and share data and functionality.

In this scenario, each AI module (such as literature review, feasibility analysis, etc.) would expose an API that allows it to receive input data, process it, and return the results. The integration layer of the platform would use these APIs to orchestrate the overall workflow, passing data between the AI modules as needed.

While I can't create a visual diagram here, I can describe a high-level design:

Data Ingestion and Preprocessing: This layer collects data from various sources (EHRs, literature databases, etc.), preprocesses it, and stores it in a suitable format. It might also expose APIs to allow the AI modules to query and retrieve the necessary data.

AI Modules: Each AI module is a standalone microservice that performs a specific task, such as analyzing patient populations or reviewing scientific literature. Each module exposes an API that accepts input data, processes it, and returns the results.

Integration Layer: This layer sits between the AI modules and the user interface. It uses the APIs of the AI modules to orchestrate the overall workflow. For example, it might first call the literature review API to get information on previous trials, then pass this information to the feasibility analysis API to get predictions on trial success.

User Interface: The UI makes requests to the integration layer to initiate analyses, receives the results, and displays them to the user in an interpretable and actionable format.

Security and Compliance Layer: This layer ensures that all data is handled securely and in compliance with relevant regulations. It might include features like encryption, access controls, and audit logs.

Feedback and Monitoring: These are integrated throughout the platform to monitor performance, identify issues, and collect user feedback for continuous improvement.



In [None]:
!pip install graphviz pydotplus


In [None]:
import pygraphviz as pgv
from IPython.display import Image, display

plantuml_code = """
digraph PlantUMLDiagram {
    // Define nodes
    UI [label="User Interface"]
    auth [label="Authentication Service"]
    gateway [label="API Gateway"]
    storage [label="Data Storage"]
    cloudProvider [label="Cloud Provider"]
    module1 [label="AI Module 1\n(e.g., Literature Review)"]
    module2 [label="AI Module 2\n(e.g., Feasibility Analysis)"]
    module3 [label="AI Module 3\n(e.g., Patient Population Analysis)"]
    module4 [label="AI Module 4\n(e.g., Historical Trial Data Analysis)"]
    module5 [label="AI Module 5\n(e.g., Real-World Data Analysis)"]
    module6 [label="AI Module 6\n(e.g., Site Selection Analysis)"]
    integration [label="Integration Layer"]
    security [label="Security & Compliance Layer"]
    feedback [label="Feedback & Monitoring"]
    ingestion [label="Data Ingestion & Preprocessing"]
    monitoring [label="Feedback & Monitoring"]

    // Define edges
    UI -> auth
    auth -> gateway
    gateway -> module1
    gateway -> module2
    gateway -> module3
    gateway -> module4
    gateway -> module5
    gateway -> module6
    module1 -> integration
    module2 -> integration
    module3 -> integration
    module4 -> integration
    module5 -> integration
    module6 -> integration
    integration -> storage
    integration -> cloudProvider
    ingestion -> storage
    feedback -> monitoring
}
"""

# Create the graph from PlantUML code
graph = pgv.AGraph(string=plantuml_code)

# Layout the graph using Graphviz
graph.layout(prog="dot")

# Save the graph as an image file
output_file = "plantuml_diagram.png"
graph.draw(output_file)

# Display the image
display(Image(filename=output_file))


In [None]:
from graphviz import Digraph

dot = Digraph(comment='The Detailed Common Data Layer Architecture')

# Data Sources nodes
dot.node('A1', 'Genomic Data')
dot.node('A2', 'EHR Data')
dot.node('A3', 'Imaging Data')
dot.node('A4', 'Environmental Data')
dot.node('A5', 'Wearable Device Data')
dot.node('A6', 'Social Determinants')

# Data Integration Node
dot.node('B', 'Data Integration')

# Data Processing nodes
dot.node('C1', 'Multi-Modal AI Models')
dot.node('C2', 'NLP Models')
dot.node('C3', 'Graph Neural Networks')

# Causal Inference Node
dot.node('D', 'Causal Inference Models')

# Insights Generation Node
dot.node('E', 'Explainable AI')

# Output Node
dot.node('F', 'Output')

# Data Sources to Data Integration edges
dot.edge('A1', 'B')
dot.edge('A2', 'B')
dot.edge('A3', 'B')
dot.edge('A4', 'B')
dot.edge('A5', 'B')
dot.edge('A6', 'B')

# Data Integration to Data Processing edges
dot.edge('B', 'C1')
dot.edge('B', 'C2')
dot.edge('B', 'C3')

# Data Processing to Causal Inference edges
dot.edge('C1', 'D')
dot.edge('C2', 'D')
dot.edge('C3', 'D')

# Causal Inference to Insights Generation edge
dot.edge('D', 'E')

# Insights Generation to Output edge
dot.edge('E', 'F')

# Show the graph
dot.view()


In [None]:
from graphviz import Digraph

dot = Digraph(comment='Clinical Trial Literature Review Automation')

# Begin defining subgraphs and nodes
with dot.subgraph(name='cluster_0') as c0:
    c0.attr(style='filled', color='lightgrey')
    c0.node_attr.update(style='filled', color='white')
    c0.attr(label='Clinical Trial Literature Review Automation')

    with c0.subgraph(name='cluster_1') as c1:
        c1.attr(label='LLM Data Processing Pipeline')
        c1.node_attr.update(style='filled', color='lightblue')
        c1.node('node1', 'Input: Clinical Trial Papers')
        c1.node('node2', 'LLM Data Extraction')
        c1.node('node3', 'LLM Quality Assessment')
        c1.node('node4', 'LLM Data Synthesis')
        c1.node('node5', 'Output: Synthesized Information')
        c1.edges(['12', '23', '34', '45'])

    with c0.subgraph(name='cluster_2') as c2:
        c2.attr(label='Human Review')
        c2.node_attr.update(style='filled', color='lightblue')
        c2.node('node6', 'Manual Data Review & Interpretation')
        c2.node('node7', 'Final Review Report')
        c2.edge('node6', 'node7')

    c0.edge('node5', 'node6')

# Save the source to file and render
dot.render('test-output/round-table.gv', view=True)


In [None]:
from graphviz import Digraph

dot = Digraph(comment='Clinical Trial Literature Review Automation', format='png')

# Creating main process
with dot.subgraph(name='cluster_main') as main:
    main.attr(label='Clinical Trial Literature Review Automation')

    # Creating LLM Data Processing Pipeline subgraph
    with main.subgraph(name='cluster_llm') as llm:
        llm.attr(label='LLM Data Processing Pipeline', color='blue')
        llm.node('A1', 'Input: Clinical Trial Papers')
        llm.node('A2', 'LLM Data Extraction')
        llm.node('A3', 'LLM Quality Assessment')
        llm.node('A4', 'LLM Data Synthesis')
        llm.node('A5', 'Output: Synthesized Information')
        llm.edges([('A1', 'A2'), ('A2', 'A3'), ('A3', 'A4'), ('A4', 'A5')])

    # Creating Human Review subgraph
    with main.subgraph(name='cluster_human') as human:
        human.attr(label='Human Review', color='blue')
        human.node('B1', 'Manual Data Review & Interpretation')
        human.node('B2', 'Final Review Report')
        human.edge('B1', 'B2')

    # Linking LLM Data Processing Pipeline with Human Review
    main.edge('A5', 'B1')

dot.view()


In [None]:
pip install requests transformers


In [40]:
import requests
import torch
import transformers
import urllib
import time


def get_literature(query, api_key):
    """Fetches relevant clinical trial literature from PubMed."""
    query = urllib.parse.quote(query)
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term={query}&api_key={api_key}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        literature = response.json()
        #print(literature)  # Print the fetched literature
        return literature
    else:
        raise Exception(f"Error fetching literature: {response.status_code}")



def preprocess_literature(literature):
    """Converts the collected data into a format suitable for the LLM."""
    processed_literature = []
    for study_id in literature["esearchresult"]["idlist"]:
        study = get_study_details(study_id)
        study["text"] = study.get("abstract", "")  # Update this line
        processed_literature.append(study)
    print(processed_literature)
    return processed_literature






def get_study_details(study_id):
    """Fetches detailed information for a specific study from PubMed."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&id={study_id}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        study = response.json()
        study_details = study["result"][study_id]
        study_details["id"] = study_id
        study_details["review"] = study_details.get("abstract", "")  # Update this line
        
        return study_details
    else:
        raise Exception(f"Error fetching study details: {response.status_code}")



def make_request_with_retry(url, max_retries=3):
    """Makes a request with retry and exponential backoff."""
    retry_delay = 1
    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                print(f"Rate limit exceeded. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                retry_delay *= 2
                continue
            else:
                response.raise_for_status()
        except Exception as e:
            print(f"Request failed. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
            retry_delay *= 2
    raise Exception("Max retries exceeded. Unable to complete the request.")


def train_llm(literature):
    """Fine-tunes a pre-trained LLM on a specialized corpus of clinical trial literature."""
    model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
    # Training code for LLM
    return model


def extract_data(llm, literature, tokenizer):
    """Uses the trained LLM to extract relevant data from the literature."""
    for study in literature:
        inputs = tokenizer.encode_plus(
            study["text"],
            add_special_tokens=True,
            truncation=True,
            max_length=512,
            padding="max_length",
            return_tensors="pt"
        )
        outputs = llm(**inputs)
        logits = outputs.logits
        predicted_label = torch.argmax(logits, dim=1)
        
        study["study_design"] = predicted_label.item()
        study["sample_size"] = int(study["study_design"])
        study["interventions"] = study.get("interventions", [])
        study["outcome_measures"] = study.get("outcome_measures", [])
        study["main_findings"] = study.get("main_findings", "")
    #print(literature)
    return literature


def generate_report(literature):
    """Generates a comprehensive review report."""
    report = ""
    for study in literature:
        study_id = study.get("id")
        review = study.get("review")
        #print(f"Study ID: {study_id}")  # Print the study ID for debugging purposes
        #print(f"Review: {review}")  # Print the review for debugging purposes
        if study_id and review:
            report += f"Study {study_id}: {review}\n"
    
    #print(report)  # Print the generated report for debugging purposes
    
    return report



if __name__ == "__main__":
    query = input("Enter a search query for clinical trial literature: ")
    api_key = 'dcbe9b28ab11329579a3694164c127836807'
    literature = get_literature(query, api_key)
    processed_literature = preprocess_literature(literature)  # Add this line
    llm = train_llm(processed_literature)  # Pass the processed literature
    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
    literature = extract_data(llm, processed_literature, tokenizer)  # Pass the processed literature
    report = generate_report(literature)
    #print(report)


Enter a search query for clinical trial literature: covid19
{'header': {'type': 'esearch', 'version': '0.3'}, 'esearchresult': {'count': '351426', 'retmax': '20', 'retstart': '0', 'idlist': ['34817268', '34312178', '34498109', '33494237', '32425996', '33464914', '33705239', '33692022', '34391321', '33484452', '32788312', '33989487', '34227079', '32512530', '34357885', '32599968', '34099197', '34904908', '33035150', '35568052'], 'translationset': [{'from': 'covid19', 'to': '"covid-19"[MeSH Terms] OR "covid-19"[All Fields] OR "covid19"[All Fields]'}], 'querytranslation': '"covid 19"[MeSH Terms] OR "covid 19"[All Fields] OR "covid19"[All Fields]'}}
Rate limit exceeded. Retrying in 1 seconds...
Rate limit exceeded. Retrying in 1 seconds...
Rate limit exceeded. Retrying in 1 seconds...
Rate limit exceeded. Retrying in 1 seconds...
Rate limit exceeded. Retrying in 1 seconds...
Rate limit exceeded. Retrying in 1 seconds...
[{'uid': '34817268', 'pubdate': '2022 Jan 1', 'epubdate': '2021 Nov 24

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [None]:
!pip install torch torchvision


In [None]:
dcbe9b28ab11329579a3694164c127836807

## V2

In [50]:
import requests
import torch
import transformers
import urllib
import time


def get_literature(query, api_key):
    """Fetches relevant clinical trial literature from PubMed."""
    query = urllib.parse.quote(query)
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term={query}&api_key={api_key}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        literature = response.json()
        return literature
    else:
        raise Exception(f"Error fetching literature: {response.status_code}")


def get_full_text(pmid):
    """Fetches the full text of an article using the BioC-PMC API."""
    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{pmid}/unicode"
    response = make_request_with_retry(url)

    if response and response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve full text for PMID: {pmid}")
        return ""


def preprocess_literature(literature):
    """Converts the collected data into a format suitable for the LLM."""
    processed_literature = []
    for study_id in literature["esearchresult"]["idlist"]:
        study = get_study_details(study_id)
        study["full_text"] = get_full_text(study_id)  # Retrieve the full text
        study["text"] = study.get("abstract", "")
        study["label"] = 1  # Assign a label of 1
        processed_literature.append(study)
    return processed_literature




def get_study_details(study_id):
    """Fetches detailed information for a specific study from PubMed."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&id={study_id}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        study = response.json()
        study_details = study["result"][study_id]
        study_details["id"] = study_id
        study_details["review"] = study_details.get("abstract", "")
        return study_details
    else:
        raise Exception(f"Error fetching study details: {response.status_code}")


def make_request_with_retry(url, max_retries=3):
    """Makes a request with retry and exponential backoff."""
    retry_delay = 1
    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                print(f"Rate limit exceeded. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                retry_delay *= 2
                continue
            else:
                response.raise_for_status()
        except Exception as e:
            print(f"Request failed. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
            retry_delay *= 2
    # Return an empty response if retries exceeded
    return None




def train_llm(literature):
    """Fine-tunes a pre-trained LLM on a specialized corpus of clinical trial literature."""
    train_texts = [study["full_text"] for study in literature]
    train_labels = [study["label"] for study in literature]

    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
    model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

    # Tokenize and encode the training texts
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors="pt")

    # Convert the labels to tensors
    train_labels = torch.tensor(train_labels)

    # Create a Dataset object
    train_dataset = torch.utils.data.TensorDataset(train_encodings["input_ids"], train_encodings["attention_mask"], train_labels)

    # Create a DataLoader for training
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True)

    # Training loop
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    for epoch in range(3):
        model.train()
        for batch in train_loader:
            batch = tuple(t.to(device) for t in batch)
            input_ids, attention_mask, labels = batch

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            loss.backward()
            optimizer.step()

    return model





def extract_data(llm, literature, tokenizer):
    """Uses the trained LLM to extract relevant data from the literature."""
    for study in literature:
        inputs = tokenizer.encode_plus(
            study["full_text"],  # Use the full text instead of the abstract
            add_special_tokens=True,
            truncation=True,
            max_length=512,
            padding="max_length",
            return_tensors="pt"
        )
        outputs = llm(**inputs)
        logits = outputs.logits
        predicted_label = torch.argmax(logits, dim=1)

        study["study_design"] = predicted_label.item()
        study["sample_size"] = int(study["study_design"])
        study["interventions"] = study.get("interventions", [])
        study["outcome_measures"] = study.get("outcome_measures", [])
        study["main_findings"] = study.get("main_findings", "")
    return literature


def generate_report(literature):
    """Generates a comprehensive review report."""
    report = ""
    for study in literature:
        study_id = study.get("id")
        full_text = study.get("full_text")
        if study_id and full_text:
            report += f"Study {study_id}: {full_text}\n"
    return report


if __name__ == "__main__":
    query = input("Enter a search query for clinical trial literature: ")
    api_key = 'dcbe9b28ab11329579a3694164c127836807'
    literature = get_literature(query, api_key)
    processed_literature = preprocess_literature(literature)
    llm = train_llm(processed_literature)
    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
    literature = extract_data(llm, processed_literature, tokenizer)
    report = generate_report(literature)
    print(report)


Enter a search query for clinical trial literature: covid 19
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 33664170
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 34355645
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 32749914


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Study 33400058: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection><source>PMC</source><date>20210914</date><key>pmc.key</key><document><id>7784226</id><infon key="license">CC BY</infon><passage><infon key="article-id_doi">10.1208/s12248-020-00532-2</infon><infon key="article-id_pmc">7784226</infon><infon key="article-id_pmid">33400058</infon><infon key="article-id_publisher-id">532</infon><infon key="elocation-id">14</infon><infon key="issue">1</infon><infon key="kwd">antiviral drugs anti-SARS-CoV-2 antibody antiviral vaccines ARDS convalescent plasma therapy immunotherapy nanotherapeutics</infon><infon key="license">Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indic

## LLM interface - working V3

In [54]:
import requests
import torch
import transformers
import urllib
import time


def get_literature(query, api_key):
    """Fetches relevant clinical trial literature from PubMed."""
    query = urllib.parse.quote(query)
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term={query}&api_key={api_key}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        literature = response.json()
        return literature
    else:
        raise Exception(f"Error fetching literature: {response.status_code}")


def get_full_text(pmid):
    """Fetches the full text of an article using the BioC-PMC API."""
    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{pmid}/unicode"
    response = make_request_with_retry(url)

    if response is not None and response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve full text for PMID: {pmid}")
        return ""


def preprocess_literature(literature):
    """Converts the collected data into a format suitable for the LLM."""
    processed_literature = []
    for study_id in literature["esearchresult"]["idlist"]:
        study = get_study_details(study_id)
        study["full_text"] = get_full_text(study_id)  # Retrieve the full text
        study["text"] = study.get("abstract", "")
        study["label"] = 1  # Assign a default label value
        processed_literature.append(study)
    return processed_literature


def get_study_details(study_id):
    """Fetches detailed information for a specific study from PubMed."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&id={study_id}"
    response = make_request_with_retry(url)

    if response.status_code == 200:
        study = response.json()
        study_details = study["result"][study_id]
        study_details["id"] = study_id
        study_details["review"] = study_details.get("abstract", "")
        return study_details
    else:
        raise Exception(f"Error fetching study details: {response.status_code}")


def make_request_with_retry(url, max_retries=3):
    """Makes a request with retry and exponential backoff."""
    retry_delay = 1
    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                return response
            elif response.status_code == 429:
                print(f"Rate limit exceeded. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                retry_delay *= 2
                continue
            else:
                response.raise_for_status()
        except Exception as e:
            print(f"Request failed. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
            retry_delay *= 2
    # Return an empty response if retries exceeded
    return None


def train_llm(literature):
    """Fine-tunes a pre-trained LLM on a specialized corpus of clinical trial literature."""
    train_texts = [study["full_text"] for study in literature]
    train_labels = [study["label"] for study in literature]

    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
    model = transformers.AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

    # Tokenize and encode the training texts
    train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors="pt")

    # Convert the labels to tensors
    train_labels = torch.tensor(train_labels)

    # Create a Dataset object
    train_dataset = torch.utils.data.TensorDataset(train_encodings["input_ids"], train_encodings["attention_mask"], train_labels)

    # Create a DataLoader for training
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True)

    # Training loop
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    for epoch in range(3):
        model.train()
        for batch in train_loader:
            batch = tuple(t.to(device) for t in batch)
            input_ids, attention_mask, labels = batch

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            loss.backward()
            optimizer.step()

    return model


def ask_clinical_question(question, llm, tokenizer):
    """Uses the trained LLM to generate an answer for a clinical question."""
    inputs = tokenizer.encode_plus(
        question,
        add_special_tokens=True,
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors="pt"
    )

    inputs = {k: v.to(llm.device) for k, v in inputs.items()}
    outputs = llm(**inputs)
    logits = outputs.logits
    predicted_label = torch.argmax(logits, dim=1)

    answer = predicted_label.item()
    return answer


if __name__ == "__main__":
    query = input("Enter a search query for clinical trial literature: ")
    api_key = 'dcbe9b28ab11329579a3694164c127836807'
    literature = get_literature(query, api_key)
    processed_literature = preprocess_literature(literature)
    llm = train_llm(processed_literature)
    tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")

    while True:
        question = input("Ask a clinical question (type 'exit' to quit): ")
        if question.lower() == "exit":
            break

        answer = ask_clinical_question(question, llm, tokenizer)
        print(f"Answer: {answer}")


Enter a search query for clinical trial literature: covid 19
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 33664170
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 34355645
Request failed. Retrying in 1 seconds...
Request failed. Retrying in 2 seconds...
Request failed. Retrying in 4 seconds...
Failed to retrieve full text for PMID: 32749914


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Ask a clinical question (type 'exit' to quit): How many clinical trials were phase 2?
Answer: 1
Ask a clinical question (type 'exit' to quit): how many trials occurred in 2020
Answer: 1
Ask a clinical question (type 'exit' to quit): how many total trials occurred
Answer: 0
Ask a clinical question (type 'exit' to quit): what was the most popular trial protocol
Answer: 1
Ask a clinical question (type 'exit' to quit): exit


In [2]:
import graphviz

# Create a new graph
graph = graphviz.Digraph()

# Add nodes to the graph
graph.node("Get Literature (PubMed API)")
graph.node("Fetch Full Text (BioC-PMC API)")
graph.node("Preprocess Literature")
graph.node("Get Study Details (PubMed API)")
graph.node("Make Request with Retry")
graph.node("Train LLM")
graph.node("Ask Clinical Question")

# Add edges between nodes
graph.edge("Get Literature (PubMed API)", "Fetch Full Text (BioC-PMC API)")
graph.edge("Fetch Full Text (BioC-PMC API)", "Preprocess Literature")
graph.edge("Preprocess Literature", "Get Study Details (PubMed API)")
graph.edge("Get Study Details (PubMed API)", "Make Request with Retry")
graph.edge("Make Request with Retry", "Train LLM")
graph.edge("Train LLM", "Ask Clinical Question")

# Render and save the graph as a PDF file
graph.render("clinical_trial_code", format="pdf", view=True)


'clinical_trial_code.pdf'