# ArXiv Synth

<p align="center">
   <img src="./arxiv-synth.png" alt="ArXiv synth retro logo" width="250"/>
</p>
ArXiv Synth is an educational demonstration project that utilizes large language models to summarize and analyze content from arXiv based on a given query. It fetches papers from arXiv, processes them, and generates summaries using OpenAI's and Anthropic's language models.

<br><br>
The project is meant as an educational demonstration of the capabilities, not as a research tool.

# Fetch the paper information from ArXiv

In [12]:
openai_api_key = #<Add API key here>
claude_api_key = #<Add API key here>

In [13]:
import json
import sys
from IPython.display import display, Markdown, clear_output

import arxiv
import anthropic
from openai import OpenAI

##########################################################
## Functions to interact with ArXiv and cache papers    ##
##########################################################
def fetch_papers(query, category=None, n_papers=20):
    # Construct the API client
    client = arxiv.Client()
    
    # Build the search query
    search_query = query
    if category:
        search_query += f" AND cat:{category}"
    
    # Initialize the search
    search = arxiv.Search(
        query=search_query,
        max_results=n_papers,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )
    
    # Fetch the results
    results = client.results(search)
    
    # Initialize a list to hold data dictionaries for each paper
    papers = []
    
    for paper in results:
        paper_data = {
            'title': paper.title,
            'abstract': paper.summary.replace('\n', ' '),  # Replace new lines in abstracts with spaces
            'authors': [author.name for author in paper.authors],
            'link': paper.entry_id  # Add this line to include the paper's URL
        }
        papers.append(paper_data)
    return papers

    
def write_papers(papers, filename="./arxiv_papers.jsonl"):
    # Write data to a JSONL file. Replace with DB write if you want something more interesting
    with open(filename, 'w') as outfile:
        for paper_data in papers:
            json.dump(paper_data, outfile)
            outfile.write('\n')


##########################################################
## Functions to get responses from OpenAI and Anthropic ##
##########################################################
def get_openai(prompt,
               model="gpt-3.5-turbo-0125", 
               api_key=None, 
               system_prompt="You are a helpful assistant.", 
               max_tokens=2000, 
               temperature=0.8):
    # models "gpt-4-turbo", "gpt-3.5-turbo-0125"
    client = OpenAI(api_key=api_key)
    
    response = client.chat.completions.create(
      model=model,
      messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
      ],
      temperature=temperature,
      max_tokens=max_tokens,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0
    )
    return response.choices[0].message.content

def get_anthropic(prompt, 
                  model="claude-3-haiku-20240307", 
                  api_key=None,
                  system_prompt="You are a helpful assistant.", 
                  max_tokens=2000, 
                  temperature=0):
    # Other models "claude-3-opus-20240229", "claude-3-sonnet-20240229", "claude-3-haiku-20240307"

    client = anthropic.Anthropic(
        api_key=api_key,
    )
    message = client.messages.create(
        model=model,
        max_tokens=2000,
        temperature=0,
        system=f"{system_prompt}",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
    )
    return message.content[0].text

def get_streaming_anthropic(input_for_summary, 
                            model="claude-3-haiku-20240307", 
                            api_key=None,
                            system_prompt="You are a helpful assitant", 
                            max_tokens=2000, 
                            temperature=0):

    client = anthropic.Anthropic(api_key=claude_api_key)

    response = ""
    with client.messages.stream(
        max_tokens=max_tokens,
        temperature=temperature,
        system=f"{system_prompt}",
        messages=[{"role": "user", "content": input_for_summary}],
        model=model,
    ) as stream:
        for text in stream.text_stream:
            response += text
            clear_output(wait=True)
            display(Markdown(response))
            sys.stdout.flush()

# Collect Data from ArXiv

In [14]:
query= "au:blaiszik"
n_papers = 40
category = ""
db_file = './arxiv_papers.jsonl'

papers = fetch_papers(query, category=category, n_papers=n_papers)
print(f"== Fetched {len(papers)} papers from ArXiv. ==> ")

write_papers(papers, db_file)
print(f"== Wrote papers to temporary cache at {db_file} ==>")

== Fetched 29 papers from ArXiv. ==> 
== Wrote papers to temporary cache at ./arxiv_papers.jsonl ==>


# (optional) Read papers back into memory from JSONL file
- This is stubbed out for database or cached file retrieval
- TODO: For a deeper augmented approach, we could keep a database of ArXiv papers with pre-calculated embeddings, and pull the top N matches into the context as well


In [15]:
papers = []
with open('./arxiv_papers.jsonl', 'r') as file:
    for line in file:
        papers.append(json.loads(line))

# Create the Prompt and Add Context

In [16]:
input_for_summary = """Summarize the following arXiv papers at the level of an advanced Ph.D. student, making interconnections between the papers where possible.
                       First create a summary paragraph that includes the most important breakthroughs in the contained papers. 
                       Second, create a summary tweet thread describing the papers. 
                       Next, provide a section on interconnections where the information is grouped in a structured way that makes it easy to understand, rather than by each paper separately. 
                       Each paper should always be referenced by its link in markdown format. Be sure to include markdown style links to the papers and to references to the other papers. 
                       \n\n"""
for paper in papers:
    input_for_summary += "<paper>\n"
    input_for_summary += f"### {paper['title']}\n\n"
    input_for_summary += f"**Abstract:** {paper['abstract']}\n\n"
    input_for_summary += f"**Authors:** {', '.join(paper['authors'])}\n\n"
    input_for_summary += f"[Link to paper]({paper['link']})\n\n"
    input_for_summary += "</paper>\n\n"

# Get OpenAI Summary++

In [17]:
system_prompt = "You are modeling the mind of a researcher who has obtained a PhD in the field of study for the papers retrieved."

# models "gpt-4-turbo", "gpt-3.5-turbo-0125"

oai = get_openai(input_for_summary, 
                 model="gpt-3.5-turbo-0125",
                 system_prompt=system_prompt,
                 max_tokens=2000,
                 api_key=openai_api_key
                )
display(Markdown(oai))

### Summary Paragraph:
These papers cover a wide range of topics in scientific research, including advancements in machine learning, AI, high-performance computing, and materials science. Breakthroughs include the development of a benchmark dataset for studying surrogate model accuracy in protein-ligand docking, the introduction of funcX for federated function serving, and the proposal of Data Stations for secure data architecture. Other notable contributions include the use of AI and HPC for lead generation in targeting SARS-CoV-2, the creation of a data ecosystem for supporting machine learning in materials science, and the development of machine learning models for predicting atomization energies accurately. These papers collectively reflect the cutting-edge research at the intersection of technology and scientific discovery.

### Tweet Thread Summary:
1. **Twins in rotational spectroscopy:** Explore the uniqueness of rotational spectra in identifying molecules, challenging assumptions with AI methods. [Link to paper](http://arxiv.org/abs/2404.04225v1)
2. **Trillion Parameter AI Infrastructure:** Survey envisioning TPM ecosystem for scientific discovery, highlighting technical challenges and system design. [Link to paper](http://arxiv.org/abs/2402.03480v1)
3. **Accelerating Electronic Stopping Power Predictions:** TDDFT and ML combo reduces time for electronic stopping power assessment, enabling predictions in hours. [Link to paper](http://arxiv.org/abs/2311.00787v1)
4. **Modular Science Factories:** Proposal for computation and AI-enabled SDLs, emphasizing modularity and scalability for diverse scientific applications. [Link to paper](http://arxiv.org/abs/2308.09793v2)
5. **ML Transforming Materials Science:** LLMs hackathon showcases LLM applications in chemistry and materials science, highlighting rapid prototype generation. [Link to paper](http://arxiv.org/abs/2306.06283v4)

### Interconnections:

- **Machine Learning and Materials Science:**
  - The paper on "Machine Learning Prediction of Atomization Energies" showcases the application of ML in predicting accurate energy values for organic molecules.
  - The "JARVIS-Leaderboard" paper introduces a platform for benchmarking materials design methods, which includes AI approaches.
- **AI and Scientific Discovery:**
  - The "funcX" paper describes a federated function serving fabric that enables scalable and high-performance remote function execution, catering to scientific needs.
  - "Serverless Supercomputing" introduces a serverless approach for high-performance function as a service, serving as a valuable resource for scientific applications.
- **AI and Drug Discovery:**
  - The "Targeting SARS-CoV-2 with AI-Enabled Lead Generation" and "IMPECCABLE" papers focus on AI-driven lead generation for COVID-19 drug discovery, demonstrating the potential of AI in accelerating drug screening processes.
- **Hybrid Computing Systems:**
  - The "DLHub" paper presents a system for model and data serving tailored for scientific applications, emphasizing the reproducibility and scalability of ML models in the scientific community.
  - "Accelerated, Scalable and Reproducible AI-driven Gravitational Wave Detection" leverages AI models to process large datasets rapidly, showcasing the efficiency of combining AI and HPC for scientific discovery.

# Get Anthropic Summary++

In [18]:
system_prompt = "You are modeling the mind of a researcher who has obtained a PhD in the field of study for the papers retrieved."

# Other models "claude-3-opus-20240229", "claude-3-sonnet-20240229", "claude-3-haiku-20240307"

anth = get_anthropic(input_for_summary, 
                         api_key=claude_api_key,
                         model="claude-3-haiku-20240307",
                         system_prompt=system_prompt, 
                         max_tokens=2000, 
                         temperature=0)
display(Markdown(anth))

Summary Paragraph:

The papers presented cover a wide range of topics in materials science and chemistry, showcasing how advanced computational methods, machine learning, and data infrastructure can accelerate scientific discovery. Key breakthroughs include demonstrating that rotational spectra do not uniquely identify molecules, enabling 10 million-fold speedups in electronic stopping power predictions, developing modular architectures for scalable "science factories", creating large-scale benchmarks for materials design methods, and leveraging large language models to transform materials research. The papers also highlight the importance of FAIR (Findable, Accessible, Interoperable, Reusable) principles for AI models and data, and present infrastructure like funcX and DLHub to enable federated, scalable, and reproducible computational workflows. Overall, these papers illustrate how the integration of AI, HPC, and data-centric approaches can drive transformative advances across materials science and chemistry.

Summary Tweet Thread:

1/ Fascinating new research in materials science and chemistry, showing how advanced computational methods, ML, and data infrastructure can accelerate discovery:

2/ Rotational spectra don't uniquely identify molecules, enabling 10M-fold speedups in electronic stopping power predictions, and modular "science factories" for scalable automation.

3/ Large benchmarks for materials design methods, leveraging large language models to transform research, and FAIR principles for AI models/data to enable reproducible workflows.

4/ Infrastructure like funcX and DLHub enabling federated, scalable, and reproducible computational pipelines across materials science and chemistry.

5/ Integration of AI, HPC, and data-centric approaches driving transformative advances - a glimpse of the future of materials research!

Interconnections:

The papers presented showcase how advanced computational methods, machine learning, and data infrastructure are transforming materials science and chemistry research. Several key interconnections emerge:

**Rotational Spectra and Molecular Identification**: The paper on "Twins in rotational spectroscopy" demonstrates that rotational spectra do not uniquely identify molecules, challenging a long-held assumption in the field. This has implications for techniques like rotational spectroscopy that rely on this assumption for structural determination.

**Accelerating Computational Methods**: Multiple papers focus on enabling massive speedups in computational methods, from the 10 million-fold acceleration in electronic stopping power predictions to the real-time analysis of flame spray pyrolysis experiments. These advances are critical for enabling high-throughput computational screening and optimization.

**Modular and Scalable Architectures**: The paper on "Towards a Modular Architecture for Science Factories" presents a framework for building scalable, reconfigurable experimental platforms, which can be combined with the computational infrastructure described in other papers to create integrated "science factories".

**Benchmarking and Reproducibility**: The "JARVIS-Leaderboard" paper and the work on FAIR principles for AI models highlight the importance of rigorous benchmarking and reproducibility in materials research. These efforts provide critical foundations for validating and comparing the various computational and machine learning methods described in the other papers.

**Leveraging Large Language Models**: The paper on how large language models can transform materials science demonstrates the potential of these powerful AI models to accelerate research across diverse domains, from knowledge extraction to experimental design.

**Data-Centric Computational Frameworks**: Several papers, including those on funcX, DLHub, and the Data Station, describe infrastructure for enabling federated, scalable, and reproducible computational workflows that tightly integrate data management, AI/ML, and high-performance computing. These frameworks are crucial for realizing the full potential of the computational and machine learning advances presented in the other papers.

By considering the interconnections between these diverse research directions, it becomes clear that the integration of advanced computational methods, machine learning, and data-centric approaches is driving transformative progress across materials science and chemistry.

# Anthropic Streaming Example
TODO: fix markdown rendering

In [20]:
# Other models "claude-3-opus-20240229", "claude-3-sonnet-20240229", "claude-3-haiku-20240307"

# Call the function with your input
get_streaming_anthropic(input_for_summary, 
                             model="claude-3-haiku-20240307",
                             api_key=claude_api_key,
                             system_prompt=system_prompt, 
                             max_tokens=750, 
                             temperature=0.5)


Summary Paragraph:

The collection of papers presented here highlights several important breakthroughs in the application of machine learning and artificial intelligence to accelerate scientific discovery across a range of domains, including materials science, chemistry, and astrophysics. Key advances include the development of benchmark datasets and evaluation frameworks to rigorously assess the performance of machine learning models [JARVIS-Leaderboard, Benchmark tests of atom segmentation], the creation of federated and modular platforms to enable scalable and reproducible AI-driven experimentation and analysis [funcX, Towards a Modular Architecture for Science Factories], and the establishment of FAIR principles to promote the findability, accessibility, interoperability, and reusability of AI models and scientific data [FAIR for AI, Community Action on FAIR Data]. Additionally, the papers showcase successful applications of machine learning to problems such as predicting molecular properties [Machine Learning Prediction of Accurate Atomization Energies], automating experimental workflows [Deep Learning for Automated Experimentation], and accelerating the detection of gravitational waves [Accelerated, Scalable and Reproducible AI-driven Gravitational Wave Detection]. Collectively, these advances demonstrate the transformative potential of AI and machine learning to revolutionize scientific discovery across a wide range of disciplines.

Summary Tweet Thread:

1/ This thread summarizes a collection of papers showcasing breakthroughs in applying machine learning and AI to accelerate scientific discovery. Key advances include:

2/ Development of benchmark datasets and evaluation frameworks to rigorously assess ML model performance [JARVIS-Leaderboard, Benchmark tests of atom segmentation]

3/ Creation of federated and modular platforms to enable scalable and reproducible AI-driven experimentation and analysis [funcX, Towards a Modular Architecture for Science Factories]

4/ Establishment of FAIR principles to promote the findability, accessibility, interoperability, and reusability of AI models and scientific data [FAIR for AI, Community Action on FAIR Data]

5/ Successful applications of ML to predict molecular properties [Machine Learning Prediction of Accurate Atomization Energies], automate experimental workflows [Deep Learning for Automated Experimentation], and accelerate gravitational wave detection [Accelerated, Scalable and Reproducible AI-driven Gravitational Wave Detection]

6/ These advances demonstrate the transformative potential of AI/ML to revolutionize scientific discovery across disciplines. The future of data-driven science is bright!

Interconnections:

The papers presented here can be grouped into several interconnected themes:

**Benchmark Datasets and Evaluation Frameworks**
- [JARVIS-Leaderboard] and [Benchmark tests of atom segmentation] describe the development of benchmark datasets and evaluation frameworks to rigorously assess the performance of machine learning models in materials science and microscopy applications.
- These benchmark efforts are crucial for promoting reproducibility, enabling fair comparisons between different modeling approaches, and accelerating progress in the field.

**Federated and Modular AI Platforms**
- [funcX] and [Towards a Modular Architecture for Science Factories] present the development of federated and modular platforms that enable scalable, flexible, and reproducible AI-driven experimentation and analysis.
- These platforms decouple the management of computational resources from the execution of scientific workflows, allowing for efficient utilization of diverse computing infrastructure (e.g., clouds, clusters, supercomputers) and