In [1]:
from utils.download_pdfs_from_url import download_papers
from utils import icml_parser 
from pathlib import Path
import pickle

## Project Parameters

In [2]:
PROJECT_DIR = './examples/icml_2024'

paper_pdf_dir = Path(PROJECT_DIR, 'paper_pdfs')
paper_parsed_dir = Path(PROJECT_DIR, 'paper_parsed.pkl')
paper_summaries = Path(PROJECT_DIR, 'paper_summaries.pkl') #{'overview':<>, 'best':[{'tag':<>. 'summary':<>}, ], 'all': []}
paper_mp3 = Path(PROJECT_DIR, 'mp3') #overview.mp3, <tag>.mp3

## Download and Parse all ICML papers

In [3]:
#_ = download_papers('https://proceedings.mlr.press/v235/', paper_pdf_dir)

#all_papers = icml_parser.parse_folder(paper_pdf_dir)
#with open(paper_parsed_dir, 'wb') as f:
#    pickle.dump(all_papers, f)

## Load ICML papers fomr pkl

In [4]:
with open(paper_parsed_dir, 'rb') as f:
    all_papers = pickle.load(f)

In [5]:
len(all_papers)

2610

## Test single paper summary

In [6]:
# Google AI Studio parameters 
from utils.paper_ontology import *
import os
from IPython.display import Markdown

genai.configure(api_key='AIzaSyDGWTFPppWNVJARFTHuaWKVg-ENwCQOuxE')
flash = genai.GenerativeModel('gemini-1.5-flash')

In [7]:
# Test single paper summary
po = PaperOntology(all_papers[0], flash)
ps = po.create_summary()
Markdown(ps)

## Overview of "Unveiling the Potential of AI for Nanomaterial Morphology Prediction" for Business Stakeholders

This research paper explores how Artificial Intelligence (AI) can revolutionize the creation of nanomaterials – incredibly tiny materials with unique properties used in various products.  Currently, designing nanomaterials with specific shapes and sizes is a slow, expensive, and highly experimental process.

**1. The Problem:**

Creating nanomaterials with desired characteristics (like size and shape) requires numerous costly and time-consuming experiments.  Current methods rely heavily on trial-and-error and expert intuition, making the process inefficient and expensive.  This significantly hinders innovation and the development of new nanomaterial-based products.

**2. Use Cases Impacted:**

This research directly impacts industries that rely on nanomaterials, including:

* **Medicine:** Drug delivery systems, targeted therapies.
* **Energy:** Improved batteries, fuel cells, solar cells.
* **Catalysis:** More efficient chemical reactions in manufacturing.
* **Manufacturing:** Development of new materials with enhanced properties (strength, conductivity, etc.)

Essentially, any industry using nanomaterials could benefit from faster, cheaper, and more efficient design processes.

**3. Proposed Approach and Its Novelty:**

The paper proposes using AI, specifically machine learning (ML) and large language models (LLMs), to predict the morphology (shape and size) of nanomaterials based on the synthesis parameters (the ingredients and conditions used to create them).  This is a significant departure from the current experimental approach.  Instead of many physical experiments, AI models could predict the outcome, dramatically reducing costs and time.

The novelty lies in:

* **A significantly larger dataset:** The researchers created a much larger dataset of nanomaterial syntheses and their resulting morphologies than previous studies, enabling more powerful AI models.
* **Use of LLMs:**  This is one of the first studies to use LLMs (like GPT-4) for nanomaterial morphology prediction, allowing for prediction based on natural language descriptions of the synthesis process.
* **Text-to-image generation:** The researchers also explored generating images of the predicted nanomaterials directly from text descriptions, offering a more intuitive way to visualize the results.


**4. Fundamental Techniques:**

The approach is based on:

* **Machine Learning (ML):**  Specifically, the researchers used classical ML techniques like Random Forest and Gradient Boosted Trees to predict nanomaterial shapes and sizes from synthesis data.
* **Large Language Models (LLMs):**  These models process natural language (descriptions of the synthesis process) to predict the nanomaterial morphology.
* **Statistical Analysis:**  Statistical methods were used to identify which synthesis parameters are most important in determining the final morphology.
* **Image Processing:** Techniques were used to analyze microscopy images of the nanomaterials to extract features and train the AI models.

**5. Methods, Algorithms, and Frameworks:**

The researchers utilized:

* **Random Forest & Gradient Boosted Trees:** ML algorithms for classification and prediction.
* **GPT-4, GPT-3.5-turbo, Mistral LLMs:** State-of-the-art LLMs for natural language processing and prediction.
* **Variational Autoencoders (VAEs):**  A type of neural network used for the text-to-image generation.
* **BERT:** A natural language processing model used for feature extraction from text descriptions.


**6. Benchmarks and Metrics:**

The researchers tested their models on a large dataset of calcium carbonate nanomaterials. They used standard classification metrics:

* **Accuracy:** The percentage of correctly predicted nanomaterial morphologies.
* **F1-score:** A metric that considers both precision and recall (correctly identifying positive and negative cases).

**7. Outperformance:**

The proposed LLM-based approach achieved higher accuracy in predicting nanomaterial shapes than traditional machine learning methods (Random Forest and XGBoost).  In certain cases, the performance on underrepresented categories was significantly improved by LLMs.


**8. Main Conclusion and Impact:**

The paper concludes that AI, particularly LLMs, holds significant promise for accelerating nanomaterial design.  The use of LLMs allows scientists to leverage natural language descriptions of synthesis procedures, making the process more accessible and less reliant on complex ML expertise. This leads to:

* **Reduced Costs:** Fewer physical experiments needed.
* **Faster Innovation:** More efficient design cycles.
* **Enhanced Material Properties:**  Better control over nanomaterial characteristics.

For businesses, this translates to potential cost savings, faster product development, and the ability to create novel nanomaterial-based products with improved performance.  Future research should focus on refining the LLMs and text-to-image systems, as well as exploring applications to other types of nanomaterials.  The improved scalability and affordability of LLMs, as exemplified by the comparable performance of Mistral-small to GPT-4, further promotes adoption in industrial settings.


## Test Single Ontology Creation

In [8]:
#neo4j database installation see https://neo4j.com/docs/python-manual/current/install/
#And comments in paper_ontology.py

URI = "neo4j://localhost:7687"
AUTH = ("neo4j", "secretgraph")
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    print("Connection established.")

Connection established.


In [9]:
#Test single ontology creation
pj = po.create_ontology_json()

In [11]:
kg = OntologyKG(URI, AUTH[0], AUTH[1])

kg.clean()
kg.insert(pj)

# Go to http://localhost:7474 and click * to check the graph

## Test QA Engine using neo4j

In [6]:
from utils.paper_QA import *

#kg.clean()

docs = paper2doc([all_papers[0]])
len(docs)

10

In [7]:
db = EmbeddingDB("neo4j://localhost:7687", "neo4j", "secretgraph", docs)

In [8]:
query = all_papers[0]['sections'][0]['section_content']
_ = db.query(query)

--------------------------------------------------------------------------------
Score:  0.9593460559844971
Creation of nanomaterials with specific morphol-
ogy remains a complex experimental process,
even though there is a growing demand for these
materials in various industry sectors. This study
explores the potential of AI to predict the morphol-
ogy of nanoparticles within the data availability
constraints. For that, we first generated a new
multi-modal dataset that is double the size of anal-
ogous studies. Then, we systematically evaluated
performance of classical machine learning and
large language models in prediction of nanoma-
terial shapes and sizes. Finally, we prototyped
a text-to-image system, discussed the obtained
empirical results, as well as the limitations and
promises of existing approaches.

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.9230