# Preamble

## Library Imports

In [1]:
%%capture
!pip install -r 'lib/requirements.txt'

In [2]:
from lib.openai import gpt4
from lib.display_md import display_md
from lib.llama2 import llama2

# Work Diary

## Sep 19 - `naive-zero-shot.template.ipynb`



### one-pager text

In [3]:
one_pager = """
Background
* A primary barrier to the widespread adoption of clean hydrogen fuel storage is discovering a practical material for Hydrogen Electrolysis.
* At the current rate of materials science advancement, researchers project that the discovery and adoption of such an advanced material will take until 20XX[a].
* However, climate change and pollution are occurring rapidly, so it is critical to accelerate this R&D timeframe to 2035.
* The most significant barrier to accelerating this timeframe is the quantity of researcher toil required per cycle of advanced materials experiments because [order of magnitude][b] experiments are projected to be needed to make the necessary discoveries
* The primary root cause of researcher toil is the human-inaccessibility of Hydrogen Electrolysis Materials Science (HEMS) experimental data.
* HEMS experimental data is inaccessible because there is no widely adopted ontology for HEMS; thus, tens of millions[c] of HEMS experiments are stored in heterogeneous formats by thousands of materials synthesis labs[d]
* We presume that HEMS data will remain heterogeneous without Practical Automated Ontologization for HEMS Heterogeneous Experimental Data due to intractable issues such as researcher turnover at labs.
* Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies seeks research advancements to commercialize practical, clean hydrogen fuel storage.
* ValuestreamAI seeks a challenging, impactful engineering problem to solve.
Hackathon Problem Statement
* Automate the classification of non-standardized CSV headers to Node Labels compliant with the Proposed EMMO Extension Ontology for Hydrogen Electrolysis Materials Science.
* Why we chose this subproblem:
   * Alignment with the broader goals of the client research department
   * Urgency for the principal researcher
   * Impact on the principal researcher
   * Estimated business and technical analysis scope smaller than 70 hours
   * Estimated implementation scope smaller than 30 hours
   * Representative of the department's larger problem set: Header Classification is a microcosm of the entire problem of solving the HEMS Heterogeneous Experimental Data problem.
   * It is approachable enough for ValuestreamAI to upskill local engineers to do strong junior-level analysis and implementation in less than 150 hours.
*   KPIs
   * ValuestreamAI seeks to choose KPIs that are an accurate proxy for implementing Practical Automated Ontologization for HEMS Heterogeneous Experimental Data.
   * Classifier Precision: precision of the classifier is paramount because a human user will distrust the inference system if they feel they cannot trust positive inferences
   * Classifier Recall: recall determines the percent of researcher toil automated for the given subtask
   * Runtime cost per million rows: determines the ongoing per-unit cost of testing and using the inference pipeline. Fortunately, this cost is exponentiating downwards due to an ongoing industry race to the top.
   * Upfront R&D costs: the upfront costs. There should be an attainable cost curve for achieving useful milestones.
   * Researcher satisfaction with deliverable


   * Common KPIs considered but not included:
   * F1 score: because we already include precision and recall
   * Concept Drift Adaptability: future work
   * Training time and cost: because we are starting with zero-shot techniques
Solution Constraints
   * The proposed solution must plausibly
   * represent the larger problem of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * scale to reduce a top category of researcher toil by at least 90%
   * be adaptable to support high-quality integration with neo4j and VIMI
Hackathon Proposed Deliverable
   * Deliverable will be a Github repo including the following:
   * Jupyter notebooks, including inference proof of concept, experiments, evaluation, and analysis
   * A self-contained development environment for reproducing or extending the experiments
   * Well-written documentation for relevant R&D workflows
   * Up to three hours of recorded live technical deep dive sessions for the client research team
   * Deliverable attributes
   * Complementary to existing research work
   * MIT license
   * High standard of readability
Payment Terms
   * ValuestreamAI requests payment only if the deliverable meets or exceeds the client research team's expectations.
   * NET60 from the day of delivery to the client research team
   * Requested payment amount is the approximate cost of competition preparation and travel expenses, $4000 CAD
   * Engineering hours are granted pro bono for this project.
Future Work
   * The proposed deliverable is only a tiny slice of the larger problem of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * However, the ValuestreamAI team is in the early stages of understanding the broader context of this research field
   * The next known subproblems are:[e]
   * Defining the complete problem scope of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * Inferring ontological relationships between heterogeneous data fields
   * Investigating the potential for collaboration with other research institutions to pool resources and data
   * Evaluating the effectiveness of the ontology in practice and making necessary adjustments based on feedback from researchers
   * Exploring the potential for machine learning algorithms to predict outcomes based on the HEMS data
Jargon and Entities
   * Hydrogen Electrolysis Materials Science (HEMS): The field of research
   * HEMS Heterogeneous Experimental Data: The subproblem under research
   * Proposed EMMO Extension Ontology: The department's proposed ontology for HEMS
   * Practical Automated Ontologization for HEMS Heterogeneous Experimental Data: The proposed solution to inaccessible experimental data
   * neo4j: an industry-standard graph database
   * VIMI: [f]


[a]Whats the year?
[b]whats the quantity?
[c]quantity true?
[d]quantity correct?
[e]Max to add a few more bullet points
[f]What does VIMI stand for?
"""

### prompt

In [4]:
prompt = f"""
What sections should be in a proof of concept notebook for doing zero-shot prediction of materials science ontology classes from experimental csv data headers?

For context, the engineering proposal is given in <one-pager>

<one-pager>
{one_pager}
</one-pager>
"""

In [5]:
completion = gpt4(prompt)

In [6]:
display_md(completion)

Based on the provided context, the following sections should be included in a proof of concept notebook for doing zero-shot prediction of materials science ontology classes from experimental CSV data headers:

1. **Introduction**: This section should provide a brief overview of the problem, the purpose of the notebook, and the approach used to solve the problem.

2. **Data Loading and Exploration**: This section should include code and explanations for loading the CSV data, exploring its structure, and understanding the headers.

3. **Data Preprocessing**: This section should detail the steps taken to clean and preprocess the data for the zero-shot prediction task. This might include handling missing values, standardizing header names, etc.

4. **Ontology Loading and Exploration**: This section should include code and explanations for loading the Proposed EMMO Extension Ontology for Hydrogen Electrolysis Materials Science, exploring its structure, and understanding the Node Labels.

5. **Zero-Shot Prediction Model**: This section should detail the zero-shot prediction model used, including its architecture, how it was trained (if applicable), and how it makes predictions.

6. **Model Application**: This section should show how the model is applied to the CSV headers to predict ontology classes. It should include code, explanations, and visualizations of the results.

7. **Evaluation**: This section should evaluate the performance of the model using the KPIs mentioned in the proposal, such as Classifier Precision, Classifier Recall, Runtime cost per million rows, and Upfront R&D costs.

8. **Discussion**: This section should discuss the results, including any limitations, potential improvements, and how well the model meets the solution constraints.

9. **Conclusion and Future Work**: This section should summarize the findings and discuss potential future work, as outlined in the proposal.

10. **References**: This section should list any references used in the notebook.

Remember to include clear explanations and visualizations throughout the notebook to make it easy to understand.