# Preamble

## Library Imports

In [1]:
%%capture
!pip install -r 'lib/requirements.txt'

In [2]:
from lib.openai import gpt4
from lib.display_md import display_md
from lib.llama2 import llama2

# 19 Sep - Editing Outline

## Outline Text

In [3]:
outline = """
Background
* A primary barrier to the widespread adoption of clean hydrogen fuel storage is discovering a practical material for Hydrogen Electrolysis.
* At the current rate of materials science advancement, researchers project that the discovery and adoption of such an advanced material will take until 20XX[a].
* However, climate change and pollution are occurring rapidly, so it is critical to accelerate this R&D timeframe to 2035.
* The most significant barrier to accelerating this timeframe is the quantity of researcher toil required per cycle of advanced materials experiments because [order of magnitude][b] experiments are projected to be needed to make the necessary discoveries
* The primary root cause of researcher toil is the human-inaccessibility of Hydrogen Electrolysis Materials Science (HEMS) experimental data.
* HEMS experimental data is inaccessible because there is no widely adopted ontology for HEMS; thus, tens of millions[c] of HEMS experiments are stored in heterogeneous formats by thousands of materials synthesis labs[d]
* We presume that HEMS data will remain heterogeneous without Practical Automated Ontologization for HEMS Heterogeneous Experimental Data due to intractable issues such as researcher turnover at labs.
* Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies seeks research advancements to commercialize practical, clean hydrogen fuel storage.
* ValuestreamAI seeks a challenging, impactful engineering problem to solve.
Hackathon Problem Statement
* Automate the classification of non-standardized CSV headers to Node Labels compliant with the Proposed EMMO Extension Ontology for Hydrogen Electrolysis Materials Science.
* Why we chose this subproblem:
   * Alignment with the broader goals of the client research department
   * Urgency for the principal researcher
   * Impact on the principal researcher
   * Estimated business and technical analysis scope smaller than 70 hours
   * Estimated implementation scope smaller than 30 hours
   * Representative of the department's larger problem set: Header Classification is a microcosm of the entire problem of solving the HEMS Heterogeneous Experimental Data problem.
   * It is approachable enough for ValuestreamAI to upskill local engineers to do strong junior-level analysis and implementation in less than 150 hours.
*   KPIs
   * ValuestreamAI seeks to choose KPIs that are an accurate proxy for implementing Practical Automated Ontologization for HEMS Heterogeneous Experimental Data.
   * Classifier Precision: precision of the classifier is paramount because a human user will distrust the inference system if they feel they cannot trust positive inferences
   * Classifier Recall: recall determines the percent of researcher toil automated for the given subtask
   * Runtime cost per million rows: determines the ongoing per-unit cost of testing and using the inference pipeline. Fortunately, this cost is exponentiating downwards due to an ongoing industry race to the top.
   * Upfront R&D costs: the upfront costs. There should be an attainable cost curve for achieving useful milestones.
   * Researcher satisfaction with deliverable


   * Common KPIs considered but not included:
   * F1 score: because we already include precision and recall
   * Concept Drift Adaptability: future work
   * Training time and cost: because we are starting with zero-shot techniques
Solution Constraints
   * The proposed solution must plausibly
   * represent the larger problem of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * scale to reduce a top category of researcher toil by at least 90%
   * be adaptable to support high-quality integration with neo4j and VIMI
Hackathon Proposed Deliverable
   * Deliverable will be a Github repo including the following:
   * Jupyter notebooks, including inference proof of concept, experiments, evaluation, and analysis
   * A self-contained development environment for reproducing or extending the experiments
   * Well-written documentation for relevant R&D workflows
   * Up to three hours of recorded live technical deep dive sessions for the client research team
   * Deliverable attributes
   * Complementary to existing research work
   * MIT license
   * High standard of readability
Payment Terms
   * ValuestreamAI requests payment only if the deliverable meets or exceeds the client research team's expectations.
   * NET60 from the day of delivery to the client research team
   * Requested payment amount is the approximate cost of competition preparation and travel expenses, $4000 CAD
   * Engineering hours are granted pro bono for this project.
Future Work
   * The proposed deliverable is only a tiny slice of the larger problem of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * However, the ValuestreamAI team is in the early stages of understanding the broader context of this research field
   * The next known subproblems are:[e]
   * Defining the complete problem scope of Practical Automated Ontologization for HEMS Heterogeneous Experimental Data
   * Inferring ontological relationships between heterogeneous data fields
   * Investigating the potential for collaboration with other research institutions to pool resources and data
   * Evaluating the effectiveness of the ontology in practice and making necessary adjustments based on feedback from researchers
   * Exploring the potential for machine learning algorithms to predict outcomes based on the HEMS data
Jargon and Entities
   * Hydrogen Electrolysis Materials Science (HEMS): The field of research
   * HEMS Heterogeneous Experimental Data: The subproblem under research
   * Proposed EMMO Extension Ontology: The department's proposed ontology for HEMS
   * Practical Automated Ontologization for HEMS Heterogeneous Experimental Data: The proposed solution to inaccessible experimental data
   * neo4j: an industry-standard graph database
   * VIMI: [f]


[a]Whats the year?
[b]whats the quantity?
[c]quantity true?
[d]quantity correct?
[e]Max to add a few more bullet points
[f]What does VIMI stand for?
"""

## Background

In [4]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the background section:
- Does the problem narrative flow well?
- Does any bullet point appear potentially factually incorrect?
- Are there any other obvious problems with the prose?

<text>
{outline}
</text>
"""

In [5]:
completion = gpt4(prompt)

In [6]:
display_md(completion)

The background section of the proposal is generally well-written and provides a clear narrative of the problem at hand. However, there are a few areas that could be improved:

1. Problem Narrative Flow: The narrative flow is generally good, but it could be improved by providing more context or explanation for some of the technical terms and concepts. For example, the term "Hydrogen Electrolysis Materials Science (HEMS)" is introduced without any explanation of what it involves. Similarly, the concept of "Practical Automated Ontologization for HEMS Heterogeneous Experimental Data" could be explained in more detail.

2. Factual Accuracy: There are several bullet points where the information provided is unclear or potentially incorrect. For example, the bullet point "At the current rate of materials science advancement, researchers project that the discovery and adoption of such an advanced material will take until 20XX[a]" lacks a specific year. Similarly, the bullet points "[order of magnitude][b] experiments are projected to be needed to make the necessary discoveries" and "tens of millions[c] of HEMS experiments are stored in heterogeneous formats by thousands of materials synthesis labs[d]" lack specific quantities or sources to verify the information.

3. Prose Problems: There are a few instances where the prose could be improved for clarity and readability. For example, the bullet point "We presume that HEMS data will remain heterogeneous without Practical Automated Ontologization for HEMS Heterogeneous Experimental Data due to intractable issues such as researcher turnover at labs" is quite long and complex, making it difficult to understand. It could be broken down into simpler sentences or rephrased for clarity.

4. Unexplained Acronyms: The acronym "VIMI" is used without any explanation of what it stands for. This could potentially confuse readers who are not familiar with this term.

5. Incomplete Information: There are several places where information seems to be missing or incomplete, such as the placeholders "[a]", "[b]", "[c]", "[d]", "[e]", and "[f]". These should be replaced with the relevant information or references.

## Problem Statement

In [7]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Problem Statement section:
- Is the problem statement named accurately?
- Can the problem statement name be made more concise?
- Does the rationale for choosing the problem statement make sense?
- Should any rationale be added, removed, or changed?

<text>
{outline}
</text>
"""

In [8]:
completion = gpt4(prompt)

In [9]:
display_md(completion)

Problem Statement Section Analysis:

- The problem statement is named accurately. It clearly states the problem of automating the classification of non-standardized CSV headers to Node Labels compliant with the Proposed EMMO Extension Ontology for Hydrogen Electrolysis Materials Science.

- The problem statement name is already concise and specific. It might not be advisable to make it more concise as it might lose its specificity and clarity.

- The rationale for choosing the problem statement makes sense. It aligns with the broader goals of the client research department, has urgency and impact for the principal researcher, and is representative of the department's larger problem set.

- The rationale seems comprehensive and well thought out. However, it might be beneficial to add a rationale explaining how this problem statement aligns with the overall mission of Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies. This could provide a broader context and further justify the choice of the problem statement.

## KPIs

In [10]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the KPIs section:
- Do the KPIs chosen make sense?
- Are there any additional KPIs I should consider?
- Do the rationales for using the KPIs make sense?

<text>
{outline}
</text>
"""

In [11]:
completion = gpt4(prompt)

In [12]:
display_md(completion)

The KPIs chosen for this proposal do make sense. They are directly related to the project's objectives and will provide a clear measure of the project's success. The KPIs chosen are:

1. Classifier Precision: This is important as it measures the accuracy of the classification system. A high precision means that the system is correctly identifying the headers, which is crucial for the project's success.

2. Classifier Recall: This measures the percentage of correct classifications out of the total classifications. A high recall means that the system is not missing any relevant headers, which is also crucial for the project's success.

3. Runtime cost per million rows: This measures the cost-effectiveness of the system. A lower cost per million rows means that the system is efficient and cost-effective.

4. Upfront R&D costs: This measures the initial investment required for the project. A lower upfront cost means that the project is financially feasible.

5. Researcher satisfaction with deliverable: This measures the satisfaction of the researchers with the final product. A high satisfaction rate means that the product meets the researchers' needs and expectations.

In addition to these KPIs, you might also consider including KPIs related to the system's scalability and adaptability. For example, you could measure the system's ability to handle increasing amounts of data without a significant increase in runtime cost. You could also measure the system's ability to adapt to changes in the data format or structure.

The rationales for using these KPIs also make sense. They are directly related to the project's objectives and will provide a clear measure of the project's success. For example, the rationale for using Classifier Precision and Recall is that a high precision and recall are crucial for the success of the project. The rationale for using Runtime cost per million rows and Upfront R&D costs is that a lower cost per million rows and a lower upfront cost make the project financially feasible. The rationale for using Researcher satisfaction with deliverable is that a high satisfaction rate means that the product meets the researchers' needs and expectations.

## Solutions Constraints

In [13]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Solutions Constraints:
- Do the solutions constraints make sense for the client?

<text>
{outline}
</text>
"""

In [14]:
completion = gpt4(prompt)

In [15]:
display_md(completion)

The solution constraints outlined in the proposal do make sense for the client, Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies. The constraints are aligned with the client's broader goals of commercializing practical, clean hydrogen fuel storage and reducing researcher toil by automating the classification of non-standardized CSV headers. 

The constraints also take into account the need for the solution to be scalable and adaptable, which are crucial for handling the large and diverse datasets involved in Hydrogen Electrolysis Materials Science (HEMS). The requirement for the solution to support high-quality integration with neo4j and VIMI is also sensible, as these tools are likely to be part of the client's existing data infrastructure.

However, it's important to note that the success of the proposed solution will depend on its ability to meet these constraints in practice. This will require careful planning, rigorous testing, and ongoing collaboration with the client to ensure that the solution is effectively addressing their needs.

## Hackathon Proposed Deliverables

In [16]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Hackathon Proposed Deliverables section:
- Do the deliverables proposed make sense for the proposed problem scope?
- Are there any other deliverables I should consider?

<text>
{outline}
</text>
"""

In [17]:
completion = gpt4(prompt)

In [18]:
display_md(completion)

The deliverables proposed in the Hackathon Proposed Deliverables section make sense for the proposed problem scope. The deliverables include a Github repo with Jupyter notebooks, a self-contained development environment, well-written documentation, and recorded technical deep dive sessions. These deliverables align with the problem of automating the classification of non-standardized CSV headers to Node Labels compliant with the Proposed EMMO Extension Ontology for Hydrogen Electrolysis Materials Science.

However, there are a few additional deliverables you might consider:

1. A detailed project report: This could include the methodology, challenges faced, solutions implemented, and key learnings from the project. This would be beneficial for future reference and for other teams who might work on similar problems.

2. A user manual or guide: This would be helpful for the research team to understand how to use the developed system effectively.

3. A maintenance and support plan: Given the technical nature of the project, it would be beneficial to have a plan in place for addressing any issues or bugs that might arise after the project is delivered.

4. A roadmap for future work: As mentioned in the proposal, the deliverable is only a small part of the larger problem. Providing a roadmap for future work could help the research team understand the next steps and how they can continue to build on the work done in this project. 

5. Training sessions: Depending on the complexity of the system, it might be beneficial to offer training sessions to the research team to ensure they can effectively use and maintain the system.

## Payment Terms

In [19]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Payment Terms section:
- Do the payment terms make sense?

<text>
{outline}
</text>
"""

In [20]:
completion = gpt4(prompt)

In [21]:
display_md(completion)

The Payment Terms section of the proposal seems to make sense. It states that ValuestreamAI will only request payment if the deliverable meets or exceeds the client research team's expectations. The payment is due NET60 from the day of delivery to the client research team. The requested payment amount is $4000 CAD, which is stated to cover the approximate cost of competition preparation and travel expenses. The engineering hours for the project are granted pro bono. This suggests that the team is not charging for the time spent on the project, but only for the expenses incurred.

## Future Work

In [22]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Future Work section:
- Does the proposed future work make sense?
- Are there any other obvious places for future work?

<text>
{outline}
</text>
"""

In [23]:
completion = gpt4(prompt)

In [24]:
display_md(completion)

The proposed future work does make sense. It is a logical continuation of the project, focusing on defining the complete problem scope, inferring ontological relationships, collaborating with other research institutions, evaluating the effectiveness of the ontology, and exploring machine learning algorithms. These steps are all crucial for the development and implementation of the Practical Automated Ontologization for HEMS Heterogeneous Experimental Data.

In terms of other obvious places for future work, the proposal could consider:

- Developing a standardized training program for researchers to understand and use the new ontology effectively.
- Investigating the potential for integrating the ontology with other existing systems or databases.
- Exploring the potential for commercializing the ontology or any derived technologies.
- Conducting a detailed cost-benefit analysis of implementing the ontology in various research settings.
- Assessing the potential environmental impact of widespread adoption of the ontology and any resulting technologies.

## Jargon and Entities

In [25]:
prompt = f"""
I am writing a one-page engineering proposal outline for a research team at Forschungszentrum Jülich
Department of AI and Data Analytics for Integrated Clean Energy Technologies.

The text of the proposal is given in <text>

Answer the following questions about the Jargon and Entities section:
- Does the jargon make sense?
- Can any of the jargon be simplified?

<text>
{outline}
</text>
"""

In [26]:
completion = gpt4(prompt)

In [27]:
display_md(completion)

The jargon in the proposal does make sense, especially for the intended audience of researchers and engineers in the field of AI and Data Analytics for Integrated Clean Energy Technologies. However, it might be challenging for individuals outside of this field to understand. 

Here are some suggestions to simplify the jargon:

1. Hydrogen Electrolysis Materials Science (HEMS): This term could be simplified to "Hydrogen Fuel Materials" or "Hydrogen Storage Materials". However, this might lose some of the specificity of the original term.

2. HEMS Heterogeneous Experimental Data: This could be simplified to "Varied Hydrogen Fuel Data" or "Diverse Hydrogen Storage Data". Again, this might lose some of the specificity.

3. Proposed EMMO Extension Ontology: This term is quite technical and specific to the field. It could be simplified to "Proposed Data Classification System", but this would lose the specificity of referring to an ontology, which is a specific type of data classification system.

4. Practical Automated Ontologization for HEMS Heterogeneous Experimental Data: This could be simplified to "Automated Classification for Varied Hydrogen Fuel Data". However, this simplification loses the reference to ontologization, which is a specific type of data classification.

5. Neo4j: This is the name of a specific software product and cannot be simplified without losing its meaning.

6. VIMI: Without knowing what VIMI stands for, it's hard to suggest a simplification. However, it's likely that this is also a specific software product or technical term that cannot be simplified without losing its meaning.

In general, while the jargon could be simplified, doing so might lose the specificity and technical accuracy of the terms. Therefore, it might be best to keep the jargon as is, but provide clear and concise definitions for each term in the Jargon and Entities section.