# Preamble

## OpenAI Setup

In [19]:
!pip install openai python-dotenv

import openai
import os
from dotenv import load_dotenv

# Load .env file
load_dotenv()

# Get API key from .env
openai.api_key = os.getenv("OPENAI_API_KEY")

def gpt4(prompt):
    completion = openai.ChatCompletion.create(
        model="gpt-4",
        temperature=0.0,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{prompt}"}
        ]
    )

    return completion.choices[0].message.content



### Install libraries

In [2]:
%pip install langchain huggingface_hub python-dotenv chainlit

import os
from dotenv import load_dotenv

%pip install ipywidgets # Update ipywidgets to include IProgress

from langchain import HuggingFaceHub  # Access LLMs from Hugging Face Hub

from langchain import LLMChain, PromptTemplate # For prompt chaining

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Pretty Print Functions

### Markdown

In [3]:
from IPython.display import display, Markdown

def display_md(markdown_string):
    display(Markdown(markdown_string))

# Usage
markdown_content = """
###### This is a heading
####### This is a sub-heading

- This is a list item
- Another list item

**Bold Text**

*Italic Text*
"""

display_md(markdown_content)



###### This is a heading
####### This is a sub-heading

- This is a list item
- Another list item

**Bold Text**

*Italic Text*


### CSV

In [4]:
!pip install pandas



In [5]:
import pandas as pd
from io import StringIO
from IPython.display import display

def display_csv(csv_string):
    try:
        csv_buffer = StringIO(csv_string)
        df = pd.read_csv(csv_buffer)

        # Simply displaying the DataFrame will render it as a table
        display(df)
    except pd.errors.EmptyDataError:
        print("The CSV data is empty.")
    except pd.errors.ParserError:
        print("Could not parse the string as a CSV.")

# Usage
# Replace 'your_csv_string' with the actual CSV string data you want to display as a DataFrame
your_csv_string = """
col1,col2,col3
1,a,4.5
2,b,4.6
3,c,5.1
"""

display_csv(your_csv_string)


Unnamed: 0,col1,col2,col3
0,1,a,4.5
1,2,b,4.6
2,3,c,5.1


### JSON

In [6]:
import json

def display_json(json_string):
    try:
        # Parse the JSON string
        parsed_json = json.loads(json_string)

        # Pretty-print the JSON data with indentation
        formatted_json_string = json.dumps(parsed_json, indent=4)

        print(formatted_json_string)

    except json.JSONDecodeError:
        print("Could not parse the string as JSON.")

# Usage
# Replace 'your_json_string' with the actual JSON string data you want to pretty-print
your_json_string = '''
{
    "name": "Alice",
    "age": 30,
    "is_student": false,
    "courses": ["Math", "Physics"]
}
'''

display_json(your_json_string)


{
    "name": "Alice",
    "age": 30,
    "is_student": false,
    "courses": [
        "Math",
        "Physics"
    ]
}


## Count Tokens for Context Window

In [7]:
!pip install tiktoken



In [8]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

count = num_tokens_from_string("tiktoken is great!", "cl100k_base")
print(count)

6


# Copyediting Context

## Juding Criteria
[From the hackathon Notion page.](https://samsungnext.notion.site/Next-Gen-AI-Hackathon-Attendee-Guide-bec2d308f82245bfbb885d1fa093521a)

In [9]:
judging_criteria = """
Judging Criteria

There will be three award categories:

- **Breakthrough Award:** The project that demonstrates outstanding deep core technical or architectural innovations that enable or leverage generative AI for mobile use cases.
- **Edge Award:** The project with greatest impact that leverages only models that have been optimized and deployed locally on the device.
- **Pioneer Award:** The project that most uniquely leverages mobile sensor, data, or services to create novel or more impactful mobile user experiences.

All projects will be evaluated on the following criteria.

- **Impact:** Does this project have the potential to impact millions - even billions - around the world?
- **Creativity:** Does the project solve a problem in a differentiated or unique way?
- **User Experience:** Does the project deliver a delightful user experience?
- **Centrality of Generative AI:** Does generative AI play a critical role in enabling the solution?
- **Technical feasibility:** Does the technical implementation work? If the project is not fully functional, could this approach work technically?
- **Fidelity:** How “finished” is the project? How close is it to an alpha launch?

Participants are expected share a 3-minute pitch deck including a demo and high-level architectural diagrams. We will have mentors available to help folks fine tune and prepare for the final presentations.

"""

## Research Proposal

In [10]:
research_proposal = """
﻿Proof of Concept Proposal: Automated Ontologization for Heterogeneous Experimental Data
Toby Tobkin - Solutions Architect - toby@valuestream.ai

Prepared for Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies
Background
A primary barrier to the widespread adoption of clean hydrogen fuel storage is discovering a practical material for Hydrogen Energy Devices. At the current rate of materials science advancement, some experts project that the discovery and adoption of requisite advanced materials such as catalysts and membrane materials could take until 2050. However, climate change and pollution are occurring rapidly, so it is critical to accelerate this R&D timeframe to 2035. The most significant barrier to accelerating this timeframe is the quantity of researcher toil required per cycle of advanced materials science (AES) experiments. The primary root cause of researcher toil is the human-inaccessibility of experimental data from applied energy materials research. This inaccessibility exists because no widely adopted ontology exists for applied energy materials research. Thus, experimental data is stored heterogeneously, effectively siloed data between scientific domains and individual labs. To overcome this, material scientists must manually de-silo from tens of millions of data points about materials and fabrication methods, drastically slowing down applied materials research. Due to intractable issues such as lab researcher turnover, we presume that AES data will remain heterogeneous without Practical Automated Ontologization for AES Heterogeneous Experimental Data (HED).

Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies seeks research advancements to commercialize practical, clean hydrogen fuel storage. ValuestreamAI seeks a challenging, impactful engineering problem to solve.
Programming Competition Venue & Scope
The ValuestreamAI team will compete in the 2023 Samsung Gen AI programming competition in San Francisco.  The team comprises three engineers and one technical business analyst. Approximately one week of engineering preparation work is done prior to the competition. The competition is approximately ten hours of execution time. Thus, the scope must be limited while still shipping a usable product.
Problem Statement
Automate the classification of non-standardized CSV headers to Node Labels compliant with the Proposed EMMO Extension Ontology for Hydrogen Energy Conversion Materials Science. 
  
Problem Selection Rationale
Header Classification to Node Label was chosen because it is a microcosm of the AES HED problem.

The problem was selected using the following process:
1. The Client research team presented several relevant research problems. 
2. Sort by urgency to solve
3. Tie-break by the impact of solving
4. Filter by:
   1. Estimated analysis scope under 70 hours
   2. Estimated implementation scope under 30 hours

Given additional time, we may also proof of concept Node Attribute classification.
________________
KPIs
ValuestreamAI seeks to choose KPIs that are an accurate proxy for implementing Practical Automated Ontologization for AES HED. The following KPIs were chosen:

    Precision
	Precision of the classifier is paramount because a human user will distrust the inference system if they feel they cannot trust positive inferences
	Recall
	Recall determines the percent of researcher toil automated for the given subtask
	Confusion Matrix
	A slideshow-friendly version of Precision and Recall.
	Runtime Cost Per Million Rows
	Determines the ongoing per-unit cost of testing and using the inference pipeline
	R&D Costs
	The upfront costs. There should be an attainable cost curve for achieving useful milestones.
	Licensing Costs
	Costs related to licensing technology
	Suitability for On-prem Cluster
	How compatible is the solution with the client research team's compute cluster?
	Data Privacy
	Is data processed and stored in a secure manner that is verifyable by a layman?
	Client Satisfaction
	Does the deliverable satisfy its intended purpose?
 
Solution Constraints
The proposed solution must:
* Represent the larger problem of Practical Automated Ontologization for AES HED
* Plausibly scale to reduce a top category of researcher toil by at least 90%
* Be adaptable to support high-quality integration with neo4j and VIMI
________________

Hackathon Proposed Deliverables
The deliverable is a Github Repo with the following contents:
Jupyter Notebooks
	Python notebooks that reproduce the inference experiments for the problem under study.
	Development Environment
	English instructions and Linux scripts documenting how to reproduce the desktop environment for reproducing or extending the experiments delivered.
	Workflows
	English instructions and Linux scripts documenting how to reproduce or extend the experiments delivered.
	Live Training Recording
	Up to three hours of recorded live sessions for the client research team for training and analysis.
	

The deliverable will have the following attributes: 1. Complementary to existing work, 2. MIT license, 3. High standard of readability

Payment Terms
ValuestreamAI requests payment only if the deliverable meets or exceeds the client research team's expectations. Payment shall be due NET60 from the day of invoice to the client research team. The requested payment amount is the approximate cost of competition preparation and travel expenses, CAD$4000. Engineering hours are granted pro bono for this project.

Future Work
The proposed deliverable is only a small slice of the larger problem of Practical Automated Ontologization for AES HED. The ValuestreamAI team is in the early stages of understanding the broader context of this research field. However, they understand the following future work is required:
* Defining the complete engineering scope of Practical Automated Ontologization for AES HED
* Inferring ontological relationships between heterogeneous data fields
* Investigating the potential for collaboration with other research institutions to pool resources and data
* Evaluating the effectiveness of the ontology in practice and making necessary adjustments based on feedback from researchers
* Exploring the potential for machine learning algorithms to predict outcomes based on the AES data

Jargon and Entities
AES: Advanced Materials Science.
HED: Heterogeneous Experimental Data.
AES HED: The subproblem under research.
Hydrogen Energy Devices: Water electrolyzers and fuel cells.
FAIR: Findable, Accessible, Interoperable, and Reusable
neo4j: an industry-standard graph database
EMMO: EMMO, or the European Materials Modelling Ontology, is a versatile ontology for materials sciences developed by the European Materials Modelling Council. It consists of three levels: the top level defines real-world objects and introduces "perspectives" to reflect their pluralistic nature, the middle level contains specific perspectives that make EMMO applicable to various domains, and the bottom level contains ontologies of specific materials science domains. EMMO is used to standardize the knowledge representation of a domain, making it particularly useful in the context of FAIR (Findable, Accessible, Interoperable, Reusable) data generation.
VIMI:  Virtual Materials Intelligence platform, is a system that integrates data-driven methods for materials science research. It uses the Python framework Django for seamless integration and data management. The platform is designed to handle data from fabrication workflows, measurements, and simulations and encourages researchers to share data to accelerate their research. VIMI is particularly useful in the field of energy materials, where it can assist in the screening of materials, experimental design, the optimization of workflows, and the orchestration of devices within self-driven labs. VIMI is a spin-off Project from the Institute of Energy and Climate Research, IEK-13 at Forschungszentrum Juelich.
"""

# Copyediting Functions

In [11]:
prompt_preamble = """
I am writing a talk track for a three minute video presentation for generative AI programming competition
hosted by Samsung in San Francisco.
"""

## hackathon_judging_criteria_addressed()

In [12]:
def hackathon_judging_criteria_addressed(talk_track_md):
    prompt = f"""
    {prompt_preamble} 
    
    1. Rate how well my talk track addresses each of the hackathon judging criterion
    2. Rate how well my talk track suits each of the three winner categories

    The talk track is given in <talk-track>. The judging criteria are provided in <judging-criteria>

    <talk-track>
        {talk_track_md}
    </talk-track>

    <judging-criteria>
        {judging_criteria}
    </judging-criteria>
    """
    completion = gpt4(prompt)
    display_md(completion)


## consistent_with_research_proposal()

In [25]:
def consistent_with_research_proposal(talk_track_md):
    prompt = f"""
    {prompt_preamble}
    
    Is the talk track inconsistent with the provided research proposal in any way?

    The talk track is given in <talk-track>. The research proposal is given in <research-proposal>.

    <talk-track>
        {talk_track_md}
    </talk-track>

    <research-proposal>
        {research_proposal}
    </research-proposal>
    """
    completion = gpt4(prompt)
    display_md(completion)


## technical_fact_check()

In [23]:
def technical_fact_check(talk_track_md):
    prompt = f"""
    {prompt_preamble}
    
    Are any of the claims made in the talk track potentially disputable by an educated observer? 
    Could any of the claims have the appearance of being disputable, even if they are technically accurate?

    The talk track is given in <talk-track>.

    <talk-track>
        {talk_track_md}
    </talk-track>
    """
    completion = gpt4(prompt)
    display_md(completion)


## estimate_talk_time()

In [33]:
def estimate_talk_time(talk_track_md):
    prompt = f"""
    {prompt_preamble}
    
    At 140 wpm, approximately how many seconds will the following talk track take to read?

    The talk track is given in <talk-track>.

    <talk-track>
        {talk_track_md}
    </talk-track>
    """
    completion = gpt4(prompt)
    display_md(completion)


## includes_required_talking_points()

In [34]:
def includes_required_talking_points(talk_track_md):
    prompt = f"""
    {prompt_preamble}
    
    Does the talk track given include all of the talking points in <talking-points>

    The talk track is given in <talk-track>.

    <talk-track>
        {talk_track_md}
    </talk-track>

    <talking-points>
        - The high level story is this: Toby immigrated to Canada and his goal is to hit $10k in profit building AI software via face to face local sales only
        - We pitched this pilot to the lab last Tuesday, September 12
        - We were offered the opportunity to pitch the research lab because we hosted a local picnic for AI developers.
        - The developers on this solution all volunteered to do this project on under 24 hour notice, paid upon payment receipt from client
        - A key factor in getting the pilot opportunity was our differentiated open source licensing
    </talking-points>
    """
    completion = gpt4(prompt)
    display_md(completion)


## story_told_chronologically()

In [37]:
def story_told_chronologically(talk_track_md):
    prompt = f"""
    {prompt_preamble}
    
    Is the narrative in the talk track given chronologically in a manner that is easy for an educated audience
    to follow?

    The talk track is given in <talk-track>.

    <talk-track>
        {talk_track_md}
    </talk-track>    
    """
    completion = gpt4(prompt)
    display_md(completion)


# Drafting

## Draft 1 - Just Read The Research Proposal

In [21]:
hackathon_judging_criteria_addressed(research_proposal)

Based on the provided talk track and judging criteria, here's how your talk track addresses each of the hackathon judging criteria:

1. **Impact:** Your project addresses a significant global issue - the need for clean hydrogen fuel storage. The potential impact is vast, as it could accelerate the discovery and adoption of advanced materials for Hydrogen Energy Devices. However, the talk track could be more explicit about the potential global impact. Rating: 8/10

2. **Creativity:** The project proposes a unique solution to a complex problem - automating the classification of non-standardized CSV headers to Node Labels. This is a creative approach to addressing the issue of data silos in applied energy materials research. Rating: 9/10

3. **User Experience:** The talk track doesn't explicitly address user experience. However, it does mention that the deliverable will have a high standard of readability, which suggests a focus on user experience. Rating: 6/10

4. **Centrality of Generative AI:** The talk track doesn't explicitly mention the use of generative AI. However, the proposed solution involves automating classification, which could potentially involve generative AI. Rating: 5/10

5. **Technical feasibility:** The talk track provides a detailed explanation of the problem and the proposed solution, suggesting that the approach is technically feasible. However, it doesn't provide a clear demonstration that the project is functional. Rating: 7/10

6. **Fidelity:** The talk track suggests that the project is in the early stages of development, but it provides a clear plan for future work. Rating: 7/10

As for the three winner categories:

1. **Breakthrough Award:** The project proposes a novel solution to a significant problem, suggesting that it could be a contender for the Breakthrough Award. However, the talk track doesn't explicitly mention any deep core technical or architectural innovations. Rating: 7/10

2. **Edge Award:** The talk track doesn't mention whether the models have been optimized and deployed locally on the device, so it's unclear whether the project would be a contender for the Edge Award. Rating: 5/10

3. **Pioneer Award:** The project proposes a unique solution to a complex problem, suggesting that it could be a contender for the Pioneer Award. However, the talk track doesn't explicitly mention the use of mobile sensor, data, or services. Rating: 7/10

Overall, the talk track provides a clear and detailed explanation of the problem and the proposed solution. However, it could be improved by explicitly addressing each of the judging criteria and winner categories.

In [26]:
consistent_with_research_proposal(research_proposal)

The talk track and the research proposal are identical in content. There is no inconsistency between the two. Both documents outline the same problem, solution, and future work, and they both use the same jargon and entities. The talk track and the research proposal both provide the same information about the upcoming Samsung Gen AI programming competition, the team's preparation for the competition, and the problem they will be addressing. Therefore, there is no inconsistency between the talk track and the research proposal.

In [27]:
technical_fact_check(research_proposal)

The talk track is well-structured and provides a comprehensive overview of the proposed project. However, there are a few points that could potentially be disputed by an educated observer or could appear disputable, even if they are technically accurate:

1. The claim that "a primary barrier to the widespread adoption of clean hydrogen fuel storage is discovering a practical material for Hydrogen Energy Devices" could be disputed. While it is true that finding a practical material is a significant challenge, there are other barriers such as the cost of production, infrastructure for distribution and storage, and safety concerns that could also be considered primary barriers.

2. The statement that "the most significant barrier to accelerating this timeframe is the quantity of researcher toil required per cycle of advanced materials science (AES) experiments" could be seen as an oversimplification. While reducing researcher toil is important, there are other significant barriers such as funding, regulatory approval, and technological limitations that could also slow down the rate of advancement.

3. The assertion that "no widely adopted ontology exists for applied energy materials research" could be disputed. While it may be true that there is no single, universally accepted ontology, there may be multiple ontologies in use within different research communities.

4. The claim that "material scientists must manually de-silo from tens of millions of data points about materials and fabrication methods, drastically slowing down applied materials research" could be seen as an exaggeration. While it is true that dealing with heterogeneous data can slow down research, the claim that scientists must manually sift through tens of millions of data points could be seen as overstating the problem.

5. The statement that "the proposed solution must plausibly scale to reduce a top category of researcher toil by at least 90%" could be seen as overly ambitious. While it is good to aim high, a 90% reduction in researcher toil is a very significant claim and could be seen as unrealistic.

6. The claim that "ValuestreamAI requests payment only if the deliverable meets or exceeds the client research team's expectations" could be seen as risky. While this shows confidence in the team's ability to deliver, it could also be seen as a potential financial risk for ValuestreamAI.

7. The statement that "the ValuestreamAI team is in the early stages of understanding the broader context of this research field" could be seen as a potential weakness. While it is honest, it could also raise questions about the team's ability to deliver a solution that meets the needs of the research field.

In [28]:
estimate_talk_time(research_proposal)

The talk track contains approximately 1,400 words. Given a speech rate of 140 words per minute, it would take approximately 10 minutes to read the entire talk track. This is significantly longer than the three-minute time frame for the video presentation. You may need to condense the talk track or focus on key points to fit within the three-minute limit.

In [35]:
includes_required_talking_points(research_proposal)

The talk track provided does not include all the talking points mentioned. Here's how you can incorporate the missing talking points into your talk track:

1. High level story: "Hello, my name is Toby Tobkin, a Solutions Architect at ValuestreamAI. I immigrated to Canada with a goal to hit $10k in profit building AI software via face to face local sales only. Today, I am here to present our proposal for Automated Ontologization for Heterogeneous Experimental Data."

2. Pitching the pilot: "We had the opportunity to pitch this pilot to the Forschungszentrum Jülich Department of AI and Data Analytics for Integrated Clean Energy Technologies last Tuesday, September 12."

3. Opportunity to pitch: "We were offered this opportunity because of our active involvement in the AI community. We recently hosted a local picnic for AI developers, which allowed us to connect with the research lab."

4. Developers volunteering: "Our dedicated team of three engineers and one technical business analyst volunteered to work on this project with less than 24 hours' notice. They are committed to delivering a high-quality product, and their payment is contingent upon receipt of payment from the client."

5. Open source licensing: "A key factor in securing this pilot opportunity was our unique approach to open source licensing. We believe in the power of collaboration and the sharing of knowledge, which is why our deliverable will be under an MIT license, allowing others to learn from and build upon our work."

Remember to weave these points naturally into your presentation to maintain a coherent narrative.

In [38]:
story_told_chronologically(research_proposal)

Yes, the narrative in the talk track is given chronologically and is easy for an educated audience to follow. It starts with the background of the problem, then moves on to the venue and scope of the competition, the problem statement, the rationale behind the problem selection, the KPIs, the solution constraints, the proposed deliverables, payment terms, and future work. It ends with a glossary of jargon and entities. This structure allows the audience to understand the context, the problem, the proposed solution, and the expected outcomes in a logical order.