# Preamble

## Library Imports

In [1]:
%%capture
!pip install -r 'lib/requirements.txt'

from lib.openai import gpt4
from lib.display_md import display_md
from lib.llama2 import llama2

# Processing HTML to reduce tokens

As extracted as HTML, the paper is *11293* tokens; which exceeds the GPT4 context window of 8000 tokens. We need an approach that reduces the tokens to at least fit in the GPT4 context window, or find an equivalent quality LLM with a larger context window

In [8]:
paper_text = """
<div data-v-58c5b51c="" class="article_bg"><h2 data-v-58c5b51c="" id="art_Abstract">Abstract</h2><div data-v-58c5b51c="" id="seo_des" class="article_Abstract mag_btn10"><p>The escalating adoption of high-throughput methods in applied materials science dramatically increases the amount of generated data and allows for the deployment and use of sophisticated data-driven methods. To exploit the full potential of these accelerated approaches, the generated data need to be managed, preserved and shared. The heterogeneity of such data calls for highly flexible models to represent the data from fabrication workflows, measurements and simulations. We propose the use of a native graph database to store the data instead of relying on rigid relational data models. To develop a flexible and extendable data model, we create an ontology that serves as the blueprint of the data model. The Python framework Django is used to enable seamless integration into the virtual materials intelligence platform VIMI. The Django framework relies on the Object Graph Mapper neomodel to create a mapping between database classes and Python objects. The model can store the whole bandwidth of the data from fabrication to simulation data. Implementing the database into a platform will encourage researchers to share data while profiting from rich and highly curated data to accelerate their research.</p></div><!----><h2 data-v-58c5b51c="" id="art_Keywords">Keywords</h2><div data-v-58c5b51c="" class="article_Abstract"><span data-v-58c5b51c=""><span data-v-58c5b51c="">FAIR</span><i data-v-58c5b51c="">, </i></span><span data-v-58c5b51c=""><span data-v-58c5b51c="">energy materials</span><i data-v-58c5b51c="">, </i></span><span data-v-58c5b51c=""><span data-v-58c5b51c="">fabrication workflow optimization</span><i data-v-58c5b51c="">, </i></span><span data-v-58c5b51c=""><span data-v-58c5b51c="">ontologies</span><i data-v-58c5b51c="">, </i></span><span data-v-58c5b51c=""><span data-v-58c5b51c="">graph databases</span><!----></span></div></div>

<div data-v-58c5b51c="" id="artDivBox" class="art_cont"><div id="sec11" class="article-Section"><h2>INTRODUCTION</h2><p class="">Accelerating the development of clean energy devices is pivotal for the energy transition. A significant proportion of development efforts in this realm is devoted to the complex materials, such as electrocatalysts, multifunctional electrodes and ionic and porous transport media. Integrating these materials into devices necessitates a symbiotic combination of their properties to achieve the target device functionality. The intertwined requirements define a need for energy materials to be comprehensively and thoroughly screened, characterized and fabricated. Consequently, this calls for the development and implementation of a holistic and seamless platform to manage and analyze the rapidly growing datasets along the materials-to-device development workflow<sup>[<a href="#B1" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B1">1</a>]</sup>.</p><p class="">High-throughput methods in materials research have rapidly evolved in recent years and are expected to greatly speed up the rate at which novel materials are developed<sup>[<a href="#B2" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B2">2</a>-<a href="#B5" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B5">5</a>]</sup>. The screening for new energy materials and the optimization of manufacturing processes are already being conducted using high-throughput computation (HTC) and high-throughput experimentation (HTE). HTC is enabled by the steep growth in computing power along with the robust and efficient implementation of physics-based models<sup>[<a href="#B6" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B6">6</a>-<a href="#B7" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B7">7</a>]</sup>. HTC paves the way for the automated large-scale screening of materials with the desired combinations of properties<sup>[<a href="#B8" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B8">8</a>-<a href="#B13" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B13">13</a>]</sup>. HTE allows many experiments to be conducted in parallel, thereby enabling fast materials screening<sup>[<a href="#B14" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B14">14</a>,<a href="#B15" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B15">15</a>]</sup>. The further acceleration of scientific research can be achieved by automating fabrication workflows. Material acceleration platforms (MAPs) seek to enable closed-loop development by performing HTE in a fully autonomous fashion<sup>[<a href="#B16" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B16">16</a>,<a href="#B17" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B17">17</a>]</sup>. Most MAPs are deployed to optimize a set of materials or workflow parameters with respect to predefined target properties<sup>[<a href="#B18" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B18">18</a>-<a href="#B21" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B21">21</a>]</sup>. A high degree of autonomy calls for a sophisticated computational backend where data from previous fabrication cycles must be extracted and used to design the next cycle on-the-fly. Bayesian optimization is the most commonly used method in closed-loop experimentation<sup>[<a href="#B18" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B18">18</a>]</sup>. These techniques generate a large amount of data along the materials development pipeline, thereby necessitating the need for efficient data management strategies<sup>[<a href="#B22" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B22">22</a>,<a href="#B23" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B23">23</a>]</sup>.</p><p class="">The standard approach to data management is to build rudimentary data infrastructures suited to the needs of a particular project. This method is effective for studies of limited scope, where the collected data can be used for incremental improvements on specific materials classes or target applications. Since the data lack standardization, it becomes, however, challenging to compare them from various sources or to reuse data for other purposes with similar scope. An effective data management system should adhere to the FAIR principles, i.e., the data should be findable, accessible, interoperable and reusable<sup>[<a href="#B24" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B24">24</a>,<a href="#B25" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B25">25</a>]</sup>. Data FAIRification improves the reproducibility of scientific results and makes data accessible to the whole research community. Thus, FAIR data management enlarges the pool of available highly curated data and allows the application of a wider variety of data-driven methods<sup>[<a href="#B26" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B26">26</a>,<a href="#B27" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B27">27</a>]</sup>. Standardized formats that originate from following the FAIR principles require less processing to make data machine-readable and AI-ready<sup>[<a href="#B27" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B27">27</a>]</sup>.</p><p class="">In materials research and other fields, database projects have emerged to collect and manage increasing amounts of research data. Data types represented by these projects tend to be specific, with examples including the Materials Project<sup>[<a href="#B28" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B28">28</a>]</sup>, a database containing materials simulation data, and the Cambridge Crystallographic Data Centre (CCDC)<sup>[<a href="#B29" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B29">29</a>]</sup>, which gathers crystallographic data on materials. Enabling the full potential of data-driven approaches for accelerated materials discovery requires databases that include not only simulation data but also suitable fabrication and characterization data.</p><p class="">A suitable database must be flexible to effectively represent heterogenous data that contain insights into fabrication processes, measurements and simulations in materials research. Furthermore, materials, components and devices need to be described on multiple spatial and temporal scales. A database that is capable of storing data with such bandwidth could be the foundation of platforms that can orchestrate and accelerate materials research. These platforms can assist in the screening of materials, experimental design, the optimization of workflows and the orchestration of devices within self-driven labs. Our efforts in this regard tie in with the recently developed VIMI platform and offer data management for simulation and fabrication data, providing data-driven analytics, accelerated characterization and computer-aided materials design to its users<sup>[<a href="#B1" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B1">1</a>]</sup>.</p><p class="">In this communication, we present a flexible data management approach for energy materials platforms to accelerate the search for advanced materials by exploiting the full potential of data-driven research. Our approach contains the following:</p><p class="">● An extension of the European Materials Modelling Ontology (EMMO) to create a standardized representation of the energy materials domain for mandating FAIR data generation.</p><p class="">● A new graph data model based on the classes manufacturing, measurement, matter and property, as well as the relations between them. The data model provides an intelligible, flexible and extendable representation of fabrication workflows, measurements and simulation data.</p><p class="">● Data storage within the native graph database neo4j for efficient access to its highly connected content.</p><p class="">● An encapsulation of the database in a Django framework to allow a straightforward integration into VIMI or other platforms.</p><p class="">● A mapping from objects within the database to Python classes <i>via</i> an Object Graph Mapper (OGM).</p></div><div id="sec12" class="article-Section"><h2>METHODOLOGY</h2><div id="sec21" class="article-Section"><h3>Ontology</h3><p class="">Ontologies are a formal representation of knowledge that connect various metadata and make them machine-readable<sup>[<a href="#B30" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B30">30</a>]</sup>. As shown in <a href="#fig1" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig1">Figure 1</a>, an ontology employs classes, which can have properties stored as key-value pairs. The classes within an ontology are connected <i>via</i> relationships and rules and constraints can be specified. Ontologies are crucial for representing domain knowledge in various scientific fields and provide the basis for data and knowledge exchange among researchers within a specific domain. Ontologies are particularly useful in the context of FAIR data generation since they standardize the knowledge representation of a domain.</p><div class="Figure-block" id="fig1"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.1.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.1.jpg" class="" title="" alt="" id="fig1"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.1.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 1. A minimalistic example of an ontology. Instances of the Process class are connected to Object instances <i>via</i> the has Participant relationship. The Manufactured class is a child class of the Object class, meaning that all properties and relationships can be inferred from Object to Manufactured. A Manufactured object is composed of other Manufactured parts. This mereological perspective is represented by the hasPart relationship.</p></div></div><p class="">The interdisciplinary nature of materials science renders the standardization of information imperative for communication. The European Materials Modelling Council<sup>[<a href="#B31" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B31">31</a>]</sup> implements EMMO<sup>[<a href="#B32" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B32">32</a>]</sup>, which is a versatile ontology for materials sciences. EMMO consists of three levels. The top level defines real-world objects and introduces “perspectives” to reflect their pluralistic nature (e.g., materials can be defined <i>via</i> their composition or function). The middle level contains specific perspectives that make EMMO applicable to various domains. Each of the perspectives represents a different branch that defines objects from a holistic, physicalistic, semiotic or mereotopological perspective.</p><p class="">The different perspectives enable EMMO to represent the fabrication, characterization and simulation of materials up to the device scale on a very general level, as illustrated in <a href="#fig2" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig2">Figure 2</a>. The bottom level contains ontologies of specific materials science domains. Other projects that require domain-specific ontologies can extend the bottom level of EMMO, while the higher levels of EMMO offer a ruleset that can be inferred from these extensions. Projects that use EMMO-based ontologies include NOMAD<sup>[<a href="#B33" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B33">33</a>,<a href="#B34" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B34">34</a>]</sup>, CHAMEO<sup>[<a href="#B35" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B35">35</a>]</sup> and BigMap<sup>[<a href="#B36" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B36">36</a>,<a href="#B37" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B37">37</a>]</sup>. Building on EMMO to introduce an ontology substantially simplifies its creation since the basics for most application domains are already contained in the EMMO. An EMMO-based ontology can therefore rely on a variety of existing relationships/classes and constraints. Extending the EMMO branches allows properties, relationships and constraints from the parent classes, e.g., Manufactured is a subclass of Object, to be inferred and it therefore also has the hasParticipant relationship to Process classes [<a href="#fig1" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig1">Figures 1</a> and <a href="#fig2" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig2">2</a>]. Creating an ontology relying on EMMO also improves interoperability with other EMMO-based ontologies, since they have the same structure and are following the same basic rule sets. In many aspects, ontologies are comparable to languages, centered around narrow domains, and like languages, the power of an ontology strongly relies on its level of adoption.</p><div class="Figure-block" id="fig2"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.2.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.2.jpg" class="" title="" alt="" id="fig2"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.2.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 2. A minimalistic example of the overall EMMO structure. The green shapes on the bottom level represent application domains that can use and extend the respective EMMO classes in the middle level. The top level is strongly simplified since it contains further fundamental classes, relationships and constraints that the rest of the EMMO is based on.</p></div></div></div><div id="sec22" class="article-Section"><h3>Database</h3><p class="">At the core of the data infrastructure, a database is required to store fabrication, simulation and measurement data in a FAIR format. Furthermore, the metadata need to be sufficient to ensure the reproducibility of each entry in the database. Relational databases, such as MySQL or Postgres, are the current industry standards<sup>[<a href="#B38" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B38">38</a>,<a href="#B39" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B39">39</a>]</sup>. These databases contain tables, each of them representing a class of real-world objects, e.g., materials or measurements. Each row within a table refers to a specific instance of these real-world objects, e.g., a specific material. Relationships between different objects are represented by joining tables with foreign keys. Relational databases excel when highly structured and sparsely connected data need to be stored. However, the usage of foreign keys makes processing these relationships slow and their table structure makes the data model rigid<sup>[<a href="#B40" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B40">40</a>]</sup>. The differences in performance between relational databases and NoSQL databases have been investigated and benchmarked previously<sup>[<a href="#B41" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B41">41</a>,<a href="#B42" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B42">42</a>]</sup>.</p><p class="">There is also a wide variety of NoSQL databases for the development of a more dynamic data model and these can be divided into document-oriented, key-value and graph databases. Document-oriented data storage uses documents in a specific encoding, such as XML, YAML, JSON or BSON. Each document is addressed <i>via</i> a unique key and can be related to other documents by joint keys. The difference to relational databases is the versatile structure of the documents, which allows for a more flexible data model<sup>[<a href="#B43" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B43">43</a>]</sup>. The handling of relationships nevertheless remains inefficient, since documents are still joined <i>via</i> foreign keys, just as in the relational data model<sup>[<a href="#B40" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B40">40</a>]</sup>. Key-value databases store data as a single opaque collection of key-value pairs. This data structure also enables highly flexible data models, but its primitive structure needs to be extended for complex use cases<sup>[<a href="#B44" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B44">44</a>,<a href="#B45" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B45">45</a>]</sup>.</p><p class="">In this work, we chose neo4j, a native graph database, for data storage. Native graph databases use graph theory to represent and store data. Graphs contain nodes that are connected <i>via</i> relationships<sup>[<a href="#B46" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B46">46</a>]</sup>. They can be used to solve mathematical problems or to represent and store data using adjacency lists. Graphs can naturally represent complex domains inside the real world, allowing for data modelling without imposing layers of abstraction between the natural world and the associated data model. Directed graphs also contain valuable information since they can represent asymmetric relationships. The data structure of native graph databases allows for a highly efficient traversal of the graph along its relationships, which can be described as pointer hopping or dereferencing pointers. Since the adjacent nodes of each node are stored directly at the node itself, queries that require graph traversal <i>via</i> a path of relationships exhibit an O(1) complexity<sup>[<a href="#B40" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B40">40</a>]</sup>.</p><p class="">In neo4j, nodes and relationships can have properties stored as key-value pairs. Relationships in neo4j are always directed, which allows for the representation of asymmetric relations. These directed relationships provide more context, e.g., a simple workflow can be represented as several Process nodes, connected <i>via </i>followedBy relationships that need to be directed to fully represent the workflow. Relations between nodes of the same class are only unambiguous if they are intrinsically symmetric or if they are represented by directed relationships. Graph databases do not require a data schema since they are naturally additive, making them highly advantageous for storing heterogeneous data.</p><p class="">From a materials research perspective, fabrication workflows underline the heterogeneity of data in materials science research, since fabrication and characterization process data can contain a wide range of parameters, materials and subprocesses. These workflows and measurements can naturally be represented by graphs, since they are mostly sequences of subprocesses that have materials and parameters as inputs and manufactured materials or properties as outputs. A process can also be represented within a table, but it requires its structure to be predefined. If a process changes, its table must be altered or a new table must be created, leading either to massive, sparsely filled tables or many small tables representing variations of the same process. Graph databases store workflows as node sequences, one sequence for every stored workflow. Variations of these workflows do not require any changes to existing ones within the database since each workflow is stored as a separate sequence of nodes. Representing the workflows in <a href="#fig3" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig3">Figure 3</a> within a graph database would lead to two different sequences, which can both be accessed by queries that ask for workflows, leading from the given precursors to the corresponding product. The database is flexible in that regard since the data model does not predefine how a specific process is structured.</p><div class="Figure-block" id="fig3"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.3.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.3.jpg" class="" title="" alt="" id="fig3"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.3.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 3. Schematic of fabrication workflows. The first row represents simplistic coating workflows. The workflows differ slightly since the left workflow contains an intermediate stirring and heating step. The fabrication workflows are represented as graphs and tables (bottom).</p></div></div></div></div><div id="sec13" class="article-Section"><h2>SYNERGIZING ONTOLOGIES AND GRAPH DATABASES</h2><div id="sec21" class="article-Section"><h3>Ontology</h3><p class="">Ontologies are naturally represented as graphs since they contain classes represented <i>via</i> relationships. The structure of an ontology makes it easily transferrable to a graph data model. The created ontology is an extension of the EMMO, which is already established as a highly sophisticated framework of classes, rules and constraints to represent materials science. The ontology with its central classes is shown in <a href="#fig4" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig4">Figure 4</a>.</p><div class="Figure-block" id="fig4"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.4.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.4.jpg" class="" title="" alt="" id="fig4"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.4.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 4. High-level schematic of proposed EMMO-based data model.</p></div></div><p class="">The classes manufacturing and measurement both share the same parent class, Process. Matter represents all physical objects, from single atoms to manufactured components or devices. Property is a child class of PhysicalQuantity and is yielded by a Measurement process. Meta Data represents a collection of classes to make the scheme more understandable. It contains all classes that contain information stored as metadata to a specific process (e.g., measurement instruments, researchers or institutions, or experimental parameters). The Manufacturing, Measurement and Matter instances can be divided into components. The Matter instances are inputs to the Manufacturing and Measurement instances and outputs of the Manufacturing nodes. Processes can have metadata assigned to them and Measurement can have physical quantities as Property instances as outputs.</p><p class="">The ontology is tailored to the specific domain of energy materials by introducing domain-specific child classes. Most properties and relationships of these child classes can be inferred from the parent classes. Only specific properties and constraints need to be added to the class definition.</p><p class="">Currently, the EMMO extension focuses on fuel cell fabrication and characterization, but it can easily be extended to other technology domains. Classes of components, materials and fabrication procedures specific to the fuel cell domain were extracted from several well-cited fuel cell reviews. Therefore, the added taxonomy follows a widely accepted classification.</p></div><div id="sec22" class="article-Section"><h3>Data model</h3><p class="">The original data model developed in this work is inspired by the open provenance model in which every simulation, measurement or fabrication process acts as a function<sup>[<a href="#B47" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B47">47</a>]</sup>. It takes an input, e.g., a dataset or a selection of materials, to a function, e.g., a simulation or a fabrication step, and yields an output, e.g., a manufactured material or a measurement result. Due to the directed nature of the neo4j graph database, each process can naturally be represented by a tuple comprising an input node, a process node and an output node.</p><p class="">The overall data model follows the flowchart in <a href="#fig4" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig4">Figure 4</a>, thereby making it highly flexible and adaptable. To improve the findability of the data, we imported specific parts of the ontology into the database, namely, the classes Process, Matter and Quantity, including all their subclasses. These ontology branches represent the abstract concept of the real-world objects we aim to store in our database. The database therefore contains an ontology domain of the abstract concepts and a real-world domain that contains the actual data. Each node of the real-world domain is a specific instance of an ontology class and can be linked to a corresponding node of the ontology domain. The ontology nodes are used as labels for the nodes in the real-world domain. The connectivity of the ontology nodes creates not only a single label but also alternative labels. A real-world node that represents H<sub>2</sub>O as a solvent would be connected to the ontology node called PolarSolvent, which is a subclass of Solvent. Therefore, H<sub>2</sub>O is labeled with PolarSovent and Solvent can be retrieved as an alternative label [<a href="#fig5" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig5">Figure 5</a>]. These alternative labels greatly improve the findability of the data and the ontology domain can be easily extended to maintain the flexibility of the database.</p><div class="Figure-block" id="fig5"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.5.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.5.jpg" class="" title="" alt="" id="fig5"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.5.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 5. Screenshot of the imported Matter ontology within the database (left). Flowcharts of the ontology and real-world domain within the graph database (right) with the mapping between these domains via the isA relationship. Inheritance in the ontology domain is represented <i>via</i> the EMMO_isA relationship. The real-world data represented here shows H<sub>2</sub>O and platinum on carbon catalyst (PtC) processed by an arbitrary Manufacturing step.</p></div></div><p class="">Furthermore, for the holistic perspective on objects, their mereological description is crucial since materials, components and processes span multiple spatial and temporal scales. Device fabrication is a process that contains subprocesses and each of these can be split into a sequence of subprocesses [<a href="#fig6" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig6">Figure 6</a>]. The mereological representation of processes and matter allows for the representation of these objects down to an arbitrary degree of precision. This method of fractioning creates tree-structured graphs. Trees are specific graphs in which two nodes are always connected by exactly one path, making it a connected acyclic graph. Each node within a tree can have an arbitrary number of child nodes but must have only one parent node, except for the root node, which has no parent node. Each node can be treated as the root node of its subtree, thereby allowing recursion to traverse a tree.</p><div class="Figure-block" id="fig6"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.6.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.6.jpg" class="" title="" alt="" id="fig6"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.6.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 6. Schematic of the mereotopological structure of a process and its subprocesses (orange nodes) and a device, its components and their materials (purple, green and blue nodes, respectively).</p></div></div><p class="">The data model must represent experimental setups to enforce the reproducibility of the measurements and fabrication workflows. This ability will be crucial when the database is employed as part of a data infrastructure with an interface to automated labs. The scientific setup can be represented as a subgraph of connected nodes representing specific instruments/devices. Steps within fabrication workflows can then be mapped to the corresponding devices of the experimental setup, thereby allowing for precise process representations and enabling specific workflow optimization and troubleshooting. In the field of automation, challenging tasks include the orchestration of different devices and the exact positions of samples and other moving or moveable parts of the setup at different times. This leads to the necessity to represent these workflows in extremely high granularity. The supplementary data generated by publications regarding automated labs show the level of detail that is needed to achieve automation<sup>[<a href="#B21" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B21">21</a>,<a href="#B48" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B48">48</a>,<a href="#B49" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B49">49</a>]</sup>. Another important aspect of data management in automated labs is tracking the process itself in real-time<sup>[<a href="#B16" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B16">16</a>]</sup>. Tracking and high-resolution representation could be intuitively accomplished using the proposed graph data model. Although a relational database could also satisfy these needs for a specific automated lab, it would be very complicated to use the same data model for a different setup.</p><p class="">The described data model is intuitive, meaning that the data model and the real-world domain are not separated by layers of abstraction that would add unnecessary complexity to the data model. It uses the middle-level classes and relationships of the EMMO ontology. Furthermore, using trees as data structures allows for the implementation of highly efficient queries along the hierarchical breakdown of objects into their components. Its performance concerning possible user requests must also be evaluated by a data model. The data model allows for intuitive querying for processing sequences or device compositions, as well as requests of fabrication workflows as sequences of input-function-output tuples.</p></div><div id="sec23" class="article-Section"><h3>Object graph mapper</h3><p class="">Researchers in materials science, as users of the database, should be able to upload and access data from fabrications, measurements and simulations. The data need to be retrievable in different formats, including in the form of Python objects. The usage of layers that connect the database with the platform it is embedded in, are a common practise. Platforms that rely on relational databases use object-relational mappers to map their database object and datatypes to a given programming language, e.g., the AiiDA platform for computational materials science employs an object-relational mapper to map from their SQL database<sup>[<a href="#B50" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B50">50</a>]</sup>.</p><p class="">We utilize the OGM neomodel to facilitate mapping between the objects and data types within a graph database and the Python programming language<sup>[<a href="#B51" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B51">51</a>]</sup>. The neomodel library is also available as a Django plugin, as a robust Python framework to reduce the complexity of web application creation, making the OGM an ideal key component for an upcoming API implementation due to its Python compatible interface. The OGM allows for the introduction of domain-specific Python classes (e.g., Manufacturing Process). These classes contain properties and member functions, reflecting how they are defined within the ontology. The Python inheritance rules can create a hierarchy of classes. Neomodel enables mapping of the Python classes, including their properties and data types, to the classes within the database [<a href="#fig7" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig7">Figure 7</a>].</p><div class="Figure-block" id="fig7"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.7.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.7.jpg" class="" title="" alt="" id="fig7"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.7.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 7. Nodes and relationships within the database (A) can be mapped to Python classes (B).</p></div></div><p class="">Neomodel can also be used to generate, modify and query in a high-level interface that is agnostic to database architectural details. Using neomodel enables query formulation following the Python syntax, thereby offering a Python-based interface to the database that makes interacting with it more intuitive due to the broad acceptance of the Python programming language in the scientific community.</p><p class="">Neomodel allows for the creation, deletion and update of the nodes and relationships within the database. For example, it might therefore be used to update the single information of single nodes. Nevertheless, the OGM lacks efficiency for large-scale database operations. Each node/relationship that is added leads to separate requests to the database, resulting in high costs for the ingestion of large datasets. Thus, for large chunks of data, neomodel allows for custom-made cypher queries to conduct complex operations in an efficient manner. To ingest a fabrication workflow, for example, requires a predefined cypher query since such a workflow contains multiple nodes with a high degree of connectivity. Using an OGM has the advantage that complex database operations can be wrapped into Python functions that have well defined input and output interfaces. Neomodel creates an additional abstraction layer between database operations and the data management system itself. Cypher queries for complex database operations are thoroughly tested, wrapped into a Python function and are ready to be used in a Python environment, thereby hiding their underlying complexity.</p></div><div id="sec24" class="article-Section"><h3>Database</h3><p class="">The proposed data model is centered around the creation of node sequences. Labels are introduced to create sets of nodes that improve the structure within the database. These labels correspond to the Python classes created using neomodel and the classes defined within the ontology. Nodes can have multiple labels, e.g., a fuel cell node might have the labels Matter, Manufactured, Device and Electrochemical Device. The choice of indexing nodes of a specific label mainly depends on the nature of queries containing that label. Each label represents an entity containing attributes as key-value pairs necessary to identify that entity (e.g., unique identifiers or the SMILES representation of a molecule). Nodes that share one label can be indexed, thereby accelerating the node retrieval of a particular label. However, indexing has the drawback of slowing down the write efficiency and requires more storage capacity.</p><p class="">The property space of materials, components and devices and the parameter space of processes in materials science is high dimensional and constantly expanding. These physical quantities are stored within external Property and Parameter nodes to allow for the representation of all parameters and properties. The externalization of these nodes also allows for the efficient querying of nodes that share the same or similar properties and fabrication workflows that share the same or similar parameters. The database stores measurement results, fabrication parameters or properties in the form of scalars or a scalar array.</p><p class="">The highly efficient traversal of relationships is made possible by the data structure of neo4j, which has a complexity of O(1). To further increase query efficiencies, neo4j physically stores adjacent nodes close to each other and tailors the database architecture specifically to frequently used queries. The caching during queries also improves the efficiency of reading, writing and matching commands.</p><p class="">Large binary data types, such as image data from imaging techniques, will be stored within an external file server and are only referenced as metadata for properties derived from the images (e.g., size distribution). Referencing of the image files is carried out <i>via</i> UML links. Externalizing large data types, such as images and videos, is a good practice for data modelling. It keeps the image files connected to their corresponding data while queries do not have to handle clunky image data chunks, thereby improving the overall performance of the database. Furthermore, keeping binary data chunks within the database does not yield benefits since the database itself cannot query, index or compare binary data.</p><p class="">The data management system will be implemented into the VIMI platform<sup>[<a href="#B1" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="B1">1</a>]</sup>, so that researchers and users from industry can upload their data <i>via</i> well-defined interfaces, e.g., the dragging and dropping of CSV files. Furthermore, cooperations with automated labs are in place for which custom tailored APIs will be created to streamline their generated data directly to the database.</p></div><div id="sec25" class="article-Section"><h3>Data representation</h3><p class="">To test the data model, a batch of data for the fabrication and characterization of fuel cells was stored. The batch spans data from the materials to the device including the data from various measurements on different length scales. The heterogeneity of the data and its high dimensionality makes them ideal for testing the proposed data model.</p><p class="">The data were ingested into the database <i>via</i> CSV files and represented following the proposed data model. <a href="#fig8" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig8">Figure 8</a> shows the fabrication data and how they are stored in the graph database. <a href="#fig8" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig8">Figure 8A</a> shows a single fabrication workflow from the materials to the fuel cell device, and <a href="#fig8" class="Link_style" data-jats-ref-type="bibr" data-jats-rid="fig8">Figure 8B</a> presents 25 fabrication workflows within the database. The screenshots were taken from the neo4j browser interface, which offers visual representations of the stored data.</p><div class="Figure-block" id="fig8"><div xmlns="http://www.w3.org/1999/xhtml" class="article-figure-image"><a href="5476.fig.8.jpg" class="Article-img" alt="" target="_blank"><img src="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.8.jpg" class="" title="" alt="" id="fig8"><span><span class="img_btn" data-img="https://oaepublishstorage.blob.core.windows.net/0bad8455-ee56-4a22-ad15-4fef7fce06e5/5476.fig.8.jpg"></span></span></a></div><div class="article-figure-note"><p class="figure-note"></p><p class="figure-note">Figure 8. Representation of a single fabrication workflow from the starting materials to the fuel cell device (A) and 25 fabrication workflows and how they are linked to the ontology domain (yellow) within the database (B).</p></div></div><p class="">Ingesting sample data shows that the proposed data model is indeed able to represent complex fabrication workflows with a degree of detail. Even though the data model can represent these workflows, it is challenging to make them intuitively retrievable. The user of the data management system should be able to retrieve all variations of fabrication procedures for specific devices or components with a single command. This requires sophisticated queries scanning the database in an efficient manner to find the requested node sequences. The wider the bandwidth of the fabrication data stored within the database, the more challenging it will be to retrieve the requested data. These queries will be wrapped in Python functions and implemented into APIs to make the data accessible. We are currently cooperating with experimentalists from different parts of the energy materials domain to enrich our database, test our data model and especially improve and challenge our queries.</p></div></div><div id="sec14" class="article-Section"><h2>CONCLUSIONS</h2><p class="">We propose a new data model as the basis of a highly adaptable data infrastructure for the fabrication, measurements and simulations of energy materials. The data model is designed to represent workflows and processes at an arbitrary level of complexity. It can be modified to incorporate new materials, components or processes. The hydrogen technology domain, with an emphasis on fabrication and characterization, is represented in the data model by introducing an EMMO-based ontology. The data are stored in the native graph database neo4j and its structure allows for the efficient traversal of fabrication processes. To further increase efficiency, tree data structures can be used to represent the fabrication workflows in their subprocesses and the dissection of devices, components or materials into their constituent parts.</p><p class="">A use case for the proposed data management system is automated labs since they require automated data management. Current automated labs usually create data management systems tailored to their specific labs, thus, each automated lab requires a new data management system and a new data model. This creates additional overhead when these labs are set up, and it leads to small unconnected data lakes that lack standardization. The proposed data model is an answer to the growing number of automated labs and their need for data management since it can represent given workflow in an arbitrary level of detail.</p><p class="">For more advanced approaches to data-driven workflow optimization, it is essential to include data FAIRification in experimental workflows. In particular, the data generated in fabrication processes lack FAIR features due to their heterogeneous nature and the absent workflows for standardization. Storing data that are both FAIR and suitable for AI-driven models is possible by abandoning the relational data model and transit to the flexible, graph-based data model. Access to FAIR experimental data further pave the way for data-driven techniques.</p><p class="">The next phases of this project involve the integration of graph databases into the VIMI platform. This will make the database accessible to other users and the data infrastructure will be used to streamline data into the generation of training datasets and the creation of machine learning models. Since the data model is based on the concepts of the EMMO, it can represent characterization and fabrication of other domains in materials science. Its flexibility also allows for applications in other related research domains, such as batteries or solar cells.</p></div><div class="article-Section article-declarations" id="declarations"><h2>DECLARATIONS</h2><span>Acknowledgements</span><p>The authors gratefully acknowledge the cooperation of Prof. Jasna Jankovic (UCONN, USA) and Fabian Tipp (Forschungszentrum Jülich) for providing fabrication and simulation data. The authors also gratefully acknowledge the Gauss Centre for Supercomputing e.V. (<a target="_blank" href="http://www.gauss-centre.eu" xmlns:xlink="http://www.w3.org/1999/xlink">http://www.gauss-centre.eu</a>) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC).</p><span>Authors’ contributions</span><p>Performed the research and drafted the manuscript: Dreger M</p><p>Revised and finalized the manuscript: Dreger M, Eslamibidgoli MJ, Eikerling MH, Malek K</p><span>Availability of data and materials</span><p>The EMMO-based ontologies as well as the project itself are available at <a target="_blank" href="https://github.com/MaxDreger92/MatGraphAI" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/MaxDreger92/MatGraphAI</a>.</p><span>Financial support and sponsoring</span><p>The authors acknowledge the financial support from the Federal Ministry of Science and Education (BMBF) under the German-Canadian Materials Acceleration Centre (GC-MAC) grant number 01DM21001A and the financial support from the HiTEC graduate school for doctoral candidates at the Forschungszentrum Jülich.</p><span>Conflict of interest</span><p>All authors declare that there are no conflicts of interest.</p><span>Ethical approval and consent to participate</span><p>Not applicable.</p><span>Consent for publication</span><p>Not applicable.</p><span>Copyright</span><p>© The Author(s) 2023.</p></div></div>
"""

In [9]:
html_tokens = num_tokens_from_string(paper_text,"cl100k_base")
print(html_tokens)

11293


## Approach: Converting HTML to Markdown

If possible, stylistic information such as bullet points should be maintained, providing addition context. Using markdown is an approach that maintains the stylistic information, but in a far more concise format than HTML tags

###Using GPT to generate code to turn HTML into Markdown

In [10]:
prompt = f"""
Write a python script using html2text that accepts valid HTML and converts it to markdown, ignoring links and images
"""

In [11]:
completion = get_completion(prompt)
display_md(completion)

Sure, here's a simple Python script that uses the `html2text` library to convert HTML to Markdown. This script also ignores links and images.

```python
import html2text

def convert_html_to_markdown(html):
    # Create an html2text.HTML2Text object
    h = html2text.HTML2Text()

    # Ignore converting links
    h.ignore_links = True

    # Ignore converting images
    h.ignore_images = True

    # Convert the HTML to Markdown
    markdown = h.handle(html)

    return markdown

# Test the function
html = "<h1>Hello, World!</h1><p>This is a <a href='https://example.com'>link</a>.</p><img src='image.jpg' alt='An image'>"
markdown = convert_html_to_markdown(html)
print(markdown)
```

This script will convert the HTML to Markdown, but it will ignore any links and images in the HTML. The `html2text.HTML2Text().handle()` method does the actual conversion. The `ignore_links` and `ignore_images` attributes of the `HTML2Text` object are set to `True` to ignore links and images.

### Measuring HTML as converted Markdown token content

In [12]:
!pip install html2text

Collecting html2text
  Downloading html2text-2020.1.16-py3-none-any.whl (32 kB)
Installing collected packages: html2text
Successfully installed html2text-2020.1.16


In [13]:
import html2text

def convert_html_to_markdown(html):
    # Create an html2text.HTML2Text object
    h = html2text.HTML2Text()

    # Ignore converting links
    h.ignore_links = True

    # Ignore converting images
    h.ignore_images = True

    # Convert the HTML to markdown
    markdown = h.handle(html)

    return markdown

# Test the function
html = "<h1>Hello, World!</h1><p>This is a <a href='https://example.com'>link</a>.</p><img src='image.jpg' alt='image'>"
markdown = convert_html_to_markdown(html)
print(markdown)

# Hello, World!

This is a link.




In [14]:
paper_text_markdown = convert_html_to_markdown(paper_text)
stripped_markdown_tokens = num_tokens_from_string(paper_text_markdown,"cl100k_base")
print(stripped_markdown_tokens)

7065


This approach resulted in a roughly ~37% reduction in tokens; and allowing the paper to fit in the chatGPT context window of 8k tokens, without losing any text or stylistic information.

# Chat With Paper

## What is FAIR?

In [15]:
prompt = f"""
An academic paper is provided below in Markdown.

What is FAIR?

{paper_text_markdown}
"""

completion = get_completion(prompt)

display_md(completion)

The academic paper discusses the concept of FAIR (Findable, Accessible, Interoperable, and Reusable) in the context of data management in materials science. The authors propose a new data model based on a native graph database to store and manage data generated from high-throughput methods in applied materials science. The data model is designed to represent workflows and processes at an arbitrary level of complexity and can be modified to incorporate new materials, components, or processes. The authors also discuss the integration of this data model into the VIMI platform, making the database accessible to other users. The paper concludes by acknowledging the financial support from various institutions and declaring no conflicts of interest.

## What is VIMI?

In [16]:
prompt = f"""
An academic paper is provided below in Markdown.

What is VIMI?

{paper_text_markdown}
"""

completion = get_completion(prompt)

display_md(completion)

VIMI, as mentioned in the academic paper, stands for Virtual Materials Intelligence platform. It is a platform that integrates data-driven methods for materials science research. The paper discusses the use of a native graph database to store data and the Python framework Django for seamless integration into the VIMI platform. The platform is designed to manage, preserve, and share data generated from high-throughput methods in applied materials science.

## What is EMMO?

In [None]:
prompt = f"""
An academic paper is provided below in Markdown.

What is EMMO?

{paper_text_markdown}
"""

completion = get_completion(prompt)

display_md(completion)

EMMO, or the European Materials Modelling Ontology, is a versatile ontology for materials sciences developed by the European Materials Modelling Council. It consists of three levels: the top level defines real-world objects and introduces "perspectives" to reflect their pluralistic nature, the middle level contains specific perspectives that make EMMO applicable to various domains, and the bottom level contains ontologies of specific materials science domains. Other projects that require domain-specific ontologies can extend the bottom level of EMMO, while the higher levels of EMMO offer a ruleset that can be inferred from these extensions.

## Sumarize Paper

In [None]:
prompt = f"""
Summarize this academic paper in one sentence. It is provided in Markdown format below:

{paper_text_markdown}
"""

In [None]:
completion = get_completion(prompt)

In [None]:
display_md(completion)

## Relative Merits of Graph DBs

In [None]:
prompt = f"""
According to the following academic paper, what are the relative
merits of using a graph database in the materials sciences domain?

{paper_text_markdown}
"""

In [None]:
completion = get_completion(prompt)

In [None]:
display_md(completion)

The academic paper discusses the merits of using a graph database in the materials sciences domain. The authors propose a new data model based on a native graph database to store data generated from high-throughput methods in applied materials science. The benefits of this approach include:

1. Flexibility: The graph database allows for a flexible and extendable data model, which is crucial for managing the heterogeneity of data in materials science. It can represent workflows and processes at an arbitrary level of complexity and can be modified to incorporate new materials, components, or processes.

2. Efficiency: The structure of the graph database allows for efficient traversal of fabrication processes. It uses tree data structures to represent fabrication workflows in their subprocesses and the dissection of devices, components, or materials into their constituent parts.

3. FAIR Principles: The data model adheres to the FAIR principles (Findable, Accessible, Interoperable, and Reusable), which improves the reproducibility of scientific results and makes data accessible to the whole research community.

4. Integration: The Python framework Django is used to enable seamless integration into the virtual materials intelligence platform VIMI. 

5. Data Representation: The data model can represent complex fabrication workflows with a degree of detail, making it ideal for automated labs that require automated data management.

6. Acceleration of Research: Implementing the database into a platform will encourage researchers to share data while profiting from rich and highly curated data to accelerate their research. 

In conclusion, the use of a graph database in the materials sciences domain offers a flexible, efficient, and FAIR approach to data management, which can accelerate research and discovery in the field.