# Writing project-specific Data Management Plans using chatGPT
In this notebook I will use the [OpenAI API](https://openai.com/blog/openai-api) and chatGPT 4.0 to turn a fictive project description and a skeleton for a Data Management Plan (DMP) into a project-specifici DMP. If you want to rerun this notebook, you need an OpenAI API key, and execution of the notebook may cost money.

In [1]:
import openai
from IPython.display import display, Markdown

We define some helper-function to send a prompt to chatGPT and retrieve the result.

In [2]:
def prompt(message:str, model="gpt-4"):
    """A prompt helper function that sends a message to openAI
    and prints the response using Markdown
    """   
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": message}]
    )
    return response['choices'][0]['message']['content']

## Asking chatGPT about DMPs

In [3]:
result = prompt("""
Give me a short list of typical sections of a Data Management Plan. 
Write bullet points and no detailed explanation.
""")

display(Markdown(result))

1. Introduction / Overview
2. Types of Data to be Collected or Created
3. Data Collection Methods
4. Data Backup and Security 
5. Data Documentation and Metadata 
6. Ethical Issues and Confidentiality
7. Data Storage and Preservation 
8. Data Sharing and Access 
9. Data Reuse and Distribution 
10. Roles and Responsibilities 
11. Data Quality Assurance and Control
12. Budget and Resources 
13. Data Retention and Disposal  
14. Review and Update of the Data Management Plan

In [4]:
result = prompt("""
What is commonly described in a section about "Backup and Archiving" in a 
Data Management Plan? Answer in 3 sentences.
""")

display(Markdown(result))

The "Backup and Archiving" section in a Data Management Plan typically describes the strategies and procedures for regularly backing up data to prevent loss due to unforeseen circumstances, including details on the frequency of backups, backup locations, and data recovery processes. It also outlines the plans for long-term preservation and archiving of data, often specifying the archiving duration, formats to be used, and strategies for data migration to overcome technology obsolescence. Relevant aspects such as security measures, access controls, and responsibility assignations for both backup and archiving processes are also elucidated.

## Our project description
In the following cell you find a description of a fictive project. It contains all aspects of such a project that came to my mind when I though of the aspects chatGPT mentioned above. It is structured chronologously, listing things that happen early in the project first, and transitioning towards publication of a manuscript, code and data. 

In [5]:
project_description = """
In our project we investigate the underlying physical principles for Gastrulation 
in Tribolium castaneum embryo development. Therefore, we use light-sheet microscopes
to acquire 3D timelapse imaging data. We store this data in the NGFF file format. 
After acquistion, two scientists, typically a PhD student and a post-doc or 
group leader look into the data together and decide if the dataset will be analyzed 
in detail. In case yes, we upload the data to an Omero-Server, a research data 
management solution specifically developed for microscopy imaging data. Data on 
this server is automatically backed-up by the compute center of our university. We then login 
to the Jupyter Lab server of the institute where we analyze the data. Analysis results
are also stored in the Omero-Server next to the imaging data results belong to. The
Python analysis code we write is stored in the institutional git-server. Also this 
server is backed up by the compute center. When the project advances, we start writing
a manuscipt using overleaf, an online service for collaborative manuscipt editing 
based on latex files. After every writing session, we save back the changed manuscript 
to the institutional git server. As soon as the manuscript is finished and 
submitted to the bioRxiv, a preprint server in the life-sciences, we also publish the 
project-related code by marking the project on the git-server as public. We also
tag the code with a release version. At the same time we publish the imaging data 
by submitting a copy of the dataset from the Omero-Server to zenodo.org, a 
community-driven repository for research data funded by the European Union. Another 
copy of the data, the code and the manuscript is stored on the institutional archive 
server. This server, maintained by the compute center, garantees to archive data for 
15 years. Documents and data we published is licensed under CC-BY 4.0 license. The code 
we publish is licensed BSD3. The entire project and all steps of the data life-cycle 
are documented in an institutional labnotebook where every user has to pay 10 Euro 
per month. Four people will work on the project. The compute center estimates the 
costs for storage and maintenance of the infrastructure to 20k Euro and half a 
position of an IT specialist. The project duration is four years.
"""

We can then use this project description as part of a prompt to chatGPT to turn this unstructured text into a DMP.

In [6]:
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}
""")

display(Markdown(result))

Data Management Plan for Investigation of Gastrulation in Tribolium Castaneum Embryo Development

1. Data Collection: Data is procured through the use of light-sheet microscopes, which produces 3D timelapse imaging data, archived in the NGFF file format. The data review is performed by a minimum of two researchers, following which a decision on detailed data analysis is made.

2. Data Storage: If the data is chosen for further analysis, it is uploaded to an Omero-Server, a platform specifically designed for microscope imaging data management. The platform is backed up by the university's compute center, ensuring the data's security. Analysis results are stored alongside the original data for easy access and reference.

3. Data Processing: Data processing is carried out on the institute's Jupyter Lab server. Python is used to code the analysis, which is then stored on the institutional git-server, offering another level of data backup.

4. Documentation: The team documents every stage of the project in an institutional lab notebook. The manuscript drafting is conducted via Overleaf, an online collaborative tool, with every version being saved back into the institutional git-server.

5. Data Availability: When complete, manuscripts are submitted to bioRxiv, a preprint server in life sciences. The code relative to the project is made public on the git-server and tagged with a release version. Imaging data is submitted to zenodo.org, a EU-funded research data repository. Additionally, a complete copy of project materials is stored on the institutional archive server for a guaranteed period of 15 years.

6. Licensing: All published documents and data fall under the CC-BY 4.0 license, while the published code is licensed BSD3.

7. Personnel: A team of four is assigned to the project. The university's compute center also dedicates half an IT specialist’s role to this project for maintenance and support.

8. Costs: Estimated costs for storage and infrastructure maintenance are roughly 20,000 Euros, with an expected project duration of four years.

9. Compliance: All team members are required to adhere to the guidelines set within this data management plan.


## Combining information and structure
We next modify the prompt to also add information about the structure we need. This structure may be different from funding agency to funding agency and thus, this step is crucial in customizing the DMP accoring to given formal requirements.

In [7]:
result = prompt(f"""
You are a professional grant proposal writer. In the following comes a description of 
a common project in our "Tribolium Development" Research Group at the University. 
Your task is to reformulate this project description into a Data Management Plan.

{project_description}

The required structure for the data management plan, we need to write is like this:

# Data Management Plan
## Data description
## Documentation and data quality
## Storage and technical archiving the project
## Legal obligations and conditions 
## Data exchange and long-term data accessibility
## Responsibilities and resources

Use Markdown for headlines and text style.
""")

display(Markdown(result))

# Data Management Plan

## Data Description
Our study aims at understanding the fundamental physical principles informing Gastrulation in Tribolium castaneum embryo development. Through this endeavor, we will generate 3D timelapse imaging data gathered from light-sheet microscopes. The data sets are stored using the NGFF file format. In the process of the project, we will create Python-based code for data analysis, research papers drafted on Overleaf, which incorporates the findings and results of our study.

## Documentation and Data Quality
Once the required data is acquired, it undergoes a scrutiny process where two scientists, generally a post-doc or group leader along with a PhD scholar, determine whether the data merits a detailed analysis. If affirmed, it will be stored on our Omero-Server - an exclusive data management solution created for microscopy imaging data. The analysis results, Python analysis scripts, and manuscript edits are also stored in this server and always updated after each modification.

## Storage and Technical Archiving of the Project
The data on the Omero-Server and the institutional git-server, where we store the Python analysis code and edits to our manuscript, are automatically backed up by our university's compute center. Completed aspects of the project are retained in the institutional archive server for a guaranteed period of 15 years. This archive likewise holds a copy of all published data, code, and manuscripts attached to the project.

## Legal Obligations and Conditions
Our published documents and data are licensed under the CC-BY 4.0 license, while the published code follows the BSD3 license. Individuals working on the project and accessing the institutional lab notebook are required to pay a monthly fee of 10 Euros.

## Data Exchange and Long-term Data Accessibility
To ensure wider accessibility and visibility, we publish our finalized manuscripts to the bioRxiv - a preprint server primarily geared towards the life-sciences. We further open-source our project-related Python code by switching the settings on our git-server to public and tagging the code with a release version. All imaging data will be made accessible by uploading a copy of the data set from the Omero-Server to zenodo.org - a community-driven repository for research data backed by the European Union.

## Responsibilities and Resources
The research group comprises four members who will be actively involved in the project that spans over four years. The compute center estimates the infrastructural costs for both storage and maintenance to be 20k Euros and one-half of an IT specialist's position.