<a href="https://colab.research.google.com/github/jchburmester/tree-decisions/blob/main/TD_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling for a Modeling Research Project

This is a Python Notebook to host a (prototype) modeling framework for tree growth and carbon deposition in wood, with a focus on tree-ring increments and small-scale processes.
- An updated version of this notebook will be stored in the Github Repository: https://github.com/jchburmester/tree-decisions
- Later on, the actual data used in this project will be stored in Zenodo: https://zenodo.org/me/uploads?q=&f=shared_with_me%3Afalse&l=list&p=1&s=10&sort=newest.
The datasets will be directly accessed from this notebook.
- The notebook will serve as a script to load, preprocess, and align the data as the initial step. This will be followed by data analysis. Finally, based on the analysis results, a prototype sub-model for vegetation will be developed, ideally integrated into LPJ-GUESS.

##### **FAIR** Data Principles
This notebook adheres to the FAIR principles by ensuring all data and materials are findable through clear metadata and persistent identifiers, and accessible via Zenodo. Interoperability is supported by using widely accepted data and code formats such as CSV, NumPy, and Pandas. Reusability is promoted through comprehensive documentation within this notebook and the accompanying README file on GitHub, which also includes licensing information. Additionally, the notebook and related resources will be openly shared and updated on the public GitHub repository and Zenodo, facilitating transparency and reproducibility.

##### Imports
So far, only standard packages will be used. Version control and dependency pinning will be determined based on the specific methods employed later in the analysis and model development, such as Machine Learning (ML) techniques. Any updates will be documented and pushed to an environment file within the GitHub repository. Specific versions can be installed using `!pip install`.

In [3]:
# Packages to connect to Zenodo and load/request data.
import requests
import json
import os

# To analyse and store data.
import pandas as pd
import numpy as np

# To store private access token for Zenodo.
from google.colab import userdata

# For direct access to Zenodo.
zenodo_token = userdata.get('ZENODO_TOKEN')

##### To test direct access to Zenodo

In [None]:
# Specifying and requesting access.
headers = {"Authorization": f"Bearer {zenodo_token}"}
deposition = requests.post(
    "https://zenodo.org/api/deposit/depositions",
    json={},
    headers=headers
)
print("Create deposition status:", deposition.status_code)
dep_json = deposition.json()
print(dep_json)

deposition_id = dep_json["id"]
bucket_url = dep_json["links"]["bucket"]

# Uploading a small dummy file (create locally first).
filename = "test_upload.txt"
with open(filename, "w") as f:
    f.write("This is a test upload to Zenodo from Colab.\n")

# Uploading file to the bucket.
with open(filename, "rb") as fp:
    upload = requests.put(
        f"{bucket_url}/{filename}",
        data=fp,
        headers=headers
    )
print("File upload status:", upload.status_code)
print(upload.json())

# Adding some metadata.
metadata = {
    "metadata": {
        "title": "Test Upload from API",
        "upload_type": "dataset",
        "description": "A simple test upload using Zenodo API.",
        "creators": [{"name": "Your Name"}]
    }
}
update = requests.put(
    f"https://zenodo.org/api/deposit/depositions/{deposition_id}",
    json=metadata,
    headers={**headers, "Content-Type": "application/json"},
)
print("Metadata update status:", update.status_code)
print(update.json())

# Publishing the deposition (test).
publish = requests.post(
    f"https://zenodo.org/api/deposit/depositions/{deposition_id}/actions/publish",
    headers=headers
)
print("Publish status:", publish.status_code)
print(publish.json())

Create deposition status: 201
{'created': '2025-09-02T09:39:15.901052+00:00', 'modified': '2025-09-02T09:39:17.279313+00:00', 'id': 17035705, 'conceptrecid': '17035704', 'metadata': {'access_right': 'open', 'prereserve_doi': {'doi': '10.5281/zenodo.17035705', 'recid': 17035705}}, 'title': '', 'links': {'self': 'https://zenodo.org/api/deposit/depositions/17035705', 'html': 'https://zenodo.org/deposit/17035705', 'badge': 'https://zenodo.org/badge/doi/.svg', 'files': 'https://zenodo.org/api/deposit/depositions/17035705/files', 'bucket': 'https://zenodo.org/api/files/7b1c7777-be28-4426-b1cf-cb8088970434', 'latest_draft': 'https://zenodo.org/api/deposit/depositions/17035705', 'latest_draft_html': 'https://zenodo.org/deposit/17035705', 'publish': 'https://zenodo.org/api/deposit/depositions/17035705/actions/publish', 'edit': 'https://zenodo.org/api/deposit/depositions/17035705/actions/edit', 'discard': 'https://zenodo.org/api/deposit/depositions/17035705/actions/discard', 'newversion': 'https

#### Data Loading (from Github Repository for now, later Zenobo)
For now and as a test, data will be loaded as CSV files. Later, data from multiple sources and formats will be aligned and standardised into Pandas DataFrames, NumPy arrays, or PyTorch tensors, depending on the requirements of downstream tasks.

In [4]:
# Load and store data.
csv_url = 'https://raw.githubusercontent.com/jchburmester/tree-decisions/refs/heads/main/data/TD_test_file.csv'
df_test = pd.read_csv(csv_url)

In [5]:
# Show dataframe.
df_test.head()

#### Pre-processing Pipeline
Here, I perform preprocessing steps on the input data, including the removal of outliers to ensure data quality, gap filling and interpolation using ML-based methods to handle missing values, and formatting to unify data structure and consistency. These steps prepare the dataset for subsequent analysis by making it clean, complete, and standardised.

In [None]:
# Here will be code.

#### Data Visualisation & Analysis
Data visualisation will help to show general patterns and anomalies within the dataset. We will then conduct regression and correlation analyses to examine relationships. Finally, statistical and ML approaches will be used to identify more complex and less obvious trends, to improve our understanding of carbon storage processes in wood biomass.

In [None]:
#  Here will be code.

#### Sub-Model Prototype
Based on insights from the data analysis, process equations will be implemented to represent the system's carbon dynamics. The model will undergo calibration, validation, and uncertainty analysis to ensure robustness. Finally, the calibrated sub-model will be integrated into the existing head-model framework for a more complete ecosystem simulation.

In [None]:
# Here will be code.