# Theory chapters 

```{admonition} Summary
:class: hint
- The **FAIR principles** provide guidelines to make data **Findable**, **Accessible**, **Interoperable**, and **Reusable**, ensuring data is well-organized, machine-readable, and optimized for reuse across disciplines.
- **Data provenance** refers to the documentation of the origin, history, and data processing.
- **Metadata** is information that describes and organizes data, enabling easier discovery and use.
- A **license** defines the permissions, restrictions, and terms under which data or software can be used, shared, and modified.
```

## **FAIR principles**

The **F**indable **A**cessible **I**nteroperable **R**eusable (**FAIR**) principles are the culmination of more than 20 years of agreements and discussions within industry and academia to address the critical issue of managing the most crucial asset of any research activities, namely the **DATA**.

**Findable**
The first step in (re)using data is to find them. Metadata and data should be easy for both humans and computers to find. Machine-readable metadata plays a crucial role in enabling the automatic discovery of datasets and services. 

- **F1** (Meta)data are assigned a globally unique and persistent identifier()

- **F2** Data are described with rich metadata (defined by R1 below)

- **F3** (Meta)data clearly and explicitly include the identifier of the data they describe

- **F4** (Meta)data are registered or indexed in a searchable resource** F1: (Meta) data are assigned globally unique and persistent identifiers

**Accessible**
Once users have found the required data, they need to understand how to access it. This involves determining whether the data is openly available or requires authentication and authorization, such as login credentials. Users must know the methods for retrieving the data, whether through direct downloads, APIs, or repositories. Finally, it is essential to consider any restrictions or conditions on access.

- **A1** (Meta)data are retrievable by their identifier using a standardised communications protocol

- **A2** (Meta)data is accessible, even when the data are no longer available

**Interoperable**
The data usually needs to be integrated with other data. In addition, the data needs to interoperate with applications or workflows for analysis, storage, and processing.

- **I1** (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

- **I2** (Meta)data use vocabularies that follow FAIR principles

- **I3** (Meta)data include qualified references to other (meta)data

**Reusable**
The ultimate goal of FAIR is to optimise data reuse. To achieve this, metadata and data should be well-described to be replicated and/or combined in different settings.

- **R1** (Meta)data are richly described with accurate and relevant attributes.
    - **R1.1** (Meta)data are released with a clear and accessible data usage license.
    - **R1.2** (Meta)data are associated with detailed provenance.
    - **R1.3** (Meta)data meet domain-relevant community standards.

## **Data Provenance** 
In scientific research, ensuring reproducibility remains a cornerstone of the scientific method. Reproducibility allows other researchers to verify findings by following the same methodology, reanalyzing data, and obtaining consistent results. In Data Science, it is fundamental to provide **transparent documentation, well-structured metadata, standardized workflows, and detailed <mark>provenance tracking</mark>** to capture every step of data processing and analysis. 
Unlike workflows, which serve as structured guidelines, **provenance** functions more like a detailed logbook by systematically recording every step to generate a specific result. This allows researchers to trace, review, and even replicate the exact process that led to a particular outcome, ensuring its validity.

For example, in typical geoscience research, provenance can include the:

- **Data source:** raw measurements, original vector and raster data, ground control data
- **Pre-processing Methods:** Reprojection of the geodata; Clipping the dataset to the bounds of a specific study area; Data cleaning (e.g., removing clouds or irrelevant features)
- **Data processing and analysis** Transformations applied include filtering, aggregation, resampling, joining, and/or model design with relative statistical analysis.
- **Model or statical parameters** with relative functions and code used in computations
- **The final output** and how it was generated


**A Jupyter Notebook** is an excellent tool for maintaining provenance in computational research. It records the entire workflow and provides a detailed logbook of all the data processing and analysis steps. Moreover, a Juper Notebook allows easy annotations to describe each step, improving clarity and documentation.
 

## **Metadata (MD):** *"Data About Data"*

Metadata (MD) is often described as "data about data." It provides <mark>**structured information**</mark> about research data, enabling better organization, discovery, and context of datasets.  

### Why Is Metadata Important?
Metadata plays a crucial role because it:  
- **Enhancing discoverability:** Well-documented metadata allows researchers to find relevant datasets quickly (*e.g. by using keywords in their search*).  
- **Ensuring Data Interoperability:** Standardized metadata enhances searchability and data integration by providing consistent descriptors (*e.g., using controlled vocabularies and standardized keywords for geospatial data*). It also facilitates the **collection and processing** of datasets across different platforms (*e.g. in the case of geospatial data, by adopting Open Geospatial Consortium (OGC) standard services (such as WMS, WFS, or WCS) allows seamless data retrieval and processing across various software and systems, including R, Python, QGIS, ArcGIS, and web-based GIS applications*). By ensuring metadata consistency (*e.g., uniformly defining coordinate reference systems, spatial extent, and thematic attributes*), interoperability is significantly improved, enabling researchers and analysts to integrate datasets from diverse sources efficiently.
- **Improving data reproducibility:** By providing details about how data was collected and processed (*e.g., by adding related links to the original data source, pre-processing algorithms, analysis-ready data, post-processing algorithms, replication packages and any related documentation such as data or software description article*).  
- **Facilitating long-term data usability and fit for purpose:** Metadata includes essential details such as data format, provenance/lineage, licensing, and links to other resources supporting research data's long-term sustainability and usability.  
- **Promoting proper Attribution, Credits, and Citations:** Metadata elements like Creator and License ensure creators hold copyright and can, therefore, be appropriately credited for their work while defining the usage condition for sharing and reusability.  

### What Does “Structured Metadata” Mean?  
Metadata is considered structured because it follows a defined format with specific elements that describe various aspects of the data. 
The metadata descriptors vary depending on the application domain, as different application domains require specialized descriptors to represent their data accurately. 
To address these differences, metadata standards have been developed to define and organize metadata descriptors based on their intended use. 
These standards are generally classified into two categories:
- **General-Purpose Metadata Standards** are designed for broad applicability across multiple disciplines, providing a standardized way to describe datasets regardless of the domain.
- **Domain-Specific Metadata Standards** – Tailored to specific fields, incorporating specialized descriptors relevant to particular types of data, such as geospatial information, biomedical research, or social sciences
### Metadata Standards  
To ensure consistency and usability, metadata often follows established standards such as:  
<mark>**The DataCite**</mark> is a widely used **general-purpose** metadata scheme for describing digital resources. It is generally designed for general research datasets across all disciplines and focuses on citation, discovery, and persistent identification.

Common metadata elements include:  
- **Title:** The name of the dataset or research work.  
- **Creator:** The individual(s) or organization(s) responsible for generating the data.  
- **Abstract:** A summary of the dataset’s content and purpose.  
- **Keywords:** Terms that help categorize and index the data for easier retrieval.  
- **Format:** The file type or structure of the dataset (e.g., CSV, PDF, XML).  
- **Subject:** The broader topic or discipline related to the data.  
- **Persistent Identifier (PID):** A unique identifier (such as a DOI) ensures the dataset remains accessible over time.  
- **License:** The terms of use specifying how the data can be shared and reused.  
- **Provenance/Lineage:** Information on the origin and history of the dataset, including how it was created and modified.  

<mark>**The ISO 19115:** </mark> Is a **domain-specific** metadata standard tailored specifically for **geodata**, providing  further and extensive details on spatial, temporal, and thematic aspects of datasets such as:

- **Spatial Reference Information:** Coordinate Reference System (CRS), Projection details, Spatial resolution (scale, ground sampling distance)
- **Temporal Extent:** Period covered by the data, Frequency of updates (e.g., daily, annually)
- **Detailed Lineage and Data Provenance:** Source data origin (e.g., satellite imagery, field surveys), Data processing history (e.g., transformations, filtering, aggregation), Quality control procedures applied
- **Data Quality:** Positional accuracy (spatial precision), Logical consistency (topological and attribute correctness), Completeness (missing data, coverage gaps)
- **Geospatial Feature and Attribute Information:** Vector feature types (e.g., points, lines, polygons), Raster properties (resolution, pixel size, band information), Thematic classification (e.g., land cover categories)
- **Geospatial Services:** Web services (e.g., WMS, WFS, WCS from OGC)

## Licences 
