# MMB Data Pipeline

This notebook implements a structured data pipeline for Mini Module Baseline (MMB) data from NOMAD. It follows a logical workflow:

1. **Setup**: Import libraries and configure environment
2. **Authentication**: Connect to NOMAD API with proper credentials
3. **Data Retrieval**: Fetch MMB-related samples and archive data
4. **Relationship Analysis**: Identify related entries and references
5. **Data Processing**: Transform raw data into structured DataFrames
6. **Analysis & Visualization**: Prepare data for modeling and analysis

Date: June 27, 2025

## 1. Setup and Environment Configuration

First, we'll import all necessary libraries and set up the environment for working with NOMAD API and data processing.

In [1]:
# Ensure we can load the .env file
from pathlib import Path
from dotenv import load_dotenv

# Find the .env file in the project root (two levels up from this notebook)
env_path = Path().absolute().parent / '.env'
if env_path.exists():
    load_dotenv(dotenv_path=env_path)
    print(f"Loaded environment from: {env_path}")
else:
    print(f"Warning: No .env file found at {env_path}")

# Now we can import the rest of our dependencies
import os
import sys
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from tqdm.notebook import tqdm

# Import NOMAD API modules
sys.path.append('../')
from nomad_api.auth import authenticate, OASIS_OPTIONS
from nomad_api.client import NomadClient
from nomad_api.data import query_sample_entries,get_all_samples_with_authors

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)

# Display settings for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

Loaded environment from: /home/qkg/Documents/1_PROJECTS/NOMAD-Tools/NOMAD-Admin-Tools/.env


## 2. Authentication with NOMAD API

To access NOMAD data, we need to authenticate with the API. The authentication will use credentials from the environment variables or prompt for them if not available.

In [2]:
# Use the SE Oasis URL - this can be changed to other OASIS options if needed
OASIS_URL = OASIS_OPTIONS['SE Oasis']

# The authenticate function will automatically try to:
# 1. Use NOMAD_CLIENT_ACCESS_TOKEN if available
# 2. Fall back to NOMAD_USERNAME and NOMAD_PASSWORD from .env file
# 3. Prompt for credentials if neither are available
token, user_info = authenticate(base_url=OASIS_URL)

print(f"Successfully authenticated as: {user_info.get('name', user_info.get('username'))}")

# Create the client with the obtained token
client = NomadClient(base_url=OASIS_URL, token=token)

Successfully authenticated as: Paolo Graniero


## 3. Data Retrieval

In this section, we'll retrieve all relevant MMB data from NOMAD:
1. Fetch all HySprint samples
2. Filter for MMB-related samples
3. Extract metadata about the dataset

In [3]:
# Use query_sample_entries to fetch all samples
# This function handles pagination and admin/visible access automatically

# Step 1: Fetch all HySprint samples
all_samples = get_all_samples_with_authors(
    client=client,
    section_type="HySprint_Sample",
    page_size=1000  # Adjust based on your needs
)
print(f"Total HySprint_Sample entries found: {len(all_samples)}")

# Step 2: Extract and filter upload names
unique_upload_names = list(set(sample['upload_name'] for sample in all_samples if 'upload_name' in sample))
unique_upload_names = [name for name in unique_upload_names if name]  # Filter out empty names
print(f"Unique upload names found: {len(unique_upload_names)}")

# Step 3: Filter for MMB-related uploads
mmb_uploads_names = [name for name in unique_upload_names if "MMB" in name]
print(f'MMB uploads found: {len(mmb_uploads_names)}')
print(f'MMB upload names:')
for name in sorted(mmb_uploads_names):
    print(f"- {name}")

# Step 4: Get all MMB samples
mmb_samples = [sample for sample in all_samples if sample.get('upload_name') in mmb_uploads_names]
print(f"Total MMB samples found: {len(mmb_samples)}")

# Display a brief summary of the first sample (if available)
if mmb_samples:
    sample_summary = {
        "entry_id": mmb_samples[0].get('entry_id'),
        "upload_id": mmb_samples[0].get('upload_id'),
        "upload_name": mmb_samples[0].get('upload_name'),
        "entry_name": mmb_samples[0].get('entry_name'),
    }
    print("\nSample entry summary:")
    for key, value in sample_summary.items():
        print(f"- {key}: {value}")


Attempting to retrieve samples with admin access...
Admin access failed, falling back to visible access...
Attempting to retrieve samples with visible access...
Found 1513 samples (approximately 2 pages)
Processed page 1/2
Processed page 2/2
Total HySprint_Sample entries found: 1513
Unique upload names found: 82
MMB uploads found: 23
MMB upload names:
- HZB_MMB_8_2001
- MMB Batch 12.0
- MMB Batch 12.1
- MMB Batch 12.10
- MMB Batch 12.11
- MMB Batch 12.12
- MMB Batch 12.5
- MMB Batch 12.7
- MMB Batch 12.8
- MMB Batch 12.9
- MMB Batch 13.0
- MMB Batch 14.0
- MMB Batch 15.0
- MMB Batch 16.0
- MMB Batch 17.0
- MMB Batch 19.0
- MMB Batch 20.0
- MMB Batch 2000
- MMB Batch 2002
- MMB Batch 21.0
- MMB Batch 23.0
- MMB Batch 24.0
- MMB Batch 25.0
Total MMB samples found: 483

Sample entry summary:
- entry_id: -635Z62MBAoFkXDxqYfamU3qcgCX
- upload_id: Uq7aoxCCRKe0g2sprkDbMg
- upload_name: MMB Batch 2002
- entry_name: None


In [4]:
unique_upload_names = list(set(sample['upload_name'] for sample in all_samples if 'upload_name' in sample))
unique_upload_names = [name for name in unique_upload_names if name]  # Filter out empty names

print(f"Unique upload names found: {len(unique_upload_names)}")
print('Upload names:')
for name in sorted(list(unique_upload_names)):
    print(f"- {name}")


Unique upload names found: 82
Upload names:
- 1st_Batch_HySPRINT_Yuxin
- 1st_batch_IRIS_Yuxin
- 2nd_Batch_IRIS_Yuxin
- AF_SDC_MAPI_ink_B4
- AF_SDC_MAPIink_Batch7
- Batch VII - SAM Wettability
- Batch_5_CSMB_Yuxin
- Calender Week 15 2024 Module Baseline
- HZB_MMB_8_2001
- IJP-BL_01
- Introduction to Nomad - Workshop Material KJ
- KW10 Module Baseline
- KW15 FACs and PEtOx60
- KW16 FACs and PEtOx
- KW22 PEtOx60 in FACs with MACl
- KW25 POx in FACs - Polymer Variation I
- MAFA Batch 5
- MAFA Batch 8
- MAFA Batch 9
- MAFA Batch2
- MAPI ink_ref_spin_coated
- MMB Batch 12.0
- MMB Batch 12.1
- MMB Batch 12.10
- MMB Batch 12.11
- MMB Batch 12.12
- MMB Batch 12.5
- MMB Batch 12.7
- MMB Batch 12.8
- MMB Batch 12.9
- MMB Batch 13.0
- MMB Batch 14.0
- MMB Batch 15.0
- MMB Batch 16.0
- MMB Batch 17.0
- MMB Batch 19.0
- MMB Batch 20.0
- MMB Batch 2000
- MMB Batch 2002
- MMB Batch 21.0
- MMB Batch 23.0
- MMB Batch 24.0
- MMB Batch 25.0
- MMX B7
- SDC-PSC-7_8_Dec2023_b2
- SOP-02_20241010_TM
- SOP_CSMB

In [5]:
mmb_samples = [sample for sample in all_samples if sample.get('upload_name') in mmb_uploads_names]
print(f"Total MMB samples found: {len(mmb_samples)}")
print(f'Example MMB sample: {json.dumps(mmb_samples[0], indent=2)}')

Total MMB samples found: 483
Example MMB sample: {
  "entry_id": "-635Z62MBAoFkXDxqYfamU3qcgCX",
  "upload_id": "Uq7aoxCCRKe0g2sprkDbMg",
  "lab_id": "HZB_MMB_8-2002-3-0",
  "main_author": "df8bc696-58aa-4571-95fb-d71a800e1c07",
  "coauthors": [],
  "coauthor_groups": [
    "MjM4ze-URpu0NrHBulkRYg"
  ],
  "upload_create_time": "2025-03-19T11:06:57.233000",
  "published": false,
  "license": "CC BY 4.0",
  "upload_name": "MMB Batch 2002"
}


In [6]:
# Function to get archive data for a specific entry
def get_sample_archive(client, entry_id):
    """
    Retrieve the complete archive data for a specific entry using the NOMAD API.
    
    Args:
        client (NomadClient): Authenticated NOMAD client
        entry_id (str): The entry ID of the sample
        
    Returns:
        dict: The complete archive data for the entry
    """
    try:
        # Prepare the request body
        request_body = {
            "required": "*"
        }
        
        # Use the make_request method with the correct endpoint pattern and request body
        response = client.make_request(
            'post',
            f'entries/{entry_id}/archive/query',
            json_data=request_body
        )
        return response
    except Exception as e:
        print(f"Error retrieving archive data: {e}")
        return None


In [None]:
archive_data['data']['archive']['m_ref_archives']xx

In [None]:
archive_data

## 5. Relationship Analysis

MMB data in NOMAD has complex relationships between entries. This section provides functions to trace relationships between entries and understand the data provenance.

Let's create a function to find all entries that reference a specific target entry ID. This will help us track the relationships between different entries in the database.

In [7]:
# Function to get entries that reference a specific target entry
def get_referencing_entries(client, target_entry_id):
    """
    Find all entries that reference a specific target entry.
    
    Args:
        client (NomadClient): Authenticated NOMAD client
        target_entry_id (str): The entry ID to search for in references
        
    Returns:
        list: List of entries that reference the target entry
    """
    # Construct the query to search for entries with matching target_entry_id in references
    query = {
        "owner": "visible",
        "query": {
            "entry_references.target_entry_id": target_entry_id
        }
    }

    try:
        # Use the make_request method to query the entries
        response = client.make_request('post', 'entries/query', json_data=query)
        if response and 'data' in response:
            return response['data']
        return []
    except Exception as e:
        print(f"Error searching for referencing entries: {e}")
        return []

In [None]:
# Visualizing the processed data
import matplotlib.pyplot as plt
import seaborn as sns

# Function to create visualizations based on the available data
def visualize_mmb_data(wide_df=None, narrow_df=None):
    """
    Create visualizations for MMB data in either wide or narrow format.
    
    Args:
        wide_df (DataFrame): Wide format DataFrame (optional)
        narrow_df (DataFrame): Narrow format DataFrame (optional)
    """
    plt.figure(figsize=(12, 8))
    
    # Check if we have wide format data
    if wide_df is not None and not wide_df.empty:
        # Example 1: Distribution of layer types
        if 'layer_type' in wide_df.columns:
            plt.subplot(2, 2, 1)
            layer_counts = wide_df['layer_type'].value_counts()
            sns.barplot(x=layer_counts.index, y=layer_counts.values)
            plt.title('Distribution of Layer Types')
            plt.xlabel('Layer Type')
            plt.ylabel('Count')
            plt.xticks(rotation=45)
        
        # Example 2: Annealing temperature by process name
        if 'annealing_temperature' in wide_df.columns and 'process_name' in wide_df.columns:
            plt.subplot(2, 2, 2)
            sns.boxplot(x='process_name', y='annealing_temperature', data=wide_df)
            plt.title('Annealing Temperature by Process')
            plt.xlabel('Process Name')
            plt.ylabel('Temperature (°C)')
            plt.xticks(rotation=45)
    
    # Check if we have narrow format data
    if narrow_df is not None and not narrow_df.empty:
        # Example 3: Parameter value distribution
        if 'parameter_name' in narrow_df.columns and 'parameter_value' in narrow_df.columns:
            # Filter for numeric parameters only
            numeric_params = narrow_df[pd.to_numeric(narrow_df['parameter_value'], errors='coerce').notna()]
            
            if not numeric_params.empty:
                # Get top 5 most frequent parameters
                top_params = numeric_params['parameter_name'].value_counts().nlargest(5).index.tolist()
                
                # Filter for top parameters
                top_param_data = numeric_params[numeric_params['parameter_name'].isin(top_params)]
                
                if not top_param_data.empty:
                    plt.subplot(2, 2, 3)
                    sns.boxplot(x='parameter_name', y='parameter_value', data=top_param_data)
                    plt.title('Distribution of Top Parameter Values')
                    plt.xlabel('Parameter')
                    plt.ylabel('Value')
                    plt.xticks(rotation=45)
                    
                    # Example 4: Parameter values by sample
                    plt.subplot(2, 2, 4)
                    pivot_data = top_param_data.pivot_table(
                        index='sample_lab_id', 
                        columns='parameter_name', 
                        values='parameter_value',
                        aggfunc='mean'
                    )
                    sns.heatmap(pivot_data, annot=True, cmap='viridis', fmt='.2f')
                    plt.title('Parameter Values by Sample')
                    plt.ylabel('Sample ID')
    
    plt.tight_layout()
    plt.show()

# Try to visualize the data if available
try:
    # Get wide format DataFrame if available
    wide_format_df = None
    if 'wide_df' in globals() and not (isinstance(wide_df, pd.DataFrame) and wide_df.empty):
        wide_format_df = wide_df
    
    # Get narrow format DataFrame if available
    narrow_format_df = None
    if 'narrow_df' in globals() and not (isinstance(narrow_df, pd.DataFrame) and narrow_df.empty):
        narrow_format_df = narrow_df
    
    # Create visualizations
    if wide_format_df is not None or narrow_format_df is not None:
        visualize_mmb_data(wide_format_df, narrow_format_df)
    else:
        print("No processed DataFrames available for visualization.")
        print("Run the data processing cells first to generate 'wide_df' and/or 'narrow_df'.")
except NameError as e:
    print(f"Missing variable: {e}")
    print("Run the data processing cells first to generate the necessary DataFrames.")
except Exception as e:
    print(f"Error during visualization: {e}")

## 8. Summary and Next Steps

This notebook has demonstrated a complete data pipeline for Mini Module Baseline (MMB) data from NOMAD:

1. **Authentication** with NOMAD API using the proper credentials
2. **Data Retrieval** to fetch all MMB-related samples and their metadata
3. **Archive Data Access** to get detailed information for specific samples
4. **Relationship Analysis** to understand the connections between entries
5. **Data Processing** to transform raw data into structured formats for analysis
6. **Visualization** to gain insights from the processed data

### Next Steps

To extend this pipeline, you could:

1. **Implement batch processing** to handle all MMB samples at once
2. **Add more advanced visualizations** specific to your research questions
3. **Develop machine learning models** to predict properties or optimize processes
4. **Create interactive dashboards** for sharing insights with collaborators
5. **Set up automated data update pipelines** for continuous monitoring

### Usage Guide

1. Ensure you have proper authentication credentials in your `.env` file
2. Run all cells in sequence to set up the complete pipeline
3. Use the `get_all_samples_with_authors()` function to fetch relevant samples
4. Process the data using the provided transformation functions
5. Visualize results using the visualization tools

For any issues or questions, refer to the NOMAD API documentation or contact the NOMAD team.