# üöÄ Quick Start - Environment Setup

## Recommended: UV (Fast & Simple)

```powershell
# 1. Install UV (if not already installed)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# 2. Navigate to DARA repository
cd C:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara

# 3. Create virtual environment
uv venv .venv --python 3.11
.\.venv\Scripts\Activate.ps1

# 4. Install DARA and dependencies
uv pip install -e ".[docs]"

# 5. Register Jupyter kernel
python -m ipykernel install --user --name=dara-mp --display-name="DARA MP Tutorial"

# 6. Launch this notebook
jupyter notebook mp_database_tutorial.ipynb
```

<details>
<summary>üí° Alternative: Use Conda (click to expand)</summary>

If you prefer Conda or already have `Pymatgen_hw` environment:

```powershell
conda activate Pymatgen_hw
cd C:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara
pip install -e .
jupyter notebook mp_database_tutorial.ipynb
```
</details>

## ‚úÖ Required Database Files

Before running, ensure these index files exist:
- ‚úÖ `indexes/icsd_index_filled.parquet` (229,487 structures)
- ‚úÖ `indexes/cod_index_filled.parquet` (501,975 structures)
- ‚úÖ `indexes/mp_index.parquet` (169,385 structures) **‚Üê Generate if missing!**

**If MP index is missing**, run:
```powershell
python scripts/index_mp.py --input "path/to/df_MP_20250211.pkl" --output indexes/mp_index.parquet --cif-dir mp_cifs
```

üìñ **Detailed setup guide**: [`ENVIRONMENT_SETUP.md`](ENVIRONMENT_SETUP.md)

---

# Tutorial 3: Using Materials Project Database with DARA

This tutorial demonstrates how to use the newly integrated **Materials Project (MP)** database alongside traditional **ICSD** and **COD** databases for phase search and refinement in DARA.

## What's New in DARA v3.0

- üóÉÔ∏è **Materials Project Integration**: 169,385 crystal structures (35% experimental + 65% theoretical)
- üî¨ **Experimental/Theoretical Filtering**: Distinguish between experimental and DFT-computed structures
- ‚ö° **Thermodynamic Stability**: Filter phases by `energy_above_hull` for material discovery
- üéØ **Unified Query Interface**: Single API for all three databases

## Tutorial Overview

We will analyze the same XRD pattern (`GeO2-ZnO` reaction sample) using **three local database indexes**:

1. **COD Database** (local index - 501,975 structures)
2. **ICSD Database** (local index - 229,487 structures)
3. **Materials Project** (local index - 169,385 structures with exp/theory classification)

All three databases use **local indexes** for fast querying - no internet connection needed!

Then we'll compare the results and perform refinement.

> **Note**: This tutorial requires all three database indexes to be generated. See `scripts/README.md` for instructions.

In [1]:
%pip install ipywidgets nbformat

Note: you may need to restart the kernel to use updated packages.


c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\.venv\Scripts\python.exe: No module named pip


## Setup: Import Libraries and Define Pattern

In [2]:
import sys
from pathlib import Path

# Import DARA
from dara import search_phases
from dara.refine import do_refinement_no_saving

# Import the new database tools
sys.path.insert(0, str(Path.cwd().parent / 'scripts'))
from dara_adapter import prepare_phases_for_dara, get_index_stats
from database_interface import StructureDatabaseIndex

print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


In [3]:
# Define the XRD pattern and chemical system
# Use absolute paths to avoid working directory issues
repo_root = Path.cwd().parent
pattern_path = str(repo_root / "notebooks" / "tutorial_data" / "GeO2-ZnO_700C_60min.xrdml")
chemical_system = "Ge-O-Zn"
required_elements = ['Ge', 'O', 'Zn']

# Create directories for CIF files from each database in user's Documents
# Using absolute path to avoid permission issues
import os
user_docs = Path(os.path.expanduser("~")) / "Documents"
work_dir = user_docs / "dara_tutorial_temp"
work_dir.mkdir(exist_ok=True)

cod_dir = work_dir / "cifs_cod"
icsd_dir = work_dir / "cifs_icsd"
mp_dir = work_dir / "cifs_mp"

cod_dir.mkdir(exist_ok=True)
icsd_dir.mkdir(exist_ok=True)
mp_dir.mkdir(exist_ok=True)

print(f"üìÅ Repository root: {repo_root}")
print(f"Pattern: {pattern_path}")
print(f"Chemical system: {chemical_system}")
print(f"\nüìÅ Working directory: {work_dir}")
print(f"   COD CIFs: {cod_dir}")
print(f"   ICSD CIFs: {icsd_dir}")
print(f"   MP CIFs: {mp_dir}")

üìÅ Repository root: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara
Pattern: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\notebooks\tutorial_data\GeO2-ZnO_700C_60min.xrdml
Chemical system: Ge-O-Zn

üìÅ Working directory: C:\Users\kedargroup_ws01\Documents\dara_tutorial_temp
   COD CIFs: C:\Users\kedargroup_ws01\Documents\dara_tutorial_temp\cifs_cod
   ICSD CIFs: C:\Users\kedargroup_ws01\Documents\dara_tutorial_temp\cifs_icsd
   MP CIFs: C:\Users\kedargroup_ws01\Documents\dara_tutorial_temp\cifs_mp


---

## Part 1: Database Setup and Phase Preparation

This section covers loading the three database indexes (COD, ICSD, MP) and preparing phase lists for search.

### 1.1 COD Database (Local Index)

Now let's use the **local COD index** that we've already built - no need to download from the internet!

### 1.1.1 Load COD Index Statistics

In [4]:
# Use the local COD index (already built)
cod_index_path = Path.cwd().parent / 'indexes' / 'cod_index_filled.parquet'

print(f"üìÇ Loading COD index from: {cod_index_path}")

# Get COD index statistics
cod_stats = get_index_stats(cod_index_path)
print(f"\nüìä COD Index Statistics:")
print(f"  Total records: {cod_stats['total_records']:,}")
if 'spacegroup' in cod_stats['completeness']:
    print(f"  Spacegroup coverage: {cod_stats['completeness']['spacegroup']['percentage']:.1f}%")
if 'path' in cod_stats['completeness']:
    print(f"  CIF path coverage: {cod_stats['completeness']['path']['percentage']:.1f}%")

üìÇ Loading COD index from: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\indexes\cod_index_filled.parquet

üìä COD Index Statistics:
  Total records: 501,975
  Spacegroup coverage: 98.5%
  CIF path coverage: 100.0%

üìä COD Index Statistics:
  Total records: 501,975
  Spacegroup coverage: 98.5%
  CIF path coverage: 100.0%


In [5]:
# Prepare COD phases using dara_adapter
print(f"\nüîç Filtering COD phases for {chemical_system}...")

cod_cif_paths = prepare_phases_for_dara(
    index_path=cod_index_path,
    required_elements=required_elements,
    max_phases=500  # Limit for performance
)

print(f"‚úÖ Found {len(cod_cif_paths)} COD phases")

# Show examples
print("\nExample COD phases:")
for path in cod_cif_paths[:5]:
    print(f"  - {Path(path).name}")


üîç Filtering COD phases for Ge-O-Zn...
‚úÖ Found 61 COD phases

Example COD phases:
  - 1509810.cif
  - 1524557.cif
  - 1525094.cif
  - 1525736.cif
  - 1529522.cif
‚úÖ Found 61 COD phases

Example COD phases:
  - 1509810.cif
  - 1524557.cif
  - 1525094.cif
  - 1525736.cif
  - 1529522.cif


---

### 1.2 ICSD Database (Local Index)

Now let's use the local ICSD index to prepare reference phases.

In [6]:
# Use the new unified interface to query ICSD
icsd_index_path = Path.cwd().parent / 'indexes' / 'icsd_index_filled.parquet'

print(f"üìÇ Loading ICSD index from: {icsd_index_path}")

# Get ICSD index statistics
icsd_stats = get_index_stats(icsd_index_path)
print(f"\nüìä ICSD Index Statistics:")
print(f"  Total records: {icsd_stats['total_records']:,}")
if 'spacegroup' in icsd_stats['completeness']:
    print(f"  Spacegroup coverage: {icsd_stats['completeness']['spacegroup']['percentage']:.1f}%")
if 'path' in icsd_stats['completeness']:
    print(f"  CIF path coverage: {icsd_stats['completeness']['path']['percentage']:.1f}%")

üìÇ Loading ICSD index from: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\indexes\icsd_index_filled.parquet

üìä ICSD Index Statistics:
  Total records: 229,487
  Spacegroup coverage: 97.2%
  CIF path coverage: 100.0%

üìä ICSD Index Statistics:
  Total records: 229,487
  Spacegroup coverage: 97.2%
  CIF path coverage: 100.0%


In [7]:
# Prepare ICSD phases using dara_adapter
print(f"\nüîç Filtering ICSD phases for {chemical_system}...")

icsd_cif_paths = prepare_phases_for_dara(
    index_path=icsd_index_path,
    required_elements=required_elements,
    max_phases=500  # Limit for performance
)

print(f"‚úÖ Found {len(icsd_cif_paths)} ICSD phases")

# Show examples
print("\nExample ICSD phases:")
for path in icsd_cif_paths[:5]:
    print(f"  - {Path(path).name}")


üîç Filtering ICSD phases for Ge-O-Zn...
‚úÖ Found 138 ICSD phases

Example ICSD phases:
  - 8846.cif
  - 12011.cif
  - 12014.cif
  - 12015.cif
  - 14174.cif
‚úÖ Found 138 ICSD phases

Example ICSD phases:
  - 8846.cif
  - 12011.cif
  - 12014.cif
  - 12015.cif
  - 14174.cif


---

### 1.3 Materials Project Database (Experimental + Theoretical)

Now let's explore the new MP database integration. MP contains both experimental structures (35%) and DFT-computed theoretical structures (65%).

#### 1.3.1 Explore MP Database Statistics

In [8]:
# Load MP index
mp_index_path = Path.cwd().parent / 'indexes' / 'mp_index.parquet'

print(f"üìÇ Loading MP index from: {mp_index_path}")

# Get MP index statistics
mp_stats = get_index_stats(mp_index_path)
print(f"\nüìä Materials Project Index Statistics:")
print(f"  Total records: {mp_stats['total_records']:,}")

# Count experimental/theoretical from sources if available
if 'sources' in mp_stats and mp_stats['sources']:
    # MP data has experimental_status info
    # We need to read the actual data to count
    import pandas as pd
    mp_df = pd.read_parquet(mp_index_path)
    if 'experimental_status' in mp_df.columns:
        exp_count = (mp_df['experimental_status'] == 'experimental').sum()
        theo_count = (mp_df['experimental_status'] == 'theoretical').sum()
        print(f"  Experimental structures: {exp_count:,}")
        print(f"  Theoretical structures: {theo_count:,}")

if 'spacegroup' in mp_stats['completeness']:
    print(f"  Spacegroup coverage: {mp_stats['completeness']['spacegroup']['percentage']:.1f}%")
if 'path' in mp_stats['completeness']:
    print(f"  CIF path coverage: {mp_stats['completeness']['path']['percentage']:.1f}%")

üìÇ Loading MP index from: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\indexes\mp_index.parquet

üìä Materials Project Index Statistics:
  Total records: 169,385
  Experimental structures: 59,936
  Theoretical structures: 109,449
  Spacegroup coverage: 100.0%
  CIF path coverage: 100.0%

üìä Materials Project Index Statistics:
  Total records: 169,385
  Experimental structures: 59,936
  Theoretical structures: 109,449
  Spacegroup coverage: 100.0%
  CIF path coverage: 100.0%


#### 1.3.2 Filter MP Phases (Experimental + Theoretical)

For this tutorial, we'll include **both experimental and theoretical phases** from MP. We can also filter by thermodynamic stability using `energy_above_hull`.

In [9]:
# Option 1: Get ONLY experimental MP phases
print(f"üî¨ Filtering MP EXPERIMENTAL phases for {chemical_system}...")

mp_exp_cif_paths = prepare_phases_for_dara(
    index_path=mp_index_path,
    required_elements=required_elements,
    experimental_only=True,  # Only experimental structures
    max_phases=500
)

print(f"‚úÖ Found {len(mp_exp_cif_paths)} MP experimental phases")

üî¨ Filtering MP EXPERIMENTAL phases for Ge-O-Zn...
‚úÖ Found 23 MP experimental phases
‚úÖ Found 23 MP experimental phases


In [10]:
# Option 2: Get experimental + stable theoretical phases
print(f"\nüß™ Filtering MP phases (EXPERIMENTAL + STABLE THEORETICAL) for {chemical_system}...")

mp_all_cif_paths = prepare_phases_for_dara(
    index_path=mp_index_path,
    required_elements=required_elements,
    include_theoretical=True,  # Include theoretical structures
    max_e_above_hull=0.1,  # Only stable/metastable phases (‚â§ 0.1 eV/atom)
    max_phases=500
)

print(f"‚úÖ Found {len(mp_all_cif_paths)} MP phases (exp + stable theoretical)")
print(f"   Additional theoretical phases: {len(mp_all_cif_paths) - len(mp_exp_cif_paths)}")

# Show examples
print("\nExample MP phases:")
for path in mp_all_cif_paths[:5]:
    print(f"  - {Path(path).name}")


üß™ Filtering MP phases (EXPERIMENTAL + STABLE THEORETICAL) for Ge-O-Zn...
‚úÖ Found 71 MP phases (exp + stable theoretical)
   Additional theoretical phases: 48

Example MP phases:
  - mp-5909.cif
  - mp-8285.cif
  - mp-27843.cif
  - mp-1190949.cif
  - mp-17392.cif
‚úÖ Found 71 MP phases (exp + stable theoretical)
   Additional theoretical phases: 48

Example MP phases:
  - mp-5909.cif
  - mp-8285.cif
  - mp-27843.cif
  - mp-1190949.cif
  - mp-17392.cif


#### 1.3.3 Detailed Analysis: Compare Experimental vs Theoretical Coverage

In [11]:
# Use the database interface for detailed analysis
db_mp = StructureDatabaseIndex(mp_index_path)

# Filter for Ge-O-Zn system
mp_ge_o_zn = db_mp.filter_by_elements(required=required_elements)

# Split by experimental status (work with DataFrame directly)
if 'experimental_status' in mp_ge_o_zn.columns:
    mp_exp = mp_ge_o_zn[mp_ge_o_zn['experimental_status'] == 'experimental']
    mp_theo = mp_ge_o_zn[mp_ge_o_zn['experimental_status'] == 'theoretical']
else:
    # If no experimental_status column, assume all are experimental (COD/ICSD)
    mp_exp = mp_ge_o_zn
    mp_theo = pd.DataFrame(columns=mp_ge_o_zn.columns)

print(f"\nüìä MP {chemical_system} Phase Distribution:")
print(f"  Total MP phases: {len(mp_ge_o_zn)}")
print(f"  Experimental: {len(mp_exp)} ({len(mp_exp)/len(mp_ge_o_zn)*100:.1f}%)")
print(f"  Theoretical: {len(mp_theo)} ({len(mp_theo)/len(mp_ge_o_zn)*100:.1f}%)")

# Analyze stability distribution for theoretical phases
if len(mp_theo) > 0:
    # Filter by energy_above_hull directly from the DataFrame
    if 'energy_above_hull' in mp_theo.columns:
        stable_theo = mp_theo[mp_theo['energy_above_hull'] <= 0.0]
        metastable_theo = mp_theo[mp_theo['energy_above_hull'] <= 0.1]
        
        print(f"\n‚ö° Theoretical Phase Stability:")
        print(f"  Stable (E_hull = 0): {len(stable_theo)}")
        print(f"  Metastable (E_hull ‚â§ 0.1): {len(metastable_theo)}")
        print(f"  Less stable (E_hull > 0.1): {len(mp_theo) - len(metastable_theo)}")
    else:
        print(f"\n‚ö†Ô∏è Energy above hull data not available for stability filtering")


üìä MP Ge-O-Zn Phase Distribution:
  Total MP phases: 126
  Experimental: 23 (18.3%)
  Theoretical: 103 (81.7%)

‚ö° Theoretical Phase Stability:
  Stable (E_hull = 0): 1
  Metastable (E_hull ‚â§ 0.1): 52
  Less stable (E_hull > 0.1): 51


---

### 1.4 Database Comparison Summary

Let's compare the number of phases available from each database.

In [12]:
import pandas as pd

# Create comparison table
comparison = pd.DataFrame([
    {
        'Database': 'COD',
        'Total Phases': len(cod_cif_paths),
        'Experimental': len(cod_cif_paths),
        'Theoretical': 0,
        'Type': 'Local Index'
    },
    {
        'Database': 'ICSD',
        'Total Phases': len(icsd_cif_paths),
        'Experimental': len(icsd_cif_paths),
        'Theoretical': 0,
        'Type': 'Local Index'
    },
    {
        'Database': 'MP (Exp Only)',
        'Total Phases': len(mp_exp_cif_paths),
        'Experimental': len(mp_exp_cif_paths),
        'Theoretical': 0,
        'Type': 'Local Index'
    },
    {
        'Database': 'MP (Exp + Theory)',
        'Total Phases': len(mp_all_cif_paths),
        'Experimental': len(mp_exp_cif_paths),
        'Theoretical': len(mp_all_cif_paths) - len(mp_exp_cif_paths),
        'Type': 'Local Index'
    }
])

print("\n" + "="*70)
print(f"üìä Phase Coverage Comparison for {chemical_system}")
print("="*70)
print(comparison.to_string(index=False))
print("="*70)


üìä Phase Coverage Comparison for Ge-O-Zn
         Database  Total Phases  Experimental  Theoretical        Type
              COD            61            61            0 Local Index
             ICSD           138           138            0 Local Index
    MP (Exp Only)            23            23            0 Local Index
MP (Exp + Theory)            71            23           48 Local Index


---

### 1.5 Custom CIF Files (Optional)

üí° **Add your own CIF files to the phase search!**

You can include custom CIF files from any source (downloaded from databases, self-synthesized materials, etc.) alongside the database phases.

**Three usage modes available:**
- **Mode A**: Use custom CIF files only
- **Mode B**: Add custom CIFs to each database list separately
- **Mode C**: Use database CIFs only (default)

In [13]:
# ========== CUSTOM CIF CONFIGURATION ==========
# Set your usage mode here:
# "A" = Use ONLY custom CIF files
# "B" = Add custom CIFs to each database list (COD+custom, ICSD+custom, MP+custom)
# "C" = Use database CIFs only (no custom CIFs)

CUSTOM_CIF_MODE = "C"  # Change this to "A", "B", or "C"

# ========== SETUP CUSTOM CIF DIRECTORY ==========
custom_cif_folder = work_dir / "custom_cifs"
custom_cif_folder.mkdir(exist_ok=True)

# Read all CIF files from the custom directory
custom_cif_files = list(custom_cif_folder.glob("*.cif"))
custom_cifs = [str(p) for p in custom_cif_files]

print(f"üìÅ Custom CIF folder: {custom_cif_folder}")
print(f"üìÑ Custom CIF files found: {len(custom_cifs)}")
if len(custom_cifs) > 0:
    print(f"\nCustom CIF files:")
    for cif in custom_cifs[:5]:
        print(f"  - {Path(cif).name}")
    if len(custom_cifs) > 5:
        print(f"  ... and {len(custom_cifs) - 5} more")

# ========== APPLY USAGE MODE ==========
print(f"\nüîß Usage Mode: {CUSTOM_CIF_MODE}")

if CUSTOM_CIF_MODE == "A":
    # Mode A: Use ONLY custom CIFs
    print("   Using ONLY custom CIF files")
    if len(custom_cifs) == 0:
        print("   ‚ö†Ô∏è WARNING: No custom CIF files found! Add .cif files to the custom folder.")
        print(f"   Folder: {custom_cif_folder}")
    
    # Override database lists with custom only
    final_cod_phases = custom_cifs.copy()
    final_icsd_phases = custom_cifs.copy()
    final_mp_phases = custom_cifs.copy()
    
elif CUSTOM_CIF_MODE == "B":
    # Mode B: Add custom CIFs to each database
    print("   Adding custom CIFs to each database list")
    print(f"   - COD phases: {len(cod_cif_paths)} + {len(custom_cifs)} = {len(cod_cif_paths) + len(custom_cifs)}")
    print(f"   - ICSD phases: {len(icsd_cif_paths)} + {len(custom_cifs)} = {len(icsd_cif_paths) + len(custom_cifs)}")
    print(f"   - MP phases: {len(mp_all_cif_paths)} + {len(custom_cifs)} = {len(mp_all_cif_paths) + len(custom_cifs)}")
    
    final_cod_phases = cod_cif_paths + custom_cifs
    final_icsd_phases = icsd_cif_paths + custom_cifs
    final_mp_phases = mp_all_cif_paths + custom_cifs
    
else:  # Mode C (default)
    # Mode C: Use database CIFs only
    print("   Using database CIFs only (no custom CIFs)")
    print(f"   - COD phases: {len(cod_cif_paths)}")
    print(f"   - ICSD phases: {len(icsd_cif_paths)}")
    print(f"   - MP phases: {len(mp_all_cif_paths)}")
    
    final_cod_phases = cod_cif_paths.copy()
    final_icsd_phases = icsd_cif_paths.copy()
    final_mp_phases = mp_all_cif_paths.copy()

print(f"\n‚úÖ Phase lists prepared:")
print(f"   final_cod_phases: {len(final_cod_phases)} phases")
print(f"   final_icsd_phases: {len(final_icsd_phases)} phases")
print(f"   final_mp_phases: {len(final_mp_phases)} phases")

üìÅ Custom CIF folder: C:\Users\kedargroup_ws01\Documents\dara_tutorial_temp\custom_cifs
üìÑ Custom CIF files found: 0

üîß Usage Mode: C
   Using database CIFs only (no custom CIFs)
   - COD phases: 61
   - ICSD phases: 138
   - MP phases: 71

‚úÖ Phase lists prepared:
   final_cod_phases: 61 phases
   final_icsd_phases: 138 phases
   final_mp_phases: 71 phases


#### üí° How to Add Custom CIF Files

**Step 1:** Download or prepare your CIF files from any source:
- Crystallography Open Database: https://www.crystallography.net/cod/
- Materials Project: https://next-gen.materialsproject.org/
- ICSD: https://icsd.fiz-karlsruhe.de/
- Or your own synthesized materials

**Step 2:** Copy the `.cif` files to the custom folder:
```
C:\Users\...\Documents\dara_tutorial_temp\custom_cifs\
```

**Step 3:** Change the `CUSTOM_CIF_MODE` in the cell above:
- Set to `"A"` to use ONLY your custom CIF files
- Set to `"B"` to add custom CIFs to all database lists
- Set to `"C"` to use database CIFs only (default)

**Step 4:** Re-run the configuration cell above to apply changes

---

## Part 2: Phase Search with Different Databases

Now let's perform phase search using each database and compare the results.

### 2.1 Phase Search with COD

In [14]:
# IMPORTANT: Change working directory to repository root
# The CIF paths in indexes are relative to repo root (e.g., 'cod_cifs/cif/...')
import os
original_cwd = os.getcwd()
os.chdir(repo_root)
print(f"‚úÖ Changed working directory to: {os.getcwd()}")
print(f"   (Original: {original_cwd})")

‚úÖ Changed working directory to: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara
   (Original: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\notebooks)


In [15]:
print("üîç Running phase search with COD database (local index)...\n")

search_results_cod = search_phases(
    pattern_path=pattern_path,
    phases=final_cod_phases,  # Using final phase list (may include custom CIFs)
    wavelength="Cu",
    instrument_profile="Aeris-fds-Pixcel1d-Medipix3",
)

print(f"\n‚úÖ COD Phase Search Complete!")
print(f"   Found {len(search_results_cod)} solution(s)")

if len(search_results_cod) > 0:
    print(f"   Best solution Rwp: {search_results_cod[0].refinement_result.lst_data.rwp:.2f}%")

üîç Running phase search with COD database (local index)...



2025-10-30 11:11:55,738	INFO worker.py:2012 -- Started a local Ray instance.


2025-10-30 11:11:57,626 INFO dara.search.tree Detecting peaks in the pattern.


[36m(pid=gcs_server)[0m [2025-10-30 11:12:25,914 E 78732 49252] (gcs_server.exe) gcs_server.cc:302: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[36m(pid=gcs_server)[0m [2025-10-30 11:12:25,914 E 78732 49252] (gcs_server.exe) gcs_server.cc:302: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[33m(raylet)[0m [2025-10-30 11:12:27,463 E 29864 48376] (raylet.exe) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[33m(raylet)[0m [2025-10-30 11:12:27,463 E 29864 48376] (raylet.exe) main.cc:975: Failed to es

2025-10-30 11:12:30,033 INFO dara.search.tree The wmax is automatically adjusted to 64.08.
2025-10-30 11:12:30,035 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:12:30,035 INFO dara.search.tree Creating the root node.
2025-10-30 11:12:30,036 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:12:30,035 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:12:30,035 INFO dara.search.tree Creating the root node.
2025-10-30 11:12:30,036 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:16:23,914 INFO dara.search.tree The initial value of eps2 is automatically set to 0.000000_-0.05^0.05.
2025-10-30 11:16:23,916 INFO dara.search.tree Finished refining 61 phases, with 18 phases removed.
2025-10-30 11:16:23,916 INFO dara.search.tree Express mode is enabled. Grouping phases before starting.
2025-10-30 11:16:23,9

### 2.2 Phase Search with ICSD

In [16]:
print("üîç Running phase search with ICSD database...\n")

search_results_icsd = search_phases(
    pattern_path=pattern_path,
    phases=final_icsd_phases,  # Using final phase list (may include custom CIFs)
    wavelength="Cu",
    instrument_profile="Aeris-fds-Pixcel1d-Medipix3",
)

print(f"\n‚úÖ ICSD Phase Search Complete!")
print(f"   Found {len(search_results_icsd)} solution(s)")

if len(search_results_icsd) > 0:
    print(f"   Best solution Rwp: {search_results_icsd[0].refinement_result.lst_data.rwp:.2f}%")

üîç Running phase search with ICSD database...

2025-10-30 11:16:28,494 INFO dara.search.tree Detecting peaks in the pattern.
2025-10-30 11:17:00,713 INFO dara.search.tree The wmax is automatically adjusted to 64.08.
2025-10-30 11:17:00,716 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:17:00,717 INFO dara.search.tree Creating the root node.
2025-10-30 11:17:00,718 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:17:00,713 INFO dara.search.tree The wmax is automatically adjusted to 64.08.
2025-10-30 11:17:00,716 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:17:00,717 INFO dara.search.tree Creating the root node.
2025-10-30 11:17:00,718 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:18:40,065 INFO dara.search.tree The initial value of eps2 is automatically set to 0.000000_-0.05^0.05.
2025-1

### 2.3 Phase Search with Materials Project (Exp + Theory)

In [17]:
print("üîç Running phase search with MP database (Exp + Stable Theory)...\n")

search_results_mp = search_phases(
    pattern_path=pattern_path,
    phases=final_mp_phases,  # Using final phase list (may include custom CIFs)
    wavelength="Cu",
    instrument_profile="Aeris-fds-Pixcel1d-Medipix3",
)

print(f"\n‚úÖ MP Phase Search Complete!")
print(f"   Found {len(search_results_mp)} solution(s)")

if len(search_results_mp) > 0:
    print(f"   Best solution Rwp: {search_results_mp[0].refinement_result.lst_data.rwp:.2f}%")

üîç Running phase search with MP database (Exp + Stable Theory)...

2025-10-30 11:18:55,756 INFO dara.search.tree Detecting peaks in the pattern.
2025-10-30 11:19:27,784 INFO dara.search.tree The wmax is automatically adjusted to 64.08.
2025-10-30 11:19:27,785 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:19:27,785 INFO dara.search.tree Creating the root node.
2025-10-30 11:19:27,786 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:19:27,784 INFO dara.search.tree The wmax is automatically adjusted to 64.08.
2025-10-30 11:19:27,785 INFO dara.search.tree The intensity threshold is automatically set to 8.53 % of maximum peak intensity.
2025-10-30 11:19:27,785 INFO dara.search.tree Creating the root node.
2025-10-30 11:19:27,786 INFO dara.search.tree Refining all the phases in the dataset.
2025-10-30 11:25:33,555 INFO dara.search.tree The initial value of eps2 is automatically set to -0.0000

### 2.4 Compare Phase Search Results

In [18]:
# Create comparison table
search_comparison = pd.DataFrame([
    {
        'Database': 'COD',
        'Solutions Found': len(search_results_cod),
        'Best Rwp (%)': search_results_cod[0].refinement_result.lst_data.rwp if len(search_results_cod) > 0 else None,
        'Input Phases': len(final_cod_phases)
    },
    {
        'Database': 'ICSD',
        'Solutions Found': len(search_results_icsd),
        'Best Rwp (%)': search_results_icsd[0].refinement_result.lst_data.rwp if len(search_results_icsd) > 0 else None,
        'Input Phases': len(final_icsd_phases)
    },
    {
        'Database': 'MP (Exp + Theory)',
        'Solutions Found': len(search_results_mp),
        'Best Rwp (%)': search_results_mp[0].refinement_result.lst_data.rwp if len(search_results_mp) > 0 else None,
        'Input Phases': len(final_mp_phases)
    }
])

print("\n" + "="*70)
print("üìä Phase Search Results Comparison")
print("="*70)
print(search_comparison.to_string(index=False))
print("="*70)


üìä Phase Search Results Comparison
         Database  Solutions Found  Best Rwp (%)  Input Phases
              COD                1         49.43            61
             ICSD                2         44.96           138
MP (Exp + Theory)                1         56.71            71


---

## Part 3: Visualize Results

Let's visualize and compare the refinement results from each database.

In [19]:
# Restore original working directory
os.chdir(original_cwd)
print(f"‚úÖ Restored working directory to: {os.getcwd()}")

‚úÖ Restored working directory to: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\notebooks


### 3.1 COD Result Visualization

In [20]:
# Helper function to extract phase details for visualization
from pymatgen.core import Structure

def extract_phase_details(result, source="search"):
    """
    Extract detailed phase information from search or refinement results.
    
    Args:
        result: SearchResult or RefinementResult object
        source: "search" or "refinement"
    
    Returns:
        pandas DataFrame with phase details
    """
    phase_data = []
    
    if source == "search":
        # For search results, get phases from the best solution
        if len(result) == 0:
            return pd.DataFrame()
        
        best_solution = result[0]
        phase_list = [phases[0] for phases in best_solution.phases]  # Get first alternative
        rwp = best_solution.refinement_result.lst_data.rwp
        phase_results = best_solution.refinement_result.lst_data.phases_results
    else:
        # For refinement results
        phase_list = [result.get_phase(name) for name in result.lst_data.phases_results.keys()]
        rwp = result.lst_data.rwp
        phase_results = result.lst_data.phases_results
    
    for phase in phase_list:
        try:
            # Load structure from CIF
            structure = Structure.from_file(str(phase.path))
            
            # Get phase name from results
            phase_name = phase.path.stem
            phase_result = phase_results.get(phase_name, {})
            
            # Get lattice parameters
            lattice = structure.lattice
            crystal_system = lattice.get_crystallographic_dict()
            
            phase_info = {
                'Phase Name': phase_name,
                'Composition': structure.composition.reduced_formula,
                'Space Group': structure.get_space_group_info()[0],
                'SG Number': structure.get_space_group_info()[1],
                'Crystal System': crystal_system.get('crystal_system', 'Unknown'),
                'a (√Ö)': f"{lattice.a:.4f}",
                'b (√Ö)': f"{lattice.b:.4f}",
                'c (√Ö)': f"{lattice.c:.4f}",
                'Œ± (¬∞)': f"{lattice.alpha:.2f}",
                'Œ≤ (¬∞)': f"{lattice.beta:.2f}",
                'Œ≥ (¬∞)': f"{lattice.gamma:.2f}",
                'Weight %': f"{phase_result.get('weight_percent', 0):.2f}" if phase_result else "N/A"
            }
            phase_data.append(phase_info)
            
        except Exception as e:
            print(f"‚ö†Ô∏è Could not extract details for {phase.path.name}: {e}")
    
    return pd.DataFrame(phase_data)

In [21]:
if len(search_results_cod) > 0:
    print("üìä COD Phase Search Result:")
    print(f"   Rwp: {search_results_cod[0].refinement_result.lst_data.rwp:.2f}%\n")
    
    # Visualize
    search_results_cod[0].visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Identified Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(search_results_cod, source="search")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ùå No solution found with COD database")

üìä COD Phase Search Result:
   Rwp: 49.43%


Identified Phases - Detailed Information:
‚ö†Ô∏è Could not extract details for 1539614.cif: [Errno 2] No such file or directory: 'cod_cifs\\cif\\1\\53\\96\\1539614.cif'
‚ö†Ô∏è Could not extract phase details

Identified Phases - Detailed Information:
‚ö†Ô∏è Could not extract details for 1539614.cif: [Errno 2] No such file or directory: 'cod_cifs\\cif\\1\\53\\96\\1539614.cif'
‚ö†Ô∏è Could not extract phase details


### 3.2 ICSD Result Visualization

In [22]:
if len(search_results_icsd) > 0:
    print("üìä ICSD Phase Search Result:")
    print(f"   Rwp: {search_results_icsd[0].refinement_result.lst_data.rwp:.2f}%\n")
    
    # Visualize
    search_results_icsd[0].visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Identified Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(search_results_icsd, source="search")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ùå No solution found with ICSD database")

üìä ICSD Phase Search Result:
   Rwp: 44.96%


Identified Phases - Detailed Information:
‚ö†Ô∏è Could not extract details for 67035.cif: [Errno 2] No such file or directory: 'icsd_cifs\\cif\\0\\06\\70\\67035.cif'
‚ö†Ô∏è Could not extract details for 111561.cif: [Errno 2] No such file or directory: 'icsd_cifs\\cif\\0\\11\\15\\111561.cif'
‚ö†Ô∏è Could not extract phase details


### 3.3 MP Result Visualization

In [23]:
if len(search_results_mp) > 0:
    print("üìä MP Phase Search Result (Exp + Stable Theory):")
    print(f"   Rwp: {search_results_mp[0].refinement_result.lst_data.rwp:.2f}%\n")
    
    # Visualize
    search_results_mp[0].visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Identified Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(search_results_mp, source="search")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ùå No solution found with MP database")

üìä MP Phase Search Result (Exp + Stable Theory):
   Rwp: 56.71%


Identified Phases - Detailed Information:
‚ö†Ô∏è Could not extract details for mp-5909.cif: [Errno 2] No such file or directory: 'mp_cifs\\5\\59\\mp-5909.cif'
‚ö†Ô∏è Could not extract phase details


---

## Part 4: Advanced Refinement

Now let's perform more detailed refinement with customized parameters using phases from each database.

In [24]:
# Change working directory back to repository root for refinement
# (Refinement also needs access to CIF files with relative paths)
os.chdir(repo_root)
print(f"‚úÖ Changed working directory to: {os.getcwd()} (for refinement)")

‚úÖ Changed working directory to: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara (for refinement)


### 4.1 Extract Best Phases from Search Results

In [25]:
# Function to extract best phase CIFs from search results
def get_best_phases(search_result):
    """Extract CIF paths from the best search solution"""
    if len(search_result) == 0:
        return []
    
    best_solution = search_result[0]
    # Get the first alternative for each phase
    cif_paths = [phases[0].path for phases in best_solution.phases]
    return cif_paths

# Extract best phases from each database
best_cod_phases = get_best_phases(search_results_cod)
best_icsd_phases = get_best_phases(search_results_icsd)
best_mp_phases = get_best_phases(search_results_mp)

print("Best phases extracted:")
print(f"  COD: {len(best_cod_phases)} phases")
print(f"  ICSD: {len(best_icsd_phases)} phases")
print(f"  MP: {len(best_mp_phases)} phases")

Best phases extracted:
  COD: 1 phases
  ICSD: 2 phases
  MP: 1 phases


### 4.2 Refinement with Customized Parameters (COD)

In [26]:
if len(best_cod_phases) > 0:
    print("üî¨ Running advanced refinement with COD phases...\n")
    
    # Define refinement parameters
    phase_params = {
        "lattice_range": 0.05,  # 5% lattice variation
        "b1": "0_0^0.005",      # Particle size parameter
        "k1": "0_0^1",          # Size distribution
        "k2": "fixed",          # Microstrain (fixed)
        "gewicht": "SPHAR2"    # Preferred orientation
    }
    
    refinement_cod = do_refinement_no_saving(
        pattern_path,
        best_cod_phases,
        phase_params=phase_params
    )
    
    print(f"\n‚úÖ COD Refinement Complete!")
    print(f"   Rwp: {refinement_cod.lst_data.rwp:.2f}%")
    
    # Visualize
    refinement_cod.visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Refined Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(refinement_cod, source="refinement")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ö†Ô∏è No COD phases available for refinement")

üî¨ Running advanced refinement with COD phases...


‚úÖ COD Refinement Complete!
   Rwp: 44.89%

Refined Phases - Detailed Information:

‚úÖ COD Refinement Complete!
   Rwp: 44.89%

Refined Phases - Detailed Information:


AttributeError: 'RefinementResult' object has no attribute 'get_phase'

### 4.3 Refinement with Customized Parameters (ICSD)

In [None]:
if len(best_icsd_phases) > 0:
    print("üî¨ Running advanced refinement with ICSD phases...\n")
    
    phase_params = {
        "lattice_range": 0.05,
        "b1": "0_0^0.005",
        "k1": "0_0^1",
        "k2": "fixed",
        "gewicht": "SPHAR2"
    }
    
    refinement_icsd = do_refinement_no_saving(
        pattern_path,
        best_icsd_phases,
        phase_params=phase_params
    )
    
    print(f"\n‚úÖ ICSD Refinement Complete!")
    print(f"   Rwp: {refinement_icsd.lst_data.rwp:.2f}%")
    
    # Visualize
    refinement_icsd.visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Refined Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(refinement_icsd, source="refinement")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ö†Ô∏è No ICSD phases available for refinement")

üî¨ Running advanced refinement with ICSD phases...


‚úÖ ICSD Refinement Complete!
   Rwp: 40.84%

‚úÖ ICSD Refinement Complete!
   Rwp: 40.84%


### 4.4 Refinement with Customized Parameters (MP)

In [None]:
if len(best_mp_phases) > 0:
    print("üî¨ Running advanced refinement with MP phases...\n")
    
    phase_params = {
        "lattice_range": 0.05,
        "b1": "0_0^0.005",
        "k1": "0_0^1",
        "k2": "fixed",
        "gewicht": "SPHAR2"
    }
    
    refinement_mp = do_refinement_no_saving(
        pattern_path,
        best_mp_phases,
        phase_params=phase_params
    )
    
    print(f"\n‚úÖ MP Refinement Complete!")
    print(f"   Rwp: {refinement_mp.lst_data.rwp:.2f}%")
    
    # Visualize
    refinement_mp.visualize()
    
    # Extract and display detailed phase information
    print("\n" + "="*80)
    print("Refined Phases - Detailed Information:")
    print("="*80)
    phase_details = extract_phase_details(refinement_mp, source="refinement")
    if not phase_details.empty:
        display(phase_details)
    else:
        print("‚ö†Ô∏è Could not extract phase details")
else:
    print("‚ö†Ô∏è No MP phases available for refinement")

üî¨ Running advanced refinement with MP phases...


‚úÖ MP Refinement Complete!
   Rwp: 52.64%

‚úÖ MP Refinement Complete!
   Rwp: 52.64%


---

## Part 5: Final Comparison and Analysis

In [None]:
# Restore original working directory after refinement
os.chdir(original_cwd)
print(f"‚úÖ Restored working directory to: {os.getcwd()}")

‚úÖ Restored working directory to: c:\Users\kedargroup_ws01\Documents\Haiwen\Repos\dara\notebooks


In [None]:
# Create final comparison table
final_comparison = []

if len(search_results_cod) > 0:
    final_comparison.append({
        'Database': 'COD',
        'Phase Search Rwp (%)': search_results_cod[0].refinement_result.lst_data.rwp,
        'Advanced Refinement Rwp (%)': refinement_cod.lst_data.rwp if len(best_cod_phases) > 0 else None,
        'Phases Used': len(best_cod_phases)
    })

if len(search_results_icsd) > 0:
    final_comparison.append({
        'Database': 'ICSD',
        'Phase Search Rwp (%)': search_results_icsd[0].refinement_result.lst_data.rwp,
        'Advanced Refinement Rwp (%)': refinement_icsd.lst_data.rwp if len(best_icsd_phases) > 0 else None,
        'Phases Used': len(best_icsd_phases)
    })

if len(search_results_mp) > 0:
    final_comparison.append({
        'Database': 'MP (Exp + Theory)',
        'Phase Search Rwp (%)': search_results_mp[0].refinement_result.lst_data.rwp,
        'Advanced Refinement Rwp (%)': refinement_mp.lst_data.rwp if len(best_mp_phases) > 0 else None,
        'Phases Used': len(best_mp_phases)
    })

if len(final_comparison) > 0:
    final_df = pd.DataFrame(final_comparison)
    print("\n" + "="*80)
    print("üìä Final Refinement Comparison")
    print("="*80)
    print(final_df.to_string(index=False))
    print("="*80)
else:
    print("‚ö†Ô∏è No results available for comparison")


üìä Final Refinement Comparison
         Database  Phase Search Rwp (%)  Advanced Refinement Rwp (%)  Phases Used
              COD                 49.43                        44.92            1
             ICSD                 44.96                        40.84            2
MP (Exp + Theory)                 56.71                        52.64            1


---

## Summary and Conclusions

### Key Takeaways:

1. **All Three Databases Use Local Indexes** üöÄ:
   - **COD**: 501,975 structures, local index, experimental only
   - **ICSD**: 229,487 structures, local index, curated experimental data
   - **MP**: 169,385 structures, local index, **35% experimental + 65% theoretical**

2. **MP Unique Advantages**:
   - ‚úÖ 100% spacegroup and CIF coverage
   - ‚úÖ Experimental/theoretical classification
   - ‚úÖ Thermodynamic stability filtering (energy_above_hull)
   - ‚úÖ Fast local queries (< 2 seconds)
   - ‚úÖ Includes DFT-computed metastable phases for material discovery

3. **When to Use Each Database**:
   - **COD**: Largest experimental database (501K structures)
   - **ICSD**: High-quality curated experimental data (229K structures)
   - **MP (Exp only)**: Validate against experimental structures (60K)
   - **MP (Exp + Theory)**: Discover new materials, include metastable phases (169K)

4. **Best Practices**:
   - ‚úÖ Use local indexes for all three databases (no internet needed!)
   - ‚úÖ Use `max_e_above_hull ‚â§ 0.1` for stable/metastable phases (MP only)
   - ‚úÖ Combine experimental (ICSD/COD) and theoretical (MP) for comprehensive analysis
   - ‚úÖ Filter by experimental status when needed (MP)

### Next Steps:

- Explore other chemical systems (e.g., Li-Mn-O for battery materials)
- Use stability filtering to discover novel metastable phases
- Compare theoretical predictions with experimental results
- Export refined structures for further analysis

---

## Bonus: Export Refined Structures

You can export the refined structures from any of the refinements.

In [None]:
# Example: Export refined structure from MP refinement
if len(best_mp_phases) > 0 and 'refinement_mp' in locals():
    # Get phase names
    phase_names = list(refinement_mp.lst_data.phases_results.keys())
    
    print("üì¶ Exporting refined structures from MP refinement...\n")
    
    for phase_name in phase_names:
        try:
            structure = refinement_mp.export_structure(phase_name)
            output_file = f"refined_{phase_name}.cif"
            structure.to(filename=output_file, symprec=1e-3)
            print(f"‚úÖ Exported: {output_file}")
        except Exception as e:
            print(f"‚ö†Ô∏è Could not export {phase_name}: {e}")
else:
    print("‚ö†Ô∏è No MP refinement results available for export")

---

**Tutorial Complete!** üéâ

You've learned how to:
- Use the new Materials Project database integration
- Filter phases by experimental/theoretical status
- Filter by thermodynamic stability
- Compare results from COD, ICSD, and MP databases
- Perform phase search and refinement with all three databases

For more information, see:
- `scripts/README.md` - Complete database documentation
- `CHANGELOG.md` - Version history and new features
- `README.md` - Project overview

---

## üßπ Cleanup (Optional)

The tutorial created temporary CIF files in your Documents folder. You can delete them if needed:

In [None]:
# Check the size of temporary files
import shutil

if work_dir.exists():
    total_size = sum(f.stat().st_size for f in work_dir.rglob('*') if f.is_file())
    total_files = sum(1 for f in work_dir.rglob('*') if f.is_file())
    
    print(f"üìä Temporary files statistics:")
    print(f"   Location: {work_dir}")
    print(f"   Total files: {total_files}")
    print(f"   Total size: {total_size / 1024 / 1024:.2f} MB")
    print(f"\nüóëÔ∏è  To delete, uncomment and run the next cell")
else:
    print("No temporary directory found")

In [None]:
# Uncomment to delete all temporary CIF files
# import shutil
# if work_dir.exists():
#     shutil.rmtree(work_dir)
#     print(f"‚úÖ Deleted temporary directory: {work_dir}")
# else:
#     print("No temporary directory to delete")