# Creating Archive-Ready Metadata
The raw data is split into a few different files:
- [A mapping of tests to filenames](./raw-data/Summary_of_CAMP_Cells.xlsx)
- [A mapping of tests to battery design](./raw-data/Summary_of_builds.xlsx)
- The actual raw data from the machines in MACCOR format
- [A list of the cells used by Paulson et al.](./raw-data/known-cells.csv)
This notebook combines them together into a single HDF5 file for each cell.

In [1]:
from battdat.io.maccor import MACCORReader
from battdat.schemas.battery import ElectrodeDescription, ElectrolyteDescription, BatteryDescription
from battdat.schemas import BatteryMetadata
from battdat.data import BatteryDataset
from concurrent.futures import ProcessPoolExecutor, as_completed
from datetime import datetime
from shutil import rmtree
from tqdm.auto import tqdm
from pathlib import Path
import pandas as pd
import numpy as np

Configuration

In [2]:
data_path = Path('./data/raw-data/CAMP_data/')
h5_path = Path('./data/hdf5')

In [3]:
h5_path.mkdir(exist_ok=True)

## Load in the Mapping Spreadsheets
These spreadsheets allow us to understand the content of the in our MACCOR files

In [4]:
test_descriptions = pd.read_excel('data/raw-data/Summary_of_CAMP_Cells.xlsx')
test_descriptions.head(2)

Unnamed: 0.1,Unnamed: 0,File Name,Owner,Batch,Cell Number,Cell Test,Start Time,Initial Cycle Number,Last Cycle,Test Time,Max Capacity (Ah),Max Energy,Max Current (A),Min Voltage,Max Voltage,Date of Test,Path,File Comments,Procedure,Number of Cycles in file
0,0,ARGONNE #20_SET-LN3024-104-1a.001,SET,LN3024_104,1,1a,03/31/2016 16:05:31,0.0,0.0,1.1667,0.0,0.0,0.0,3.305715,3.306783,\t03/31/2016\t,\tC:\Data\MIMS\Backup\ARGONNE #20\SET-LN3024-1...,SET-LN3024-104 Targray NCM811 [LN2086-32-4] ...,ABRHV-NCM523-Form-4p1.000NCM 523 Formation T...,0.0
1,1,ARGONNE #20_SET-LN3024-104-1aa.001,SET,LN3024_104,1,1aa,03/31/2016 16:07:53,0.0,3.0,4942.6788,0.003038,0.01179,0.000242,2.999924,4.300908,\t03/31/2016\t,\tC:\Data\MIMS\Backup\ARGONNE #20\SET-LN3024-1...,SET-LN3024-104 Targray NCM811 [LN2086-32-4] ...,ABRHV-NCM523-Form-4p3.000NCM 523 Formation T...,3.0


In [5]:
cell_descriptions = pd.read_excel('data/raw-data/Summary_of_builds.xlsx')
cell_descriptions.head(2)

Unnamed: 0,build,anode,cathode,cathode_full_name,cathode_supplier,anode_full_name,anode_supplier,electrolyte,electrolyte_additive,total_cathode_area (cm2),...,number_interfaces,target_capacity (Ah),anode_thickness (um),anode_loading (mg/cm2),anode_porosity,cathode_thickness (um),cathode_loading (mg/cm2),cathode_porosity,temperature (C),Notes
0,B1,C,HE5050,HE5050,Commercial,Graphite,Commercial,Gen 2,NONE,,...,,0.375,86.0,5.75,35.0,68.0,14.5,42.0,30.0,
1,B1A,C,HE5050,HE5050,Commercial,Graphite,Commercial,Gen 2,NONE,,...,,0.375,86.0,5.75,35.0,68.0,14.5,42.0,30.0,


## Get the Cells from the Paper
Get the cells that are listed in a table from Noah Paulson, describing the source fo the data in Table S1 of [our paper](https://www.sciencedirect.com/science/article/pii/S0378775322001495#appsec1).

In [6]:
known_cells = pd.read_csv('data/raw-data/known-cells.csv')
print(f'Loaded {len(known_cells)} known cells')

Loaded 300 known cells


Go from filename to batch name and cell number

In [7]:
known_cells['cell_number'] = known_cells['cellname'].str.split("_").apply(lambda x: x[2])
known_cells['batch'] = known_cells['cellname'].str.split("_").apply(lambda x: x[1])

## Filter down to best-documented cells
Get only the test descriptions where we have the "Batch" described in the cell descriptions

In [8]:
is_documented = test_descriptions['Batch'].apply(lambda x: x in set(cell_descriptions['build']))

In [9]:
print(f'Found descriptions for {is_documented.sum()}/{len(is_documented)} tests')

Found descriptions for 3544/8618 tests


In [10]:
test_descriptions = test_descriptions[is_documented]

In [11]:
print(f'There is a total of {len(test_descriptions[["Batch", "Cell Number"]].value_counts())} unique cells')

There is a total of 645 unique cells


## Remove Formation Tests
Get tests that do not have "formation" in the procedure

In [12]:
is_formation = test_descriptions['Procedure'].str.lower().str.contains('formation')

In [13]:
print(f'Found {is_formation.sum()}/{len(is_formation)} formation experiments')

Found 667/3544 formation experiments


In [14]:
test_descriptions = test_descriptions[~is_formation]
print(f'There is a total of {len(test_descriptions[["Batch", "Cell Number"]].value_counts())} unique cells')

There is a total of 602 unique cells


## Create a Function to Document Cell
Build a batdata-compliant metadata for a test given the information from the "test descriptons" and "cell descriptions" spreadsheets.
This new format will contain the same information, but mapped to community-agreed-upon names for concepts

First get an example record

In [15]:
record = test_descriptions.iloc[0]
record

Unnamed: 0                                                                205
File Name                                         ARGONNE_10_CFF-B12A-P1b.008
Owner                                                                     CFF
Batch                                                                    B12A
Cell Number                                                                 1
Cell Test                                                                 P1b
Start Time                                                           10:51:04
Initial Cycle Number                                                      0.0
Last Cycle                                                               12.0
Test Time                                                              50.202
Max Capacity (Ah)                                                         0.0
Max Energy                                                                0.0
Max Current (A)                                                 

Look up the cell metadata

In [16]:
cell_metadata = cell_descriptions.query(f'build == "{record["Batch"]}"').iloc[0]
cell_metadata

build                             B12A
anode                                C
cathode                         HE5050
cathode_full_name               HE5050
cathode_supplier            Commercial
anode_full_name               Graphite
anode_supplier              Commercial
electrolyte                      Gen 2
electrolyte_additive              NONE
total_cathode_area (cm2)         169.2
number_layers                     13.0
number_interfaces                 12.0
target_capacity (Ah)               0.3
anode_thickness (um)              86.0
anode_loading (mg/cm2)            5.75
anode_porosity                    35.0
cathode_thickness (um)            63.0
cathode_loading (mg/cm2)          14.3
cathode_porosity                  38.0
temperature (C)                   30.0
Notes                              NaN
Name: 21, dtype: object

We just need to rearrange this data into the structure provided by `battdat`.

In [17]:
cathode_metadata = ElectrodeDescription(
    name=cell_metadata['cathode'],
    supplier=cell_metadata['cathode_supplier'],
    thickness=cell_metadata['cathode_thickness (um)'],
    area=cell_metadata['total_cathode_area (cm2)'],
    loading=cell_metadata['cathode_loading (mg/cm2)'],
    porosity=cell_metadata['cathode_porosity']
)
print(cathode_metadata.model_dump_json(indent=2))

{
  "name": "HE5050",
  "supplier": "Commercial",
  "product": null,
  "thickness": 63.0,
  "area": 169.2,
  "loading": 14.3,
  "porosity": 38.0
}


We put all of this into a single function for convenience

In [18]:
def describe_cell(test_record: dict) -> BatteryMetadata:
    """Create a single metadata record
    
    Args:
        test_record: Record for a certain test
    Returns:
        Formatted metadata for the battery
    """
    
    # Match cell description
    matches = cell_descriptions.query(f'build == "{test_record["Batch"]}"')
    if len(matches) > 1: 
        print(f'WARNING: Found {len(matches)} descriptions for build="{test_record["Batch"]}". Picking the first')
    cell_metadata = matches.iloc[0].to_dict()
    
    # Replace NaNs with None so that we know they are missing
    cell_metadata = dict((k, None if isinstance(v, float) and np.isnan(v) else v) for k, v in cell_metadata.items())

    # Describe the electrodes
    cathode_metadata = None
    anode_metadata = None
    if cell_metadata['cathode'] != "unknown":
        cathode_metadata = ElectrodeDescription(
            name=cell_metadata['cathode'],
            supplier=cell_metadata['cathode_supplier'],
            thickness=cell_metadata['cathode_thickness (um)'],
            area=cell_metadata['total_cathode_area (cm2)'],
            loading=cell_metadata['cathode_loading (mg/cm2)'],
            porosity=cell_metadata['cathode_porosity']
        )
    if cell_metadata['anode'] != "unknown":
        anode_metadata = ElectrodeDescription(
            name=cell_metadata['anode'],
            supplier=cell_metadata['anode_supplier'],
            product=cell_metadata['anode_full_name'],
            thickness=cell_metadata['anode_thickness (um)'],
            loading=cell_metadata['anode_loading (mg/cm2)'],
            porosity=cell_metadata['anode_porosity']
        )

    # Get the electrolyte information
    additives = cell_metadata['electrolyte_additive']
    additives = [] if additives == 'NONE' else [{'name': x.strip()} for x in additives.split(",")]
    electrolyte = ElectrolyteDescription(
        name=cell_metadata['electrolyte'],
        additives=additives
    )
    
    # Combine to form a cell description
    battery = BatteryDescription(
        anode=anode_metadata,
        cathode=cathode_metadata,
        electrolyte=electrolyte,
    )
    if cell_metadata['target_capacity (Ah)'] != "unknown":
        battery.nominal_capacity = cell_metadata['target_capacity (Ah)']
    if cell_metadata['number_layers'] != "unknown":
        value = cell_metadata['number_layers']
        battery.layer_count = int(value) if value is not None else value
    return battery
describe_cell(record).dict(exclude_unset=True)

{'layer_count': 13,
 'anode': {'name': 'C',
  'supplier': 'Commercial',
  'product': 'Graphite',
  'thickness': 86.0,
  'loading': 5.75,
  'porosity': 35.0},
 'cathode': {'name': 'HE5050',
  'supplier': 'Commercial',
  'thickness': 63.0,
  'area': 169.2,
  'loading': 14.3,
  'porosity': 38.0},
 'electrolyte': {'name': 'Gen 2', 'additives': []},
 'nominal_capacity': 0.3}

## Load in an Example Test
Tests are stored in MACCOR format. Let's load one in to see how the data looks.


In [19]:
reader = MACCORReader(ignore_time=True)  # The datetime column is a problem in a few files

In [20]:
data = reader.read_file(data_path / record['File Name'])
data.head(2)

Unnamed: 0,cycle_number,file_number,test_time,state,current,voltage,step_index,method,substep_index
0,0,0,0.0,ChargingState.unknown,0.0,0.073548,0,ControlMethod.constant_current,0
1,0,0,0.162,ChargingState.unknown,0.0,0.0,0,ControlMethod.constant_current,0


## Process all known cells
Loop through everything and save it into HDF5 format

In [21]:
for path in h5_path.iterdir():
    if path.is_dir():
        rmtree(path)

Make a function we can run in parallel

In [22]:
def process_cell(extractor: MACCORReader, files: list[str], metadata: BatteryMetadata, path: str):
    """Convert then save to HDF5 format"""
    
    data = extractor.read_dataset(files, metadata=metadata)
    for col in data.raw_data.columns:
        if data.raw_data[col].dtype == np.float64:
            data.raw_data[col] = data.raw_data[col].astype(np.float32)
    data.validate()
    data.to_hdf(path, complevel=9)

Parse the test date

In [23]:
test_descriptions['start_date'] = test_descriptions['Date of Test'].apply(lambda x: datetime.strptime(x.strip(), '%m/%d/%Y'))

Run for all cells

In [24]:
refined = 0
futures = []
failures = []
known_cells['matched'] = False
success_count = 0
for (batch_id, cell_id), group in tqdm(test_descriptions.groupby(['Batch', 'Cell Number'])):
    group = group.sort_values('start_date')
    
    # Get the metadata for the cell 
    cell_name = f'batch_{batch_id}_cell_{cell_id}'
    cell_metadata = describe_cell(group.iloc[0])
    
    # Determine if this is in the "refined" set
    known_matches = np.logical_and(known_cells.batch == batch_id, known_cells.cell_number == cell_id)
    known_cells.loc[known_matches, 'matched'] = True
    is_refined = known_matches.any()
    if is_refined:
        refined += 1

    # Assemble the metadata for everything else
    metadata = BatteryMetadata(
        name=f'CAMP_{cell_name}',
        battery=cell_metadata,
        dataset_name='camp_2023',
        authors=[
            ['Logan', 'Ward'],
            ['Joseph', 'Kubal'],
            ['Susan J.', 'Babinec'],
            ['Wenquan', 'Lu'],
            ['Allison', 'Dunlop'],
            ['Steve', 'Trask'],
            ['Andrew', 'Jansen'],
            ['Noah H.', 'Paulson'],
        ],
        associated_ids=['https://doi.org/10.1016/j.jpowsour.2022.231127']
    )

    # Get the test results
    files = group['File Name'].apply(lambda x: data_path / x).tolist()

    # Determine the path to save the file
    name = f'{cell_name}.h5'
    sub_dir = 'refined' if is_refined else 'other'
    (h5_path / sub_dir).mkdir(exist_ok=True)
    save_path = h5_path / sub_dir / name

    # Submit to be processed: 
    try:
        process_cell(reader, files, metadata, save_path)
    except:
        raise
        record = group.iloc[0].to_dict()
        record['reason'] = str(exc)
        failures.append(record)
    else:
        success_count += 1
failures = pd.DataFrame(failures)
print(f'Succeeded in parsing {success_count} cells')

  0%|          | 0/602 [00:00<?, ?it/s]

Succeeded in parsing 602 cells


Save the failure information to disk

In [25]:
failures.to_csv('failures.csv', index=False)

Make sure we found all cells from Noah's paper

In [26]:
n_matched = known_cells['matched'].sum()
print(f'Matched {n_matched}/{len(known_cells)} known cells')

Matched 300/300 known cells


Show off the metadata for one of the cells

In [27]:
example_cell = './data/hdf5/refined/batch_B1A_cell_4.h5'

In [28]:
data = BatteryDataset.from_hdf(str(example_cell))

In [29]:
print(data.metadata.model_dump_json(exclude_defaults=True, indent=2))

{
  "name": "CAMP_batch_B1A_cell_4",
  "battery": {
    "anode": {
      "name": "C",
      "supplier": "Commercial",
      "product": "Graphite",
      "thickness": 86.0,
      "loading": 5.75,
      "porosity": 35.0
    },
    "cathode": {
      "name": "HE5050",
      "supplier": "Commercial",
      "thickness": 68.0,
      "loading": 14.5,
      "porosity": 42.0
    },
    "electrolyte": {
      "name": "Gen 2"
    },
    "nominal_capacity": 0.375
  },
  "dataset_name": "camp_2023",
  "authors": [
    [
      "Logan",
      "Ward"
    ],
    [
      "Joseph",
      "Kubal"
    ],
    [
      "Susan J.",
      "Babinec"
    ],
    [
      "Wenquan",
      "Lu"
    ],
    [
      "Allison",
      "Dunlop"
    ],
    [
      "Steve",
      "Trask"
    ],
    [
      "Andrew",
      "Jansen"
    ],
    [
      "Noah H.",
      "Paulson"
    ]
  ],
  "associated_ids": [
    "https://doi.org/10.1016/j.jpowsour.2022.231127"
  ]
}
