### This notebook is used to test array handling for the whole dataset

First, we get the metadata that is not stored in any array. This is the "acquisition" metadata, and we obtain it based on the acquisition schema.

In [1]:
from acquisitionMapper import extract_metadata_addresses

In [2]:
mapFile = '/Users/elias/Desktop/PP13_Mapping/pp13-mapper/schemas/new_sem_fib_nested_schema_map.json'

xmlMap, imgMap = extract_metadata_addresses(mapFile)

The metadata map is split up into "addresses" which point to the acquisition's provided xml file, and ones which we can get from one of the images of the entire acquisition. 

Next, we read in the xml file as a python dictionary:

In [3]:
from acquisitionMapper import xml_to_dict

In [4]:
xmlFile = '/Users/elias/Desktop/NFDI Tomographiedaten/20200818_AlSi13 XRM tomo2/EMproject.emxml'
xmlMetadata = xml_to_dict(xmlFile)

Then, we want the metadata specified by our map from the xml file:

In [5]:
from acquisitionMapper import extract_values

acqXmlMetadata = extract_values(xmlMap, xmlMetadata)

This results in formatted metadata from the xml into a mapped dictionary containing keys which are the json addresses and values taken from the xml file.

The next step is to obtain the acquisition data from one of the images. For this we read an image file, doesn't matter which as this data should be common to all images in the acquisition.

In [6]:
from imageMapper import readFile, formatMetadata, extractImageMappings, extractImageData, headerMapping

imgFile = '/Users/elias/Desktop/NFDI Tomographiedaten/20200818_AlSi13 XRM tomo2/Images/SEM Image/SEM Image - SliceImage - 001.tif'

imgMetadata = readFile(imgFile)

Then it must be properly formatted:

In [7]:
formattedImgMetadata = formatMetadata(imgMetadata)
formattedImgMetadata

{'Images.SEM Image.SliceImage.User.date': '08/18/2020',
 'Images.SEM Image.SliceImage.User.time': '01:40:03 PM',
 'Images.SEM Image.SliceImage.User.user': 'user',
 'Images.SEM Image.SliceImage.User.usertext': '',
 'Images.SEM Image.SliceImage.User.usertextunicode': '',
 'Images.SEM Image.SliceImage.System.type': 'DualBeam',
 'Images.SEM Image.SliceImage.System.dnumber': '9952707',
 'Images.SEM Image.SliceImage.System.software': '14.5.1.432',
 'Images.SEM Image.SliceImage.System.buildnr': '432',
 'Images.SEM Image.SliceImage.System.source': 'FEG',
 'Images.SEM Image.SliceImage.System.column': 'Elstar',
 'Images.SEM Image.SliceImage.System.finallens': 'Elstar',
 'Images.SEM Image.SliceImage.System.chamber': 'xT-SDB',
 'Images.SEM Image.SliceImage.System.stage': '110 x 110',
 'Images.SEM Image.SliceImage.System.pump': 'TMP',
 'Images.SEM Image.SliceImage.System.esem': 'no',
 'Images.SEM Image.SliceImage.System.aperture': 'AVA',
 'Images.SEM Image.SliceImage.System.scan': 'PIA 3.0',
 'Imag

Then we extract only the necessary values according to the map:

In [8]:
extractedImgMetadata = extractImageData(formattedImgMetadata, imgMap)
extractedImgMetadata

{'Images.SEM Image.SliceImage.System.source': 'FEG',
 'Images.SEM Image.SliceImage.System.column': 'Elstar',
 'Images.SEM Image.SliceImage.System.stage': '110 x 110',
 'Images.SEM Image.SliceImage.System.pump': 'TMP',
 'Images.SEM Image.SliceImage.System.esem': 'no',
 'Images.SEM Image.SliceImage.System.eucwd': '0.004',
 'Images.SEM Image.SliceImage.System.systemtype': 'Helios G4 PFIB CXe'}

Then it simply needs to be mapped:

In [9]:
acqImgMetadata = headerMapping(extractedImgMetadata, imgMap)
acqImgMetadata

{'acquisition.genericMetadata.pump': 'TMP',
 'acquisition.genericMetadata.column': 'Elstar',
 'acquisition.genericMetadata.source': 'FEG',
 'acquisition.genericMetadata.eucentricWorkingDistance.value': '0.004',
 'acquisition.genericMetadata.ESEM': 'no',
 'acquisition.genericMetadata.systemType': 'Helios G4 PFIB CXe',
 'acquisition.genericMetadata.stage': '110 x 110'}

Now we have two dictionaries, `acqXmlMetadata` and `acqImgMetadata`. We combine these to get all of the metadata we need for the acquisition in one nicely formatted Python dictionary:

In [10]:
acqMetadata = {**acqXmlMetadata, **acqImgMetadata}
acqMetadata

{'acquisition.genericMetadata.program.programName': 'Auto Slice & View 4',
 'acquisition.genericMetadata.program.programVersion': '4.2.1.1982',
 'acquisition.genericMetadata.applicationId.identifierValue': 'ASV',
 'acquisition.genericMetadata.fileVersion': '1.2',
 'acquisition.genericMetadata.projectName': '20200818_AlSi13 XRM tomo2',
 'acquisition.genericMetadata.zCutSpacing.value': '2.0000000000000002E-07',
 'acquisition.genericMetadata.numberOfCuts': '719',
 'acquisition.genericMetadata.pump': 'TMP',
 'acquisition.genericMetadata.column': 'Elstar',
 'acquisition.genericMetadata.source': 'FEG',
 'acquisition.genericMetadata.eucentricWorkingDistance.value': '0.004',
 'acquisition.genericMetadata.ESEM': 'no',
 'acquisition.genericMetadata.systemType': 'Helios G4 PFIB CXe',
 'acquisition.genericMetadata.stage': '110 x 110'}

The acquisition metadata is now ready to be output to the resulting json metadata document.

# Dataset Handling
The generated JSON document can have several elements under "dataset", as one acquisition may consist of more than one dataset, each with their own unique parameters (which may be common to images within that dataset, but not shared with other datasets). Therefore, we need a unique element for each dataset in the resulting JSON metadata document.

In our example PP13 data acquisition, we have three datasets, but we only focus on two. These are named `SEM Image`, `SEM Image 2`. The name is specified in the xml file, and we can see that that's exactly where the map says we should take it from. Therefore, we should write a function which returns the names of the datasets included in the XML file, just for fun. We already have the XML file as a python dictionary, `xmlMetadata`, so this should be easy. We know from the xml file that datasets falls under `EMProject'`, `'Datasets'`, `'Dataset'`:

In [11]:
datasets = xmlMetadata['EMProject']['Datasets']['Dataset']
numDatasets = len(datasets)
datasetNames = [d['Name'] for d in datasets]

datasetNames

['SEM Image', 'SEM Image 2', 'EDS']

Now we can use the functions which have already been developed to extract metadata for each dataset. First extract the information we need for each dataset from the map:

In [12]:
from datasetMapper import extract_metadata_addresses_dataset

datasetXmlMap, datasetImgMap = extract_metadata_addresses_dataset(mapFile)

Now that we have the necessary maps for each dataset, we must do the following **for each dataset**:

1. Extract the values from the xml file
2. Extract the values from an image in the dataset
3. Map these values according to `datasetXmlMap` and `datasetImgMap`

Then store them as an array in a JSON document under datasets

To start, let's write a function which does steps 1-3. It should output the metadata dictionary which should be written to the json file.

In [13]:
import os
def processDatasets(datasetNum, imageDirectory):
    # Extract xml data for this dataset
    mappedEMMetadata = extract_values(datasetXmlMap, xmlMetadata, datasetNum)
    
    # Read data from image in proper folder
    datasetName = datasetNames[datasetNum - 1]
    for root, dirs, files in os.walk(imageDirectory):
        if os.path.basename(root) == datasetName:
            for file in files:
                if file.endswith('.tif'):
                    imgPath = os.path.join(root, file)
                    break
            break
    imageData = readFile(imgPath)
    formattedMetadata = formatMetadata(imageData)
    imageMetadata = extractImageData(formattedMetadata, datasetImgMap)
    mappedImgMetadata = headerMapping(imageMetadata, datasetImgMap)
    
    return {**mappedEMMetadata, **mappedImgMetadata}

Now that we have a function which will get the process one dataset, we can loop through the known datasets and produce a list of these dictionaries (we will only do the first two as only those are relevant for us):

In [14]:
imgDirectory = '/Users/elias/Desktop/NFDI Tomographiedaten/20200818_AlSi13 XRM tomo2/Images'

datasetMetadata = []
for i, dataset in enumerate(datasetNames[:2]):
    print(i, dataset)
    datasetMetadata.append(processDatasets(i+1, imgDirectory))

0 SEM Image
1 SEM Image 2


Now, `datasetMetadata` is an array (list) containing as many mapped and ready dictionaries as there are datasets we wish to extract metadata from (in our case, only two):

In [15]:
datasetMetadata

[{'acquisition.dataset.rows': '1',
  'acquisition.dataset.columns': '1',
  'acquisition.dataset.tileColumn': '0',
  'acquisition.dataset.user.userName': 'user',
  'acquisition.dataset.program.programName': '14.5.1.432',
  'acquisition.dataset.instrument.beamType': 'EBeam',
  'acquisition.dataset.instrument.spot': '1',
  'acquisition.dataset.instrument.eBeam.accelerationVoltage.value': '15000',
  'acquisition.dataset.instrument.eBeam.beamCurrent.value': '1.6e-009',
  'acquisition.dataset.instrument.eBeam.scanRotation.value': '0',
  'acquisition.dataset.instrument.eBeam.imageMode.value': 'Normal',
  'acquisition.dataset.instrument.eBeam.apertureSetting.size.value': '4.53e-005',
  'acquisition.dataset.instrument.eBeam.horizontalFieldWidth.value': '0.000592',
  'acquisition.dataset.instrument.eBeam.verticalFieldWidth.value': '0.000394667',
  'acquisition.dataset.instrument.eBeam.tiltCorrectionIsOn': 'no',
  'acquisition.dataset.instrument.eBeam.dynamicFocusIsOn': 'no',
  'acquisition.datas

# Image Array Handling

The most complex case. We have been able to create a metadata dictionary for a single image, now we need to make a list of metadata dictionaries for each dataset. Start with writing a function which processes a single image:

In [16]:
imgMappings = extractImageMappings(mapFile)
def processImage(imgPath):
    # read image file
    rawImgMetadata = readFile(imgPath)
    formattedMetadata = formatMetadata(rawImgMetadata)
    imageMetadata = extractImageData(formattedMetadata, imgMappings)
    mappedImgMetadata = headerMapping(imageMetadata, imgMappings)
    
    return mappedImgMetadata

This function now processes a single image. It takes in the image path and returns the mapped image metadata as a python dictionary. Next, we simply have to write a function which loops through all of the images in a single dataset. Since we already have the `processDatasets()` function, we can add a stgep which processes the images in that dataset, and we can have it return the list of dictionaries where each dictionary is the metadata extracted fpr each image. Let's do that.

In [17]:
def processDatasets(datasetNum, imageDirectory):
    # Extract xml data for this dataset
    mappedEMMetadata = extract_values(datasetXmlMap, xmlMetadata, datasetNum)
    
    # Read data from image in proper folder
    datasetName = datasetNames[datasetNum - 1]
    for root, dirs, files in os.walk(imageDirectory):
        if os.path.basename(root) == datasetName:
            for file in files:
                if file.endswith('.tif'):
                    imgPath = os.path.join(root, file)
                    break
            break
    imageData = readFile(imgPath)
    formattedMetadata = formatMetadata(imageData)
    imageMetadata = extractImageData(formattedMetadata, datasetImgMap)
    mappedImgMetadata = headerMapping(imageMetadata, datasetImgMap)
    
    # Repeat to produce list of image metadata dictionaries
    imageMetadataList = []
    for root, dirs, files in os.walk(imageDirectory):
        if os.path.basename(root) == datasetName:
            for file in files:
                if file.endswith('.tif'):
                    imgPath = os.path.join(root, file)
                    imageMetadataList.append(processImage(imgPath))
    
    
    return {**mappedEMMetadata, **mappedImgMetadata}, imageMetadataList

In [18]:
datasetMetadata = []
imageMetadata   = []
for i, dataset in enumerate(datasetNames[:2]):
    print(i, dataset)
    datasetMetadataDict, ImageMetadataDict =  processDatasets(i+1, imgDirectory)
    datasetMetadata.append(datasetMetadataDict)
    imageMetadata.append(ImageMetadataDict)

0 SEM Image
1 SEM Image 2


Now we have two variables:

* `imageMetadata` of type `list` and `len` 2
    * each list inside this has `len` equal to the number of images. And each list within that has a dictionary with the metadata for each individual image
* `datasetMetadata` of type `list` and `len` 2



We now have all elements we need to build the JSON file. The Acquisition metadata `acqMetadata`, the dataset metadata `datasetMetadata`, and the image metadata `imageMetadata`. They now need to be exported to a JSON file, so we must write a function which does this.


The JSON file needs to be structured as follows:

* `acqMetadata` is simply a dictionary where the keys are the dot-separated levels of hierarchy for our json file, and the values the values that shall be stored in the inner most (last) level in the key.
* `datasetMetadata` is a list of the metadata for each dataset in the acquisition. Each list element is a dictionary containing metadata structured similarly to `acqMetadata`, where the keys are the dot-separated levels of hierarchy for our json file, and the values the values that shall be stored in the inner most (last) level in the key. The caveat is that each of the elements of this list needs to appear under the 'dataset' level as an array. e.g. the first element under 'dataset' is the first element of the array in the json file as well.
* `imageMetadata` is a list of lists of image metadata dictionaries. It has a list for each dataset, and within that list each element is a metadata dictionary for the image. The list corresponding to the dataset needs to go under the 'images' key/level of hierachy in the json such that in the end we have a very deeply nested structure:
    * under the dataset level, there should be an array of the metadata the datasets. this array should have the same length as the length of datasetMetadata (we should assert this as a final check that everything is ok). This is where we should place the `datasetMetadata`, each element of this array will have the values from the corresponding elements of `datasetMetadata`. Remember that the keys of each of these dictionaries are the dot-separated levels of hierarchy as they should appear in the json file.
    * Very similarly, in the json file under 'dataset', and further, under 'images', we should have an array where each element is the metadata dictionary for each image in that specific dataset.
    
In summary, we have {acquisition{dataset1{img1,img2,...},dataset2{img1,img2,...}}}. Not including all the stuff in between of course.

In [31]:
from datetime import datetime

def writeToJSON(acqMetadata, datasetMetadata, imageMetadata, output_directory):
    # TODO
    print(f"JSON file created successfully.")

In [30]:
output_file_path = "/Users/elias/Desktop/PP13_Mapping/pp13-mapper/result_jsons"
writeToJSON(acqMetadata, datasetMetadata, imageMetadata, output_file_path)

JSON file created successfully.
