In [1]:
import numpy as np
import sys
sys.path.append("./src")
import run_model
import types
import os

## Default arguments
Below are all of teh arguments and their default values 

- targets: the list of space separated target morphologies, 
>--targets cylinder disk sphere cs_cylinder cs_disk cs_sphere

- datadir: A directory containing all the data. There should be a file called "TRAIN_[target].csv" and "TEST_[target].csv" for each target,
>--datadir data
- configdir: The directory containing the configuration files,
- resultsdir: A directory to save results to.
>--resultsdir results
- hierarchy_file: A file contaiing the structure of the hierarchical model, should be in the configdir directory.
>--hierarchy_file hierarchical_structure.txt
- reg_file: A file containing the hyperparameters and targets for the regression models, should be in the configdir directory. must contain one set of hyperparmeters for each desired target for each morphology.
>--reg_file krr_hyperparameters.txt
- extrapolation: A flag for whether to limit the test data to aspect ratios and shell ratios outside the range of the training data.
>--extrapolation False
- evaluate_file: An optional path to a file containing curves to evaluate, this is where to point to new data of interest. Curves must have the same q values.
>--evaluate_file None
- quotient: A flag, if true pre-process with the a quotient transform as defined in B. Yildirim, J. Doutch and J. M. Cole, Digital Discovery, 2024,3, 694–704.
>--quotient False
- uncertainty: A flag, if true report uncertainty bounds on the curves in evaluate file, using conformal prediction
>--uncertainty: True

In [2]:
arguments = {"targets":['cylinder', 'disk', 'sphere', 'cs_cylinder', 'cs_disk', 'cs_sphere'],
             "datadir": 'data',
             "configdir": 'configs',
             "resultsdir": 'results',
             "hierarchy_file": 'hierarchical_structure.txt',
             "reg_file": 'krr_hyperparameters.txt',
             "extrapolation": False,
             "evaluate_file": None,
             "quotient": False}

# Running the model

This document help explain first how to run the model, and then dives a little deeper into how the model works.
First we can run the model. 
there is a script to do this, called "example_run.sh" which sets a few arguments and runs the model.

```python3 src/run_model.py  --datadir data --configdir configs --resultsdir results --evaluate_file ./data/experimental_curves.csv --extrapolation True```

This configures a few key arguments and runs the full mmodel. 
The arguments it configures is 

1. datadir: the directory containing all the data. inthis case the directory called "data."
2. configdir: the directory containing the configuration files for the hierarchical model and each of the component classifiers, as well as a separate file for teh hyperparameters defining each of the regression models.
3. evaluate_file: this is a separate file contains the experimental curves shown in the paper. These are curves that the model is not trained on. If you want to test the model on other data. point this argument there.
4. extrapolation: This is a boolean flag specific to this application. In the accompanying peper we display results in which we only train on a subsampled training set containing only points with a small aspect ratio or shell-to-total ratio. If this flag is set to true, that is what is used. The performance will not be quite as strong as when all the data are used, so if trying to implement in practice set this to false.

The other argumentas and to fully customize running the model are shown below. For now let's look at these results. We can also invoke the model from within this notebook. instead of the command line arguments a dictionary of arguments can be used.


In [1]:
arguments["evaluate_file"] =  './data/experimental_curves.csv'
arguments["extrapolate"] = True
arguments["quotient"] = True
run_model.main(types.SimpleNamespace(**arguments))

NameError: name 'arguments' is not defined

## results
The results are saved to the results directory

1. *classification report* contains the same classification report printed when the model is run. This shows the breakdown of precision, recall, and f1-score of each class as well as the overall accuracy.
2. *correct_<TARGET>.csv* contains the details of each correct curve, for each respective class. These contain both the structural parameters used to generate the data as well as the structural parameteres used for generating the curves, as well as the results of the regressions.
each line starts with thhe index of that curve in its respective dataset, then the true structural parameters indicated by the all caps "TRUE".
Following these are the structural parameters as suggested by the regression models, labeled with a capital "REGRESSED".
3. *incorrect_<TARGET>.txt* are similar to their correct counterparts, but contain the respective incorrect curves, as well as the morphology those curves were classified as and the results of regression using those incorrect classes. In many cases these incorrect classes and structural parameters can in fact fit the curves quite well.
These also start with the index and the true structual parameters, but the first entry in the "REGRESSED" section is the predicted morphology.

These give detailed information both on the classification breakdown and on each individual regression model.

In [6]:
print(os.listdir("results"))
classification_summary = open('results/classification_results.txt', 'r').readlines()
classification_report = ''.join(classification_summary)
print(classification_report)
correct = open("results/correct_cs_cylinder.csv", 'r').readlines()
print("\n".join(correct[:5]))
incorrect = open("results/incorrect_cs_cylinder.txt", 'r').readlines()
print("\n".join(incorrect[:5]))


['classification_results.txt', 'correct_sphere.csv', 'incorrect_cs_disk', 'incorrect_cs_disk.txt', 'incorrect_cylinder', 'correct_cs_disk.csv', 'incorrect_disk.txt', 'incorrect_cs_cylinder.txt', 'incorrect_disk', 'incorrect_cylinder.txt', 'incorrect_sphere.txt', 'correct_cylinder.csv', 'incorrect_cs_cylinder', 'predictions.txt', 'correct_cs_sphere.csv', 'correct_disk.csv', 'correct_cs_cylinder.csv', 'incorrect_cs_sphere', 'incorrect_cs_sphere.txt', 'incorrect_sphere']
              precision    recall  f1-score   support

         0.0       0.93      0.94      0.94      1000
         1.0       0.92      0.88      0.90      1000
         2.0       0.91      0.98      0.95      1000
         3.0       0.83      0.90      0.86      1000
         4.0       0.84      0.80      0.82      1000
         5.0       0.85      0.78      0.82      1000

    accuracy                           0.88      6000
   macro avg       0.88      0.88      0.88      6000
weighted avg       0.88      0.88      

## Quotient transform
one of the arguments available is the quotient transform defined in [yldirim, Doutch, and Cole, 2024](https://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00225j)
in order to use the quotient transform use the argument "--quotient True" Below are the effects on our data
|preprocessing|dataset|accuracy|f1-score|
|:--|:--|:--|:--|
Background subtraction & high-q shift|constant scale|0.88|0.88|
||extrapolated scale|0.86|0.86|
quotient transform|constant scale|0.84|0.84|
||extrapolated scale|0.81|0.80

background subtraction and high-q shift are the method preferred in our work. This involves subtracting out the background intensity of the curves, then vertically shifting all curves to have the same value in the high-q at incoherrence.

quotient transform is a method in which the value of a point and index i is divided by the value at point i+1 prior to taking the log.

In our study we maintained the value of the scale parameter at 1, in the scale extrapolated set we varied that parameter from 0.5 to 1.5 to evaluate fow variances in that parameter affected the reliability of our model.
