# NSORT with Python
- Author: Tyler Martin 
- Contact: tyler.martin@nist.gov
- Last updated: 03/18/19
- Version: 0.3-dev

The goal of this notebook is to allow users to interactively stitch together **reduced** ABS files produced from the NCNR Igor macros. It should be highlighted that this file only does the NSORT portion of the reduction process i.e. the combining of reduced scattering data from multiple configurations into a single curve. 

This notebook works by comparing the trial/sample portion of the trial label and combining those measurements together. For example, if you had the following sets of measurement labels

- AC5-116 1p15m 5A Offset Scatt
- AC5-116 4p7m 5A Scatt
- AC5-116 4p7m 12A Scatt
- AC5-117 1p15m 5A Offset Scatt
- AC5-117 4p7m 5A Scatt
- AC5-117 4p7m 12A Scatt

..your goal would be to construct a regular-expression (regex) to extract the "AC5-11x" portion of the measurement labels so that the first three and last three measurements could be combined into a single curve. 

## Global Instructions

- This notebook should be worked through linearly from top to bottom
- All cells can be run by using the 'play' symbol in the toolbar or by pressing [Shift] + [Enter] simultaneously
- Sections headers denote user interaction
    - !> cells in this section require interaction/modification by user
    - \>\> cell in this section should just be run and output checked


## >> Setting up environment

The next several cells may take up to a minute or two to finish running.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets
import pathlib
import re
import time

The next cell is non-essential and can be skipped if it fails (i.e. if Seaborn is not installed) 

In [2]:
#if this fails, change widget --> notebook
import seaborn as sns
sns.set(context='notebook',style='ticks',palette='bright')

If the next cell fails either
    
    a) Install ipympl via conda or pip (conda install -c conda-forge ipympl)
    
    b) Change widget --> notebook

In [3]:
%matplotlib widget 

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
#hack in the typySANS directory to the PYTHONPATH (for now)
import sys
sys.path.insert(0,'../')
import typySANS

## !> Pick path to .ABS files

The title says it! Write in the path to your reduced ABS files. Use [Tab] to autocomplete the paths as you type them

In [12]:
ABS_path = pathlib.Path('../dev/1804-BottleBrush/2018-11-08 NGBSANS83 - solvents/reduction')
ABS_path = pathlib.Path('membranes/')
# ABS_path = pathlib.Path('membrane_solutions/')

## >> Scan the .ABS files and build label table

In [13]:
dfLabel =[]
for file_path in ABS_path.glob('*ABS'):
    file_name = file_path.parts[-1]
    
    with open(file_path,'r') as f:
        lines = [f.readline() for _ in range(4)]
        
    ## We don't want COMBINED ABS files
    if 'COMBINED' in lines[0]:
        continue
        
    ## Parse the LABEL: row
    raw_label = lines[1].strip().split(':')[-1].strip()
    dfLabel.append([file_name,file_path,raw_label])

dfLabel = pd.DataFrame(dfLabel,columns=['file_name','file_path','label'])
dfLabel = dfLabel.set_index('file_name').squeeze()
dfLabel.head()

Unnamed: 0_level_0,file_path,label
file_name,Unnamed: 1_level_1,Unnamed: 2_level_1
MAR19305.ABS,membranes/MAR19305.ABS,Dow 07002 80 D2O 1p15m 5A Offset Scatt T=30C
MAR19311.ABS,membranes/MAR19311.ABS,Dow 08003 60 D2O 1p15m 5A Offset Scatt T=30C
MAR19107.ABS,membranes/MAR19107.ABS,Dow 06004 Dry 4p7m 5A Scatt T=30C
MAR19113.ABS,membranes/MAR19113.ABS,Dow 08005 Dry 4p7m 5A Scatt T=30C
MAR19098.ABS,membranes/MAR19098.ABS,Dow 07004 Dry 1p15m 5A Offset Scatt T=30C


## !> Build regex to extract trial label

The goal here is to construct a 'regular expression' (i.e. a regex) that will extract the **non-configuration** portion of the trial label. This entire notebook works by finding the portion of the label that is common between the different instrument configurations and combining them. The key is to precisely construct this regex so that the correct measurements will be combined together. 

Some general regex  notes:
- A period "." represents any alphanumeric character
- A star "\*" denotes that the **previous** character can be repeated any number of times (including zero times)
- A question mark "?" denotes that the **previous** character occurs either 0 or 1 times
- Parenthesis () denote 'caputure' groups. These is how we extract substrings
- Brackets [] denote character lists i.e. [mA] is a single character equal to m **or** A

Example 1:

    Consider the following set of measurement labels
    
    - AC5-116 1p15m 5A Offset Scatt
    - AC5-116 4p7m 5A Scatt
    - AC5-116 4p7m 12A Scatt
    - AC5-117 1p15m 5A Offset Scatt
    - AC5-117 4p7m 5A Scatt
    - AC5-117 4p7m 12A Scatt
    
    ..your goal would be to construct a regular-expression (regex) to extract the "AC5-11x" portion of the measurement labels so that the first three and last three measurements could be combined into a single curve. The following regular expressions would work in this case:
    
    - (AC5.*)\s
    - (.*)\s
    - ([0-9a-zA-Z-]\*)
    - (.{7})\s
    
Example 2:

    - Full Sample Label: AC5-116-42k dPS 4p7m 5A Scatt T=25
    - Regex: (.*) dPS 
        - Explanation: Capture 0 or more characters which precede the characters dPS
    - Captured Groups: (' AC5-116-42k')


**Note**: This code will always use the *first* capture group as the trial label

In [14]:
# this is a general regex that works well for many samples coming off of the 10m
# regex_init = '(.*)\s*(.*)[mA]\s*(.*)[mA]'
regex_init = '(.*)\s*(.*)[mA]\s*(.*)[mA]'
regex_init = '(.*)\s[0-9]p'

regex_label = ipywidgets.Dropdown(options=dfLabel['label'].values,description='Label:',layout={'width':'450px'})
regex = ipywidgets.Text(value=regex_init,description='Regex:',layout={'width':'600px'})
regex_output = ipywidgets.Output()

def match(event):
    regex_output.clear_output()
    with regex_output:
        try:
            re_result = re.search(regex.value,regex_label.value)
        except re.error:
            print('Error! Bad regular-expression!')
        else:
            if re_result is None:
                print('Error! No match!')
            else:
                groups = re_result.groups()
                print('\n')
                print('All Groups: {}'.format(groups))
                print()
                print('Extracted Trial Label: {}'.format(groups[0]))

regex_label.observe(match)
regex.observe(match)
match(None)
display(ipywidgets.VBox([regex_label,regex,regex_output]))

VBox(children=(Dropdown(description='Label:', layout=Layout(width='450px'), options=('Dow 07002 80 D2O 1p15m 5…

## >> Gather trial labels and configuration information

If you created a correct regex for *all* trials above, this cell should correctly produce a table with the extracted trial label along with the sample to detector distance (SDD) and wavelength (LAM) as well.

In [15]:
#This is hopefully a somewhat generic regex
cre = re.compile(regex.value)

## get lambda and SDD 
dfABS =[]
for file_name,sdf in dfLabel.iterrows():
    file_path = sdf['file_path']
    label = sdf['label']
    
    ABS,config = typySANS.readABS(file_path)
    LAMBDA = float(config['LAMBDA'])
    SDD = float(config['DET DIST'])
    
    ## Parse the LABEL: row
    re_result = cre.search(label)
    if not re_result: #if regex doesn't match, skip
        print('Warning: skipping {} because regex failed!'.format(file_name))
        continue
    label   = re_result.groups()[0].strip()
    
    dfABS.append([label,SDD,LAMBDA,file_name,file_path])

dfABS = pd.DataFrame(dfABS,columns=['label','SDD','LAM','file_name','file_path'])
dfABS = dfABS.sort_values(['label','SDD','LAM'])
dfABS.head()

Unnamed: 0,label,SDD,LAM,file_name,file_path
36,Dow 06001 100 D2O,1.15,5.0,MAR19300.ABS,membranes/MAR19300.ABS
103,Dow 06001 100 D2O,4.75,5.0,MAR19290.ABS,membranes/MAR19290.ABS
42,Dow 06001 100 D2O,4.75,12.0,MAR19260.ABS,membranes/MAR19260.ABS
135,Dow 06001 100 D2O,4.75,16.0,MAR19240.ABS,membranes/MAR19240.ABS
84,Dow 06001 Dry,1.15,5.0,MAR19085.ABS,membranes/MAR19085.ABS


## >> Create NSORT Table

Now the real magic: Using the power of [Pandas](https://pandas.pydata.org/), we can automatically group the above table by the label column. If the regex was properly constructed, this cell will output a table which lists all of the individual instrument configurations for each sample label.

In [16]:
dfNSORT = []
for i, sdf in dfABS.groupby('label'):
    dd = {'label':sdf.label.iloc[0]}
    for j,ssdf in sdf.iterrows():
        dd[ssdf.SDD,ssdf.LAM,'fname'] = ssdf.file_name
        dd[ssdf.SDD,ssdf.LAM,'fpath'] = ssdf.file_path
    dfNSORT.append(dd)

dfNSORT = pd.DataFrame(dfNSORT)
dfNSORT.set_index('label',inplace=True)
dfNSORT.columns = pd.MultiIndex.from_tuples(dfNSORT.columns.tolist(),names=['SDD','LAM','datatype'])
dfNSORT.sort_index(axis=0,inplace=True)
dfNSORT.sort_index(axis=1,inplace=True)
dfNSORT.tail().T

Unnamed: 0_level_0,Unnamed: 1_level_0,label,Dow 10003 Dry,Dow 10004 00 D2O,Dow 10004 Dry,Dow 10005 15 D2O,Dow 10005 Dry
SDD,LAM,datatype,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1.15,5.0,fname,MAR19096.ABS,MAR19315.ABS,MAR19100.ABS,MAR19409.ABS,MAR19104.ABS
1.15,5.0,fpath,membranes/MAR19096.ABS,membranes/MAR19315.ABS,membranes/MAR19100.ABS,membranes/MAR19409.ABS,membranes/MAR19104.ABS
4.75,5.0,fname,MAR19106.ABS,MAR19327.ABS,MAR19110.ABS,MAR19401.ABS,MAR19114.ABS
4.75,5.0,fpath,membranes/MAR19106.ABS,membranes/MAR19327.ABS,membranes/MAR19110.ABS,membranes/MAR19401.ABS,membranes/MAR19114.ABS
4.75,12.0,fname,,MAR19333.ABS,,MAR19397.ABS,
4.75,12.0,fpath,,membranes/MAR19333.ABS,,membranes/MAR19397.ABS,
4.75,16.0,fname,MAR19136.ABS,MAR19351.ABS,MAR19140.ABS,MAR19385.ABS,MAR19144.ABS
4.75,16.0,fpath,membranes/MAR19136.ABS,membranes/MAR19351.ABS,membranes/MAR19140.ABS,membranes/MAR19385.ABS,membranes/MAR19144.ABS


## !> Choose Global Trim Params

Next, trim and shift parameters need to be chosen to be applied to all trials. Use the widget produced by the cell below to demo trim parameters and shift-factors for different systems. 

In [17]:
#extra out only full path information
dfNSORTPath = dfNSORT.xs('fpath',level='datatype',axis=1)

plt.figure(figsize=(6,3))
tp =  typySANS.TrimPlot(dfNSORTPath)
tp.run_widget()

FigureCanvasNbAgg()

VBox(children=(HBox(children=(Dropdown(description='System:', options=('Dow 06001 100 D2O', 'Dow 06001 Dry', '…

## >> Check Shift Factors

Ensure that the shift factors below make sense for all systems/configurations. Ideally, the factors should be between 0.95-1.05.

In [98]:
df_trim = tp.df_trim
shiftConfig = eval(tp.shift_config.value)

shifts=[]
for label,df in dfNSORTPath.iterrows():
    df_xy = []
    index = []
    for i,(config,fpath) in enumerate(df.iteritems()):
        if pd.isna(fpath):
            continue
        index.append(config)
        sdf = typySANS.readABS(fpath)[0]
        df_xy.append(sdf.set_index('q',drop=False)[['q','I','dI']])
    df_xy = pd.Series(df_xy,index=pd.MultiIndex.from_tuples(index))
    df_xy = df_xy.sort_index(axis=0)
    
    dfShift = typySANS.buildShiftTable(df_xy,df_trim,shiftConfig)
    dfShift.name = label
    # shifts.append(dfShift.values)
    shifts.append(dfShift)
    
df_shift = pd.concat(shifts,axis=1).T
# df_shift = pd.DataFrame(shifts,index=dfNSORTPath.index,columns=dfNSORTPath.columns)
df_shift = df_shift.sort_values(by=dfNSORTPath.columns.tolist(),axis=0)
df_shift

Unnamed: 0_level_0,1.15,4.75,4.75
Unnamed: 0_level_1,5.0,5.0,16.0
Dow 07 D2O Soln 10CB,1.0,1.0,1.0
Dow 07 D2O Soln 6ROT,1.0,1.0,1.0
Dow 07 dTHF Soln 10CB,1.0,1.0,1.0
Dow 07 dTHF Soln 6ROT,1.0,1.0,1.0
Dow 08 D2O Soln 10CB,1.0,1.0,1.0
Dow 08 D2O Soln 6ROT,1.0,1.0,1.0
Dow 08 dTHF Soln 10CB,1.0,1.0,1.0
Dow 08 dTHF Soln 6ROT,1.0,1.0,1.0
SWC 04 D2O Soln 10CB,1.0,1.0,1.0
SWC 04 D2O Soln 6ROT,1.0,1.0,1.0


## >> Write all ABS Files

In [99]:
AUTONSORT_path = ABS_path / 'AUTONSORTED'
if not AUTONSORT_path.exists():
    AUTONSORT_path.mkdir()
    
for label,sdfABS in dfNSORTPath.iterrows():
    sdfShift = df_shift.loc[label]
    fname = label.strip() + '.ABS'
    print('--> Writing {}'.format(AUTONSORT_path/fname))
    typySANS.writeABS(fname,sdfABS,sdfShift,shiftConfig,df_trim,path=AUTONSORT_path,shift=True)

--> Writing membrane_solutions/AUTONSORTED2/Dow 07 D2O Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 07 D2O Soln 6ROT NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 07 dTHF Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 07 dTHF Soln 6ROT NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 08 D2O Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 08 D2O Soln 6ROT NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 08 dTHF Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/Dow 08 dTHF Soln 6ROT NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/SWC 04 D2O Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/SWC 04 D2O Soln 6ROT NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/SWC 04 dTHF Soln 10CB NoShift.ABS
--> Writing membrane_solutions/AUTONSORTED2/SWC 04 dTHF Soln 6ROT NoShift.ABS


## !> Check AUTO-NSORTED ABS Files

In [90]:
AUTONSORT_path = ABS_path / 'AUTONSORTED'
AUTO_ABS_PATH = list(AUTONSORT_path.glob('*ABS'))
MABS = typySANS.MultiPlotABS(AUTO_ABS_PATH)
MABS.run_widget()

VBox(children=(HBox(children=(VBox(children=(SelectMultiple(layout=Layout(width='400px'), options=('Dow 07 D2O…

In [None]:
|