# Intro


## Goal
**WHAT**: Automatic report generation from Hamilton measurements.  
**WHY**: Speed up the report generation, and avoid human errors (copying data, subjective evaluation, ....)

## Tools
Fast iteration in an agile way.  
Generic approach - different plates setup, prameters, ... all with the same code, no changes needed.  

**Python** programming language.  
**jupyter** notebook is currently used, with some functions divided into small modules.  
**Visual Studio Code** IDE (Integrated Development Environment).  
**Markdown** (*.md) format for generated report (Simple, humanly redable).  

## Input:
 - Worklist file path (*.xls) as used for Hamilton input.
   - Sample name
   - Dilution
   - Viscosity
 - Measurement results file path (*.xls) as output from Hamilton.
 - Parameters; constants in code (file path *.json)
   - CV (Coefficient of variation) threshold
   - Referennce value (1.7954e+10 cp/ml)
   - Dilutions [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0]
   - Decimal digits for output

## Output:
  - Report (*.md, printable to pdf)
    - Could be manually edited
    - Image files
    - Result sheets
  - Estimated size <2kB (current)

## Done
  - Invalid sample:
    - CV >THRESHOLD
    - Only one point
  - Parameters file (*.scv, *.json)
  - Multiple plates (in worklist file)
  - Modules
  - Running modes
    - Python script - automatic run (command line with parameters)

## TODO:
  - Finalize the report
    - 2 decimal places
  - Running modes
    - GUI; use modules to crete an App (code remains the same, but used from GUI)
  - Tests (unit, integration)
  - checksum (*.sdax); put into report
  - Extensive testing...
  - Automatic print to *.pdf ?
  - md2pdf

## Conclusion
End to end evaluation time reduction approximately 2h -> 20min per measurement. (thx Felix)


# Generate report  - POC

[AV9 data folder](<../../Users/hwn6193/OneDrive - Takeda/General - Gene Therapy Analytics (AD+PA)/3_Teams/3.1_Protein_Quantification/_AAV9 Capsid ELISA>)

## Review bugs
### TODO

### Fixed
- mask sample point(s) if `CV>CV_THRESHOLD` and `valid sample_poitns <= MIN_VALID_SAMPLE_POINTS` (Igor)
- `CV[%]` one `{:.1f}` decimal digit (Felix)
- `Result [cp/ml]` three `{:.3e}` (Felix)
- `nan` -> `NA` (Felix)
- control sample image line ending (Sebastian)
- `CV[%]` column format to 2 decimal digits with trailing zeroes (Sebastian/Robert)
- Fit parameter description https://teams.microsoft.com/l/message/19:4ba886dcae16442f802adcc65edc04bb@thread.v2/1688557386620?context=%7B%22contextType%22%3A%22chat%22%7D (Felix)

## Imports

In [None]:
VERBOSE_NOTEBOOK = False
WARNING_DISABLE = True
DEBUG = False

In [None]:
from os import path
import warnings
from scipy.optimize import OptimizeWarning

if WARNING_DISABLE:
    warnings.simplefilter('ignore', RuntimeWarning)
    warnings.simplefilter('ignore', OptimizeWarning)
    warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')

In [None]:
from mkinout import make_input_paths
WORKING_DIR = './reports/input/'
BASE_NAME = '230426_GN004240-033_-_'

# WORKING_DIR = './reports/all/230628_AAV9-ELISA_sey_GN004240-046'
# BASE_NAME = '230628_GN004240-046_-_'

input_files = make_input_paths(WORKING_DIR, BASE_NAME)
WORKLIST_FILE_PATH = input_files['worklist']
PARAMS_FILE_PATH = input_files['params']

DATA_DIR = './data'

## Layouts

In [None]:
from readdata import read_layouts

PLATE_LAYOUT_ID = 'plate_layout_ident.csv'
PLATE_LAYOUT_NUM = 'plate_layout_num.csv'
PLATE_LAYOUT_DIL_ID = 'plate_layout_dil_id.csv'


g_lay = read_layouts(path.join(DATA_DIR, PLATE_LAYOUT_ID),
                     path.join(DATA_DIR, PLATE_LAYOUT_NUM),
                     path.join(DATA_DIR, PLATE_LAYOUT_DIL_ID))

if VERBOSE_NOTEBOOK:
    display(g_lay)

## Worklist

In [None]:
from worklist import read_worklist, check_worklist
from readdata import read_params

g_wl_raw = read_worklist(WORKLIST_FILE_PATH)
g_valid_plates = check_worklist(g_wl_raw)
g_params = read_params(PARAMS_FILE_PATH)

## Dilution to Concentration

Define dilution dataframe. The dataframe is indexed according plate layout, index of refference dataframe corresponds to refference of the `plate_layout_dil`.

In [None]:
# TODO: read reference value from parameters
REF_VAL_MAX = 1.7954e+10
DILUTIONS = [1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0]

from sample import make_concentration
g_reference_conc = make_concentration(REF_VAL_MAX, DILUTIONS)

if VERBOSE_NOTEBOOK:
    display(g_reference_conc)

## Report generation

In [None]:
from reportmain import report_plate, check_report_crc
from mkinout import make_output_paths

def gen_report(valid_plates, worklist, params, layout, reference_conc,
               working_dir, base_name):
    reports = []
    for plate in valid_plates:
        print('Processing plate {} of {}'.format(plate, len(valid_plates)))

        output_files = make_output_paths(working_dir, base_name, plate)
        result_file_path = output_files['results']
        report_file_path = output_files['report']
        report_dir = path.dirname(path.abspath(report_file_path))
        reports.append(report_plate(plate, worklist, params, layout,
                    reference_conc, result_file_path, report_dir, report_file_path
                    ))
    return reports


reports = gen_report(g_valid_plates, g_wl_raw, g_params, g_lay, g_reference_conc, WORKING_DIR, BASE_NAME)

In [None]:
CHECK_REPORT_CRC = True
REPORT_PLATES_CRC = [985937237, 3856888741]
if CHECK_REPORT_CRC:
    for report, crc in zip(reports, REPORT_PLATES_CRC):
        check_report_crc(report, crc)

Use pandoc to convert markdown to Word.

In [None]:
# ! c:/work/pandoc/pandoc -o output.docx -f markdown -t docx {reports[0]}

# DEBUG

In [None]:
import pandas as pd
import numpy as np
  
# Initialize data to Dicts of series.
d = {'concentration': pd.Series([10, 11, 7, 14],
                      index=[0, 1, 2, 3]),
    'mask': pd.Series([np.nan, np.nan, '<8', np.nan],
                      index=[0, 1, 2, 3])}
  
# creates Dataframe.
dfd = pd.DataFrame(d)
  
# print the data.
dfd

In [None]:
from scipy.stats import variation
from itertools import combinations

def mask_sample_cv(df_in, valid_pts, cv_threshold):
    df = df_in[df_in['mask'].isna()]
    display(df)
    cv_min = cv_threshold # variation(df['concentration'], ddof=1)
    non_mask_idx = []
    indices = df.index
    # Reverse combinations order to break if `CV` < `cv_threshold`
    for l in reversed(range(2, len(indices) + 1)):
        for subset in combinations(indices, l):
            comb = list(subset)
            t = df.loc[comb]
            display(t)
            cv = variation(t['concentration'], ddof=1)
            print(comb, cv)
            if cv < cv_min:
                non_mask_idx = comb
                cv_min = cv
                print(f'!!! min {cv}')
        # break if CV drops below threshold
        if cv_min < cv_threshold:
            break

    mask_idx = list(set(indices).symmetric_difference(non_mask_idx))
    return mask_idx, non_mask_idx, cv_min

display(dfd['mask'].isna())
m_idx,_,_ = mask_sample_cv(dfd, 2, 0.2)
display(m_idx)
dfd.loc[m_idx, ['mask']] = "cv-masked"
display(dfd)

In [None]:
idx = pd.MultiIndex.from_product([['A'],
                                  [1, 2, 3, 4]],
                                 names=['col', 'row'])
col = ['concentration', 'mask']

dfm = pd.DataFrame([(10,np.nan) ,(11,np.nan),(6,'<8'),(16,np.nan)], idx, col)
display(dfm)

# display(dfm['mask'].isna())
m_idx,_,_ = mask_sample_cv(dfm, 2, 0.2)
display(m_idx)
dfm.loc[m_idx, ['mask']] = "cv-masked"
display(dfm)

In [None]:
DIGITS = 3
'{0} {1:.{dgts}e}'.format(DIGITS, 1.234572, dgts=DIGITS)