# Script Description

* 이 스크립트는 baseline fitting algorithm 비교 분석용을 위한 전처리용으로 사용된다.
* `2.dsp-executer.ipynb` 는 다음 2가지의 data를 전처리한다.
    * `1.cfx-to-parquet-converter.ipynb` 에 의해 전처리된 CFX Data (RFU Baseline Subtracted by CFX Manager) `parquets`의 `DSP2.1` 연산 결과를 출력한다.  
    * `1.cfx-to-parquet-converter.ipynb` 에 의해 전처리된 Raw Sample Data `parquets`의 `DSP2.1` 연산 결과를 출력한다. 

# DSP2.1 연산을 위한 전처리 작업

## [2024-02-19]

## 1. Package Import

In [1]:
# Config
import pydsptools.config.pda as pda
import pydsptools.biorad as biorad

# Analysis Preparation
import polars as pl
import pandas as pd
import numpy as np
from scipy.stats import chi2 # https://en.cppreference.com/w/cpp/numeric/random/chi_squared_distribution

# DSP Processing
import pydsp.run.worker
import pydsptools.biorad.parse as bioradparse

# PreProcessing
import pprint
import pyarrow as pa
import os
import shutil
import subprocess
import pathlib as Path
import psutil # memory checking
import shutil
from package.preprocess import (check_memory_status,
                                get_disk_usage)
# Visualization
import pydsptools.plot as dspplt
import plotly.express as px
import matplotlib.pyplot as plt


In [2]:
# Son's Module
import pydsptools.utils
import pydsptools.biorad.parse as pdabioradparse
import pydsptools.plot as dspplt

import sys
import yaml
import fsspec

from dataclasses import dataclass
from numpy import diff
import seaborn as sns
import matplotlib.patches as patches
import scipy.io as sio
import math
import plotly.express as px
import plotly.graph_objects as go

from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay

parent_dir = os.path.abspath('../hsson1')
sys.path.append(parent_dir)

## Import the entire module
import source.general_derivative_analysis
import source.derivative_peak_analysis_ver1_9
import source.baseline_estimation_ver1_3
import source.calculate_ct_ver1_1
import source.plot_analysis

import importlib

## Reload the module to reflect any changes
importlib.reload(source.general_derivative_analysis)
importlib.reload(source.derivative_peak_analysis_ver1_9)
importlib.reload(source.baseline_estimation_ver1_3)
importlib.reload(source.calculate_ct_ver1_1)
importlib.reload(source.plot_analysis)

## Now, import the specific functions from the reloaded module
from source.general_derivative_analysis import (moving_average,
                                                compute_derivative_LSR,
                                                compute_max_deriv_data,
                                                plot_max_deriv_data,
                                                compute_max_deriv_smoothed_data,
                                                plot_max_deriv_smoothed_data)
from source.derivative_peak_analysis_ver1_9 import (get_peak_groups,
                                             calculate_crossing_points_for_data,
                                             get_peak_properties_for_row,
                                             plot_peak_properties,
                                             plot_derivative_baseline_modeling)
from source.baseline_estimation_ver1_3 import (coder_scd_fitting,
                                        coder_section_rp2,
                                        adjusted_r2,
                                        operation_baseline_fitting,
                                        calculate_base_rfu,
                                        plot_baseline,
                                        plot_comparison)
from source.calculate_ct_ver1_1 import (compute_crossing_point,
                                        label_threshold_result,
                                        plot_data_threshold_crossing)
from source.plot_analysis import (plot_Signal,
                                  plot_2d_scatter,
                                  PGR_manager_plot)


## 2. Path 설정

In [3]:
# root_path = Path.Path.cwd() # local: /home/kmkim/pda/dsp-research-strep-a/kkm, PDA-pro: /home/jupyter-kmkim/dsp-research-strep-a/kkm
# prefix = 'data'

## 3. DSP 연산 결과 Directories 생성 (생략)
`config__dsp2_orig/basesub` 디렉토리가 반드시 있어야 autobaseline 연산 결과를 얻을 수 있음. DSP2에서는 자동으로 생성하는 것 같은데 DSP1에서는 수동으로 디렉토리를 만들어줘야 하는 것 같음.
(김형규 과장님이 개선 예정)
* `{상위 path}/computed/{TESTNAME}/config__dsp1_orig/dsp`
* `{상위 path}/computed/{TESTNAME}/config__dsp2_orig/basesub`
* `{상위 path}/computed/{TESTNAME}/config__dsp2_orig/dsp`


In [3]:
# [Path.Path(f"./data/baseline-subtracted/computed/example1/config__dsp2_orig/{subdir}").mkdir(parents=True, exist_ok=True) for subdir in ["dsp", "basesub", "log"]]
# [Path.Path(f"./data/pda-raw-sample/computed/example1/config__dsp2_orig/{subdir}").mkdir(parents=True, exist_ok=True) for subdir in ["dsp", "basesub", "log"]]
# [Path.Path(f"./data/GI-B1/computed/example1/config__dsp2_orig/{subdir}").mkdir(parents=True, exist_ok=True) for subdir in ["dsp", "basesub", "log"]]

[None, None, None]

## 5. DSP 실행

1) docker를 이용한 DSP 실행 

```
docker run -it --rm -v $(pwd):/code seegene/pydsp:2.1.0-alpha.1 python -m pydsp.run.worker multiple \
    -i /code/computed/example1/pcr_results \
    -c /code/config/yaml/PRJDS001/RP1/dsp1_orig.yml \
    -o /code/computed/example1/config__dsp1_orig
```

2) Python Script를 이용한 DSP 실행

```
!python -m pydsp.run.worker multiple \
    -i ./data/cfx-baseline-subtracted/computed/example1/pcr_results/ \
    -c config/yaml/PRJDS001/GI-B-I/dsp2_generic.yml \
    -o ./data/cfx-baseline-subtracted/computed/example1/config__dsp2_orig
```

`pydsp.run.worker()` 가 DSP를 실행하는 함수인데 3개의 옵션이 있다. 
* `-i`: input data있는 path
* `-c`: DSP 설정값 file이 있는 path
* `-o`: DSP 연산 결과가 출력될 path

DSP 설정값에는 제품별, DSP 버전별로 따로 yaml파일이 존재한다. 참고로 DSP1은 autobaseline 결과가 출력이 안된다(알려주는 사람이 없어서 삽질 여러번 함). 
DSP2로 돌려야 autobaseline 결과도 같이 출력되는데 이때 출력될 path에 `config__dsp2_orig/basesub`, `config__dsp2_orig/dsp`, `config__dsp2_orig/log`라는 directorys가 있어야 한다.

아래의 코드는 다음의 6가지 데이터를 확보하기 위해 실행되는 코드들이다.

* [After BPN] RFU: original rfu를 BPN으로 Transformation한 Data
* [Modified DSP] Original ABSD: 음성탈락 신호들의 absd-orig을 얻기위해 설정값들을 변경하여 얻은 Data로 음성 데이터에서 **final_ct가 부정확**할 수 있다. (final_ct 부정확 with 설정값 modified)
* [Auto] Baseline-Subtracted RFU: pgr-manager의 raw tab 결과 (final_ct 부정확 with 설정값 modified)
* [CFX] Baseline-Subtracted RFU: cfx manager의 baseline subtracted data. (final_ct 부정확 with 설정값 modified)
* [Strep] Baseline-Subtracted RFU (final_ct 부정확 with 설정값 modified)
* [Strep+n] Baseline-Subtracted RFU (final_ct 부정확 with 설정값 modified)
* [Control DSP] Baseline-Subtracted RFU (final_ct 정확 with 설정값 intact)

### 1) DSP 실행 for `[After BPN] RFU` + `[DSP] Original ABSD`

- PDA-Raw-Sample Data : GI-B-I의 8-strip의 일부 plates
    - plate_number_002, plate_number_005, plate_number_031, plate_number_032, plate_number_036, plate_number_041
    - 선별 기준: 육안으로 CFX manager의 baseline subtraction 성능이 떨어지는 신호들 선별
    - `[After BPN] RFU` 을 위한 raw data만 추출하고 BPN적용은 `3.baseline-analysis.ipynb`에서 수행
    - `[DSP] Original ABSD` 는 DSP 실행 결과에 포함되어 있음

### 2) DSP 실행 for `[CFX] Baseline-Subtracted RFU Data`

In [6]:
# Execution in the local Environment
!python -m pydsp.run.worker multiple \
    -i ./data/cfx-baseline-subtracted/computed/example1/pcr_results/ \
    -c config/yaml/PRJDS001/GI-B-I/dsp2_generic_research.yml \
    -o ./data/cfx-baseline-subtracted/computed/example1/config__dsp2_orig

/opt/tljh/user/bin/python: Error while finding module specification for 'pydsp.run.worker' (ModuleNotFoundError: No module named 'pydsp')


In [None]:
# Execution in the PDA-pro
pydsp.run.worker.multiple_tasks(
    "./data/cfx-baseline-subtracted/computed/example1/pcr_results/",               # Input directory
    f"config/yaml/PRJDS001/GI-B-I/dsp2_generic_research.yml",          # Configuration
    f"./data/cfx-baseline-subtracted/computed/example1/config__dsp2_orig",    # Output directory
    4,                                      # Number of processes
    is_verbose=True                         # Verboase mode
)

### 3) DSP 실행 for `[Strep] Baseline-Subtracted RFU Data`

- Strep Assay 적용 알고리즘 = Multi-Amp, 다중 증폭 대응 알고리즘

1) Multiamp 적용한 (is_multiamp = 1) 설정값 yml 파일 생성
   - 1번만 하면됨

2. DSP 실행
    - 설정값을 변경하여 다중 증폭 대응 알고리즘을 생성

## DSP 실행 for the GI-B1 Half Data (plate_data_001~100)

- 신호 패턴별 알고리즘들이 어떤 성능을 보이는지에 대한 시각화 진행
- PDA-Raw-Sample Data (GI-B-I의 8-strip의 일부 plates: 
plate_numbe 0002,_005 _031,_032 _036 _04)로는 불충분
- 100 plates 돌려보기로 결정 1

### [2024-04-19]

- pydsp-pro의 최신 버전 반영하여 dsp 실행
- **DSP 연산**
    * INPUT_PARQUET_DIR : PCR 데이터 parquet 파일 디렉토리 경로
    * CONFIG_YML_PATH : DSP 연산 CONFIG YAML 파일 경로
    * TO_OUTPUT_DIR : DSP 연산 결과 parquet 파일 디렉토리 경로
    * DSP실행: `pydsp.run.worker.multiple_tasks()` 실행

#### 1-1) DSP 실행 for Raw Data with `No MuDT` +`DRFU=0` + `BPN_rv=0` + `fb=0`

In [None]:
INPUT_PARQUET_DIR = "./data/GI-B-I/GI-B-I-100/computed/pcr_results/"
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_generic_config_no-MuDT.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/raw_data/computed/dsp2_generic_config_no_MuDT"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    2,                 # Number of processes
    is_verbose=False    # Verboase mode
)

2024-05-08 04:11:41,320 - 120598 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-08 04:11:41,354 - 120598 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/4. admin_2017-05-22 14-38-11_BR101644 GI9801XY MOM 소량 반제품1 민감도(Ahyd, Avero, Sbong).parquet
2024-05-08 04:11:41,366 - 120598 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/5. admin_2017-05-22 14-40-33_BR101645 GI9801XY MOM 소량 반제품1 민감도(Styphi, IC).parquet
2024-05-08 04:11:41,376 - 120598 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/admin_2015-02-23 02-32-41_BR100160_Allplex GI_B1_TCF_specificity-1.parquet
2024-05-08 04:11:41,386 - 120598 - INFO - Check entry /home/jupyter-kmkim/dsp-research-st

#### 1-2) DSP 실행 for `MuDT` +`(-) DRFU` + `[DSP] Original ABSD`

In [9]:
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/config__dsp2_generic_research.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/GI-B-I-100/computed/config_dsp2_generic_research"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=False    # Verboase mode
)

2024-05-07 00:13:53,568 - 6012 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-07 00:13:53,568 - 6012 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-07 00:13:53,568 - 6012 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-07 00:13:53,587 - 6012 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/4. admin_2017-05-22 14-38-11_BR101644 GI9801XY MOM 소량 반제품1 민감도(Ahyd, Avero, Sbong).parquet
2024-05-07 00:13:53,587 - 6012 - INFO - Check entry /home/jupyter-kmkim/dsp

1

#### 2) DSP 실행 for `[CFX] Baseline-Subtracted RFU Data`

- 이미 돌렸음

#### 3) DSP 실행 for `[Strep] Baseline-Subtracted RFU Data`

In [8]:
CONFIG_YML_PATH = "../pro/config/240213_multiamp_research.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep/computed/config__240213_multiamp_research"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=True    # Verboase mode
)


#### 4) DSP 실행 for `[Auto] Baseline-Subtracted RFU Data`

In [9]:
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_generic_research_auto.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/auto/computed/config__dsp2_generic_research_auto"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=True    # Verboase mode
)


#### 5) DSP 실행 for `[Strep+1] Baseline-Subtracted RFU Data` 

In [14]:
data_path = './data/GI-B-I/GI-B-I-100/computed/config__dsp2_generic_research_auto/dsp/'
# Tag if we are running MC Simulation or Real Data?
#MC_or_Real = 'MC'
MC_or_Real = 'Real'
def get_all_parquet_files(data_path):
    # get all parquet files from that directory
    files = [f for f in os.listdir(data_path) if f.endswith('.parquet')]
    
    # create an empty list to store dataframes
    df_list = []
    
    # read each parquet file and append it to the list
    for file in files:
        df = pd.read_parquet(data_path + file)
        df_list.append(df)

    # concatenate all dataframes in the list
    dsp_result_df = pd.concat(df_list, ignore_index=True)

    return dsp_result_df
dsp_result_df = get_all_parquet_files(data_path)


# 소모품, 웰타입, 채널, 온도 설정
#Consumable = "8-strip"
#Welltype = "Sample"
#Channel = "HEX"
#Temperature = 'High'

#if Temperature == 'Low':
#    TM = 0
#elif Temperature == 'High':
#    TM = 1

# Single amp 설정값 적용한 df
df = dsp_result_df

# 임의의 index의 original_rfu column에 있는 list에 기반하여 x값 만들기 (x값은 모든 index에서 동일, x는 1부터 시작)
# x = [i for i in range(1, len(dsp_result_df.loc[2]['original_rfu'])+1)] -> 45개 로 고정된 값

df['new_jump_corrected_rfu'] = df.apply(lambda row: [a - b for a, b in zip(row['preproc_rfu'], row['analysis_rd_diff'])], axis=1)

DFM = 30 # Derivative Maximum에 대한 Threshold (모든 증폭에 해당)
DFC = 50 # Derivative Maximum에 대한 Threshold (이른 증폭 신호의 상향된 기준)

# Derivative diff threshold: Second Derivative에서 양수에서 음성으로 넘어가는 임계값. peak가 생성되는 직전의 지점.
DDT = 5
# First Derivaitve Peak threshold (valid peak를 분석하는데 사용)
FDPT = 25
# EFC threshold: Derivative 분석에서 EFC를 산출하기 위한 threshold
# [주의] FDPT는 EFCT 보다 커야함.
EFCT = 15 # 이값은 최적화 필요. 현재까지는 15가 적정함.
columns_to_assign = [
    None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 
    'new_efc0', None, None, 'derivative_maximum', None,
    None, None, None, None, 'derivative_baseline', None, None, None, None, None, None
]

# Assuming df is your DataFrame and get_peak_properties_for_row is correctly defined
results = df.apply(lambda row: get_peak_properties_for_row(row, 'new_jump_corrected_rfu', DDT, FDPT, EFCT), axis=1)

# Now assign the results to the DataFrame columns where applicable
for idx, col_name in enumerate(columns_to_assign):
    if col_name is not None:  # This checks if a column name has been provided
        df[col_name] = [res[idx] if res is not None else None for res in results]

# Create the 'is_early_amp' column
df['is_early_amp'] = np.where(df['new_efc0'] < 13, 1, 0)

# baseline modeling
SFC = 4
df = calculate_base_rfu(df, SFC, 'new_jump_corrected_rfu', DFM)
df['new_absd'] = [np.array(row['new_jump_corrected_rfu']) - np.array(row['new_baseline']) for index, row in df.iterrows()]

# make the df['derivative_baseline'] column homogeneous to export the dataframe df into a parquet
df['derivative_baseline'] = df['derivative_baseline'].apply(lambda x: np.array([x]) if not isinstance(x, np.ndarray) else x)

# export df
df.to_parquet('./data/GI-B-I/strep_plus1/dsp2_strep-assay-plus1.parquet')

Unnamed: 0,name,steps,pcr_system,consumable,welltype,well,channel,temperature,original_rfu,has_melt,...,final_dataprocnum,final_ct,setval_is_multiamp,analysis_deriv_lsr,analysis_cq,postproc_cq,ctalk_cq,final_cq,experiment_name,plate_name
0,admin_2015-03-11 15-35-24_CC015842_Allplex GI ...,"[4, 5]",,8-strip,Sample,A01,HEX,Low,"[3353.95186057254, 3345.93540829, 3340.4045094...",0,...,0,27.555666,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
1,admin_2015-03-11 15-35-24_CC015842_Allplex GI ...,"[4, 5]",,8-strip,Sample,A01,HEX,High,"[3581.28236764036, 3583.21549558755, 3574.2886...",0,...,8,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
2,admin_2015-03-11 15-35-24_CC015842_Allplex GI ...,"[4, 5]",,8-strip,Sample,A01,Cal Red 610,Low,"[6169.79894808816, 6127.3424965504, 6113.79868...",0,...,5,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
3,admin_2015-03-11 15-35-24_CC015842_Allplex GI ...,"[4, 5]",,8-strip,Sample,A01,Cal Red 610,High,"[6132.98502148135, 6127.82623545015, 6127.6832...",0,...,4,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
4,admin_2015-03-11 15-35-24_CC015842_Allplex GI ...,"[4, 5]",,8-strip,Sample,A01,FAM,Low,"[9430.35766948379, 9399.53418917517, 9392.6687...",0,...,5,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63507,admin_2015-05-11 17-45-31_CC013751_18D_PM,"[4, 5]",,8-strip,NC,G10,Cal Red 610,High,"[7567.64282180831, 7560.59491197985, 7549.6078...",0,...,13,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
63508,admin_2015-05-11 17-45-31_CC013751_18D_PM,"[4, 5]",,8-strip,NC,G10,FAM,Low,"[13943.442947638, 13893.8277504585, 13847.6944...",0,...,13,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
63509,admin_2015-05-11 17-45-31_CC013751_18D_PM,"[4, 5]",,8-strip,NC,G10,FAM,High,"[13418.2363947505, 13412.5975986545, 13379.481...",0,...,13,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100
63510,admin_2015-05-11 17-45-31_CC013751_18D_PM,"[4, 5]",,8-strip,NC,G10,Quasar 670,Low,"[7576.97980467324, 7565.75992022617, 7550.8423...",0,...,6,-1.000000,0,0.0,,,,,MY EXPERIMENT,PLATE_NUM_001_100


#### 5-1) DSP 실행 for `[Control DSP] + [Strep+1]` 


In [8]:
# CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_generic.yml"
# TO_DSP_RESULT_DIR = "./data/GI-B-I/control/computed/config__dsp2_generic"

# pydsp.run.worker.multiple_tasks(
#     INPUT_PARQUET_DIR, # Input directory
#     CONFIG_YML_PATH,   # Configuration
#     TO_DSP_RESULT_DIR,     # Output directory
#     4,                 # Number of processes
#     is_verbose=True    # Verboase mode
# )


In [None]:
data_path = TO_DSP_RESULT_DIR + "/dsp/"
# Tag if we are running MC Simulation or Real Data?
#MC_or_Real = 'MC'
MC_or_Real = 'Real'
def get_all_parquet_files(data_path):
    # get all parquet files from that directory
    files = [f for f in os.listdir(data_path) if f.endswith('.parquet')]
    
    # create an empty list to store dataframes
    df_list = []
    
    # read each parquet file and append it to the list
    for file in files:
        df = pd.read_parquet(data_path + file)
        df_list.append(df)

    # concatenate all dataframes in the list
    dsp_result_df = pd.concat(df_list, ignore_index=True)

    return dsp_result_df
dsp_result_df = get_all_parquet_files(data_path)


# 소모품, 웰타입, 채널, 온도 설정
#Consumable = "8-strip"
#Welltype = "Sample"
#Channel = "HEX"
#Temperature = 'High'

#if Temperature == 'Low':
#    TM = 0
#elif Temperature == 'High':
#    TM = 1

# Single amp 설정값 적용한 df
df = dsp_result_df

# 임의의 index의 original_rfu column에 있는 list에 기반하여 x값 만들기 (x값은 모든 index에서 동일, x는 1부터 시작)
# x = [i for i in range(1, len(dsp_result_df.loc[2]['original_rfu'])+1)] -> 45개 로 고정된 값

df['new_jump_corrected_rfu'] = df.apply(lambda row: [a - b for a, b in zip(row['preproc_rfu'], row['analysis_rd_diff'])], axis=1)

DFM = 30 # Derivative Maximum에 대한 Threshold (모든 증폭에 해당)
DFC = 50 # Derivative Maximum에 대한 Threshold (이른 증폭 신호의 상향된 기준)

# Derivative diff threshold: Second Derivative에서 양수에서 음성으로 넘어가는 임계값. peak가 생성되는 직전의 지점.
DDT = 5
# First Derivaitve Peak threshold (valid peak를 분석하는데 사용)
FDPT = 25
# EFC threshold: Derivative 분석에서 EFC를 산출하기 위한 threshold
# [주의] FDPT는 EFCT 보다 커야함.
EFCT = 15 # 이값은 최적화 필요. 현재까지는 15가 적정함.
columns_to_assign = [
    None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 
    'new_efc0', None, None, 'derivative_maximum', None,
    None, None, None, None, 'derivative_baseline', None, None, None, None, None, None
]

# Assuming df is your DataFrame and get_peak_properties_for_row is correctly defined
results = df.apply(lambda row: get_peak_properties_for_row(row, 'new_jump_corrected_rfu', DDT, FDPT, EFCT), axis=1)

# Now assign the results to the DataFrame columns where applicable
for idx, col_name in enumerate(columns_to_assign):
    if col_name is not None:  # This checks if a column name has been provided
        df[col_name] = [res[idx] if res is not None else None for res in results]

# Create the 'is_early_amp' column
df['is_early_amp'] = np.where(df['new_efc0'] < 13, 1, 0)

# baseline modeling
SFC = 4
df = calculate_base_rfu(df, SFC, 'new_jump_corrected_rfu', DFM)
df['new_absd'] = [np.array(row['new_jump_corrected_rfu']) - np.array(row['new_baseline']) for index, row in df.iterrows()]

# make the df['derivative_baseline'] column homogeneous to export the dataframe df into a parquet
df['derivative_baseline'] = df['derivative_baseline'].apply(lambda x: np.array([x]) if not isinstance(x, np.ndarray) else x)

# export df
df.to_parquet('./data/GI-B-I/strep_plus1/control_strep-assay-plus1.parquet')

#### 5-2) DSP 실행 for `[Strep+1] with MuDT` 

In [4]:
INPUT_PARQUET_DIR = "./data/GI-B-I/GI-B-I-100/computed/pcr_results/"
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_strep_plus1_MuDT.yml" 
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus1/computed/dsp2_strep_plus1_MuDT"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=False    # Verboase mode
)

2024-05-03 05:30:25,062 - 856 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-03 05:30:25,273 - 856 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/4. admin_2017-05-22 14-38-11_BR101644 GI9801XY MOM 소량 반제품1 민감도(Ahyd, Avero, Sbong).parquet
2024-05-03 05:30:25,291 - 856 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/5. admin_2017-05-22 14-40-33_BR101645 GI9801XY MOM 소량 반제품1 민감도(Styphi, IC).parquet
2024-05-03 05:30:25,308 - 856 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/admin_2015-02-23 02-32-41_BR100160_Allplex GI_B1_TCF_specificity-1.parquet
2024-05-03 05:30:25,323 - 856 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/

0

In [8]:
check_memory_status()

Memory used/ Memory total: 263.40 MB
Total memory: 7831.25 MB
Available memory: 6952.23 MB


In [14]:
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus1/computed/dsp2_strep_plus1_config_MuDT"
data_path = TO_DSP_RESULT_DIR + '/dsp/'
# Tag if we are running MC Simulation or Real Data?
#MC_or_Real = 'MC'
MC_or_Real = 'Real'
def get_all_parquet_files(data_path):
    # get all parquet files from that directory
    files = [f for f in os.listdir(data_path) if f.endswith('.parquet')]
    
    # create an empty list to store dataframes
    df_list = []
    
    # read each parquet file and append it to the list
    for file in files:
        df = pd.read_parquet(data_path + file)
        df_list.append(df)

    # concatenate all dataframes in the list
    dsp_result_df = pd.concat(df_list, ignore_index=True)

    return dsp_result_df
dsp_result_df = get_all_parquet_files(data_path)

# 소모품, 웰타입, 채널, 온도 설정
#Consumable = "8-strip"
#Welltype = "Sample"
#Channel = "HEX"
#Temperature = 'High'

#if Temperature == 'Low':
#    TM = 0
#elif Temperature == 'High':
#    TM = 1

# Single amp 설정값 적용한 df
df = dsp_result_df

# 임의의 index의 original_rfu column에 있는 list에 기반하여 x값 만들기 (x값은 모든 index에서 동일, x는 1부터 시작)
# x = [i for i in range(1, len(dsp_result_df.loc[2]['original_rfu'])+1)] -> 45개 로 고정된 값

df['new_jump_corrected_rfu'] = df.apply(lambda row: [a - b for a, b in zip(row['preproc_rfu'], row['analysis_rd_diff'])], axis=1)

DFM = 30 # Derivative Maximum에 대한 Threshold (모든 증폭에 해당)
DFC = 50 # Derivative Maximum에 대한 Threshold (이른 증폭 신호의 상향된 기준)

# Derivative diff threshold: Second Derivative에서 양수에서 음성으로 넘어가는 임계값. peak가 생성되는 직전의 지점.
DDT = 5
# First Derivaitve Peak threshold (valid peak를 분석하는데 사용)
FDPT = 25
# EFC threshold: Derivative 분석에서 EFC를 산출하기 위한 threshold
# [주의] FDPT는 EFCT 보다 커야함.
EFCT = 15 # 이값은 최적화 필요. 현재까지는 15가 적정함.
columns_to_assign = [
    None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 
    'new_efc0', None, None, 'derivative_maximum', None,
    None, None, None, None, 'derivative_baseline', None, None, None, None, None, None
]

# Assuming df is your DataFrame and get_peak_properties_for_row is correctly defined
results = df.apply(lambda row: get_peak_properties_for_row(row, 'new_jump_corrected_rfu', DDT, FDPT, EFCT), axis=1)

# Now assign the results to the DataFrame columns where applicable
for idx, col_name in enumerate(columns_to_assign):
    if col_name is not None:  # This checks if a column name has been provided
        df[col_name] = [res[idx] if res is not None else None for res in results]

# Create the 'is_early_amp' column
df['is_early_amp'] = np.where(df['new_efc0'] < 13, 1, 0)

# baseline modeling
SFC = 4
df = calculate_base_rfu(df, SFC, 'new_jump_corrected_rfu', DFM)
df['new_absd'] = [np.array(row['new_jump_corrected_rfu']) - np.array(row['new_baseline']) for index, row in df.iterrows()]

# make the df['derivative_baseline'] column homogeneous to export the dataframe df into a parquet
df['derivative_baseline'] = df['derivative_baseline'].apply(lambda x: np.array([x]) if not isinstance(x, np.ndarray) else x)

# export df
df.to_parquet('./data/GI-B-I/strep_plus1/dsp2_strep-plus1_config_MuDT.parquet')

In [10]:
check_memory_status()

Memory used/ Memory total: 1189.25 MB
Total memory: 7831.25 MB
Available memory: 5998.14 MB


#### 5-3) DSP 실행 for `[Strep+1] without MuDT` 

In [2]:
INPUT_PARQUET_DIR = "./data/GI-B-I/GI-B-I-100/computed/pcr_results/"
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_strep_plus1_config_no-MuDT.yml" 
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus1/computed/dsp2_strep_plus1_config_no-MuDT"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=False    # Verboase mode
)

In [3]:
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus1/computed/dsp2_strep_plus1_config_no-MuDT"
data_path = TO_DSP_RESULT_DIR + '/dsp/'
# Tag if we are running MC Simulation or Real Data?
#MC_or_Real = 'MC'
MC_or_Real = 'Real'
def get_all_parquet_files(data_path):
    # get all parquet files from that directory
    files = [f for f in os.listdir(data_path) if f.endswith('.parquet')]
    
    # create an empty list to store dataframes
    df_list = []
    
    # read each parquet file and append it to the list
    for file in files:
        df = pd.read_parquet(data_path + file)
        df_list.append(df)

    # concatenate all dataframes in the list
    dsp_result_df = pd.concat(df_list, ignore_index=True)

    return dsp_result_df
dsp_result_df = get_all_parquet_files(data_path)


# 소모품, 웰타입, 채널, 온도 설정
#Consumable = "8-strip"
#Welltype = "Sample"
#Channel = "HEX"
#Temperature = 'High'

#if Temperature == 'Low':
#    TM = 0
#elif Temperature == 'High':
#    TM = 1

# Single amp 설정값 적용한 df
df = dsp_result_df

# 임의의 index의 original_rfu column에 있는 list에 기반하여 x값 만들기 (x값은 모든 index에서 동일, x는 1부터 시작)
# x = [i for i in range(1, len(dsp_result_df.loc[2]['original_rfu'])+1)] -> 45개 로 고정된 값

df['new_jump_corrected_rfu'] = df.apply(lambda row: [a - b for a, b in zip(row['preproc_rfu'], row['analysis_rd_diff'])], axis=1)

DFM = 30 # Derivative Maximum에 대한 Threshold (모든 증폭에 해당)
DFC = 50 # Derivative Maximum에 대한 Threshold (이른 증폭 신호의 상향된 기준)

# Derivative diff threshold: Second Derivative에서 양수에서 음성으로 넘어가는 임계값. peak가 생성되는 직전의 지점.
DDT = 5
# First Derivaitve Peak threshold (valid peak를 분석하는데 사용)
FDPT = 25
# EFC threshold: Derivative 분석에서 EFC를 산출하기 위한 threshold
# [주의] FDPT는 EFCT 보다 커야함.
EFCT = 15 # 이값은 최적화 필요. 현재까지는 15가 적정함.
columns_to_assign = [
    None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 
    'new_efc0', None, None, 'derivative_maximum', None,
    None, None, None, None, 'derivative_baseline', None, None, None, None, None, None
]

# Assuming df is your DataFrame and get_peak_properties_for_row is correctly defined
results = df.apply(lambda row: get_peak_properties_for_row(row, 'new_jump_corrected_rfu', DDT, FDPT, EFCT), axis=1)

# Now assign the results to the DataFrame columns where applicable
for idx, col_name in enumerate(columns_to_assign):
    if col_name is not None:  # This checks if a column name has been provided
        df[col_name] = [res[idx] if res is not None else None for res in results]

# Create the 'is_early_amp' column
df['is_early_amp'] = np.where(df['new_efc0'] < 13, 1, 0)

# baseline modeling
SFC = 4
df = calculate_base_rfu(df, SFC, 'new_jump_corrected_rfu', DFM)
df['new_absd'] = [np.array(row['new_jump_corrected_rfu']) - np.array(row['new_baseline']) for index, row in df.iterrows()]

# make the df['derivative_baseline'] column homogeneous to export the dataframe df into a parquet
df['derivative_baseline'] = df['derivative_baseline'].apply(lambda x: np.array([x]) if not isinstance(x, np.ndarray) else x)

# export df
df.to_parquet('./data/GI-B-I/strep_plus1/dsp2_strep-plus1_config_no-MuDT.parquet')

In [3]:
check_memory_status()

Memory used: 247.87 MB
Total memory: 7831.25 MB
Available memory: 5199.89 MB


##### 6) DSP 실행 for `[Control DSP]` 

#### 8-1) DSP 실행 for `[Strep+2] with MuDT` 

In [None]:
INPUT_PARQUET_DIR = "./data/GI-B-I/GI-B-I-100/computed/pcr_results/"
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_strep_plus2_MuDT_multiamp.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus2/computed/dsp2_strep_plus2_MuDT_multiamp"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=False    # Verboase mode
)


2024-05-07 04:26:21,375 - 40269 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-07 04:26:21,407 - 40269 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/4. admin_2017-05-22 14-38-11_BR101644 GI9801XY MOM 소량 반제품1 민감도(Ahyd, Avero, Sbong).parquet
2024-05-07 04:26:21,418 - 40269 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/5. admin_2017-05-22 14-40-33_BR101645 GI9801XY MOM 소량 반제품1 민감도(Styphi, IC).parquet
2024-05-07 04:26:21,431 - 40269 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/admin_2015-02-23 02-32-41_BR100160_Allplex GI_B1_TCF_specificity-1.parquet
2024-05-07 04:26:21,445 - 40269 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a

#### 8-2) DSP 실행 for `[Strep+2] without MuDT` 

In [4]:
INPUT_PARQUET_DIR = "./data/GI-B-I/GI-B-I-100/computed/pcr_results/"
CONFIG_YML_PATH = "./config/yaml/PRJDS001/GI-B-I/dsp2_strep_plus2_no-MuDT_multiamp.yml"
TO_DSP_RESULT_DIR = "./data/GI-B-I/strep_plus2/computed/dsp2_strep_plus2_no-MuDT_multiamp"

pydsp.run.worker.multiple_tasks(
    INPUT_PARQUET_DIR, # Input directory
    CONFIG_YML_PATH,   # Configuration
    TO_DSP_RESULT_DIR,     # Output directory
    4,                 # Number of processes
    is_verbose=False    # Verboase mode
)


2024-05-07 06:32:02,760 - 85498 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/2. admin_2017-05-22 14-06-05_BR101459 GI9801XY MOM 소량 반제품1 민감도(Vpara, CdB, Vchol).parquet
2024-05-07 06:32:02,831 - 85498 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/4. admin_2017-05-22 14-38-11_BR101644 GI9801XY MOM 소량 반제품1 민감도(Ahyd, Avero, Sbong).parquet
2024-05-07 06:32:02,852 - 85498 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/5. admin_2017-05-22 14-40-33_BR101645 GI9801XY MOM 소량 반제품1 민감도(Styphi, IC).parquet
2024-05-07 06:32:02,862 - 85498 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a/kkm/data/GI-B-I/GI-B-I-100/computed/pcr_results/admin_2015-02-23 02-32-41_BR100160_Allplex GI_B1_TCF_specificity-1.parquet
2024-05-07 06:32:02,872 - 85498 - INFO - Check entry /home/jupyter-kmkim/dsp-research-strep-a

0

In [5]:
check_memory_status()

Memory used: 265.05 MB
Total memory: 7831.25 MB
Available memory: 5193.57 MB


In [6]:
get_disk_usage()

Total Disk Capacity: 48.28 GB
Used Disk Space: 26.54 GB
Free Disk Space: 21.72 GB
