##### **Workflow example**

polyloxpgen contains two main methods
- polylox_merge: merge samples from multiple raw barcode files ([RPBPBR](https://github.com/hoefer-lab/RPBPBR) output)
- polylox_pgen: purge barcodes and compute pgen for single or multiple samples


##### **Installation**

to use polyloxpgen and the above methods, install it via

```
pip install polyloxpgen
```

In [1]:
# after installation, import the package via
import polyloxpgen

##### **Notes on file format**

- all input and output files here use tab-separated values (TSV) files (saved as .txt)
- the required file formats (headers, columns, rows) can be seen in the examples folder
- by default, barcodes are surrounded by quote marks (for Excel)
    - otherwise Excel would alter barcodes sometimes falsely to dates or scientific numbers

##### **polylox_merge**

below we merge two data sets (in this example folder) to a combined dataframe


In [2]:
### USER INPUT: the user has to specify the following information
# the (absolute or relative) paths of the input data sets
location_files_in = ['./sample1.barcode.count.txt', './sample2.barcode.count.txt']

# how the individual data sets are referred to ("samples")
sample_names =  ['Sample1', 'Sample2']

# folder/directory for the merged output dataframe
merge_location_dir_out = './'

# name of the output dataframe
merge_file_name_out = 'sample1_sample2_merged'
###

In [3]:
# run polylox_merge by 
df_merged = polyloxpgen.polylox_merge(location_files_in, sample_names, merge_location_dir_out, merge_file_name_out)

Merging data ... Done


In [4]:
# the merged dataframe is saved in examples folder and 
# can be displayed here by (remove semicolon for output)
df_merged;

##### **polylox_pgen**

the merged dataframe from before (or any dataframe in this format) can then be used
- to purge barcodes and reads (eliminate impossible/false barcodes and reads)
- to compute the generation probability (pgen) for the purged barcodes

In [5]:
### USER INPUT
location_file_in = './sample1_sample2_merged.txt'

pgen_location_dir_out = './'
pgen_file_name_out = 'sample1_sample2_pgen'
###

In [6]:
# run polylox_pgen by 
df_pgen = polyloxpgen.polylox_pgen(location_file_in, pgen_location_dir_out, pgen_file_name_out)

Loading input data ... Done
Loading Polylox libraries ... Done
Purging barcodes and reads ... Done
Finding minimal recombinations ... Done
Computing generation probabilities ... Done
Creating output ... Done


In [7]:
# remove semicolon to display output dataframe
df_pgen;

***Additional options for polylox_pgen***

1) **float decimal delimiter**: to open files with European/German Excel, you may want to change the float decimal from '.' (default) to ','

use the option as below to change to any desired float decimal (here ',')

```python
df_pgen = polyloxpgen.polylox_pgen(location_file_in, pgen_location_dir_out, pgen_file_name_out, decimal_float=',')
```

2) **transition matrix**: the current default pgen computation is based on a uniform transition matrix

to reproduce pgens as in the original publication ([Pei et al., Nature, 2017](https://www.nature.com/articles/nature23653)) a length-dependent transition matrix can be used by

```python
df_pgen = polyloxpgen.polylox_pgen(location_file_in, pgen_location_dir_out, pgen_file_name_out, path_matrix_type='ld_2017')
```

(this is more a note for reproducibility; the option only marginally affects the computed pgens)

##### **References / Final notes**

- these scripts are based on the original [polylox (MATLAB)](https://github.com/hoefer-lab/polylox) implementation; see there also for more information
- original publication: [Pei et al., Nature, 2017](https://www.nature.com/articles/nature23653)