# HowTo: Use easyPheno as a pip package

In this Jupyter notebook, we show how you can use easyPheno as a pip package and also guide you through the steps that easyPheno is doing when triggering an optimization run.

Please clone the whole GitHub repository if you want to run this tutorial on your own, as we need the tutorial data from our GitHub repository and to make sure that all paths we define are correct: ``git clone https://github.com/grimmlab/easyPheno.git``

Then, start a Jupyter notebook server on your machine and open this Jupyter notebook, which is placed at ``docs/source/tutorials`` in the repository.

However, you could also download the single files and define the paths yourself:
- The Jupyter notebook can be downloaded here: [HowTo: Use easyPheno as a pip package.ipynb](https://github.com/grimmlab/easyPheno/tree/main/docs/source/tutorials/HowTo:%20Use%20easyPheno%20as%20a%20pip%20package.ipynb)

- The data we use can be found here: [tutorial data](https://github.com/grimmlab/easyPheno/tree/main/docs/source/tutorials/tutorial_data)


## Installation, imports and paths
First, we make sure that easyPheno is installed as a pip package. Then, we import easyPheno as well as further libraries that we need in this tutorial. In the end, we define some paths and filenames which we will use more often throughout this tutorial. We will save the results in the same directory where this repository is placed. 

In [None]:
!pip3 --uninstall easypheno

In [None]:
!python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ easypheno==0.1.23

In [None]:
import easypheno

In [None]:
import easypheno
import pathlib
import pandas as pd
import datetime
import pprint

In [None]:
# Definition of paths and filenames
cwd = pathlib.Path.cwd()
data_dir = cwd.joinpath('tutorial_data')
save_dir = cwd.parents[3]
genotype_matrix = 'x_matrix.csv'
phenotype_matrix = 'y_matrix.csv'
phenotype = 'continuous_values'

## Run whole optimization pipeline at once
As shown for the [Docker workflow](https://easypheno.readthedocs.io/en/latest/tutorials/tut_run_docker.html), easyPheno offers a function [optim_pipeline.run()](https://github.com/grimmlab/easyPheno/blob/main/easypheno/optim_pipeline.py) that triggers the whole optimization run. 

In the definition of ``optim_pipeline.run()``, we set several default values. In order to run it using our tutorial data, we just need to define the data and directories we want to use as well as the models we want to optimize. Furthermore, we set values for the ``datasplit`` and ``n_trials`` to limit the waiting time for getting the results.

When calling the function, we first see some information regarding the data preprocessing and the configuration of our optimization run, e.g. the data that is used. Then, the current progress of the optuna optimization with results of the individual trials is shown. In the end, we show a summary of the whole optimization run.

In [None]:
easypheno.optim_pipeline.run(
    data_dir=data_dir, genotype_matrix=genotype_matrix, phenotype_matrix=phenotype_matrix, phenotype=phenotype,
    save_dir=save_dir, models=['xgboost'], n_trials=10, datasplit='cv-test'
)

Within the defined ``save_dir``, a ``results`` folder will be created. 
   
Then, easyPheno's default folder structure follows: ``name_of_genotype_matrix/name_of_phenotype_matrix/phenotype/``. For instance, all phenotype matrices assigned to the same genotype matrix are gathered in the same subdirectory (``name_of_genotype_matrix/``). The same applies for all phenotypes assigned to the same phenotype matirx (``name_of_genotype_matrix/name_of_phenotype_matrix/``)

We can see this structure below with all optimization results for the defined ``phenotype``.

In [None]:
result_folders = list(save_dir.joinpath('results', genotype_matrix.split('.')[0], phenotype_matrix.split('.')[0], phenotype).glob('*'))
for results_dir in result_folders:
    print(results_dir)

These result folder names show information on the ``datasplit`` (type and parameters, e.g. in case of ``cv-test`` the part ``5-20`` 5 relates to 5 folds of the cross-validation and 20 to a test set consisting of 20 percent of the data). Furthermore, we see the maf filter that was applied (``MAF``), the ``models`` that were optimized and a time stamp.

In the example below, we can see that each result folder contains a ``Results_overview_*.csv`` as well as detailed results for each of the optimized models. In case of ``nested-cv``, this is preceded by a subfolder for each of the outer folds. 

In [None]:
result_elements = list(result_folders[0].glob('*'))
for result_element in result_elements:
    print(result_element)

The ``Results_overview_*.csv`` file contains the best parameters, evaluation as well as runtime metrics for each of the optimized models as we can see in the example below.

In [None]:
results_overview_file = [overview_file for overview_file in result_elements if 'Results_overview' in str(overview_file)][0]
pd.read_csv(results_overview_file)

Beyond that, we see below that the detailed results for each optimized model contain validation and test results, saved prediction models, an optuna database, a runtime overview with information for each trial (good for debugging, as pruning reasons are also documented) and for some prediction models also feature importances.

In [None]:
for subdir in [overview_file for overview_file in result_elements if 'Results_overview' not in str(overview_file)][0].rglob('*'):
    print(subdir)

## Single elements of the optimization pipeline
For a better understanding of the whole optimization pipeline, we subsequently show some of the single elements which are called within ``optim_pipeline.run()``. 

First, ``optim_pipeline.run()`` contains some functions to check the specified arguments, which we will skip for this tutorial. However, we need to define some of the default values and create ``pathlib.Path`` objects.

In [None]:
data_dir = pathlib.Path(data_dir)
save_dir = pathlib.Path(save_dir)
datasplit = 'cv-test'
n_innerfolds = 5
test_set_size_percentage=20
maf_percentage = 0
models = ['xgboost']
n_trials = 10

The first step of the optimization pipeline is the preparation of the raw data files using ``easypheno.preprocess.raw_data_functions.prepare_data_files()``. If the format matches our [Data Guide](https://easypheno.readthedocs.io/en/latest/data.html#), the raw data files are preprocessed.

The genotype matrix is converted and unified to a ``.h5`` file and saved with the same name as the raw file, if this genotype matrix is used for the first time. 

The phenotype matirx is checked whether the format is fine, but not saved in a different format.

An index file containing indices for filtering the data (e.g. maf or duplicates) and creating the data splits is saved or updated in case it already exists and a datasplit that is currently not present in the file is requested. This ensures reproducibility of the preprocessing and data splits.

In [None]:
easypheno.preprocess.raw_data_functions.prepare_data_files(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    models=models, user_encoding=None, maf_percentage=maf_percentage
)

After setting all seeds for reproducibility using ``easypheno.utils.helper_functions.set_all_seeds()``, a model for the current optimization is selected. This information is then used to retrieve its ``standard_encoding`` if the user did not define an encoding. 

With this information, the ``easypheno.preprocess.base_dataset.Dataset`` object is initialized.  
We also print some information regarding the current progress, as loading the data might take some time for bigger datasets.   
When running the optimization for multiple models, these are sorted according to their encoding and the dataset is only loaded new if the encoding changes between models. 

In [None]:
easypheno.utils.helper_functions.set_all_seeds()
current_model_name = models[0]
encoding = easypheno.utils.helper_functions.get_mapping_name_to_class()[current_model_name].standard_encoding

dataset = easypheno.preprocess.base_dataset.Dataset(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    encoding=encoding, maf_percentage=maf_percentage
)

After retrieving the type of ML ``task`` using ``easypheno.utils.helper_functions.test_likely_categorical()`` as well as the time stamp for saving the results, we create an ``easypheno.optimization.optuna_optim.OptunaOptim`` object. For this purpose, we handover all information that is needed for the hyperparameter search.

In [None]:
task = 'classification' if easypheno.utils.helper_functions.test_likely_categorical(dataset.y_full) else 'regression'
models_start_time = '+'.join(models) + '_' + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

optim_run = easypheno.optimization.optuna_optim.OptunaOptim(
    save_dir=save_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, n_outerfolds=5, n_innerfolds=n_innerfolds,val_set_size_percentage=20, 
    test_set_size_percentage=test_set_size_percentage, maf_percentage=maf_percentage, n_trials=n_trials, 
    save_final_model=False, batch_size=None, n_epochs=10000, task=task, 
    models_start_time=models_start_time, current_model_name=current_model_name, dataset=dataset
)

Finally, we just need to call the method ``run()`` of our ``easypheno.optimization.optuna_optim.OptunaOptim`` object to start the Bayesian hyperparameter search, which will print the current progress and return a dictionary with summary results. 

In [None]:
summary_results = optim_run.run_optuna_optimization()
pprint.PrettyPrinter(depth=4).pprint(summary_results)

Beyond that, ``easypheno.optimization.optuna_optim.OptunaOptim`` creates and saves the ``Results_overview_*.csv`` files, which we show above in this tutorial.

## Further information
This notebooks shows how the use the easyPheno pip package to run an optimization. Furthermore, we give an overview of the individual steps within ``optim_pipeline.run()``.

For more information on specific topcis, see the following links:
- [Documentation of the whole package](https://easypheno.readthedocs.io/en/latest/index.html)

- [easyPheno's GitHub repository](https://github.com/grimmlab/easyPheno)

- Prepare your data according to our format: [Data Guide](https://easypheno.readthedocs.io/en/latest/data.html)

- The [Installation Guide](https://easypheno.readthedocs.io/en/latest/install_docker.html) as well as [basic tutorial](https://easypheno.readthedocs.io/en/latest/tutorials/tut_run_docker.html) for the Docker workflow as an alternative

- Summarize and analyze prediction results with easyPeno: [HowTo: Summarize prediction results with easyPheno](https://easypheno.readthedocs.io/en/latest/tutorials/tut_sum_results.html)

- Several [advanced topics](https://easypheno.readthedocs.io/en/latest/tutorials/tut_adv.html) such as adjusting existing prediction models or creation of new ones

## Installation, imports and paths
First, we make sure that easyPheno is installed as a pip package. Then, we import easyPheno as well as further libraries that we need in this tutorial. In the end, we define some paths and filenames which we will use more often throughout this tutorial. We will save the results in the same directory where this repository is placed. 

In [8]:
!pip3 --uninstall easypheno


Usage:   
  pip3 <command> [options]

no such option: --uninstall


In [5]:
!python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ easypheno==0.1.23

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple/
Collecting easypheno==0.1.23
  Downloading https://test-files.pythonhosted.org/packages/1d/38/21042c29c9a5eeceb17aa82d91cdfd0e5b4c1d8610bf13733c55a60f4e36/easypheno-0.1.23-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 53 kB/s  eta 0:00:011


Installing collected packages: easypheno
  Attempting uninstall: easypheno
    Found existing installation: easypheno 0.0.1
    Uninstalling easypheno-0.0.1:
      Successfully uninstalled easypheno-0.0.1
Successfully installed easypheno-0.1.23


In [7]:
import easypheno

AttributeError: partially initialized module 'easypheno.model' has no attribute '_base_model' (most likely due to a circular import)

In [None]:
import easypheno
import pathlib
import pandas as pd
import datetime
import pprint

In [None]:
# Definition of paths and filenames
cwd = pathlib.Path.cwd()
data_dir = cwd.joinpath('tutorial_data')
save_dir = cwd.parents[3]
genotype_matrix = 'x_matrix.csv'
phenotype_matrix = 'y_matrix.csv'
phenotype = 'continuous_values'

## Run whole optimization pipeline at once
As shown for the [Docker workflow](https://easypheno.readthedocs.io/en/latest/tutorials/tut_run_docker.html), easyPheno offers a function [optim_pipeline.run()](https://github.com/grimmlab/easyPheno/blob/main/easypheno/optim_pipeline.py) that triggers the whole optimization run. 

In the definition of ``optim_pipeline.run()``, we set several default values. In order to run it using our tutorial data, we just need to define the data and directories we want to use as well as the models we want to optimize. Furthermore, we set values for the ``datasplit`` and ``n_trials`` to limit the waiting time for getting the results.

When calling the function, we first see some information regarding the data preprocessing and the configuration of our optimization run, e.g. the data that is used. Then, the current progress of the optuna optimization with results of the individual trials is shown. In the end, we show a summary of the whole optimization run.

In [None]:
easypheno.optim_pipeline.run(
    data_dir=data_dir, genotype_matrix=genotype_matrix, phenotype_matrix=phenotype_matrix, phenotype=phenotype,
    save_dir=save_dir, models=['xgboost'], n_trials=10, datasplit='cv-test'
)

Within the defined ``save_dir``, a ``results`` folder will be created. 
   
Then, easyPheno's default folder structure follows: ``name_of_genotype_matrix/name_of_phenotype_matrix/phenotype/``. For instance, all phenotype matrices assigned to the same genotype matrix are gathered in the same subdirectory (``name_of_genotype_matrix/``). The same applies for all phenotypes assigned to the same phenotype matirx (``name_of_genotype_matrix/name_of_phenotype_matrix/``)

We can see this structure below with all optimization results for the defined ``phenotype``.

In [None]:
result_folders = list(save_dir.joinpath('results', genotype_matrix.split('.')[0], phenotype_matrix.split('.')[0], phenotype).glob('*'))
for results_dir in result_folders:
    print(results_dir)

These result folder names show information on the ``datasplit`` (type and parameters, e.g. in case of ``cv-test`` the part ``5-20`` 5 relates to 5 folds of the cross-validation and 20 to a test set consisting of 20 percent of the data). Furthermore, we see the maf filter that was applied (``MAF``), the ``models`` that were optimized and a time stamp.

In the example below, we can see that each result folder contains a ``Results_overview_*.csv`` as well as detailed results for each of the optimized models. In case of ``nested-cv``, this is preceded by a subfolder for each of the outer folds. 

In [None]:
result_elements = list(result_folders[0].glob('*'))
for result_element in result_elements:
    print(result_element)

The ``Results_overview_*.csv`` file contains the best parameters, evaluation as well as runtime metrics for each of the optimized models as we can see in the example below.

In [None]:
results_overview_file = [overview_file for overview_file in result_elements if 'Results_overview' in str(overview_file)][0]
pd.read_csv(results_overview_file)

Beyond that, we see below that the detailed results for each optimized model contain validation and test results, saved prediction models, an optuna database, a runtime overview with information for each trial (good for debugging, as pruning reasons are also documented) and for some prediction models also feature importances.

In [None]:
for subdir in [overview_file for overview_file in result_elements if 'Results_overview' not in str(overview_file)][0].rglob('*'):
    print(subdir)

## Single elements of the optimization pipeline
For a better understanding of the whole optimization pipeline, we subsequently show some of the single elements which are called within ``optim_pipeline.run()``. 

First, ``optim_pipeline.run()`` contains some functions to check the specified arguments, which we will skip for this tutorial. However, we need to define some of the default values and create ``pathlib.Path`` objects.

In [None]:
data_dir = pathlib.Path(data_dir)
save_dir = pathlib.Path(save_dir)
datasplit = 'cv-test'
n_innerfolds = 5
test_set_size_percentage=20
maf_percentage = 0
models = ['xgboost']
n_trials = 10

The first step of the optimization pipeline is the preparation of the raw data files using ``easypheno.preprocess.raw_data_functions.prepare_data_files()``. If the format matches our [Data Guide](https://easypheno.readthedocs.io/en/latest/data.html#), the raw data files are preprocessed.

The genotype matrix is converted and unified to a ``.h5`` file and saved with the same name as the raw file, if this genotype matrix is used for the first time. 

The phenotype matirx is checked whether the format is fine, but not saved in a different format.

An index file containing indices for filtering the data (e.g. maf or duplicates) and creating the data splits is saved or updated in case it already exists and a datasplit that is currently not present in the file is requested. This ensures reproducibility of the preprocessing and data splits.

In [None]:
easypheno.preprocess.raw_data_functions.prepare_data_files(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    models=models, user_encoding=None, maf_percentage=maf_percentage
)

After setting all seeds for reproducibility using ``easypheno.utils.helper_functions.set_all_seeds()``, a model for the current optimization is selected. This information is then used to retrieve its ``standard_encoding`` if the user did not define an encoding. 

With this information, the ``easypheno.preprocess.base_dataset.Dataset`` object is initialized.  
We also print some information regarding the current progress, as loading the data might take some time for bigger datasets.   
When running the optimization for multiple models, these are sorted according to their encoding and the dataset is only loaded new if the encoding changes between models. 

In [None]:
easypheno.utils.helper_functions.set_all_seeds()
current_model_name = models[0]
encoding = easypheno.utils.helper_functions.get_mapping_name_to_class()[current_model_name].standard_encoding

dataset = easypheno.preprocess.base_dataset.Dataset(
    data_dir=data_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, datasplit=datasplit, n_outerfolds=5, n_innerfolds=n_innerfolds,
    test_set_size_percentage=test_set_size_percentage, val_set_size_percentage=20,
    encoding=encoding, maf_percentage=maf_percentage
)

After retrieving the type of ML ``task`` using ``easypheno.utils.helper_functions.test_likely_categorical()`` as well as the time stamp for saving the results, we create an ``easypheno.optimization.optuna_optim.OptunaOptim`` object. For this purpose, we handover all information that is needed for the hyperparameter search.

In [None]:
task = 'classification' if easypheno.utils.helper_functions.test_likely_categorical(dataset.y_full) else 'regression'
models_start_time = '+'.join(models) + '_' + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

optim_run = easypheno.optimization.optuna_optim.OptunaOptim(
    save_dir=save_dir, genotype_matrix_name=genotype_matrix, phenotype_matrix_name=phenotype_matrix,
    phenotype=phenotype, n_outerfolds=5, n_innerfolds=n_innerfolds,val_set_size_percentage=20, 
    test_set_size_percentage=test_set_size_percentage, maf_percentage=maf_percentage, n_trials=n_trials, 
    save_final_model=False, batch_size=None, n_epochs=10000, task=task, 
    models_start_time=models_start_time, current_model_name=current_model_name, dataset=dataset
)

Finally, we just need to call the method ``run()`` of our ``easypheno.optimization.optuna_optim.OptunaOptim`` object to start the Bayesian hyperparameter search, which will print the current progress and return a dictionary with summary results. 

In [None]:
summary_results = optim_run.run_optuna_optimization()
pprint.PrettyPrinter(depth=4).pprint(summary_results)

Beyond that, ``easypheno.optimization.optuna_optim.OptunaOptim`` creates and saves the ``Results_overview_*.csv`` files, which we show above in this tutorial.

## Further information
This notebooks shows how the use the easyPheno pip package to run an optimization. Furthermore, we give an overview of the individual steps within ``optim_pipeline.run()``.

For more information on specific topcis, see the following links:
- [Documentation of the whole package](https://easypheno.readthedocs.io/en/latest/index.html)

- [easyPheno's GitHub repository](https://github.com/grimmlab/easyPheno)

- Prepare your data according to our format: [Data Guide](https://easypheno.readthedocs.io/en/latest/data.html)

- The [Installation Guide](https://easypheno.readthedocs.io/en/latest/install_docker.html) as well as [basic tutorial](https://easypheno.readthedocs.io/en/latest/tutorials/tut_run_docker.html) for the Docker workflow as an alternative

- Summarize and analyze prediction results with easyPeno: [HowTo: Summarize prediction results with easyPheno](https://easypheno.readthedocs.io/en/latest/tutorials/tut_sum_results.html)

- Several [advanced topics](https://easypheno.readthedocs.io/en/latest/tutorials/tut_adv.html) such as adjusting existing prediction models or creation of new ones