Skip to content

Commit

Permalink
add some info about how/what order to run scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
jjc2718 committed Jun 12, 2023
1 parent 26f2439 commit 131ae24
Showing 1 changed file with 15 additions and 1 deletion.
16 changes: 15 additions & 1 deletion 01_stratified_classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,25 @@ A more detailed description of the results and takeaways can be found in the man
|-- lasso_range_gene_optimizers.ipynb: plot detailed results for a single cancer gene (Figures 1A/B and 3C/D)
|-- lasso_range_gene_learning_rate.ipynb: plot detailed results for varying learning rate schedules (Figure 2)
|-- optimizer_figures.ipynb: script to generate multi-panel figures in manuscript
|-- run_stratified_classification.py: script to train classifiers and write results (performance, coefficients, loss function values)
|-- run_stratified_lasso_penalty.py: script to train LASSO penalized classifiers and write results (performance, coefficients, loss function values)
|-- run_stratified_nn.py: script to train neural network classifier (not used in final paper)
```

## Setup and testing pipeline

To set up the environment for running the code in this repo, use the conda environment described in the [parent directory](https://github.com/greenelab/pancancer-evaluation#setup).
The parent directory README also contains instructions for running tests to ensure the repo/environment are set up correctly.

## Running experiments

To train classifiers and generate the primary results files for the optimization comparison (e.g. classification metrics, best model coefficients, loss function curves), run the `run_stratified_lasso_penalty.py` script. By default, this will use the `liblinear` (coordinate descent) optimizer, unless the `--sgd` flag is included in which case it will use SGD.

If the `--sgd` flag is included, the `--sgd_lr_schedule` argument can be used to select the learning rate schedule. The default is `optimal` (this is the scikit-learn default), but most experiments in the paper use the `constant_search` option. Other options are described [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), under the `learning_rate` function argument.

The `--num_features` argument can be used for feature selection. This will default to 8000 features selected by median absolute deviation. For the experiments in the paper we used 16042 features, which is all of the features in the preprocessed TCGA gene expression dataset.

The script will write output to the `--results_dir` directory, which defaults to `01_stratified_classification/results`.

## Analysis and visualization of results

The Jupyter notebooks in this directory can be used to visualize the results generated by `run_stratified_lasso_penalty.py` and ultimately to generate the figures in the paper, as described above in the "repository layout" section. Each of these scripts has a `results_dir` variable (or multiple variables) defined near the top, which can either be set manually or modified programmatically using [papermill commmand line arguments](https://papermill.readthedocs.io/en/latest/).

0 comments on commit 131ae24

Please sign in to comment.