add some info about how/what order to run scripts

greenelab · Jun 12, 2023 · 131ae24 · 131ae24
1 parent 26f2439
commit 131ae24
Showing 1 changed file with 15 additions and 1 deletion.
diff --git a/01_stratified_classification/README.md b/01_stratified_classification/README.md
@@ -17,11 +17,25 @@ A more detailed description of the results and takeaways can be found in the man
 |-- lasso_range_gene_optimizers.ipynb: plot detailed results for a single cancer gene (Figures 1A/B and 3C/D)
 |-- lasso_range_gene_learning_rate.ipynb: plot detailed results for varying learning rate schedules (Figure 2)
 |-- optimizer_figures.ipynb: script to generate multi-panel figures in manuscript
-|-- run_stratified_classification.py: script to train classifiers and write results (performance, coefficients, loss function values)
+|-- run_stratified_lasso_penalty.py: script to train LASSO penalized classifiers and write results (performance, coefficients, loss function values)
 |-- run_stratified_nn.py: script to train neural network classifier (not used in final paper)
 ```
 
 ## Setup and testing pipeline
 
 To set up the environment for running the code in this repo, use the conda environment described in the [parent directory](https://github.com/greenelab/pancancer-evaluation#setup).
 The parent directory README also contains instructions for running tests to ensure the repo/environment are set up correctly.
+
+## Running experiments
+
+To train classifiers and generate the primary results files for the optimization comparison (e.g. classification metrics, best model coefficients, loss function curves), run the `run_stratified_lasso_penalty.py` script. By default, this will use the `liblinear` (coordinate descent) optimizer, unless the `--sgd` flag is included in which case it will use SGD.
+
+If the `--sgd` flag is included, the `--sgd_lr_schedule` argument can be used to select the learning rate schedule. The default is `optimal` (this is the scikit-learn default), but most experiments in the paper use the `constant_search` option. Other options are described [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), under the `learning_rate` function argument.
+
+The `--num_features` argument can be used for feature selection. This will default to 8000 features selected by median absolute deviation. For the experiments in the paper we used 16042 features, which is all of the features in the preprocessed TCGA gene expression dataset.
+
+The script will write output to the `--results_dir` directory, which defaults to `01_stratified_classification/results`.
+
+## Analysis and visualization of results
+
+The Jupyter notebooks in this directory can be used to visualize the results generated by `run_stratified_lasso_penalty.py` and ultimately to generate the figures in the paper, as described above in the "repository layout" section. Each of these scripts has a `results_dir` variable (or multiple variables) defined near the top, which can either be set manually or modified programmatically using [papermill commmand line arguments](https://papermill.readthedocs.io/en/latest/).