# LLM-Lasso Tutorial

## 1. Setup Instructions
1. Install `LLM-Lasso` as an editable package:
    ```
    $ pip install -e .
    ```
    for `pip`, or
    ```
    $ conda develop .
    ```
    for `conda`. Note that this requires you to `conda install conda-build`.

2. Initialize the `adelie` submodule:
    ```
    $ git submodule init
    $ git submodule update
    ```
3. Install `adelie` as an editable package (`adelie` is used for solving LASSO with penalty factors).
    ```
    $ cd adelie-fork
    $ pip install -e .
    ```
    or the equivalent for `conda`.

4. Copy the file `sample_constants.py` to `_my_constants.py` and populate relevant API keys.

The values from `_my_constants.py` are automatically loaded into `constants.py`.

### 1.1 Common issues:
Intalling `adelie` as an editable package requires compiling from source, which may come with several issues:
- `adelie` requires some C++ libraries, namely `eigen`, `llvm`, and `openmp` (which may be installed as `libomp`). For Unix-based systems, these should be available through your package manager, and releases are also available online.
- There may issues with the `eigen` library (and others) not being in the `C_INCLUDE_PATH` and `CPLUS_INCLUDE_PATH`. For this, you need to:
    - Find where the `eigen` include directory is on your machine (it should be a directory with subdirectories `Eigen` and `unsupported`). For macOS with `eigen` installed via `homebrew`, this may be in a directory that looks like `/opt/homebrew/Cellar/eigen/3.4.0_1/include/eigen3/`. For linux, this may be `/usr/include/eigen3/` or `/usr/local/include/eigen3/`, for instance.

    - Run the following:
        ```
        $ export C_INCLUDE_PATH="the_path_from_the_previous_step:$C_INCLUDE_PATH"
        $ export CPLUS_INCLUDE_PATH="the_path_from_the_previous_step:$CPLUS_INCLUDE_PATH"
        ```
    You may also have to do this with other libraries, like `libomp`.

- If you installed `llvm` via `homebrew` on macOS, make sure you run the following:
    ```
    $ export LDFLAGS="-L/opt/homebrew/opt/llvm/lib"
    $ export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"
    ```

## 2. Includes 

In [None]:
from llm_lasso.task_specific_lasso.llm_lasso import *
from llm_lasso.task_specific_lasso.plotting import plot_heatmap, plot_llm_lasso_result
from llm_lasso.data_splits import read_train_test_splits, read_baseline_splits
import numpy as np
import warnings
import json
warnings.filterwarnings("ignore")  # Suppress warnings

In [None]:
%load_ext autoreload
%autoreload 2

## 3. Small-Scale Classification Example: Diabetes
The first 4 steps will be run on the command line, and the remainder of the tutorial will be run using this notebook.
### Step 1: Generate Training and Test Splits
For evaluation, we consider 50/50 balanced training and test splits generated with different random seeds. As the same splits are used for the LASSO portion of LLM-Lasso and the data-driven baselines, we generate them beforehand.

To generate $k$ train/test splits, run the following in the command line from the base directory of this repository:
```
$ python scripts/small_scale_splits.py \
        --dataset Diabetes \
        --save_dir data/splits/diabetes \
        --n-splits 10
```

**Corresponding shell script**: run `./shell_scripts/diabetes/step_01_splits.sh`


### Step 2: Run Data-Driven Baselines
Next, run the baseline feature-selected methods that require access to the training splits, e.g., mutual information.
```
$ python scripts/run_baselines.py \
        --split-dir data/splits/diabetes \
        --n-splits 10 \
        --save-dir data/baselines/diabetes
```

**Corresponding shell script**: run `./shell_scripts/diabetes/step_02_baselines.sh`


### Step 3: Run the LLM-Score Baseline
For example:
```
$ python scripts/llm_score.py \
        --prompt-filename prompts/llm-select/diabetes_prompt.txt \
        --feature_names_path small_scale/data/Diabetes_feature_names.pkl \
        --category Diabetes \
        --wipe \
        --save_dir data/llm-score/diabetes \
        --n-trials 1 \
        --step 1 \
        --model-type gpt-4o \
        --temp 0
```

**Corresponding shell script**: run `./shell_scripts/diabetes/step_03_llm_score_baseline.sh`


### Step 4: Generate LLM-Lasso Penalties
Note that there is no RAG setup for the small-scale datasets, so we will not enable RAG in the following script.
```
$ python scripts/llm_lasso_scores.py \
        --prompt-filename prompts/small_scale_prompts/diabetes_prompt.txt \
        --feature_names_path small_scale/data/Diabetes_feature_names.pkl \
        --category Diabetes \
        --wipe \
        --save_dir data/llm-lasso/diabetes \
        --n-trials 1 \
        --model-type gpt-4o \
        --temp 0
```

**Corresponding shell script**: run `./shell_scripts/diabetes/step_04_llm_lasso_penalties.sh`


### Step 5: Run LLM-Regularized LASSO

#### **Prepare Data**
First, load in the required data splits, penalty factors, and baseline-selected features.

In [None]:
# Load in splits
N_SPLITS = 10
splits = read_train_test_splits("../data/splits/diabetes", N_SPLITS)
n_features = splits[0].x_train.shape[1]

In [None]:
# Load in LLM-Lasso Penalties
penalty_list={
    "plain": np.array(
        np.load("../data/llm-lasso/diabetes/final_scores_plain.pkl", allow_pickle=True)
    ),
}

In [None]:
# Load in baseline features
feature_baseline = read_baseline_splits(
    "../data/baselines/diabetes", n_splits=N_SPLITS, n_features=n_features)

with open("../data/llm-score/diabetes/llmselect_selected_features.json", "r") as f:
    llm_select_genes = json.load(f)[f"{n_features}"]

feature_baseline["llm_score"] = [llm_select_genes] * N_SPLITS

#### **Run Experiments**

LLM-Lasso experiments are set up in a **modular** fashion, so you run the baselines, Lasso, and LLM-Lasso separately.

**Experiment Configuration:**

In [None]:
config = LLMLassoExperimentConfig(
    folds_cv=5, # number of cross-validation folds
    regression=False, # this is classification, not regression,
    score_type=PenaltyType.PF, # We have penalty factors from the LLM,
                               # not importance scores.
    max_imp_power=4,
    lambda_min_ratio=0.01, # Lasso parameter,
    n_threads=8, # number of threads to use for computation
    run_pure_lasso_after=5,
    lasso_downstream_l2=True,
    cross_val_metric=CrossValMetric.ERROR
)

**Run Data-Driven Baselines**

This, along with all of the following experiment functions, outputs a Pandas `DataFrame`

In [None]:
baselines = run_downstream_baselines_for_splits(
    splits=splits,
    feature_baseline=feature_baseline,
    config=config
)

In [None]:
baselines

**Run Lasso Baseline**

In [None]:
lasso = run_lasso_baseline_for_splits(
    splits=splits,
    config=config
)

**Run LLM-Lasso**

In [None]:
llm_lasso = run_llm_lasso_cv_for_splits(
    splits=splits,
    scores=penalty_list,
    config=config,
    verbose=False
)

In [None]:
llm_lasso[llm_lasso["n_features"] == 1]

In [None]:
lasso[lasso["n_features"] == 1]

**Plotting Results**

To plot the test error and AUROC, use `plot_llm_lasso_result` and pass in a list of all of the dataframes output by the previous experiments.

In [None]:
dataframes_to_plot = [df[df["n_features"] > 0] for df in [lasso, baselines,llm_lasso]]
plot_llm_lasso_result(
    dataframes_to_plot,
    bolded_methods=["1/imp - plain"],
    plot_error_bars=False,
)

You can also plot a feature inclusion heatmap.

In [None]:
plot_heatmap(
    [llm_lasso, lasso],
    method_models=["1/imp - plain", "Lasso"], # these are from the method_model column of the dataframe
    labels=["LLM-Lasso", "Lasso"], # this is how each method_model will be labeled on the plot
    feature_names=splits[0].x_train.columns,
    sort_by="LLM-Lasso"
)

## 4. Small-Scale Regression Example: Spotify
The first 4 steps will be run on the command line, and the remainder of the tutorial will be run using this notebook.
### Command-Line Component
Here are the commands to save data splits, run baselines, and generate penalties, same as for the Diabetes example:
```
$ python scripts/small_scale_splits.py \
        --dataset Spotify \
        --save_dir data/splits/spotify \
        --n-splits 10

$ python scripts/run_baselines.py \
        --split-dir data/splits/spotify \
        --n-splits 10 \
        --save-dir data/baselines/spotify

$ python scripts/llm_score.py \
        --prompt-filename prompts/llm-select/spotify_prompt.txt \
        --feature_names_path small_scale/data/Spotify_feature_names.pkl \
        --category "number of Spotify streams" \
        --wipe \
        --save_dir data/llm-score/spotify \
        --n-trials 1 \
        --step 1 \
        --model-type gpt-4o \
        --temp 0

$ python scripts/llm_lasso_scores.py \
        --prompt-filename prompts/small_scale_prompts/spotify_prompt.txt \
        --feature_names_path small_scale/data/Spotify_feature_names.pkl \
        --category "number of Spotify streams" \
        --wipe \
        --save_dir data/llm-lasso/spotify \
        --n-trials 1 \
        --model-type gpt-4o \
        --temp 0
```

### LLM-Regularized LASSO
First, load in the required data splits, penalty factors, and baseline-selected features.

In [None]:
# Load in splits
N_SPLITS = 10
splits = read_train_test_splits("../data/splits/spotify", N_SPLITS)
n_features = splits[0].x_test.shape[1]

In [None]:
# Load in LLM-Lasso Penalties
penalty_list={
    "plain": np.array(
        np.load("../data/llm-lasso/spotify/final_scores_plain.pkl", allow_pickle=True)
    ),
}

In [None]:
# Load in baseline features
feature_baseline = read_baseline_splits(
    "../data/baselines/spotify", n_splits=N_SPLITS, n_features=n_features
)

with open("../data/llm-score/spotify/llmselect_selected_features.json", "r") as f:
    llm_select_genes = json.load(f)[f"{n_features}"]

feature_baseline["llm_score"] = [llm_select_genes] * N_SPLITS

Ccompute test error and AUROC for LLM-Lasso and the baselines, averaged across the splits.

Make sure to pass in **`regression=True`** to `LLMLassoExperimentConfig`!

In [None]:
config = LLMLassoExperimentConfig(
    folds_cv=10, # number of cross-validation folds
    regression=True, # This is regression!!
    score_type=PenaltyType.PF, # We have penalty factors from the LLM,
                               # not importance scores.
    lambda_min_ratio=0.001, # Lasso parameter,
    n_threads=8, # number of threads to use for computation
)

In [None]:
baselines = run_downstream_baselines_for_splits(
    splits=splits,
    feature_baseline=feature_baseline,
    config=config
)

In [None]:
lasso = run_lasso_baseline_for_splits(
    splits=splits,
    config=config
)

In [None]:
adaptive_lasso = run_adaptive_lasso_for_splits(
    splits=splits,
    config=config
)

In [None]:
xgboost = run_xgboost_for_splits(
    splits=splits,
    ordered_features=feature_baseline["xgboost"],
    config=config
)

In [None]:
llm_lasso = run_llm_lasso_cv_for_splits(
    splits=splits,
    scores=penalty_list,
    config=config,
    verbose=False
)

In [None]:
plot_llm_lasso_result(
    [lasso, adaptive_lasso, baselines, xgboost, llm_lasso],
    bolded_methods=["1/imp - plain"],
    plot_error_bars=False
)

Plot the feature inclusion heatmap.

In [None]:
plot_heatmap(
    [lasso, llm_lasso],
    method_models=["1/imp - plain", "Lasso"], # these are from the method_model column of the dataframe
    labels=["LLM-Lasso", "Lasso"], # this is how each method_model will be labeled on the plot
    feature_names=splits[0].x_train.columns,
    sort_by="LLM-Lasso"
)