# Environment Setup

Follow these steps to set up and activate the development environment for this project:

1. **Create the Environment**

   Open a terminal in the project root directory and run one of the following commands:

   - For the standard environment:
     ```bash
     conda env create -f environment-dev.yml
     ```
   - For GPU support (e.g., with Sockeye):
     ```bash
     conda env create -f environment-dev-gpu.yml
     ```

2. **Activate the Environment**

   - For the standard environment:
     ```bash
     conda activate mds-afforest-dev
     ```
   - For GPU support:
     ```bash
     conda activate mds-afforest-dev-gpu
     ```

These commands will install all required dependencies for development.  
**Note:** Ensure you have [conda](https://docs.conda.io/en/latest/miniconda.html) installed before proceeding.


# Running the Scripts

To run the scripts, ensure you are in the project directory and the environment is activated. You can then execute the scripts using make. You can find the available scripts in the `Makefile` located in the root directory of the project. The scripts are organized into different sections, such as data processing, model training, and evaluation.

## Pre-requisites
Before running the scripts, ensure you have:
- Installed the required environment as described above.
- All necessary data files are available in the expected directories. By default the raw data is not available in the repository, but you can download it from: [Google Drive Link](https://drive.google.com/file/d/1GengsSVG29m0wH9EET1oaVhadv48dgGj/view?usp=drive_link)
- Place the data files in the `data/raw` directory of the project.

## Load the Data
To load the data into parquet format, you can use the provided script. Run the following command in the terminal:
```bash
make data/raw/raw_data.parquet RAW_DATA_PATH=data/raw/AfforestationAssessmentDataUBCCapstone.rds
```

## Preprocess the Data
To preprocess the data, you can use the following command:
```bash
make preprocess_features
```

# Data Processing for Classical Models

## Set the Threshold
You can set the threshold for the pivoting operation by defining the `THRESHOLD` variable. This variable determines the high and low survival rates. You can set it in the command line when running the pivoting script, as shown below.
```bash
THRESHOLD=0.7
```

## Pivot the data
To pivot the data, you can use the following command:
```bash
make pivot_data THRESHOLD=${THRESHOLD}
```

## Split the Data  
To split the processed data into training and test sets:  
```bash
make data_split THRESHOLD=${THRESHOLD}
```  
This will execute the `data_split.py` script to generate the train and test datasets in the specified directory.


## Execute all Preprocessing for Classical Models

**To execute all of these commands at once:**

```bash
make data_for_classical_models THRESHOLD=${THRESHOLD}
```


# Data Processing for RNN Models

## Split cleaned data
The RNN model does not use a threshold, so it must be split at the interim processing stage:
```bash
make data_split_RNN
```
This splits the partially cleaned dataset into training and testing subsets.

## Generate Training Sequence Data
To generate the training time series data for the RNN modelling:
```bash
make time_series_train_data
```
This will execute the `get_time_series.py` script to generate the training lookup table, the sequences and the `norm_stats.json` file used for standard scaling.

## Generate Testing and Validation Sequence Data

To generate the test and validation time series data for the RNN modelling:
```bash
make time_series_test_data
```
This will execute the `get_time_series.py` to generate the training and validation lookup tables and sequences, and use `norm_stats.json` to standardize features.


## Execute all Preprocessing for RNN Models

**To execute all of these commands at once:**

```bash
make data_for_RNN_models
```


## Train the Models

### Classical Machine Learning Models
To train the models using the provided pipelines, run the following commands:

- **Logistic Regression:**
    ```bash
    make logistic_regression_pipeline
    ```
    This will train a logistic regression model and save it to `models/logistic_regression.joblib`.

- **Random Forest:**
    ```bash
    make random_forest_pipeline
    ```
    This will train a random forest model and save it to `models/` directory.

- **Gradient Boosting:**
    ```bash
    make gradient_boosting_pipeline
    ```
    This will train a gradient boosting model and save it to `models/` directory.
- **All models**:
    ```bash
    make all_classical_models
    ```
    This will train all the models defined in the `Makefile` and save them to the `models/` directory.

## Fine-tune the classical models
You can customize hyperparameters for tuning by setting the following variables in your command:

- `TUNING_METHOD`: Specify the tuning method (e.g., `grid` or `random`).
- `PARAM_GRID`: Define the parameter grid as a string (e.g., `"{'C':[0.1,1,10]}"`).
- `NUM_ITER`: Number of iterations for randomized search.
- `NUM_FOLDS`: Number of cross-validation folds.
- `SCORING`: Scoring metric (e.g., `accuracy`, `f1`).
- `RANDOM_STATE`: Random seed for reproducibility.
- `RETURN_RESULTS`: Set to `True` to return full results.


### Set up arguments for tuning
```bash
TUNING_METHOD=random
PARAM_GRID={}
NUM_ITER=20
NUM_FOLDS=5
SCORING=f1
RANDOM_STATE=42
RETURN_RESULTS=True
```

You can tune each classical model separately using the following commands:

- **Tune Gradient Boosting:**
    ```bash
    make tune_gbm \
        TUNING_METHOD=${TUNING_METHOD} \
        PARAM_GRID=${PARAM_GRID} \
        NUM_ITER=${NUM_ITER} \
        NUM_FOLDS=${NUM_FOLDS} \
        SCORING=${SCORING} \
        RANDOM_STATE=${RANDOM_STATE} \
        RETURN_RESULTS=${RETURN_RESULTS} \
        THRESHOLD_PCT=${THRESHOLD_PCT}
    ```

- **Tune Random Forest:**
    ```bash
    make tune_rf \
        TUNING_METHOD=${TUNING_METHOD} \
        PARAM_GRID=${PARAM_GRID} \
        NUM_ITER=${NUM_ITER} \
        NUM_FOLDS=${NUM_FOLDS} \
        SCORING=${SCORING} \
        RANDOM_STATE=${RANDOM_STATE} \
        RETURN_RESULTS=${RETURN_RESULTS} \
        THRESHOLD_PCT=${THRESHOLD_PCT}
    ```

- **Tune Logistic Regression:**
    ```bash
    make tune_lr \
        TUNING_METHOD=${TUNING_METHOD} \
        PARAM_GRID=${PARAM_GRID} \
        NUM_ITER=${NUM_ITER} \
        NUM_FOLDS=${NUM_FOLDS} \
        SCORING=${SCORING} \
        RANDOM_STATE=${RANDOM_STATE} \
        RETURN_RESULTS=${RETURN_RESULTS} \
        THRESHOLD_PCT=${THRESHOLD_PCT}
    ```

- **Tune all models:**
    ```bash
    make tune_classical_models \
        TUNING_METHOD=${TUNING_METHOD} \
        PARAM_GRID=${PARAM_GRID} \
        NUM_ITER=${NUM_ITER} \
        NUM_FOLDS=${NUM_FOLDS} \
        SCORING=${SCORING} \
        RANDOM_STATE=${RANDOM_STATE} \
        RETURN_RESULTS=${RETURN_RESULTS} \
        THRESHOLD_PCT=${THRESHOLD_PCT}
    ```



## Deep Learning Models (RNNs)

### Set up arguments for RNN model

Set the following environment variables to configure your RNN model architecture and training:

- `INPUT_SIZE`: Number of input features per time step (default: 12)
- `HIDDEN_SIZE`: Number of hidden units in the RNN (default: 16)
- `SITE_FEATURES_SIZE`: Number of site-level features to concatenate (default: 4)
- `RNN_TYPE`: Type of RNN cell to use (`LSTM`, `GRU`)
- `NUM_LAYERS`: Number of stacked RNN layers (default: 1)
- `DROPOUT_RATE`: Dropout rate between RNN layers (default: 0.2)
- `CONCAT_FEATURES`: Whether to concatenate site features (`True` or `False`)
- `RNN_PIPELINE_PATH`: Path to save the trained RNN pipeline

```bash
    INPUT_SIZE=12 
    HIDDEN_SIZE=16 
    SITE_FEATURES_SIZE=4
    RNN_TYPE=LSTM
    NUM_LAYERS=2
    DROPOUT_RATE=0.3
    CONCAT_FEATURES=True
    RNN_PIPELINE_PATH=models/rnn_model.pth
```

### Set up arguments for RNN training
Set the following environment variables to configure your RNN training:

- `LR`: Learning rate for the optimizer (default: 0.01)
- `BATCH_SIZE`: Batch size for training (default: 64)
- `EPOCHS`: Number of training epochs (default: 10)
- `PATIENCE`: Early stopping patience (default: 5)
- `NUM_WORKERS`: Number of workers for data loading (default: 0)
- `SITE_COLS`: Comma-separated list of site-level feature columns (default: Density,Type_Conifer,Type_Decidous,Age)
- `SEQ_COLS`: Comma-separated list of sequential feature columns (default: NDVI,SAVI,MSAVI,EVI,EVI2,NDWI,NBR,TCB,TCG,TCW,log_dt,neg_cos_DOY)
- `TRAINED_RNN_OUTPUT_PATH`: Path to save the trained RNN model outputs

```bash
    LR=0.01
    BATCH_SIZE=64
    EPOCHS=10
    PATIENCE=5
    NUM_WORKERS=0
    SITE_COLS=Density,Type_Conifer,Type_Decidous,Age
    SEQ_COLS=NDVI,SAVI,MSAVI,EVI,EVI2,NDWI,NBR,TCB,TCG,TCW,log_dt,neg_cos_DOY
    TRAINED_RNN_OUTPUT_PATH=models/trained_rnn_model.pth
```

- **Create RNN model:**  
    To train the RNN model, run the following command:
    ```bash
    make rnn_model \
        INPUT_SIZE=${INPUT_SIZE} \
        HIDDEN_SIZE=${HIDDEN_SIZE} \
        SITE_FEATURES_SIZE=${SITE_FEATURES_SIZE} \
        RNN_TYPE=${RNN_TYPE} \
        NUM_LAYERS=${NUM_LAYERS} \
        DROPOUT_RATE=${DROPOUT_RATE} \
        CONCAT_FEATURES=${CONCAT_FEATURES} \
        RNN_PIPELINE_PATH=${RNN_PIPELINE_PATH}
    ```

- **Train RNN model:**
    To train the RNN model with the specified parameters, run:
    ```bash
    make rnn_training \
        LR=${LR} \
            BATCH_SIZE=${BATCH_SIZE} \
            EPOCHS=${EPOCHS} \
            PATIENCE=${PATIENCE} \
            NUM_WORKERS=${NUM_WORKERS} \
            SITE_COLS=${SITE_COLS} \
            SEQ_COLS=${SEQ_COLS} \
            RNN_PIPELINE_PATH=${RNN_PIPELINE_PATH} \
            TRAINED_RNN_OUTPUT_PATH=${TRAINED_RNN_OUTPUT_PATH}
    ```


## Test
To run the tests, you can use the following command:
```bash
make test
```

## Clean Up

To clean up generated data and models, you can use the following commands:

- **Clean data files:**
    ```bash
    make clean_data
    ```
    This will remove the raw, interim, and processed data files and recreate the necessary directories with `.gitkeep` files.

- **Clean model files:**
    ```bash
    make clean_models
    ```
    This will remove all files in the `models` directory.
- **Clean all generated files:**
    ```bash
    make clean_all
    ```