# Modelling

This notebook is used for both training models and using them to generate predictions. Feel free to begin configuring model code!

Below are some models we may use. Included are some observations concerning each of them.
1. N-BEATS
    - Sequence modelling.
    - Easily interpretable
    - Very performant for time-series forecasting.
    - Far more stable than other RNN-based neural networks.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.nbeats.html).
2. XGBoost
    - Feature modelling. Can use [TSfresh](https://tsfresh.readthedocs.io/en/latest/) to automatically generate and filter a lot of good features.
    - Quick training
    - Handles label-encoding of categoricals
    - Performant with limited data
    - Handles missing values
    - Requires good-quality training data 
    - [Docs](https://xgboost.readthedocs.io/en/stable/).

3. DLinear
    - Incredibly simple, yet surprisingly performant for seasonal and cyclical time-series.
    - Essentially N-BEATS without neural networks.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.dlinear.html).
4. Temporal Fusion Transformer
    - Sequence modelling. Handles time-series very well.
    - Good interpretability.
    - Requires a good deal of training data to perform well. Maybe we have enough, maybe not.
    - [Docs](https://unit8co.github.io/darts/generated_api/darts.models.forecasting.tft_model.html)
5. Random Forest Classifier
    - A simpler, less performant version of XGBoost.
    - Feature modelling.
    - [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

## Setup

### Imports

### Data Loading
I have used load/save functions before, but I have not checked if they work properly in this project. Please fix the code if you find errors.

In [None]:
def load_data(filename, folder="raw"):
    """
    Load the data from the specified folder with appropriate delimiter.
    
    Parameters
    ----------
    filename : str
        The name of the file to load.
    folder : str, optional
        The folder to load the data from. Default is "raw".
    
    Returns
    -------
    pandas.DataFrame or geopandas.GeoDataFrame
        The loaded data.
    """
    # Define delimiters for different file types
    delimiters = {
        'ais_test.csv': ',',
        'ais_sample_submission.csv': ',',
        'default': '|'
    }
    
    BASE_DIR = os.getcwd()
    file_path = os.path.join(BASE_DIR, f"../data/{folder}/{filename}")
    
    try:
        if folder == "maps":
            # Load map data
            if "land" in filename:
                df = gpd.read_file(file_path)
            elif "ocean" in filename:
                df = gpd.read_file(file_path)
            else:
                raise ValueError("Unsupported map file")
        else:
            # Get correct delimiter
            delimiter = delimiters.get(filename, delimiters['default'])
            # Load CSV data
            df = pd.read_csv(file_path, sep=delimiter)
        
        print(f"Data loaded successfully from {file_path}")
        return df
    except Exception as e:
        print(f"An error occurred while loading the file: {e}")

def save_data(df, filename, folder="2_interim"):
    """
    Save the dataframe to a CSV file in the specified folder.
    
    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe to save.
    filename : str
        The name of the file to save.
    folder : str, optional
        The folder to save the data in. Default is "2_interim".
    """
    BASE_DIR = os.getcwd()
    file_path = os.path.join(BASE_DIR, f"../data/{folder}/{filename}.csv")
    
    try:
        df.to_csv(file_path, sep='|', index=False)
        print(f"Data saved successfully to {file_path}")
    except Exception as e:
        print(f"An error occurred while saving the file: {e}")

### Helper Functions

## XGBoost

This section implements the XGboost model. Then the model is used for prediction.

Pros:
- Quick training
- Handles label-encoding of categoricals
- Performant with limited data
- Handles missing values

Cons:
- Requires good-quality training data

Got a tip that XGBoost can be configured to use quantile regression as the objective function. We need to use something like ```reg:quantileerror``` with the parameter ```alpha=0.2```. This way we optimize the model for the 20-percent quantile, such as the task asks for.

### Training

### Prediction

### Model Evaluation