# Second Assignment - FINTECH 540 - Machine Learning for FinTech - Cryptocurrency Options Data analysis

In this assignment, you'll work with cryptocurrency options transaction data between 2022-08-04 and 2022-11-18 from Binance. The primary objective is to achieve a satisfactory performance in predicting option prices. This is a regression task, and you must ensure that your model can predict well on the test set (out-of-sample).

## Dataset Overview

You have been provided with a dataset containing the following columns:

- **Symbol**: Unique identifier for each option contract. For instance, the option contract *ETH-221125-900-C* refers to a Call option (C) for Ethereum (ETH), with a strike price (also called execution price) of 900, expiring on November 25th, 2022 (221125. If a holder has this option, they can buy Ethereum at a price of 900 units of currency on or before this expiration date, regardless of Ethereum's actual market price.
- **Time**: Timestamp indicating when the trade occurred.
- **Price**: Price at which the option was traded. **This is the target you want to predict for this task**.
- **Quantity (Qty)**: Number of option contracts traded.
- **Strike Price**: The predetermined price at which the holder can buy or sell the underlying asset.
- **Underlying Price**: The current market price of the underlying asset (e.g., BTC or ETH for this dataset).
- **Time to Maturity (Tt)**: Duration (expressed as a fraction of a year) until the option's expiration date. For instance, if `Tt=0.145152`, there are 0.145152*365 = 52.98 days until the expiration. The calculation of the Time to Maturity has already been provided to you, and you do not need to bring it back to several days for the sake of this exercise. 
- **RV_lag0, RV_lag7, RV_lag15, RV_lag30**: Realized underlying asset volatility at various lag values. The lag 0 reflects the underlying asset's volatility on the day the transaction occurred.
- **Underlying Asset**: The asset upon which the option contract is based (e.g., BTC or ETH).
- **Option Type**: Specifies whether it's a 'Call' (C) or 'Put' (P) option.
- **Side**: Represents whether the trade was a buy (`1`) or a sell (`-1`) since these are transaction data that reflect both sides of the order book.

## Task and General Hints

In this assignment, you are tasked with building a predictive regression model on cryptocurrency options data. Your primary goal is to ensure accurate out-of-sample predictions and evaluate them with the metrics below.

To guide you through this process, consider breaking down your tasks into the following three phases:

**Preprocessing**
The dataset is already free of inconsistencies, missing values, or outliers. 
- **Feature Engineering**: You might want to create additional variables, perform transformations, or encode categorical variables if necessary. Ensure that all the variables you want to use for modeling are correctly preprocessed. You don't need to use all the variables necessarily. You will eventually refine your choices while modeling.
- **Data Splitting**: Partition your data into training and test sets. Ensure you set the seed to `42` for reproducibility and use `0.33` as the test size.

**Model Selection**
- This notebook focuses on using ensembles of trees for regression. You can experiment with all the ensembles we have seen in class. However, feel free to compare the performance against a linear regression model. 

**Model Tuning and Evaluation**
- Once you've selected a model, you'll want to fine-tune its parameters to achieve the best out-of-sample performance.
- You may adjust parameters manually, but consider using Grid Search or Randomized Search for a more systematic and potentially practical approach. 
- When employing Grid Search (or Randomized Search), you can use cross-validation schemes provided in **Notebook 8**.
- Evaluate your final model using the $R^2$ and the Root Mean Squared Error (RMSE) metrics from `scikit-learn.` For the RMSE, you can either calculate the MSE and take the square root of it or use the `mean_squared_error` function from `scikit-learn` and pass the parameter `squared=False.` Remember, this is the primary criterion on which you will be graded. You can carry out the calculations on your own while developing your solution. However, the final cell of this notebook is also going to take care of it, so follow the naming convention stated at the bottom of the notebook.

**Note**: Parameter choices and tuning should be made thoughtfully while up to you. Carefully study the documentation of the tree ensemble you are testing to see the possible parameters you can fine-tune. In notebook 8, you have the structure of a simple grid search over a small grid of parameters. You can borrow that structure and modify it accordingly.

**IMPORTANT REMARK**: 

In scikit-learn, if you use cross-validation functions like `cross_val_score` or `GridSearchCV` or `RandomizedGridSearchCV` with a specified cross-validation scheme, you only need to pass the training dataset. The cross-validation function will automatically split the training data into training and validation subsets multiple times according to the selected cross-validation strategy. Therefore, creating a validation set to tune the hyperparameters is unnecessary.

Instead, you have to use the test set obtained from the split solely as data the model has never seen before. The results on that part of the dataset are those that are going to provide your grade.

## Grading Rubric

Your grade will be determined by combining the $R^2$ value and the **normalized Root Mean Squared Error (RMSE)** your model achieves on the test set. Specifically, your grade will be calculated as:

$$ \text{Grade} = (0.7 \times R^2 + 0.3 \times \text{Normalized RMSE}) \times 100 $$

which will be a number between 0 and 100. Grades may be curved before being released.

The normalization for RMSE is defined as:

$$ \text{Normalized RMSE} = 1 - \left( \frac{\text{RMSE}}{\text{MAX_POSSIBLE_RMSE}} \right) $$

Where `MAX_POSSIBLE_RMSE` represents a domain-specific value that signifies the worst possible RMSE for your dataset, which could be set as the standard deviation of the target variable. This normalization ensures that the RMSE value is scaled between 0 (worst) and 1 (best).

**Rationale**

The rationale for using both $R^2$ and **RMSE** in your grade is to ensure a holistic assessment of your model's performance:

- **$R^2$** captures the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher $R^2$ indicates a model that explains more of the variation, providing insight into the model's goodness of fit.
  
- **RMSE** measures the average magnitude of the errors between predicted and observed values. It offers a more direct interpretation of how much, on average, predictions deviate from the actual values, allowing for a clear understanding of the model's accuracy. The RMSE is normalized to allow being combined with the $R^2$ and results in a number between 0 and 100.

By weighting $R^2$ and **RMSE** with weights of 0.7 and 0.3, respectively, we emphasize the model's ability to explain variance while holding it accountable for its accuracy in terms of error magnitude.

The quality of the prediction assessed by those metrics will result from all the choices you made when it comes to preprocessing features, including them into a model, selecting and evaluating a proper regression model, and eventually doing hyperparameter optimization. 

In [1]:
### START CODE HERE ###
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [2]:
raw_data = pd.read_csv('raw_options_binance.csv')
# 1. data preprocessing
# 1.1 selecet proper columns
# Time and Symbol looks useless here.
# The first column is just index, should drop it

selected_columns = [ 'price', 'qty', 'strike_price',
       'underlying_price', 'Tt', 'RV_lag0', 'RV_lag7', 'RV_lag15', 'RV_lag30',
       'underlying_asset', 'option_type', 'side']
df = raw_data[selected_columns]
# 1.2 transform time underlying_asset and option_type into numerical form that a model can interpret.
df['trans_underlying_asset'] = df['underlying_asset'].apply(lambda x: 1 if x == "ETH" else 0)
df['trans_option_type'] = df['option_type'].apply(lambda x: 1 if x == "C" else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['trans_underlying_asset'] = df['underlying_asset'].apply(lambda x: 1 if x == "ETH" else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['trans_option_type'] = df['option_type'].apply(lambda x: 1 if x == "C" else 0)


In [3]:
df['option_type'].unique()
X = df[['qty', 'strike_price',
       'underlying_price', 'Tt', 'RV_lag0', 'RV_lag7', 'RV_lag15', 'RV_lag30',
       'trans_underlying_asset', 'trans_option_type', 'side']]
y = df['price']

In [4]:
# 1.3 fitting model
max_components = min(X.shape)
model_results = []
for n in range(5, max_components + 1):
    # perform PCA
    pca = PCA(n_components = n)
    pca.fit(X)
    trans_X = pca.transform(X)
    
    # split the data
    X_train, X_test, y_train, y_test = train_test_split(trans_X, y, test_size=0.33, random_state=42)
    print(X_train.shape)

    # build model
    model = XGBRegressor(random_state = 42)
    
    # fine-tuned
    param_grid = {
        "learning_rate": np.arange(0.1,0.5,0.1),
        "max_depth": range(3, 10),
        "n_estimators":[50,500]
    }
    model_gs = GridSearchCV(
        model, param_grid, scoring="neg_mean_squared_error",  n_jobs=-1, verbose=1
    )
    
    # Record the result
    model_gs.fit(X_train,y_train)
    result = {}
    result['PCA_n'] = n
    result['best_params'] = model_gs.best_params_
    result['best_score'] = model_gs.best_score_
    model_results.append(result)
    print("number of components from PCA:",n)
    print(f"Best parameters: {model_gs.best_params_}")
    print(f"Neg MSE (Training set): {model_gs.best_score_:.4f}")

(27731, 5)
Fitting 5 folds for each of 56 candidates, totalling 280 fits
number of components from PCA: 5
Best parameters: {'learning_rate': 0.4, 'max_depth': 8, 'n_estimators': 50}
Neg MSE (Training set): -12050.5804
(27731, 6)
Fitting 5 folds for each of 56 candidates, totalling 280 fits
number of components from PCA: 6
Best parameters: {'learning_rate': 0.4, 'max_depth': 3, 'n_estimators': 500}
Neg MSE (Training set): -10110.2720
(27731, 7)
Fitting 5 folds for each of 56 candidates, totalling 280 fits
number of components from PCA: 7
Best parameters: {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 500}
Neg MSE (Training set): -8045.2063
(27731, 8)
Fitting 5 folds for each of 56 candidates, totalling 280 fits
number of components from PCA: 8
Best parameters: {'learning_rate': 0.30000000000000004, 'max_depth': 7, 'n_estimators': 500}
Neg MSE (Training set): -7250.5993
(27731, 9)
Fitting 5 folds for each of 56 candidates, totalling 280 fits
number of components from PCA: 9
Best 

In [5]:
max_dict = max(model_results, key=lambda x: x['best_score'])
print(max_dict)  # This will output {'name': 'Jane', 'score': 95}

{'PCA_n': 9, 'best_params': {'learning_rate': 0.30000000000000004, 'max_depth': 7, 'n_estimators': 500}, 'best_score': -7036.130389498916}


In [6]:
#1.4 reproduce the model to get a grade


pca = PCA(max_dict['PCA_n'])
pca.fit(X)
trans_X = pca.transform(X)
X_train, X_test, y_train, y_test = train_test_split(trans_X, y, test_size=0.33, random_state=42)

best_model = XGBRegressor(
    learning_rate = max_dict['best_params']['learning_rate']
    ,max_depth = max_dict['best_params']['max_depth']
    ,n_estimators = max_dict['best_params']['n_estimators']
    ,random_state=42
)
best_model.fit(X_train, y_train)
y_test_pred = best_model.predict(X_test)

**Instructions to let the next code cell run:**

Before running the cell below, ensure the following:
1. The target variable of your problem has to be named exactly `y_test`, while the out-of-sample prediction variable has to be named `y_test_pred`. Also the calculation of `MAX_POSSIBLE_RMSE` relies on this naming convention to determine the standard deviation of the test target values. 

By adhering to these naming conventions, the grading cell can compute the final score without any issues.

In [7]:
import math
from sklearn.metrics import mean_squared_error, r2_score
MAX_POSSIBLE_RMSE = y_test.std()
normalized_rmse = 1 - (mean_squared_error(y_test,y_test_pred,squared=False) / MAX_POSSIBLE_RMSE)
R2 = r2_score(y_test,y_test_pred)

Grade = 0.7 * R2 + 0.3 * normalized_rmse
print('The grade for this assignment is ',math.ceil(Grade*100))

The grade for this assignment is  97
