<hr style="border:1.5px solid gray">

# DATA3888 Final Report

<hr style="border:1.5px solid gray">

## Optiver 1

---
SIDs: 520434835, 




### Executive Summary

Financial markets are complicated systems where participants buy or sell financial instruments, such as stocks, bonds, derivatives, currencies and commodities, through a structured marketplace under regulatory oversight. Financial data provides a rich environment for data analysis through the investigation of market research, investment strategies, risk management, algorithmic trading, and predictive modelling. An order book provides one instance of a relatively simplistic dataset conveying the bid-ask prices and sizes of particular stocks at a given moment in the market. High-Frequency Trading (HFT) market makers like Optiver provide liquidity in the market to make small amounts of profit over large volumes of trades. For these trading firms, capturing price fluctuations of the products being traded is invaluable for their profits; through this report, we discuss the impact of feature engineering and clustering on the performance of model predictions.

Realised volatility is defined as the square root of the sum of consecutive log returns, calculated using the Weighted Average Price (WAP).
$$
WAP = \frac{\text{BidPrice}_1\cdot\text{AskSize}_1+\text{AskPrice}_1\cdot\text{BidSize}_1} {\text{BidSize}_1+\text{AskSize}_1}
$$

To develop more variables for our models in addition to the attributes in the stock datasets, we applied feature engineering on the bid and ask prices and sizes, creating various measures of spreads for the time ids. Through this, we compare our optimised linear regression model with a baseline linear model. The feature engineering variables were also used in our tree based models, Light GBM and Random Forest. Comparatively, time series models like GARCH and AutoRegressive Integrated Moving Average(ARIMA) were also developed. These various models were developed and compared to determine the prediction of volatility with the highest accuracy whilst reducing the error rate.
 
*Write something on clustering?*

Our final product for traders to utilise is an interactive shiny application that allows users to upload stock data and visualise the predictive volatility through an optimised ensemble model that combines all models’ predictions and chooses the lowest accuracy.


<img src="Figure_1.drawio.png" alt="figure 1" />

## Method

### Linear Regression

### LightGBM 

### Random Forest

### GARCH

### ARIMA

### ARIMA

In this study, an ARIMA model was employed to predict stock volatility. The ARIMA model is a widely-used approach for forecasting time series data, particularly in the field of finance[1]. In this instance, an ARIMA(3,1,1) model was used. This parameter, p = 3, incorporates the model's use of the past three observations to predict the current value. The parameter d = 1 signifies that the data was differenced once to achieve stationarity, thereby ensuring that the time series is stable over time by removing trends. The parameter q = 1 denotes that the model includes one moving average term, which utilises the past forecast error to enhance the accuracy of the prediction. The optimal values for p, d and q were selected through a combination of theoretical knowledge and empirical analysis, involving the testing of multiple parameter sets. The ARIMA(3,1,1) model exhibited a low root mean square error (RMSE) value, indicative of high accuracy, and a low distribution of RMSE, demonstrating consistency in the predictions.


#### Moving Window

The implementation of a moving window approach in conjunction with the ARIMA model enables the model to adapt dynamically to evolving patterns within the dataset, thereby enhancing its predictive capabilities[2]. By employing a sliding window of a specified size, the model is able to respond in real-time to shifting data dynamics, thereby facilitating accurate and timely predictions. In this study, the 'step' parameter was set to 4, corresponding to a 60-second prediction window. The original dataset consisting of 40 volatilities was divided into 36 for training and 4 for testing. This allocation of data ensured a balanced evaluation of the model's predictive performance, allowing for a comprehensive assessment of its ability to accurately forecast future volatility.

## Results

## Discussion

### Shiny

The deployment process of the machine learning model, aimed at predicting stock volatility, was carefully executed to ensure seamless integration and usability within the Python based Shiny app (https://nithya7612.shinyapps.io/my-app/). The app provides traders with quick and accurate volatility forecasts for a range of stocks, enhancing their decision-making processes in dynamic financial markets.

The app mimics real-time streaming of data and provides access to the Optiver stock database, delivering an authentic trading experience. Currently, it includes 20 stock files, allowing users to select from these pre-loaded options. Due to Posit subscription plans, all 100 stock files could not be deployed on the free plan, so only 1GB worth of data was uploaded. Once a file is chosen, the app swiftly performs volatility predictions within approximately one second, ensuring minimal delay. Traders simply need to select the stock ID to receive one-minute ahead predictions, making it a highly efficient tool for rapid decision-making. This seamless process ensures that traders have timely and precise volatility data at their fingertips, enhancing their ability to react swiftly to market changes. Shiny for Python was used to design the prediction app. In order to deploy the app, the code:
`rsconnect deploy shiny ./dashboard/ --name nithya7612 –title my-app` was executed in the terminal. Rsconnect first verifies the connection to the Shiny server to ensure it reachable and verifies the app mode is compatible with the Shiny deployment requirements. When deploying a Shiny for Python application, the next step involves making a bundle. The deployment tool gathers all the necessary files from the application directory, including python scripts, data files and CSS scripts. The requirements.txt is file is parsed to identify all the dependencies required by the application. All this data is compressed into a single bundle file, to streamline the upload process by reducing the number of individual file transfers and includes all the metadata about the file.

The next stage is to deploy the bundle itself. The bundle is uploaded to the server and then unpacked. This involves decompressing the archive and extracting all the files appropriately. The server creates an environment for the application to run in, by installing all the files (python scripts etc.) and the required dependencies (requirements.txt). Images of the application are then pushed to the server. Following this, the server begins the staging phase preparing for deployment. A new instance of the
application if rolled forward and any old instances are terminated. The final stage confirms that the application runs as expected on the server by “Verifying Deploying Content”. A multidisciplinary approach was crucial to the development of the volatility prediction application, integrating finance, data science and human computer interactions. Finance underpins the app’s
purpose, ensuring it meets the financial needs and objectives of Optiver Option’s Traders. Data Science provides the models that were incorporated for the prediction and accuracies of the volatility prediction model. Human-Computer Interaction focuses on the optimising how options traders interact with the application ensuring a user-friendly interface.

### Future Work
Future work for the app involves continuously updating and improving the model and its predictive capabilities by training on data that is received throughout the day. Each night, once trading closes, the app will use latest stock data to create a new model ensuring the predictions are based on the most recent information. This approach guarantees that each day, the app operates with a freshly trained model, enhancing accuracy and relevance in the dynamic stock market environment. By prioritising more recent data, the model better captures the latest market trends and shifts, which have a greater impact on future volatility compared to significantly older data.


## Conclusion

### Future Work

## References

[1]https://www.sciencedirect.com/science/article/pii/S0020025523015360?casa_token=1rjJEPgVSvsAAAAA:iITm0ttv6qfGLVIvuPKeL-xbOJ7GZrDO03L140nkd2CN5PXTX1WxsZCjc3xCsGqhmJ5kQN3VPQ#br0240 <br/>
[2]​​https://arxiv.org/abs/2405.08284

## Student Contribution