# Jane Street Market Prediction 💰
> Jane Street Market Prediction Kaggle Competition part 2

- toc: true 
- badges: true
- comments: true
- author: Jaekang Lee
- categories: [MLP, python, feature engineering, imputation, Jane Street, Kaggle, Visualization, Big Data, random forest]

Got a score of 9443.499 (249th place out of 3616 competitors) using MLP.

### Library 📂

### Null Values 🈳

As discussed before in my [EDA notebook](https://leejaeka.github.io/jaekangai/python/eda/jane%20street/kaggle/visualization/big%20data/2021/01/23/JaneStreet-Copy1.html), we have couple of options to handle null values. <br>
1. Drop all nans
2. Impute with median or mean
3. Feedforward/backward
4. KNN imputer
5. Be creative! 
<br>

In this notebook, I used KNN imputer with 5 nearest neighbors to fill the nans. This takes a long time to run so I suggest downloading the imputed data files from [here](https://www.kaggle.com/louise2001/janestreetimputeddata) by louise2001. Note that he also uploaded soft and iterative imputes.

### Import Data 📚

In this notebook, we are just going to load the imputed data instead of running the feature engineering here. Since it is very time consuming and takes a lot of RAM.

### Feature Engineering 🔧

We first do two feature engineering right off the bat.
1. We are going to drop any rows with 'weight' column equal to 0. This tells us that overall gain from such trade is 0. This would be like telling machine to just guess if learned correctly. <br>
2. To explain why we are dropping all dates before day 85 can be shown visually below. Before the day 85, we can clearly see that the trend has changed quite drastically. 

Note that we only have 130 features compared to over 2 million datas. We easily make more features and avoid curse of dimensionality. 

Let us do log transform and add them as new columns to the dataframe. Since performing on all features will give me out of memory error, let's do this on group_0 which has tag_0 from features.csv. For more information, check out my [EDA notebook](https://leejaeka.github.io/jaekangai/python/eda/jane%20street/kaggle/visualization/big%20data/2021/01/23/JaneStreet-Copy1.html).

Other ideas for feature engineering:
1. aggregating categorical columns by 'tags' on features.csv
2. count above mean, mean abs change, abs energy
3. log transform, kurt transform and other transforms
4. get creative!

Reasons not to do more feature engineering:
1. We have no idea what the features represent so it might be meaningless and dangerous
2. The dataset is really big so adding couple more columns will make me run out of memory
3. Much slower computation

### Split data ✂️

We are going to use approximately 20000 data as test set. Our target value is action which we already have defined as any weight times resp above 0.(positive trades)

### Random Forest Classifier

### Result 1

So we got about 52.4% accuracy with random forest. <br>
From the confusion matrix, we can tell that the model is having harder time predicting 0's correctly. It is actually doing a good job of classifying 1's though! So with this model, we can expect to get lots of good trades but also fail to not go for bad trades.

### MLP 

Classic multiple layer perceptron with AUC(Area Under Curve) metrics. After looking at many notebooks on Kaggle, MLP seem to perform the best with short run time. Let us build one ourselves.

### Result 2


This is actually good! Although one could say that the machine is doing slightly better than me if I was to go to Jane Street and randomly decide to 'action' on trades. <br>



It is important to note that even though we are getting only around ~55% accuracy only, this is actually considered good for trading markets. To explain this, since Jane Market has billions of money, as long as they have a positive return rate, it doesn't matter how much they lose because in the end they will gain more. It is like going to a casino knowing you have more chance of winning than losing. The more time you spend here, the more you will gain out of it!

### Hyper-parameter tuning

RandomSearch and GridSearch easily runs out of memory..

So from trial and error, I've learned that with learning rate at 1e-3, model overfits quickly around at 10 with batch_size around 5000. However, the model wasn't able to learn much with less than 100 epochs. One solution is to add more layers and perceptrons which is what I did and the result 2 is the result of manual hyper param tuning.

### Conclusion


For my final review and conclusion, check out my [blog post]()

Other things to try/explore:
1. Weighted training. We know that sometimes we will encounter 'monster' deals. It is crucial for the Kaggle competition to get these ones correct since these will probably outweight most other trades. So we could make model that focuses more on these heavy trades. (high weight X resp data)
2. Split data and train multiple models. Idea is that we could split the data into two by feature_0 and maybe one model that optimizes the '1's data and another model that optimizes the '-1's data. 
3. Make much more features and explore more data (requires time and big data machines)
4. One interesting thing I learned is that apparently, in financial, it is sometimes good to heavily overfit the model. Something to do with volatile. I've experimented with this and indeed my utility score for the competition went really high when super overfitted with epoches over 200.

### Reference
[Imputing-missing-values](https://www.kaggle.com/louise2001/imputing-missing-values) <br>
[OWN Jane Street with Keras NN](https://www.kaggle.com/tarlannazarov/own-jane-street-with-keras-nn)