# Jane Street Market EDA
> Jane Street Market Prediction Kaggle Competition

- toc: true 
- badges: true
- comments: true
- author: Jaekang Lee
- categories: [python, EDA, Jane Street, Kaggle, Visualization, Big Data]

<img src="images/jane_logo.png">

Problem: Maximize profit by choosing to take a trading opportunity or not using 130 anonymized features

### Import Library 📂

In [1]:
#!pip install datatable > /dev/null
import datatable as dt
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.gridspec as gridspec
import plotly.express as px
import plotly.graph_objects as go
from collections import defaultdict
# garbage collector to keep RAM in check
import gc 
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Import Data 📚

In [4]:
# df = pd.read_csv("../input/jane-street-market-prediction/train.csv")
df = dt.fread('../input/train.csv')
df = df.to_pandas()

In [7]:
feat = pd.read_csv("../input/features.csv")

In [6]:
#hide_input
print("df.shape: " + str(df.shape))
print("how many days? " + str(len(df.date.unique())) + "days")
df.head()

df.shape: (2390491, 138)
how many days? 500days


Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,...,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,...,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,...,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,...,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,...,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [7]:
df.describe()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,...,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
count,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,2390491.0,...,2320637.0,2390268.0,2390268.0,2374408.0,2374408.0,2381638.0,2381638.0,2388570.0,2388570.0,2390491.0
mean,247.8668,3.031535,0.0001434969,0.0001980749,0.0002824183,0.0004350201,0.0004083113,0.009838565,0.3855776,0.3576875,...,0.2687757,0.3435523,0.2799973,0.3351537,0.2448752,0.3391778,0.2323809,0.3425608,0.2456182,1195245.0
std,152.2746,7.672794,0.008930163,0.01230236,0.01906882,0.03291224,0.02693609,0.9999518,2.559373,2.477335,...,2.174238,2.087842,1.977643,1.742587,2.242853,2.534498,1.795854,2.30713,1.765419,690075.5
min,0.0,0.0,-0.3675043,-0.5328334,-0.5681196,-0.5987447,-0.5493845,-1.0,-3.172026,-3.093182,...,-7.471971,-5.862979,-6.029281,-4.08072,-8.136407,-8.21505,-5.765982,-7.024909,-5.282181,0.0
25%,104.0,0.16174,-0.001859162,-0.002655044,-0.005030704,-0.009310415,-0.007157903,-1.0,-1.299334,-1.263628,...,-1.123252,-1.114326,-0.9512009,-0.913375,-1.212124,-1.452912,-0.899305,-1.278341,-0.8544535,597622.5
50%,254.0,0.708677,4.552665e-05,6.928179e-05,0.0001164734,0.0001222579,8.634997e-05,1.0,-1.870182e-05,-7.200577e-07,...,0.0,7.006233000000001e-17,6.054629000000001e-17,4.8708260000000006e-17,-2.558675e-16,1.015055e-16,5.4199200000000007e-17,8.563069000000001e-17,4.8695290000000005e-17,1195245.0
75%,382.0,2.471791,0.002097469,0.002939111,0.005466336,0.009804649,0.007544347,1.0,1.578417,1.526399,...,1.342829,1.405926,1.308625,1.228277,1.409687,1.767275,1.111491,1.582633,1.125321,1792868.0
max,499.0,167.2937,0.2453477,0.2949339,0.3265597,0.5113795,0.4484616,1.0,74.42989,148.0763,...,110.7771,48.12516,127.6908,65.14517,70.52807,58.72849,69.32221,51.19038,116.4568,2390490.0


In [8]:
feat.describe()

Unnamed: 0,feature,tag_0,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8,...,tag_19,tag_20,tag_21,tag_22,tag_23,tag_24,tag_25,tag_26,tag_27,tag_28
count,130,130,130,130,130,130,130,130,130,130,...,130,130,130,130,130,130,130,130,130,130
unique,130,2,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,feature_72,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
freq,1,113,113,113,113,113,122,90,128,128,...,123,125,125,121,82,118,118,118,118,118


### Cleaning Data 🧹

In [9]:
#hide_input
has_nulls = set(df.columns[df.isnull().sum()!=0])
print("There are "+str(len(has_nulls))+" many cols with at least one null value")
print(has_nulls)

There are 88 many cols with at least one null value
{'feature_108', 'feature_91', 'feature_115', 'feature_128', 'feature_93', 'feature_33', 'feature_24', 'feature_4', 'feature_79', 'feature_28', 'feature_19', 'feature_88', 'feature_56', 'feature_117', 'feature_31', 'feature_21', 'feature_7', 'feature_94', 'feature_16', 'feature_76', 'feature_96', 'feature_12', 'feature_55', 'feature_29', 'feature_120', 'feature_35', 'feature_124', 'feature_32', 'feature_74', 'feature_17', 'feature_116', 'feature_97', 'feature_86', 'feature_105', 'feature_127', 'feature_36', 'feature_99', 'feature_34', 'feature_104', 'feature_10', 'feature_100', 'feature_58', 'feature_87', 'feature_111', 'feature_122', 'feature_80', 'feature_78', 'feature_25', 'feature_18', 'feature_59', 'feature_26', 'feature_73', 'feature_92', 'feature_15', 'feature_81', 'feature_27', 'feature_13', 'feature_112', 'feature_109', 'feature_125', 'feature_3', 'feature_98', 'feature_82', 'feature_84', 'feature_45', 'feature_90', 'feature_9

A lot of the histogram with null values has extreme outliers. It would be safe to fill the null values with medians. Other imputation method considered were mean and KNN-Imputation. Check out my other notebook where KNN-Imputation was used to train MLP.


If we just remove all nans, we would be removing more than 16.54% of the dataset.

In [None]:
#hide_input
df = df.apply(lambda x: x.fillna(x.median()),axis=0)
print("Number of features with null values after median imputation: ",np.sum(df.isna().sum()>0))

Interesting points so far:
- feature_0 is binary.
- A lot of features seems to be normally distributed.
- A lot of missing values. 

### Plots & Visualization 📊

#### resp, resp_1, resp_2, resp_3, resp_4

We can see that resp is closely related to resp_4 (blue and purple). Resp_1 and resp_2 also seem to be closely related but much much linear. Resp_3 seem to be in the middle, where the shape is closer to upper group but position is slightly closer to green and orange.

#### Weights

Note: **weight** and **resp** multiplied together represents a return on the trade.

We can see that most weights are around 0.2 and we can see two 'peaks' which is around 0.2 and 0.3. Note that maximum weight was 167.29 represented by 1.0 on x-axis. Thus 0.2 represents around 33.458 and 0.3 represents around 50.187.

Note that the graph plots all the positive gains. (Our 1's for our action column). So we can see that there were 'bigger' gains in the beginning and as time approach 500, the gain becomes smaller. In conclusion, the earlier trades are much bigger but we don't know what it's going to be like in our competition test set. 

We know that we probability want to invest more 'weight' if there are bigger 'resp'(return). We learn here that higher weights are only when resp is close to 0. In other words, it is dumb to trade if resp is away from 0 but it is safe to invest even a lot if it is near 0.

In the Kaggle community, there's been lots of discussion on how the trends changed significantly since day ~85. We can see much more trades happening before day 100. Rest of the days are still very active but not as noisy. We can suggest that there has been a change of trading model from Jane Street as discussed [here](https://www.kaggle.com/c/jane-street-market-prediction/discussion/201930) by [Carl](https://www.kaggle.com/carlmcbrideellis).

Let us look at the most important feature, 'feature_0'

In [14]:
df['feature_0'].value_counts()

 1    1207005
-1    1183486
Name: feature_0, dtype: int64

Interestingly, when feature_0 is 1, plot shows negative slope while in contrast, when feature_0 is -1, plot shows positive slope. My guess is that feature_0 corresponds to Buy(1) and Sell(-1) or vice versa. So if we set action to 1 with feature_0 = 1 then we are selling and when we set action to 0 with feature_0 = -1, then we are buying. This makes sense since whether we are buying or selling we can still lose or gain profit.

#### Features

Remember that we have another file called features.csv. Which can help us understand 100+ features and maybe cluster into groups. Let's take a look. <br>
<img src="images/tags.png">

In [1]:
#hide_input
# fig = px.bar(feat.set_index('feature').T.sum(), title='Number of tags for each feature')

# fig.layout.xaxis.tickangle = 300
# fig.update_traces( showlegend = False)
# fig.layout.xaxis. dtick = 5
# fig.layout.xaxis.title = ''
# fig.layout.yaxis.title = ''
# fig.show()

Let us see what tag_0 groups tells us.

Correlation between features of tag_0. It looks like there certainly are correlation between elements of the group except a few.

Interesting points:
- feature_0 has no tags
- feature 79 to 119 all has 4 tags
- feature 7 to 36 have 3 and 4 tags periodically
- Similar trend between 2 to 7, 37 to 40, 120 to 129
- tag_n doesn't tell too much about the features

### Reference 📖
- [Jane Street: EDA of day 0 and feature importance](https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance)
- [Jane_street_Extensive_EDA & PCA starter 📊⚡](https://www.kaggle.com/muhammadmelsherbini/jane-street-extensive-eda-pca-starter)
- [EDA / A Quant's Prespective](https://www.kaggle.com/hamzashabbirbhatti/eda-a-quant-s-prespective#Weight)

### Submission

In another notebook. Thoughts going into predicting phase.
1. Days before ~100 can be dropped as suspicion of model shift.
2. Feature_0 seem very important to find slope of cummulative resp.
3. Resp near 0 is prefered over other values.
4. A lot of features are normally distributed.
5. We have over 2 million datas, it would be safe to add lot more features(feature enginerring)
6. There are a lot of missing values too. Can try mean, median or KNN imputation methods.
7. Note that although this is kind of a time series data, we can only predict with features 0 to 129
