# EDA IBM stock 1min ticks

## Takeouts -  december 29/2019
* The distribution of the difference of the time series has very long tails: see plot. This was expected; still very ugly.
* Created a very clean dataset with 'pastValues', 'currentValue', 'futureValue', 'deltaMinutes'. 
* The precision recall curve that we already have looks quite good: we trained one single time.

## Takeouts -  december 30/2019
* Plot a precision recall curve with several(monthly/weekly) training batches

## To do

* There is a delicate issue about rescaling before training that needs to be fixed asap.
* Turn the code that produces the 'clean' data set into a more reusable one
* The current version of the target is not very realistic.
* Need a more elaborated way of looking at histograms: the tails don't let me see anything: those tails are a huge concern.
* Look out for sklearn methods to do crossvalidation in our setting: do not reinvent the wheel.

## Questions to Jake:
* Data provider (currently using sample (adjusted) data from Kibot): he uses polygon
* How can I authomatize making orders, is there an api? how does this even work in real life? he mentioned 'efficient frontier'
* Cost per order (0.5 cents per share or 2 dollars per trade)
* Latency issues to be aware of.
* How/at what point can we know if we are `moving the market` too much? In the afternoon there is very little volume.


## Notes from Jake
* Tick data might be more useful for quant analysis.
* polygon.io source of data.
* thinkorswim.com: 2 dollars per trade, 
* interactivebrokers : half a cent per share.
* kelly criterion?
* ibridgepy ... take quantopian to real life.
* zipline - quantopian type of thing.
* efficient frontier. for blending the strategies.
* kygo: his thing.
* Use quantopian!

In [7]:
cd ~/Desktop/MyProjects/moneyManager/

/Users/lduque/Desktop/MyProjects/moneyManager


In [8]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import plot_precision_recall_curve
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [12]:
# only loading some rows
names = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']
df = pd.read_csv('data/sampleKibotData/minuteIntraday/IBM_adjusted.txt', header=None, names=names, nrows=2000)
df['DateTime'] = df.Date+' '+df.Time
df.DateTime = pd.to_datetime(df.DateTime,infer_datetime_format=True) 
ds = df.drop(columns=['Date', 'Time'])

In [13]:
ds

Unnamed: 0,Open,High,Low,Close,Volume,DateTime
0,33.69,33.69,33.69,33.69,207820,1998-01-02 09:30:00
1,33.65,33.69,33.65,33.65,33499,1998-01-02 09:31:00
2,33.67,33.69,33.65,33.69,41254,1998-01-02 09:32:00
3,33.67,33.69,33.65,33.65,52110,1998-01-02 09:33:00
4,33.65,33.69,33.65,33.65,14892,1998-01-02 09:34:00
...,...,...,...,...,...,...
1995,33.19,33.21,33.17,33.17,29467,1998-01-09 10:49:00
1996,33.17,33.17,33.15,33.15,64210,1998-01-09 10:50:00
1997,33.15,33.17,33.15,33.17,19541,1998-01-09 10:51:00
1998,33.17,33.17,33.15,33.15,3722,1998-01-09 10:52:00


In [None]:
ds.Open.plot(figsize=(15,5), title='IBM stock');

In [None]:
ds.head()

In [None]:
ds.DateTime.apply(lambda x: x.year).value_counts().sort_index().plot.bar(figsize=(15, 5), title='ticks per year');

In [None]:
ds.Open.diff().plot.hist(bins=200, figsize=(15,5), title= 'distribution of the differential');

In [14]:
def pivotTimeSeries(ds):
    pastWindow = 5
    futureWindow = 10
    
    dsTimeStamp = ds.DateTime
    pastLow = pd.concat([ds.Low.shift(i) for i in range(pastWindow, 0, -1)], axis=1).apply(lambda x: list(x), axis=1)
    pastHigh = pd.concat([ds.High.shift(i) for i in range(pastWindow, 0, -1)], axis=1).apply(lambda x: list(x), axis=1)
    pastVolume = pd.concat([ds.Volume.shift(i) for i in range(pastWindow, 0, -1)], axis=1).apply(lambda x: list(x), axis=1)
    
    currentValue = ds.Close
    currentVolume = ds.Volume
    
    futureHigh = pd.concat([ds.High.shift(i) for i in range(-1, -futureWindow-1, -1)], axis=1).apply(lambda x: max(list(x)), axis=1)
    deltaMinutes = (ds.DateTime.shift(-futureWindow) - ds.DateTime.shift(pastWindow)).apply(lambda x: x.seconds)//60
    
    dg = pd.concat([dsTimeStamp, pastLow, pastHigh, pastVolume, currentValue, currentVolume, futureHigh, deltaMinutes], axis=1)
    dg.columns = ['DateTime', 'pastLow', 'pastHigh', 'pastVolume', 'currentValue', 'currentVolume', 'futureHigh', 'deltaMinutes']
    dg['target']= dg.futureHigh>dg.currentValue
    dg = dg.drop(columns='futureHigh')
    dg = dg[dg.deltaMinutes==(pastWindow+futureWindow)].drop(columns='deltaMinutes')
    dg = dg.set_index('DateTime')
    return dg

In [15]:
def produceFeatures(dg):
    # rescale pastLow, pastHigh (using currentValue), remove currentValue
    scaledPastLow = dg.apply(lambda x: np.array(x.pastLow)/x.currentValue, axis=1)
    scaledPastHigh = dg.apply(lambda x: np.array(x.pastHigh)/x.currentValue, axis=1)
    scaledPastVolume = dg.apply(lambda x: np.array(x.currentVolume)/x.currentVolume, axis=1)
    
    pastLowFeatures = scaledPastLow.apply(listToFeatures)
    pastHighFeatures = scaledPastLow.apply(listToFeatures)
    pastVolumeFeatures = scaledPastLow.apply(listToFeatures)
    
    W = (pastLowFeatures + pastHighFeatures + pastVolumeFeatures).apply(lambda x: pd.Series(x))
    
    # consider rescaling only at the very end?
    # rescale pastVolume (using currentVolume) remove currentVolume
    return W
    
    
def listToFeatures(x):
    L = list(x)
    features = [max(L), min(L), np.mean(L), np.std(L), np.median(L)]
    return features
    

In [None]:
produceFeatures(dg).describe()

In [None]:
L = [3,4,5,6]

In [None]:
np.array(L)/4

In [None]:
dg = pivotTimeSeries(ds)

In [None]:
dg.head()

In [None]:
dg.target.value_counts().plot.bar()

In [None]:
X = pd.concat([dg.pastValues.apply(lambda x:pd.Series(x)), dg.currentValue, dg.target], axis=1)
y = X.pop('target')

In [None]:
# rescaling the dataframe
X = X.div(X.currentValue, axis=0)
X.drop(columns='currentValue')

## A precision recall-curve with only one training split.

In [None]:
# temporal split of train/test
trainPercentage = 20
testBegins = (len(X)*trainPercentage)//100
Xtrain, ytrain = X[:testBegins], y[:testBegins]
Xtest, ytest = X[testBegins:], y[testBegins:]

In [None]:
parameters = {
    'min_samples_leaf' : [10, 20, 100],
    'max_depth': [5, 10, 20],
    'n_estimators': [10, 20, 100],   
    'max_features': ['sqrt']
}

regr = RandomForestClassifier(class_weight='balanced')
grid = GridSearchCV(regr, parameters, cv=2, scoring='average_precision')

In [None]:
grid.fit(Xtrain, ytrain)
model = grid.best_estimator_

In [None]:
base_rate = sum(ytest)/len(ytest)
ypredicted=model.predict_proba(Xtest)[:,1]
average_precision = average_precision_score(ytest, ypredicted)
disp = plot_precision_recall_curve(model, Xtest, ytest)
disp.ax_.set_title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
plt.plot([0, 1], [base_rate, base_rate]);

# A precision-recall curve with multiple training batches

In [None]:
X = pd.concat([dg.pastValues.apply(lambda x:pd.Series(x)), dg.currentValue, dg.target], axis=1)
firstTimeStamp=X.index[0]
first_day = datetime(firstTimeStamp.year, firstTimeStamp.month, firstTimeStamp.day)
X['week']=pd.Series(X.index).apply(lambda x: (datetime(x.year, x.month, x.day)-first_day).days//7).values

In [None]:
# notice that for the week zero we wont have a model.
trainSets = [X[X.week==w] for w in X.week.unique()]
trainTestBarches = [(None, None, None, None)]+[(trainSets[i].drop(columns='target'),trainSets[i].target,trainSets[i+1].drop(columns='target'),trainSets[i+1].target) for i in range(len(trainSets)-1)]
models = [None] + [GridSearchCV(regr, parameters, cv=2, scoring='average_precision') for _ in range(len(trainSets)-1)]

In [None]:
Xtrain, ytrain, Xtest, ytest = trainTestBarches[1]

In [None]:
Xtrain # notice: week should not be in the model, but its ok for now

In [None]:
for i in range(1, len(models)):
    print(i)
    Xtrain, ytrain, Xtest, ytest = trainTestBarches[i]
    models[i].fit(Xtrain.div(Xtrain.currentValue, axis=0), ytrain)

In [None]:
# this function is worth saving somewhere !
def indexedModelEvaluation(x):
    row = x.copy()
    week = row['week']
    row = row.drop('target')
    row = [list(row/row.currentValue)]
    model = models[week]
    return np.nan if model==None else model.predict_proba(row)[0][1]

In [None]:
ypredicted = X.apply(indexedModelEvaluation,axis=1)

In [None]:
X['predicted']=ypredicted

In [None]:
dh = X[['predicted', 'target']].copy()

In [None]:
dh = dh[dh.predicted.notna()] 

In [None]:
ytest = dh.target
ypredicted = dh.predicted
base_rate = sum(ytest)/len(ytest)
average_precision = average_precision_score(ytest, ypredicted)
precision, recall, _ = precision_recall_curve(ytest, ypredicted)

In [None]:
plt.plot(recall, precision)
plt.plot([0, 1], [base_rate, base_rate]);