# Data Science on Bitcoin Trading

## Introduction

Bitcoin is a new and fast growing currency, and now there are several trading markets of Bitcoin. Compared with stocks, index funds and futures, Bitcoin market approximates a free market more, and is considered less complex to analyze. Thus, this project aims to apply Data Science on Bitcoin Trading, and build an automated Bitcoin trading algorithm that makes money on its own. Most importantly, we want to investigate the potential factors that influence the Bitcoin market, and unveil some characteristics of this market from the perspective of Data Science.

## Content In this Notebook
* Getting Raw Data
* First Try: Applying Linear Regression
* For Trading Decision: Applying Linear Classifaction
* For Price Change Prediction: Applying Bayesian Regression for "latent source model"

### Getting Raw Data

In [5]:
import requests
import json
import numpy as np
import pandas as pd
import datetime as dtime

We found the data from https://bitcoincharts.com/, which provides a huge CSV file around 500MB, that records all the historical transactions in the past two years. We did some preprocessing on the raw data we collect, and wrote the data into a file that makes it easier for later analysis.

In [27]:
filename = "data/coinbaseUSD.csv"
df = pd.read_csv(filename)

In [56]:
df['Time'] = df['Time'].apply(lambda x : dtime.datetime.fromtimestamp(float(x)))

In [57]:
df = df.groupby(['Time']).agg({'Price':'mean', 'Quantity':'sum'})

In [58]:
df.to_csv("processed.csv")

In [6]:
df = pd.read_csv("processed.csv")
# convert string to datetime
df['Time'] = pd.to_datetime(df['Time'])

One non-trivial thing we did above is merging transactions that happen almost together, because we don't need the exact time of every individual transaction up to milliseconds precision.

### First Try: Applying Linear Regression
The first thing we tried is to use the most basic Linear Regression. For each sample, we look at the recent prices change at several fixed data points (10 seconds before, 20s before, 30s before ... to 600s before), and the label is the max price in the next 90 seconds. If an effective model for this data format exists, then we can build our automated trading algorithm on it.

In [7]:
# predict max price in the next 90 seconds

def extract_feature (current_time, dataframe):
    
    data_points = [i for i in range(10,600,10)]
    #Q_points = [100, 200, 500]
    predict_interval = (30, 90)
    
    lookback = dtime.timedelta(seconds = max(data_points))
    
    end_i = np.argmin(np.abs(dataframe['Time'] - current_time)) + 1
    start_i = np.argmin(np.abs(dataframe['Time'] - (current_time-lookback))) - 1
    
    predict_interval = (current_time + dtime.timedelta(seconds=predict_interval[0]),
                        current_time + dtime.timedelta(seconds=predict_interval[1]))
    
    predict_slice = dataframe[(dataframe['Time'] >= predict_interval[0]) 
                                & (dataframe['Time'] <= predict_interval[1])]
    
    def is_good_sample():
        y_threshold = 2 #count
        x_threshold = 180  #plus minus seconds
        res_count = len(predict_slice)
        x_difference = np.abs((dataframe['Time'].iloc[start_i] - 
                        (current_time-lookback)).total_seconds())
        if res_count < y_threshold or x_difference > x_threshold:
            return False
        return True
        
    if not is_good_sample():
        return None
    
    
    
    segment = dataframe.iloc[start_i:end_i]
    
    
        
    
    def time_to_float (x):
        return (x-current_time).total_seconds()
    
    result_prices = []
    
    
    x_list = segment['Time'].apply(time_to_float).tolist()
    y_list = segment['Price'].tolist()
    
    current_price = np.interp(0, x_list, y_list)
    
    for x in data_points:
        temp = np.interp(0 - x, x_list, y_list)
        result_prices.append(temp - current_price)
        
    
    
    result_Q = []
    
    def consider_Q():
        last_time = current_time
        for x in Q_points:
            new_time = last_time - dtime.timedelta(seconds=x)
            temp = dataframe[(dataframe['Time'] > new_time) 
                             & (dataframe['Time'] <= last_time)]
            temp = temp['Quantity'].sum()
            result_Q.append(temp)
            last_time = new_time
    
    X = result_prices+result_Q
    
    Y = np.max(predict_slice['Price']) - current_price
    
    
    
    return (X, Y)
    
    

In [8]:
# test on a particular example
extract_feature(dtime.datetime(2016,5,20),df)

([0.046895424666672625,
  0.09379084933334525,
  0.12605228709998073,
  0.026607842766679823,
  -0.07283660156667793,
  -0.15935985726667923,
  -0.12959241526669985,
  -0.09982497326666362,
  0.019324528390484375,
  0.19806206981905916,
  0.3767996112475771,
  0.5372050651999984,
  0.5326217318666409,
  0.5280383985332833,
  0.5192050651999693,
  0.4721217318666504,
  0.4250383985333315,
  0.3798717318666718,
  0.3519550652000021,
  0.3240383985333324,
  0.3070383985333365,
  0.38828839853334784,
  0.46953839853330237,
  0.5424967318666631,
  0.540830065199998,
  0.5391633985333328,
  0.5376633985333115,
  0.5376633985333115,
  0.5376633985333115,
  0.5376633985333115,
  0.5376633985333115,
  0.5376633985333115,
  0.5380205413904378,
  0.5415919699618712,
  0.5451633985333046,
  0.5776633985333319,
  0.6776633985332978,
  0.7776633985333206,
  0.35980174233975504,
  0.5849264513720414,
  0.8100511604042708,
  1.0126633985333342,
  1.0126633985333342,
  1.0126633985333342,
  1.011163398

Given this helper function that extracts feature for a particular time, we wrote a program that procedurely generates our training set, specified by the time interval and the dataframe we pass as parameters.

In [9]:
def prepare_samples(start_time, end_time, dataframe):
    delta_seconds = 60
    
    scope_x = start_time - dtime.timedelta(hours=1)
    scope_y = end_time + dtime.timedelta(hours=1)
    
    scope = dataframe[(dataframe['Time'] > scope_x) 
                             & (dataframe['Time'] < scope_y)].reset_index(drop=True)
        
    
    pointer = start_time
    
    X = []
    Y = []
    while pointer < end_time:
        temp = extract_feature(pointer, scope)
        if temp is not None:
            X.append(temp[0])
            Y.append(temp[1])
        pointer = pointer + dtime.timedelta(seconds = delta_seconds)
    
    return (X,Y)

Generate a particular training set:

In [13]:
samples = prepare_samples(dtime.datetime(2016,5,20), dtime.datetime(2016,5,21), df)

Now, let's see how well a Linear Regression model can be trained on this data.

In [10]:
from sklearn import linear_model

In [11]:
reg = linear_model.LinearRegression(normalize=True)

In [14]:
reg.fit(samples[0], samples[1])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

In [16]:
reg.score(samples[0], samples[1])

0.080965710556980763

### For Trading Decision: Applying Linear Classifaction
The score above indicates the coefficient of determination R^2 of the prediction. Such score is a "terrible" score, which means the proportion of the variance in the dependent variable that is predictable from the independent variable, so we think the prediction of an exact price change might be too hard. Perhaps training a binary classification SVM that predicts whether there is a chance of making money in the next 90 seconds would be more accurate. This is our immediate next step.

### For Price Change Prediction: Applying Bayesian Regression for "latent source model"
As simple Linear Regression performs poorly in predicting Bitcoin price given historical price, we searched for acedemic paper and decide to apply Bayesian Regression for "latent source model" proposed by Shah and Zhang.
#### Trading Strategy
at each time, we either maintain position of +1 Bitcoin, 0 Bitcoin or −1 Bitcoin. At each time instance, we predict the average price movement over the 10 seconds interval, say ∆p, using Bayesian regression discussed by Shah and Zhang in "Bayesian Regression and Bitcoin" - if ∆p > t, a threshold, then we buy a bitcoin if current bitcoin position is ≤ 0; if ∆p < −t, then we sell a bitcoin if current position is ≥ 0; else do nothing. The choice of time steps when we make trading decisions as mentioned above are chosen carefully by looking at the recent trends.
#### Predicting Price Change
Given time- series of price variation of Bitcoin over the interval of few months, measured every 10 second interval, we have a very large time-series (or a vector). We use this historic time series and from it, generate three subsets of time-series data of three different lengths: S1 of time-length 30 minutes, S2 of time-length 60 minutes, and S3 of time-length 120 minutes. Now at a given point of time, to predict the future change ∆p, we use the historical data of three length: previous 30 minutes, 60 minutes and 120 minutes - denoted $x_1$ , $x_2$ and $x_3$ . We use $x_j$ with historical samples $S_j$ for Bayesian regression to predict average price change $∆p_j$ for 1 ≤ j ≤ 3. We also calculate $r = \frac{v_{bid} − v_{ask}}{v_{bid} + v_{ask}}$ where vbid is total volume people are willing to buy in the top 60 orders and vask is the total volume people are willing to sell in the top 60 orders based on the current order book data. The final estimation ∆p is produced as
$$∆p=w_0+ \sum_{j=1}^{3}􏰆 w_j ∆p^{j}+w_4 r$$
where w = (w0, . . . , w4) are learnt parameters. 

Now on finding Sj,1 ≤ j ≤ 3 and learning w. We divide the entire time duration into three, roughly equal sized, periods. We utilize the first time period to find patterns Sj, 1 ≤ j ≤ 3. The second period is used to learn parameters w and the last third period is used to evaluate the performance of the algorithm. The learning of w is done simply by finding the best linear fit over all choices given the selection of Sj , 1 ≤ j ≤ 3. Now selection of Sj, 1 ≤ j ≤ 3. For this, we take all possible time series of appropriate length (effectively vectors of dimension 180, 360 and 720 respectively for S1,S2 and S3). Each of these form xi and their corresponding label yi is computed by looking at the average price change in the 10 second time interval following the end of time duration of xi. This data repository is extremely large. To facilitate computation on single machine with 128G RAM with 32 cores, we clustered patterns in 100 clusters using k−means algorithm. From these, we chose 20 most effective clusters and took representative patterns from these clusters.


### Reference
* Shah, Devavrat, and Kang Zhang. ”Bayesian regression and Bitcoin.” arXiv preprint arXiv:1410.1231 (2014).