<a href="https://colab.research.google.com/github/louisgraham333/Bitcoin_prediction/blob/main/Crytpo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crypto prediction model creation
*This notebook is created to build a model to predict crypto prices using a range of sources. This is an experiment to see whether this is possible, or whether prices follow a truly random walk. Data sources include:
- Previous Bitcoin transaction and price data (using Kraken)
- Google Trends data
- Activity among other Bitcoin users (from Blockcypher)*

---



Things to do:
1. Add in bitcoin and other keyword in the news
2. Look at whales by pulling from the blockchain and taking the number of transactions over a certain amount 
3. Look into other factors from https://www.coinspeaker.com/guides/how-big-news-influence-bitcoin-price/ and https://www.forbes.com/sites/billybambrough/2020/03/22/heres-how-to-predict-major-moves-in-the-price-of-bitcoin/#90adc4d7a4ab
4. Improve the buy and sell metric


## Chapter 1: Prepare the script
Import packages, and set up a link to Google Drive, where the data will be stored

In [1]:
###Install and import packages
#Basics
import os
import random
import pandas as pd
import numpy as np
import time
import shutil
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
from datetime import datetime
from functools import reduce
import scipy.stats  as stats
import gc
import sys
import os

#For pydrive
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive 
from google.colab import auth 
from oauth2client.client import GoogleCredentials

#For Kraken API
import requests
import json

#For Google trends
import datetime
!pip install pytrends
from pytrends.request import TrendReq 

#For blockchain
!pip install --default-timeout=100 blockcypher
from blockcypher import get_address_overview
from blockcypher import get_address_details
from blockcypher import get_address_full

#For Sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.utils import class_weight #for overweighting users
from sklearn.metrics import SCORERS
from pickle import dump
from pickle import load

#Tensorflow
!pip install tensorflow-determinism
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation, LSTM, Input, Masking
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.callbacks import Callback, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.regularizers import l1, l2
from tensorflow.random import set_seed

#Set seed
seed_object = 7140
os.environ['PYTHONHASHSEED']=str(seed_object)
os.environ['TF_DETERMINISTIC_OPS']=str(seed_object)
random.seed(seed_object)
np.random.seed(seed_object)
set_seed(seed_object)

Collecting pytrends
  Downloading https://files.pythonhosted.org/packages/96/53/a4a74c33bfdbe1740183e00769377352072e64182913562daf9f5e4f1938/pytrends-4.7.3-py3-none-any.whl
Installing collected packages: pytrends
Successfully installed pytrends-4.7.3
Collecting blockcypher
  Downloading https://files.pythonhosted.org/packages/07/6d/2daa51b4f71b5d7945486d4d977d9bf566db97b092bc9ce0ff35e8fadaeb/blockcypher-1.0.90-py3-none-any.whl
Collecting bitcoin==1.1.39
  Downloading https://files.pythonhosted.org/packages/12/ef/569ad753ccc46d483aa3fcfcd0699a7709723ba2fee031e02474deaffa82/bitcoin-1.1.39.tar.gz
Building wheels for collected packages: bitcoin
  Building wheel for bitcoin (setup.py) ... [?25l[?25hdone
  Created wheel for bitcoin: filename=bitcoin-1.1.39-cp36-none-any.whl size=28433 sha256=1b0a635835f44742d0883b1c4df2366796ef6a9d44bd548711c580b4a7c64e8a
  Stored in directory: /root/.cache/pip/wheels/c4/ae/e0/80053298b6540fe80388e5e8919d92804ca8a21d0b211655b5
Successfully built bitcoin
In

In [2]:
#Set visualization options
pd.set_option('display.max_columns', 500)
pd.options.mode.chained_assignment = None

In [3]:
#Mount Google Drive (to allow saving)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#Set Google Drive filepath. In this, we need "Raw Data", "Last Date", "Cleaned Data" and "Models"
my_filepath = "drive/My Drive/Data Science/Crypto/"

## Chapter 2: Pull previous transactions and price data
Download transactions data from Kraken. This is designed for updating previously downloaded data, but can be adjusted for a first pull

In [None]:
###Create function for downloading new data
def downloading_new_data(pair, pair_adj, start_time, end_time):
  #Create objects to be updated for the time and the final dataset
  temp_time = start_time
  transactions_data = pd.DataFrame(columns = [0,1,2,3,4,5])
  printcounter = 0
  URL = "https://api.kraken.com/0/public/Trades"
  #Run a loop to collect data
  while int(temp_time)<int(end_time):
    #Print the time to keep track
    if (printcounter % 100 == 0):
      print("Time pulled:", pd.to_datetime(int(temp_time)))
      print("Time of request:", pd.to_datetime(time.time()*1000000000))
    #Sleep to avoid pulling from the API too much
    if printcounter % 5 == 0:
      time.sleep(15)
    #Pull transactions data (Try 3 times)
    tries = 3
    for i in range(tries):
      try:
          transactions_data_temp = json.loads(requests.get(URL, params = {"pair": pair,
                                                      "since": temp_time}).content)
          transactions_data_temp = pd.DataFrame.from_dict(transactions_data_temp['result'][pair_adj])
      except (KeyError, ValueError) as e:
          if i < tries - 1: # i is zero indexed
              print("Retrying")
              continue
          else:
              print("Failed")
              raise
      break
    #Update objects
    transactions_data = transactions_data.append(transactions_data_temp)
    temp_time = str(format(transactions_data_temp[2].max()*1000000000, '.0f'))
    printcounter += 1
  
  #Remove objects
  del(transactions_data_temp, printcounter)

  #Return object
  return(transactions_data, temp_time)

In [None]:
###Prepare and pull the data for bitcoin (change last data to the first date interested in if this is the first pull)
#Pull the start and end times (end time is current - if this fails do this in batches)
last_date_bitcoin = np.loadtxt(my_filepath + "Last Date/Last_Date_Bitcoin.txt")
start_time_bitcoin = str(format(last_date_bitcoin,'.0f'))
end_time = str(format(time.time()*1000000000, '.0f'))
##Pull the new data
new_data_bitcoin, final_time_bitcoin = downloading_new_data("XBTEUR", "XXBTZEUR", start_time_bitcoin, end_time)

Time pulled: 2020-12-26 16:27:10.300000
Time of request: 2021-01-06 13:16:52.032999936
Time pulled: 2020-12-27 17:29:57.963699968
Time of request: 2021-01-06 13:22:26.829432576
Time pulled: 2020-12-29 09:45:14.122899968
Time of request: 2021-01-06 13:28:01.631375872
Time pulled: 2020-12-30 21:01:28.665699840
Time of request: 2021-01-06 13:33:43.274811392
Time pulled: 2021-01-01 20:36:12.234200064
Time of request: 2021-01-06 13:39:25.905658624
Time pulled: 2021-01-02 20:54:51.448800
Time of request: 2021-01-06 13:45:11.135453952
Time pulled: 2021-01-03 15:30:10.384300032
Time of request: 2021-01-06 13:50:59.571577600
Time pulled: 2021-01-04 10:20:14.366099968
Time of request: 2021-01-06 13:56:53.686587392
Time pulled: 2021-01-05 09:08:08.078400
Time of request: 2021-01-06 14:02:53.700824320
Time pulled: 2021-01-06 10:03:23.735300096
Time of request: 2021-01-06 14:08:57.347183872


In [None]:
###Rename and adjust types
new_data_bitcoin.rename(columns={0: 'price', 1: 'volume', 2: 'time', 
                           3: 'buy_sell', 4: 'market_limit', 
                           5: 'misc'}, inplace=True)
new_data_bitcoin['time'] = pd.to_datetime(new_data_bitcoin['time'],unit='s')

In [None]:
###Pull the previous data, append and drop duplicates for bitcoin. If this is the first pull, simply rename new_data_bitcoin as bitcoin_data
bitcoin_data = pd.read_csv(my_filepath + "Raw Data/Raw_Data_Bitcoin.csv")
bitcoin_data = bitcoin_data.append(new_data_bitcoin)
new_data_bitcoin = pd.DataFrame()
del(new_data_bitcoin)
bitcoin_data.drop_duplicates(keep='first', inplace = True)

In [None]:
###Save for bitcoin
bitcoin_data.to_csv(my_filepath + 'Raw Data/Raw_Data_Bitcoin.csv',index=False)
print(final_time_bitcoin,  file=open(my_filepath + 'Last Date/Last_Date_Bitcoin.txt', 'w'))

In [None]:
###Delete bitcoin objects
bitcoin_data = pd.DataFrame()
del(bitcoin_data, last_date_bitcoin, start_time_bitcoin, final_time_bitcoin)

In [None]:
###Prepare and pull the data for ethereum (change last data to the first date interested in if this is the first pull)
#Pull the start and end times (end time is current - if this fails do this in batches)
last_date_ethereum = np.loadtxt(my_filepath + "Last Date/Last_Date_Ethereum.txt")
start_time_ethereum = str(format(last_date_ethereum,'.0f'))
end_time = str(format(time.time()*1000000000, '.0f'))
##Pull the new data and save the last date
new_data_ethereum, final_time_ethereum = downloading_new_data("ETHEUR", "XETHZEUR", start_time_ethereum, end_time)

Time pulled: 2020-09-13 12:59:54.925499904
Time of request: 2021-01-06 16:09:39.559493376
Time pulled: 2020-09-23 14:25:05.702699776
Time of request: 2021-01-06 16:15:17.455465216
Time pulled: 2020-10-07 06:23:11.575000064
Time of request: 2021-01-06 16:20:55.863287808
Time pulled: 2020-10-21 15:53:03.841299968
Time of request: 2021-01-06 16:26:41.686530816
Time pulled: 2020-11-02 08:26:35.325499904
Time of request: 2021-01-06 16:32:29.048775424
Time pulled: 2020-11-09 00:21:02.664300032
Time of request: 2021-01-06 16:38:18.667533824
Time pulled: 2020-11-17 17:06:13.772100096
Time of request: 2021-01-06 16:44:13.994258432
Time pulled: 2020-11-22 15:02:30.868499968
Time of request: 2021-01-06 16:50:14.943987968
Time pulled: 2020-11-25 17:42:45.881499904
Time of request: 2021-01-06 16:56:12.909421824
Time pulled: 2020-11-29 08:25:05.971500032
Time of request: 2021-01-06 17:02:16.573541120
Time pulled: 2020-12-03 20:41:56.505299968
Time of request: 2021-01-06 17:08:25.871072768
Time pulle

In [None]:
###Rename and adjust types
new_data_ethereum.rename(columns={0: 'price', 1: 'volume', 2: 'time', 
                           3: 'buy_sell', 4: 'market_limit', 
                           5: 'misc'}, inplace=True)
new_data_ethereum['time'] = pd.to_datetime(new_data_ethereum['time'],unit='s')


In [None]:
###Pull the previous data, append and drop duplicates for ethereum. If this is the first pull, simply rename new_data_ethereum as ethereum_data
ethereum_data = pd.read_csv(my_filepath + "Raw Data/Raw_Data_Ethereum.csv")
ethereum_data = ethereum_data.append(new_data_ethereum)
new_data_ethereum = pd.DataFrame()
del(new_data_ethereum)
ethereum_data.drop_duplicates(keep='first', inplace = True)

In [None]:
###Save for ethereum
ethereum_data.to_csv(my_filepath + 'Raw Data/Raw_Data_Ethereum.csv',index=False)
print(final_time_ethereum,  file=open(my_filepath + 'Last Date/Last_Date_Ethereum.txt', 'w'))

In [None]:
###Delete ethereum objects
ethereum_data = pd.DataFrame()
del(ethereum_data, last_date_ethereum, start_time_ethereum, final_time_ethereum)

## Chapter 3: Pull google trends data
Download hourly trends in overlapping datasets from Google Trends, merge these together, and normalise these. Again, this is designed for updating previously downloaded data, but can be adjusted for a first pull. 

In [None]:
def downloading_google_trends(year_s, month_s, day_s, year_e, month_e, day_e, keyword, location):
  #Specify start and end date, and required keyword
  start_date = datetime.date(year_s, month_s, day_s)
  end_date = datetime.date(year_e, month_e, day_e)
  
  #Make a list of weeks
  weekly_date_list = []
  start_date_temp = start_date
  weekly_date_list.append(start_date_temp)
  while start_date_temp+datetime.timedelta(days=7) < end_date:
      start_date_temp += datetime.timedelta(days=7)
      weekly_date_list.append(start_date_temp)
  if start_date_temp+datetime.timedelta(days=7) >= end_date:
      weekly_date_list.append(end_date)

  #Make a list of data for each week, and remove the second of any duplicates (the value for the beginning of next week)
  interest_list = []
  for i in range(len(weekly_date_list)-1):
      p = TrendReq()
      keyword_list = [keyword] 
      interest = p.get_historical_interest(keyword_list, weekly_date_list[i].year, weekly_date_list[i].month, weekly_date_list[i].day, 0,  weekly_date_list[i+1].year, weekly_date_list[i+1].month, weekly_date_list[i+1].day, 0, geo = location, sleep = 1).reset_index()
      interest.rename(columns = {'date': "Date"}, inplace = True)
      interest.drop_duplicates(keep='first', subset = "Date", inplace = True)
      interest_list.append(interest)

  #Rescale the data 
  interest_list[0]["Check"] = interest_list[0][keyword_list[0]]
  if len(interest_list) > 1:
    ratio_list = []
    for i in range(len(interest_list)-1):
        #Calculation of the ratio
        try:
          ratio = float(interest_list[i][keyword_list[0]].iloc[-1])/float(interest_list[i+1][keyword_list[0]].iloc[0]) 
        except ZeroDivisionError:
          ratio = 1
        ratio_list.append(ratio)
        interest_list[i+1]["Check"] = interest_list[i+1][keyword_list[0]].apply(lambda x:x*ratio_list[i])
        interest_list[i+1][keyword_list[0]] = interest_list[i+1]["Check"]

  #Combine the data and return the object
  df = pd.concat(interest_list)
  df.drop(labels = keyword_list[0] , axis = 1, inplace = True)
  df.drop(df[df['isPartial'] == "True"].index, inplace=True)
  df.drop(labels = "isPartial", axis = 1, inplace = True)
  df.rename(columns = {'Check': keyword}, inplace = True)
  return(df)

In [None]:
#List the terms, country ISO codes, and country labels to be pulled (i.e. the keywords we want, and the countries/regions that they are being searched from)
keyword_list = ["Bitcoin", "Ethereum", "Coinbase"]
ISO_codes = ["", "US", "GB", "JP", "IN"]
country_labels = ["World", "US", "GB", "Japan", "India"]

In [None]:
# #Start collecting data if we don't have any already (don't run if this isn't the first pull)
# Google_Trends_List = []
# for i in range(len(ISO_codes)): 
#   print(country_labels[i])
#   for j in range(len(keyword_list)): 
#     Google_Trends_Temp = downloading_google_trends(2015, 1, 1, 2015, 1, 15, keyword_list[j], ISO_codes[i])
#     Google_Trends_Temp = Google_Trends_Temp.add_prefix(country_labels[i] + "_")
#     Google_Trends_Temp.columns.values[0] = "Date"
#     Google_Trends_Temp.drop_duplicates(keep='first', subset = "Date", inplace = True)
#     Google_Trends_List.append(Google_Trends_Temp)
# 
# #Merge together
# Google_Trends = reduce(lambda x, y: pd.merge(x, y, on = 'Date'), Google_Trends_List)
# 
# #Capture last date
# last_date = Google_Trends['Date'].iloc[-1]

In [None]:
#Read in previous data, and capture the last date, if we have it (don't run if this is the first pull)
Google_Trends = pd.read_csv(my_filepath + "Raw Data/Google_Trends.csv")
last_date = pd.to_datetime(Google_Trends['Date'].iloc[-1])

In [None]:
#Pull new data (don't run if this is the first pull)
#end_time = pd.to_datetime(time.time()*1000000000)
end_time = last_date + datetime.timedelta(days=7)
Google_Trends_New_List = []
for i in range(len(ISO_codes)): 
  print(country_labels[i])
  for j in range(len(keyword_list)): 
    Google_Trends_Temp = downloading_google_trends(last_date.year, last_date.month, last_date.day, end_time.year, end_time.month, end_time.day, keyword_list[j], ISO_codes[i])
    Google_Trends_Temp = Google_Trends_Temp.add_prefix(country_labels[i] + "_")
    Google_Trends_Temp.columns.values[0] = "Date"
    Google_Trends_Temp.drop_duplicates(keep='first', subset = "Date", inplace = True)
    Google_Trends_New_List.append(Google_Trends_Temp)

#Merge together
Google_Trends_New = reduce(lambda x, y: pd.merge(x, y, on = 'Date'), Google_Trends_New_List)

World
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
US
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
GB
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
Japan
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
India
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.
The request failed: Google returned a response with code 500.


In [None]:
###Append objects together (don't run if this is the first pull)
#Normalise to the original object's ratio
for i in range(1, Google_Trends.shape[1]):
    try:
      ratio = float(Google_Trends.iloc[-1,i])/float(Google_Trends_New.iloc[0,i]) 
    except ZeroDivisionError:
      ratio = 1
    Google_Trends_New.iloc[:,i] = Google_Trends_New.iloc[:,i]*ratio   

#Append and delete duplicates
Google_Trends = Google_Trends.append(Google_Trends_New).reset_index(drop=True)
Google_Trends_New = pd.DataFrame()
del(Google_Trends_New)
Google_Trends.drop_duplicates(keep='first', subset = "Date", inplace = True)

In [None]:
#Save the data
Google_Trends.to_csv(my_filepath + 'Raw Data/Google_Trends.csv',index=False)

## Chapter 4: Pull ledger data 
Pull data on the biggest accounts and their transactions from blockcypher. Again, this is designed for updating previously downloaded data, but can be adjusted for a first pull. 

In [None]:
# #Pull the 5000 accounts with the most bitcoin at the moment
# Accounts_List = []
# for i in range(50):
#   response_temp = requests.get("https://api.blockchair.com/bitcoin/addresses", params = {'limit': 100,
#                                                                                   'offset': i*100})
#   print(response_temp.status_code)
#   response_temp = pd.DataFrame(json.loads(response_temp.text)['data'])
#   Accounts_List.append(response_temp)
#   time.sleep(3)
# 
# Accounts = pd.concat(Accounts_List).reset_index(drop=True)
# account = Accounts['address']
# 
# #Save
# account.to_csv(my_filepath + 'Raw Data/Accounts.csv',index=False)

In [None]:
#Read in list of the largest accounts
accounts = pd.read_csv(my_filepath + 'Raw Data/Accounts.csv')['address']

In [None]:
#Loop through, pull the data, and convert this to an hourly dataset with the balance for each account
ledger_data_List = []
dates = pd.DataFrame({'Date': pd.date_range(start="2001-01-01",end=(datetime.datetime.today()+datetime.timedelta(days=1)).strftime('%Y-%m-%d'), freq = "H")})

for j in range(100):
  if j % 20 == 0:
    print(j)
  i = j + 500
  #Pull the data, make an hour column, and clean (Try 3 times)
  tries = 3
  ledger_data_temp = pd.DataFrame(get_address_details(accounts[i], txn_limit = 50000)['txrefs'])[['ref_balance', 'confirmed']]
  ledger_data_temp['confirmed'] = pd.to_datetime(ledger_data_temp['confirmed'], errors='coerce', utc=True).values.astype('datetime64[s]')
  if ledger_data_temp[ledger_data_temp['confirmed'].isnull()].shape[0] > 0:
    print("Account " + str(i) + " missing dates: " + str(ledger_data_temp[ledger_data_temp['confirmed'].isnull()].shape[0]))
  ledger_data_temp[ledger_data_temp['confirmed'].notnull()]
  ledger_data_temp['Date'] = pd.Series(ledger_data_temp['confirmed']).dt.floor("H")
  ledger_data_temp.drop(labels = "confirmed", axis = 1, inplace = True)
  ledger_data_temp['ref_balance'] = ledger_data_temp['ref_balance']/100000000

  #Expand this to be a full hourly dataset and append to list
  ledger_data_temp = pd.merge(dates, ledger_data_temp, how = 'left', on = 'Date')
  ledger_data_temp.iloc[0,1] = 0
  ledger_data_temp = ledger_data_temp.fillna(method='ffill')
  ledger_data_temp = ledger_data_temp.groupby('Date')['ref_balance'].aggregate(balance = 'last')
  ledger_data_temp.rename(columns = {'balance': "balance_" + str(i+1)}, inplace = True)
  ledger_data_List.append(ledger_data_temp)
  time.sleep(5)

#Merge together the datasets
ledger_data = reduce(lambda x, y: pd.merge(x, y, on = 'Date'), ledger_data_List)

0


RateLimitError: ignored

In [None]:
#Open previous data
ledger_data_old = pd.read_csv(filepath + 'Raw Data/Ledger_Data_All.csv').set_index('Date')

In [None]:
#Merge
ledger_data = pd.merge(ledger_data_old, ledger_data, left_index= True, right_index= True, how = 'inner')

In [None]:
### Save data
ledger_data.to_csv(my_filepath + 'Raw Data/Ledger_Data_All.csv')

In [None]:
### Add columns for summed balances
#First count the number of non-zero columns per row
cols = ledger_data.columns
ledger_data_summary = pd.DataFrame({'non_zero': ledger_data[cols].gt(0).sum(axis=1)})
#Then sum, and divide by the number of non-zero columns
ledger_data_summary['adjusted_balance'] = ledger_data.sum(axis=1)/np.where(ledger_data_summary['non_zero']==0,1,ledger_data_summary['non_zero'])
ledger_data_summary.drop(labels = "non_zero", axis = 1, inplace = True)
#Finally, take only the bits from 2015 onwards
ledger_data_summary = ledger_data_summary.iloc[120000:]

In [None]:
### Save data
ledger_data_summary.to_csv(my_filepath + 'Raw Data/Ledger_Data.csv')

## Chapter 5: Clean all datasets 
Where required, clean datasets (currently just Google Trends)

Clean trends data to account for missing data. Add the trend from first missing to last missing

In [None]:
###Open data
Google_Trends = pd.read_csv(my_filepath + 'Raw Data/Google_Trends.csv')

In [None]:
###Replace with missing if the world bitcoin data for that day is 0 (as this indicates missing, rather than actually 0)
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, ['World_Ethereum','World_Coinbase',]] = np.nan, np.nan
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, ['US_Bitcoin','US_Ethereum','US_Coinbase',]] = np.nan, np.nan, np.nan
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, ['GB_Bitcoin','GB_Ethereum','GB_Coinbase',]] = np.nan, np.nan, np.nan
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, ['Japan_Bitcoin','Japan_Ethereum','Japan_Coinbase',]] = np.nan, np.nan, np.nan
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, ['India_Bitcoin','India_Ethereum','India_Coinbase',]] = np.nan, np.nan, np.nan
Google_Trends.loc[Google_Trends["World_Bitcoin"] == 0, 'World_Bitcoin'] = np.nan

In [None]:
###Replace with the trend
Google_Trends.interpolate(method='linear', inplace=True)

In [None]:
#Save
Google_Trends.to_csv(my_filepath + 'Raw Data/Google_Trends_Filled.csv',index=False)

## Chapter 6: Collapse data into period by period data
Collapse into period by period statistics (i.e. statistics by hour rather than by transaction)

In [None]:
###Read in data and convert the datetime variable
bitcoin_data = pd.read_csv(my_filepath + "Raw Data/Raw_Data_Bitcoin.csv")
bitcoin_data['time'] = pd.to_datetime(bitcoin_data['time'])
ethereum_data = pd.read_csv(my_filepath + "Raw Data/Raw_Data_Ethereum.csv")
ethereum_data['time'] = pd.to_datetime(ethereum_data['time'])
Google_Trends = pd.read_csv(my_filepath + 'Raw Data/Google_Trends_Filled.csv')
Google_Trends['Date'] = pd.to_datetime(Google_Trends['Date'])
ledger_data_summary = pd.read_csv(my_filepath + 'Raw Data/Ledger_Data.csv')
ledger_data_summary['Date'] = pd.to_datetime(ledger_data_summary['Date'])

In [None]:
###Set time period for 1 hour 
time_period = "1H"

In [None]:
###Create summary statistics by period for bitcoin
cleaned_data_bitcoin = bitcoin_data.groupby(pd.Grouper(key = 'time', freq=time_period))['price'].aggregate(end_price_bitcoin = 'last')
cleaned_data_bitcoin['mean_volume_bitcoin'] = bitcoin_data.groupby(pd.Grouper(key = 'time', freq=time_period))['volume'].aggregate(np.mean).to_frame()
cleaned_data_bitcoin['number_sales_bitcoin'] = bitcoin_data[bitcoin_data['buy_sell']=="s"].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_sales_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_sales_bitcoin']),0,cleaned_data_bitcoin['number_sales_bitcoin'])
cleaned_data_bitcoin['number_big_sales_bitcoin'] = bitcoin_data[(bitcoin_data['buy_sell']=="s") & ((bitcoin_data['price']*bitcoin_data['volume'])>10000)].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_big_sales_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_big_sales_bitcoin']),0,cleaned_data_bitcoin['number_big_sales_bitcoin'])
cleaned_data_bitcoin['number_purchases_bitcoin'] = bitcoin_data[bitcoin_data['buy_sell']=="b"].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_purchases_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_purchases_bitcoin']),0,cleaned_data_bitcoin['number_purchases_bitcoin'])
cleaned_data_bitcoin['number_big_purchases_bitcoin'] = bitcoin_data[(bitcoin_data['buy_sell']=="b") & ((bitcoin_data['price']*bitcoin_data['volume'])>10000)].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_big_purchases_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_purchases_bitcoin']),0,cleaned_data_bitcoin['number_purchases_bitcoin'])
cleaned_data_bitcoin['number_market_bitcoin'] = bitcoin_data[bitcoin_data['market_limit']=="m"].groupby(pd.Grouper(key = 'time', freq=time_period))['market_limit'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_market_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_market_bitcoin']),0,cleaned_data_bitcoin['number_market_bitcoin'])
cleaned_data_bitcoin['number_limit_bitcoin'] = bitcoin_data[bitcoin_data['market_limit']=="l"].groupby(pd.Grouper(key = 'time', freq=time_period))['market_limit'].aggregate('count').to_frame()
cleaned_data_bitcoin['number_limit_bitcoin'] = np.where(pd.isnull(cleaned_data_bitcoin['number_limit_bitcoin']),0,cleaned_data_bitcoin['number_limit_bitcoin'])

In [None]:
###Create summary statistics by period for ethereum
cleaned_data_ethereum = ethereum_data.groupby(pd.Grouper(key = 'time', freq=time_period))['price'].aggregate(end_price_ethereum = 'last')
cleaned_data_ethereum['mean_volume_ethereum'] = ethereum_data.groupby(pd.Grouper(key = 'time', freq=time_period))['volume'].aggregate(np.mean).to_frame()
cleaned_data_ethereum['number_sales_ethereum'] = ethereum_data[ethereum_data['buy_sell']=="s"].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_ethereum['number_sales_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_sales_ethereum']),0,cleaned_data_ethereum['number_sales_ethereum'])
cleaned_data_ethereum['number_big_sales_ethereum'] = ethereum_data[(ethereum_data['buy_sell']=="s") & ((ethereum_data['price']*ethereum_data['volume'])>10000)].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_ethereum['number_big_sales_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_big_sales_ethereum']),0,cleaned_data_ethereum['number_big_sales_ethereum'])
cleaned_data_ethereum['number_purchases_ethereum'] = ethereum_data[ethereum_data['buy_sell']=="b"].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_ethereum['number_purchases_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_purchases_ethereum']),0,cleaned_data_ethereum['number_purchases_ethereum'])
cleaned_data_ethereum['number_big_purchases_ethereum'] = ethereum_data[(ethereum_data['buy_sell']=="b") & ((ethereum_data['price']*ethereum_data['volume'])>10000)].groupby(pd.Grouper(key = 'time', freq=time_period))['buy_sell'].aggregate('count').to_frame()
cleaned_data_ethereum['number_big_purchases_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_big_purchases_ethereum']),0,cleaned_data_ethereum['number_big_purchases_ethereum'])
cleaned_data_ethereum['number_market_ethereum'] = ethereum_data[ethereum_data['market_limit']=="m"].groupby(pd.Grouper(key = 'time', freq=time_period))['market_limit'].aggregate('count').to_frame()
cleaned_data_ethereum['number_market_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_market_ethereum']),0,cleaned_data_ethereum['number_market_ethereum'])
cleaned_data_ethereum['number_limit_ethereum'] = ethereum_data[ethereum_data['market_limit']=="l"].groupby(pd.Grouper(key = 'time', freq=time_period))['market_limit'].aggregate('count').to_frame()
cleaned_data_ethereum['number_limit_ethereum'] = np.where(pd.isnull(cleaned_data_ethereum['number_limit_ethereum']),0,cleaned_data_ethereum['number_limit_ethereum'])

In [None]:
###Merge data
cleaned_data = pd.merge(cleaned_data_bitcoin, cleaned_data_ethereum, how = 'inner', on = 'time')

In [None]:
#Delete unneeded objects
del(cleaned_data_bitcoin, cleaned_data_ethereum)

In [None]:
###Merge with the Google data
cleaned_data = cleaned_data.reset_index()
Google_Trends.rename(columns = {'Date':'time'}, inplace = True)
cleaned_data = pd.merge(cleaned_data, Google_Trends, how = 'inner', on = 'time')

In [None]:
###Merge with the ledger data
ledger_data_summary.rename(columns = {'Date':'time'}, inplace = True)
cleaned_data = pd.merge(cleaned_data, ledger_data_summary, how = 'inner', on = 'time')

In [None]:
###Save as cleaned data
cleaned_data.to_csv(my_filepath + 'Cleaned Data/Data_Cleaned_1Hour.csv',index=False)

## Chapter 7: Lag the dataset
Create a lagged dataset for feeding into the model

In [5]:
###Read data
cleaned_data = pd.read_csv(my_filepath + 'Cleaned Data/Data_Cleaned_1Hour.csv')
cleaned_data['time'] = pd.to_datetime(cleaned_data['time'])

In [6]:
###Set the number of periods to be lagged
number_periods = 500

In [None]:
##Create lagged datasets for merge
lagged_data = []
#Loop through each of the lags and print progress occasionally
for i in range(number_periods, -1, -1):
  if (i % 10 == 0):
      print("Lag:", i)
  #Create new dataset lagged by the amount
  lagged_data_temp = cleaned_data.copy()
  lagged_data_temp['time'] = lagged_data_temp['time'] + datetime.timedelta(hours=i)
  lagged_data_temp.columns = lagged_data_temp.columns + "_L" + str(i)
  lagged_data_temp.columns = lagged_data_temp.columns.str.replace('time_L'+str(i), 'time')
  #Append to list
  lagged_data.append(lagged_data_temp)

#Merge all together
lagged_data = reduce(lambda  left,right: pd.merge(left,right,on=['time'],
                                            how='inner'), lagged_data)

Lag: 500
Lag: 490
Lag: 480
Lag: 470
Lag: 460
Lag: 450
Lag: 440
Lag: 430
Lag: 420
Lag: 410
Lag: 400
Lag: 390
Lag: 380
Lag: 370
Lag: 360
Lag: 350
Lag: 340
Lag: 330
Lag: 320
Lag: 310
Lag: 300
Lag: 290
Lag: 280
Lag: 270
Lag: 260
Lag: 250
Lag: 240
Lag: 230
Lag: 220
Lag: 210
Lag: 200
Lag: 190
Lag: 180
Lag: 170
Lag: 160
Lag: 150
Lag: 140
Lag: 130
Lag: 120
Lag: 110
Lag: 100
Lag: 90
Lag: 80
Lag: 70
Lag: 60
Lag: 50
Lag: 40
Lag: 30
Lag: 20
Lag: 10
Lag: 0


In [None]:
###Add hour of day and day or week
lagged_data['hour_of_day'] = lagged_data['time'].dt.hour
lagged_data['day_of_week'] = lagged_data['time'].dt.weekday

In [None]:
#Set time as the index
lagged_data.set_index('time', inplace = True)

In [None]:
###Save as cleaned data
lagged_data.to_csv(my_filepath + 'Cleaned Data/Data_Cleaned_Lagged_1Hour.csv',index=False)

##Chapter 8: Put into form for machine learning
First, decide on the length of time to look at for the outcome data (i.e. predicting changes over the next hour, 3 hours etc.) If the length of time is longer than the length of each period in the data, then observations need to be dropped to avoid overlap between outcome data. To choose this, select the periods to be dropped (0 for the length of time equal to the length of each period, or higher for a longer length of time). 
Then split into train, test and validation as usual. 

In [None]:
#Choose the length of time to look at for the outcome data (0 if equal to the length of each period, 1 is double, 2 triple etc.) 
periods_to_drop = 3

In [None]:
###Read in data and convert the datetime variable
lagged_data_master = pd.read_csv(my_filepath + "Cleaned Data/Data_Cleaned_Lagged_1Hour.csv")

In [None]:
###Copy the dataset
lagged_data = lagged_data_master.copy() #Do this to avoid having to reload every time

In [None]:
###Capture the number of time-varying variables
number_vars = len(lagged_data.columns[pd.Series(lagged_data.columns).str.endswith(("_L0"))])

In [None]:
###Create the outcome variable, which depends on the number of periods to drop
lagged_data["Bitcoin_percentage_change"] = 100*(lagged_data["end_price_bitcoin_L0"]-lagged_data["end_price_bitcoin_L"+str(periods_to_drop+1)])/lagged_data["end_price_bitcoin_L"+str(periods_to_drop+1)]
#Then drop all variables that we don't need (all L0 ones, plus any others if there are periods to drop)
for i in range(0, periods_to_drop+1):
  cols_to_drop = [col for col in lagged_data if col.endswith('L' + str(i))]
  lagged_data.drop(axis = 1, columns = cols_to_drop, inplace = True)

In [None]:
#Drop rows to stop outcome measure overlap
lagged_data = lagged_data[lagged_data.index % (periods_to_drop+1) == 0]

In [None]:
###Change all the predictors to be percentage difference
#Set up variables for exclusion (add "_" as well to capture some single and some double digit endings)
for i in range(periods_to_drop+1, number_periods):
  for j in [col for col in lagged_data if col.endswith('L' + str(i))]:
    col_stub = j.replace("_L" + str(i), "")
    j_lag = col_stub + "_L" + str(i+1)
    lagged_data[j] = (lagged_data[j] - lagged_data[j_lag])/np.where(lagged_data[j] == 0, 0.1, lagged_data[j])

#Remove the final lag
cols_to_drop = [col for col in lagged_data if col.endswith('L' + str(number_periods))]
lagged_data.drop(axis = 1, columns = cols_to_drop, inplace = True)

In [None]:
###Drop any with NA or inf
lagged_data = lagged_data.replace(-np.inf, np.nan)
lagged_data = lagged_data.replace(np.inf, np.nan)
lagged_data = lagged_data.dropna()

In [None]:
###Split into train, val and test
#First split out the test set. This should be the most recent observations
full_data_train_val, full_data_test  = np.split(lagged_data, [int(0.9*len(lagged_data))])

#Then split the remainder into train and validation
np.random.seed(seed_object+1)
rand = np.random.rand(len(full_data_train_val))
msk_train = rand < 0.8
full_data_train = full_data_train_val.iloc[msk_train]
full_data_val = full_data_train_val.iloc[~msk_train]

In [None]:
###Make train data balanced
#Create a rounded outcome variable and check it's balance
full_data_train['binary'] = np.where(full_data_train["Bitcoin_percentage_change"]>0, 1, 0)
full_data_train['binary'].value_counts()
#Drop randomly from the overweighted group
full_data_train_1 = full_data_train[full_data_train['binary'] == 1]
full_data_train_0 = full_data_train[full_data_train['binary'] == 0]
np.random.seed(seed_object)
full_data_train_1 = full_data_train_1.sample(full_data_train['binary'].value_counts()[0])
full_data_train = full_data_train_0.append(full_data_train_1)
#Check that the oucome is balanced
print(full_data_train['binary'].value_counts())
full_data_train.drop(axis = 1, columns = 'binary', inplace = True)

In [None]:
#Make outcome data binary
full_data_train['Bitcoin_percentage_change'] = np.where(full_data_train['Bitcoin_percentage_change']>0, 1, 0)
full_data_val['Bitcoin_percentage_change'] = np.where(full_data_val['Bitcoin_percentage_change']>0, 1, 0)
full_data_test['Bitcoin_percentage_change'] = np.where(full_data_test['Bitcoin_percentage_change']>0, 1, 0)

In [None]:
#Split out data into X and Y
y_train = full_data_train[["Bitcoin_percentage_change"]]
x_train = full_data_train.drop(axis = 1, columns =  "Bitcoin_percentage_change")
y_val = full_data_val[["Bitcoin_percentage_change"]]
x_val = full_data_val.drop(axis = 1, columns = "Bitcoin_percentage_change")
y_test = full_data_test[["Bitcoin_percentage_change"]]
x_test = full_data_test.drop(axis = 1, columns =  "Bitcoin_percentage_change")

In [None]:
#Scale the data
scaler = StandardScaler()
x_train = pd.DataFrame(scaler.fit_transform(x_train), columns = x_train.columns, index = x_train.index)
x_val = pd.DataFrame(scaler.transform(x_val), columns = x_val.columns, index = x_val.index)
x_test = pd.DataFrame(scaler.transform(x_test), columns = x_test.columns, index = x_test.index)

##Chapter 9: Run basic machine learning
Try first with an Elasticnet and Random forest to give a quick check and to look for important variables

In [None]:
#Set up gridsearch model for elasticnet
gsc = GridSearchCV(
        estimator=LogisticRegression(penalty = 'elasticnet', solver='saga',
                                     random_state = 71166, max_iter = 150),
        param_grid={
            'C': [0.000005, 0.00005, 0.0005],
            'l1_ratio': [0, 0.3, 0.7, 1]
        },
        scoring = {'Accuracy': 'accuracy', 'Balanced_Accuracy': 'balanced_accuracy', 
                   'F1':'f1', 'AUC': 'roc_auc'}, 
        refit='AUC',
        cv=2, verbose=10, n_jobs = 2)

In [None]:
#Fit elasticnet model
model = gsc.fit(x_train, y_train.values.ravel())

In [None]:
#Show elasticnet results
pd.DataFrame(model.cv_results_)

In [None]:
#Get the most important variables for elasticnet
importance = pd.DataFrame({"Feature": x_train.columns,
                           "stdev": np.std(x_train, 0),
                           "coef": model.best_estimator_.coef_[0]})
importance['Importance'] = 1000*importance['stdev']*importance['coef']
importance.sort_values(by = 'Importance', ascending = False, inplace = True)

In [None]:
#Show the most important
importance.head(20)

In [None]:
#Set up gridsearch model for random forest
gsc = GridSearchCV(
        estimator=RandomForestClassifier(verbose = 10),
        param_grid={
            'n_estimators': [100, 150],
            'min_samples_split': [50, 200]
        },
        scoring = {'Accuracy': 'accuracy', 'Balanced_Accuracy': 'balanced_accuracy', 
                   'F1':'f1', 'AUC': 'roc_auc'}, 
        refit='AUC',
        cv=2, verbose=10, n_jobs = 2)

In [None]:
#Fit model for random forest
model = gsc.fit(x_train, y_train.values.ravel())

In [None]:
#Show results for random forest
pd.DataFrame(model.cv_results_)

##Chapter 10: Put into form for LSTM

In [None]:
#Split X vars into time varying and time invariant components and put into numpy arrays
y_train = y_train.values
x_train_static = x_train[['hour_of_day', 'day_of_week']].values
x_train_time = x_train.drop(axis = 1, columns =  ['hour_of_day', 'day_of_week']).values
y_val = y_val.values
x_val_static = x_val[['hour_of_day', 'day_of_week']].values
x_val_time = x_val.drop(axis = 1, columns = ['hour_of_day', 'day_of_week']).values
y_test = y_test.values
x_test_static = x_test[['hour_of_day', 'day_of_week']].values
x_test_time = x_test.drop(axis = 1, columns =  ['hour_of_day', 'day_of_week']).values

In [None]:
#Reshape time variant data to (observation,month,variables)
x_train_time = x_train_time.reshape((full_data_train.shape[0],number_periods-periods_to_drop-1,number_vars))
x_val_time = x_val_time.reshape((full_data_val.shape[0],number_periods-periods_to_drop-1,number_vars))
x_test_time = x_test_time.reshape((full_data_test.shape[0],number_periods-periods_to_drop-1,number_vars))

##Chapter 11: Create and fit a LSTM model
Run a model which uses LSTM layers for the time varying data and fully connected layers for the time invariant data, and which joins them at the top. Explore how well this works

In [None]:
##Model
#First the LSTM side
input_tensor_1 = Input(shape=(number_periods-periods_to_drop-1, number_vars))
X_1 = LSTM(units = 16, return_sequences = True, kernel_initializer=keras.initializers.glorot_uniform(seed=seed_object))(input_tensor_1)
X_1 = LSTM(units = 8, return_sequences = False, kernel_initializer=keras.initializers.glorot_uniform(seed=seed_object))(X_1)

#Fully connected
input_tensor_2 = Input(shape = (2,)) 
X_2 = Dense(16, activation="relu")(input_tensor_2)
X_2 = Dropout(0.1)(X_2)

#Join together
X_3 = keras.layers.concatenate([X_1,X_2])
X_3 = Dense(128, activation="relu")(X_3)
X_3 = Dropout(0.05)(X_3)
X_3 = Dense(64, activation="relu")(X_3)
X_3 = Dropout(0.05)(X_3)
out = Dense(1, activation="sigmoid")(X_3)
model = Model([input_tensor_1, input_tensor_2], out) 

#Compile the model and create callbacks
opt = Adam(lr=0.001, beta_1=0.9, beta_2=0.9999, decay=0.00001)
#model.compile(loss='mean_squared_error', optimizer=opt, metrics=["mae"])
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
filepath = my_filepath + "Models/model_1h.h5"
#checkpoint = ModelCheckpoint(filepath, monitor='val_mae', verbose=False, 
#                             save_best_only=True, mode='min')
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=False, 
                             save_best_only=True, mode='max')
callbacks_list = [checkpoint]

#Fit the model
history = model.fit([x_train_time, x_train_static], y_train, batch_size = 512, epochs=20,
          validation_data = ([x_val_time, x_val_static], y_val), shuffle=True,
          verbose = True, callbacks=callbacks_list) #, class_weight = weights_train) #, sample_weight = weights_train)

#Print the accuracy and correlation
model_best = load_model(my_filepath + 'Models/model_1h.h5')
full_data_train['prediction'] = model_best.predict([x_train_time, x_train_static])
#full_data_train['prediction_rounded'] = np.where(full_data_train['prediction'] > 0,1,0)
full_data_train['prediction_rounded'] = np.where(full_data_train['prediction'] > 0.5,1,0)
full_data_train['actual_rounded'] = np.where(full_data_train["Bitcoin_percentage_change"] > 0,1,0)
correlation_temp_train, pvalue_temp_train = stats.pearsonr(full_data_train['prediction'], full_data_train['Bitcoin_percentage_change'])
full_data_val['prediction'] = model_best.predict([x_val_time, x_val_static])
#full_data_val['prediction_rounded'] = np.where(full_data_val['prediction'] > 0,1,0)
full_data_val['prediction_rounded'] = np.where(full_data_val['prediction'] > 0.5,1,0)
full_data_val['actual_rounded'] = np.where(full_data_val["Bitcoin_percentage_change"] > 0,1,0)
correlation_temp_val, pvalue_temp_val = stats.pearsonr(full_data_val['prediction'], full_data_val['Bitcoin_percentage_change'])
print("Accuracy train: " + str(sum(full_data_train['prediction_rounded']==full_data_train['actual_rounded'])/full_data_train.shape[0]))
print("Correlation train: " + str(correlation_temp_train))
print("Accuracy val: " + str(sum(full_data_val['prediction_rounded']==full_data_val['actual_rounded'])/full_data_val.shape[0]))
print("Correlation val: " + str(correlation_temp_val))

# Delete objects
del(input_tensor_1, X_1, out, filepath, checkpoint, callbacks_list, correlation_temp_train, pvalue_temp_train, correlation_temp_val, pvalue_temp_val)

In [None]:
#Explore the balance of predictions
full_data_train['prediction_rounded'].value_counts()

##Chapter 12: Explore results on the test set
Run several times and take aggregate results on the test dataframe. Need to do this as we can't set seed when using a GPU, so want to take aggregated results 

In [None]:
#Set key variables
epochs = 20
number_runs = 20
learning_rate = 0.001

In [None]:
#Try the model several times to get the outcome metrics
results_models_list = []

for j in range(number_runs):
  print(j)
  ##Model
  #First the LSTM side
  input_tensor_1 = Input(shape=(number_periods-periods_to_drop-1, number_vars))
  X_1 = LSTM(units = 16, return_sequences = True, kernel_initializer=keras.initializers.glorot_uniform(seed=seed_object))(input_tensor_1)
  X_1 = LSTM(units = 8, return_sequences = False, kernel_initializer=keras.initializers.glorot_uniform(seed=seed_object))(X_1)

  #Fully connected
  input_tensor_2 = Input(shape = (2,)) 
  X_2 = Dense(16, activation="relu")(input_tensor_2)
  X_2 = Dropout(0.1)(X_2)

  #Join together
  X_3 = keras.layers.concatenate([X_1,X_2])
  X_3 = Dense(128, activation="relu")(X_3)
  X_3 = Dropout(0.05)(X_3)
  X_3 = Dense(64, activation="relu")(X_3)
  X_3 = Dropout(0.05)(X_3)
  out = Dense(1, activation="sigmoid")(X_3)
  model = Model([input_tensor_1, input_tensor_2], out) 

  #Compile the model and create callbacks
  opt = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.9999, decay=0.00001)
  #model.compile(loss='mean_squared_error', optimizer=opt, metrics=["mae"])
  model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
  filepath = my_filepath + "Models/model_1h.h5"
  #checkpoint = ModelCheckpoint(filepath, monitor='val_mae', verbose=False, 
  #                             save_best_only=True, mode='min')
  checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=False, 
                               save_best_only=True, mode='max')
  callbacks_list = [checkpoint]

  #Fit the model
  history = model.fit([x_train_time, x_train_static], y_train, batch_size = 512, epochs=epochs,
          validation_data = ([x_val_time, x_val_static], y_val), shuffle=True,
            verbose = True, callbacks=callbacks_list) #, class_weight = weights_train) #, sample_weight = weights_train)

  # Delete objects
  del(input_tensor_1, X_1, out, filepath, checkpoint, callbacks_list)

  #Read models in
  model_best = load_model(my_filepath + 'Models/model_1h.h5')

  #Evaluate model
  train_loss, train_mae = model_best.evaluate([x_train_time, x_train_static], y_train, verbose = False)
  val_loss, val_mae = model_best.evaluate([x_val_time, x_val_static], y_val, verbose = False)
  test_loss, test_mae = model_best.evaluate([x_test_time, x_test_static], y_test, verbose = False)

  #Predict outcomes for the test, train and validation set
  full_data_test['prediction'] = model_best.predict([x_test_time, x_test_static])
  #full_data_test['prediction_rounded'] = np.where(full_data_test['prediction'] > 0,1,0)
  full_data_test['prediction_rounded'] = np.where(full_data_test['prediction'] > 0.5,1,0)
  full_data_test['actual_rounded'] = np.where(full_data_test["Bitcoin_percentage_change"] > 0,1,0)
  #full_data_test['prediction_rounded_2'] = np.where(full_data_test['prediction'] >= 0.01,2,np.where(full_data_test['prediction'] <= -0.01,0,1))
  full_data_test['prediction_rounded_2'] = np.where(full_data_test['prediction'] >= 0.51,2,np.where(full_data_test['prediction'] <= 0.049,0,1))
  full_data_train['prediction'] = model_best.predict([x_train_time, x_train_static])
  #full_data_train['prediction_rounded'] = np.where(full_data_train['prediction'] > 0,1,0)
  full_data_train['prediction_rounded'] = np.where(full_data_train['prediction'] > 0.5,1,0)
  full_data_train['actual_rounded'] = np.where(full_data_train["Bitcoin_percentage_change"] > 0,1,0)
  full_data_val['prediction'] = model_best.predict([x_val_time, x_val_static])
  #full_data_val['prediction_rounded'] = np.where(full_data_val['prediction'] > 0,1,0)
  full_data_val['prediction_rounded'] = np.where(full_data_val['prediction'] > 0.5,1,0)
  full_data_val['actual_rounded'] = np.where(full_data_val["Bitcoin_percentage_change"] > 0,1,0)

  #Calculate the accuracy for each
  Accuracy_temp_train = sum(full_data_train['prediction_rounded']==full_data_train['actual_rounded'])/full_data_train.shape[0]
  Accuracy_temp_val = sum(full_data_val['prediction_rounded']==full_data_val['actual_rounded'])/full_data_val.shape[0]
  Accuracy_temp_test = sum(full_data_test['prediction_rounded']==full_data_test['actual_rounded'])/full_data_test.shape[0]
  try:
    Accuracy_temp_test_m05 = sum(full_data_test[full_data_test['Bitcoin_percentage_change'] < -0.5]['prediction_rounded']==full_data_test[full_data_test['Bitcoin_percentage_change'] < -0.5]['actual_rounded'])/full_data_test[full_data_test['Bitcoin_percentage_change'] < -0.5].shape[0]
  except ZeroDivisionError: 
    Accuracy_temp_test_m05 = 0
  try:
    Accuracy_temp_test_m025 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.25) & (full_data_test['Bitcoin_percentage_change'] > -0.5)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.25) & (full_data_test['Bitcoin_percentage_change'] > -0.5)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.25) & (full_data_test['Bitcoin_percentage_change'] > -0.5)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_m025 = 0
  try:
    Accuracy_temp_test_m01 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.1) & (full_data_test['Bitcoin_percentage_change'] > -0.25)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.1) & (full_data_test['Bitcoin_percentage_change'] > -0.25)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.1) & (full_data_test['Bitcoin_percentage_change'] > -0.25)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_m01 = 0
  try:
    Accuracy_temp_test_m005 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.05) & (full_data_test['Bitcoin_percentage_change'] > -0.1)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.05) & (full_data_test['Bitcoin_percentage_change'] > -0.1)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] < -0.05) & (full_data_test['Bitcoin_percentage_change'] > -0.1)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_m005 = 0
  try:
    Accuracy_temp_test_m0 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] < 0) & (full_data_test['Bitcoin_percentage_change'] > -0.05)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] < 0) & (full_data_test['Bitcoin_percentage_change'] > -0.05)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] < 0) & (full_data_test['Bitcoin_percentage_change'] > -0.05)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_m0 = 0
  try:
    Accuracy_temp_test_0 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] >= 0) & (full_data_test['Bitcoin_percentage_change'] < 0.05)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] >= 0) & (full_data_test['Bitcoin_percentage_change'] < 0.05)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] >= 0) & (full_data_test['Bitcoin_percentage_change'] < 0.05)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_0 = 0
  try:
    Accuracy_temp_test_005 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.05) & (full_data_test['Bitcoin_percentage_change'] < 0.1)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.05) & (full_data_test['Bitcoin_percentage_change'] < 0.1)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.05) & (full_data_test['Bitcoin_percentage_change'] < 0.1)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_005 = 0
  try:
    Accuracy_temp_test_01 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.1) & (full_data_test['Bitcoin_percentage_change'] < 0.25)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.1) & (full_data_test['Bitcoin_percentage_change'] < 0.25)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.1) & (full_data_test['Bitcoin_percentage_change'] < 0.25)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_01 = 0
  try:
    Accuracy_temp_test_025 = sum(full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.25) & (full_data_test['Bitcoin_percentage_change'] < 0.5)]['prediction_rounded']==full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.25) & (full_data_test['Bitcoin_percentage_change'] < 0.5)]['actual_rounded'])/full_data_test[(full_data_test['Bitcoin_percentage_change'] > 0.25) & (full_data_test['Bitcoin_percentage_change'] < 0.5)].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_025 = Accuracy_temp_test_025
  try:
    Accuracy_temp_test_05 = sum(full_data_test[full_data_test['Bitcoin_percentage_change'] > 0.5]['prediction_rounded']==full_data_test[full_data_test['Bitcoin_percentage_change'] > 0.5]['actual_rounded'])/full_data_test[full_data_test['Bitcoin_percentage_change'] > 0.5].shape[0]
  except ZeroDivisionError:
    Accuracy_temp_test_05 = 0
  correlation_temp, pvalue_temp = stats.pearsonr(full_data_test['prediction'], full_data_test['Bitcoin_percentage_change'])

  ### Calculate what would happen to $100 invested at the start (no transaction fee) with basic strategy
  #Set start date (bottom) with 100
  full_data_test['invested_basic'] = 0
  full_data_test['invested_basic'].iloc[-1] = 100
  #Loop upwards, calculating the next row each time
  for i in range(full_data_test.shape[0]-1):
    #Calculate the new value
    row_number_temp = full_data_test.shape[0]-2-i
    full_data_test['invested_basic'].iloc[row_number_temp] = full_data_test['invested_basic'].iloc[row_number_temp+1] * np.where(full_data_test['prediction_rounded'].iloc[row_number_temp] == 1, 1 + full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp]/100, 1)

  ### Calculate what would happen to $100 invested at the start (with a transaction fee) with basic strategy
  #Set start date (bottom) with 100
  full_data_test['invested_fee_basic'] = 0
  full_data_test['invested_fee_basic'].iloc[-1] = 100
  #Loop upwards, calculating the next row each time
  for i in range(full_data_test.shape[0]-1):
    #Calculate the new value
    row_number_temp = full_data_test.shape[0]-2-i
    full_data_test['invested_fee_basic'].iloc[row_number_temp] = full_data_test['invested_fee_basic'].iloc[row_number_temp+1] * np.where(full_data_test['prediction_rounded'].iloc[row_number_temp] == 1, 1 + full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp]/100, 1)
    #Minus the fee if there was a change
    full_data_test['invested_fee_basic'].iloc[row_number_temp] = full_data_test['invested_fee_basic'].iloc[row_number_temp] - np.where((full_data_test['prediction_rounded'].iloc[row_number_temp] == 1) & (full_data_test['prediction_rounded'].iloc[row_number_temp+1] == 0), full_data_test['invested_fee_basic'].iloc[row_number_temp+1] * 0.0021, 0)
    full_data_test['invested_fee_basic'].iloc[row_number_temp] = full_data_test['invested_fee_basic'].iloc[row_number_temp] - np.where((full_data_test['prediction_rounded'].iloc[row_number_temp] == 0) & (full_data_test['prediction_rounded'].iloc[row_number_temp+1] == 1), full_data_test['invested_fee_basic'].iloc[row_number_temp+1] * 0.0021, 0)

  ### Calculate what would happen to $100 invested at the start (with a transaction fee) with complex strategy
  #Set start date (bottom) with 100
  full_data_test['invested_fee_complex'] = 0
  full_data_test['invested_fee_complex'].iloc[-1] = 100
  #Loop upwards, calculating the next row each time
  for i in range(full_data_test.shape[0]-1):
    row_number_temp = full_data_test.shape[0]-2-i
    #Calculate the new value
    if i == 0:
      #Start conservatively. Only if the prediction is buy do you sell. Otherwise keep (and change prediction to 0)
      full_data_test['invested_fee_complex'].iloc[row_number_temp] = full_data_test['invested_fee_complex'].iloc[row_number_temp+1] * np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] == 2, 1 + full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp]/100, 1)
      full_data_test['prediction_rounded_2'].iloc[row_number_temp]  = np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] in [0,1], 0, 2)
    else:
      #For holding previously
      if full_data_test['prediction_rounded_2'].iloc[row_number_temp+1] == 2:
        #If 1 or 2, then keep (and change prediction to 2). If 0, then sell (and keep prediction as 0)
        full_data_test['invested_fee_complex'].iloc[row_number_temp] = full_data_test['invested_fee_complex'].iloc[row_number_temp+1] * np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] in [1,2], 1 + full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp]/100, 1)
        full_data_test['prediction_rounded_2'].iloc[row_number_temp]  = np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] in [1,2], 2, 0)
        #If 0, then minus fees
        full_data_test['invested_fee_complex'].iloc[row_number_temp] = full_data_test['invested_fee_complex'].iloc[row_number_temp] - np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] == 0, full_data_test['invested_fee_complex'].iloc[row_number_temp] * 0.0021, 0)
      #For not holding previously
      if full_data_test['prediction_rounded_2'].iloc[row_number_temp+1] == 0:
        #If 0 or 1, then don't buy (and change prediction to 0). If 2, then buy (and keep prediction as 2)
        full_data_test['invested_fee_complex'].iloc[row_number_temp] = full_data_test['invested_fee_complex'].iloc[row_number_temp+1] * np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] == 2, 1 + full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp]/100, 1)
        full_data_test['prediction_rounded_2'].iloc[row_number_temp]  = np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] in [0,1], 0, 2)
        #If 2, then minus fees
        full_data_test['invested_fee_complex'].iloc[row_number_temp] = full_data_test['invested_fee_complex'].iloc[row_number_temp] - np.where(full_data_test['prediction_rounded_2'].iloc[row_number_temp] == 2, full_data_test['invested_fee_complex'].iloc[row_number_temp] * 0.0021, 0)

  ### Compare to simply leaving the investment as is
  #Set start date (bottom) with 100
  full_data_test['invested_simple'] = 0
  full_data_test['invested_simple'].iloc[-1] = 100
  #Loop upwards, calculating the next row each time
  for i in range(full_data_test.shape[0]-1):
    #Calculate the new value
    row_number_temp = full_data_test.shape[0]-2-i
    full_data_test['invested_simple'].iloc[row_number_temp] = full_data_test['invested_simple'].iloc[row_number_temp+1] * (1 + (full_data_test['Bitcoin_percentage_change'].iloc[row_number_temp])/100)

  #Calculate final values
  simple_temp = full_data_test['invested_simple'].iloc[0]
  basic_temp = full_data_test['invested_basic'].iloc[0]
  basic_fee_temp = full_data_test['invested_fee_basic'].iloc[0]
  complex_fee_temp = full_data_test['invested_fee_complex'].iloc[0]

  #Add items to list of results
  results_models_list.append([correlation_temp, pvalue_temp, Accuracy_temp_train, Accuracy_temp_val, Accuracy_temp_test, Accuracy_temp_test_m05, Accuracy_temp_test_m025, Accuracy_temp_test_m01, Accuracy_temp_test_m005, Accuracy_temp_test_m0, Accuracy_temp_test_0, Accuracy_temp_test_005, Accuracy_temp_test_01, Accuracy_temp_test_025, Accuracy_temp_test_05, simple_temp, basic_temp, basic_fee_temp, complex_fee_temp])

results_models = pd.DataFrame(results_models_list, columns=["Correlation", "P-value", "Accuracy_train", "Accuracy_val", "Accuracy_test", "Accuracy_test_m05", "Accuracy_test_m025", "Accuracy_test_m01", "Accuracy_test_m005", "Accuracy_test_m0", "Accuracy_test_0", "Accuracy_test_005", "Accuracy_test_01", "Accuracy_test_025", "Accuracy_test_05","Simple", "Basic_No_Fee", "Basic_Fee", "Complex_Fee"])

In [None]:
#Print aggregated results
print("Train accuracy: " + str(results_models['Accuracy_train'].agg('mean')))
print("Validation accuracy: " + str(results_models['Accuracy_val'].agg('mean')))
print("Test accuracy: " + str(results_models['Accuracy_test'].agg('mean')))
print("Correlation coefficient: " + str(results_models['Correlation'].agg('mean')))
print("Correlation coefficient > 0: " + str(results_models[results_models['Correlation'] >=0].shape[0]/results_models.shape[0]))
print("Correlation p-value: " + str(results_models['P-value'].agg('mean')))
print("Accuracy test < -0.05: " + str(results_models['Accuracy_test_m05'].agg('mean')))
print("Accuracy test -0.05 - -0.025: " + str(results_models['Accuracy_test_m025'].agg('mean')))
print("Accuracy test -0.025 - -0.01: " + str(results_models['Accuracy_test_m01'].agg('mean')))
print("Accuracy test -0.01 - -0.005: " + str(results_models['Accuracy_test_m005'].agg('mean')))
print("Accuracy test -0.005-0: " + str(results_models['Accuracy_test_m0'].agg('mean')))
print("Accuracy test 0-0.005: " + str(results_models['Accuracy_test_0'].agg('mean')))
print("Accuracy test 0.005 - 0.01: " + str(results_models['Accuracy_test_005'].agg('mean')))
print("Accuracy test 0.01 - 0.025: " + str(results_models['Accuracy_test_01'].agg('mean')))
print("Accuracy test 0.025 - 0.05: " + str(results_models['Accuracy_test_025'].agg('mean')))
print("Accuracy test > 0.05: " + str(results_models['Accuracy_test_05'].agg('mean')))
print("Simple end: " + str(results_models['Simple'].agg('mean')))
print("Basic no fee end: " + str(results_models['Basic_No_Fee'].agg('mean')))
print("Basic fee end: " + str(results_models['Basic_Fee'].agg('mean')))
print("Complex fee end: " + str(results_models['Complex_Fee'].agg('mean')))
print("Basic no fee beats simple: " + str(100*sum(results_models['Basic_No_Fee']>results_models['Simple'])/results_models.shape[0]))
print("Basic fee beats simple: " + str(100*sum(results_models['Basic_Fee']>results_models['Simple'])/results_models.shape[0]))
print("Complex fee beats simple: " + str(100*sum(results_models['Complex_Fee']>results_models['Simple'])/results_models.shape[0]))

In [None]:
#Save all results for exploring
results_models.to_csv(my_filepath + 'Results/Repeat_Model_Results_higher.csv',index=False)

In [None]:
#Save the full data, but with only the last couple of months
full_data_test.iloc[:,number_vars*(number_periods-periods_to_drop-3):].to_csv(my_filepath + 'Results/Test_Date.csv',index=False)

##Chapter 13: Try copying accounts
Explore what happens if you copy some of the more active whales. Work in progress

In [None]:
###Read in data
cleaned_data = pd.read_csv(my_filepath + 'Cleaned Data/Data_Cleaned_1Hour.csv', index_col = 'time')
ledger_data = pd.read_csv(my_filepath + 'Raw Data/Ledger_Data_All.csv', index_col = 'Date')
account_names = pd.read_csv(my_filepath + 'Raw Data/Accounts.csv')['address']

In [None]:
###Create list of relevant accounts, and the starttime for the exercise
#accounts = [19, 21, 25, 37, 44, 81, 90]
accounts = list(range(ledger_data.shape[1]))
start = 000
end = 100000

In [None]:
#Keep only the end price for bitcoin
cleaned_data = pd.DataFrame(cleaned_data['end_price_bitcoin'])

In [None]:
#Keep only the ledger data with the relevant numbers 
ledger_data = ledger_data.iloc[:, accounts]

In [None]:
#Merge the datasets
ledger_data.index.names = ['time']
full_data = pd.merge(cleaned_data, ledger_data, on = "time", how = 'inner')

In [None]:
#Create dataframe with the account name and the first id number (i.e. first trade)
summary = pd.DataFrame({"First_id": full_data.reset_index(drop = True).ne(0).idxmax()}).reset_index(drop = True)
summary2 = pd.DataFrame({"Account_name": ['No account']})
summary3 = pd.DataFrame({"Account_name": account_names[accounts]})
summary2 = summary2.append(summary3).reset_index(drop = True)
summary = pd.merge(summary2, summary, left_index= True, right_index = True, how = 'inner')

In [None]:
#Calculate percentage change, and shift price up one to match with previous period's transactions
full_data = full_data.pct_change()
full_data['end_price_bitcoin'] = full_data['end_price_bitcoin'].shift(-1)

In [None]:
#Make each of the percentage changes in balance -1, 0 or 1
full_data.loc[:, full_data.columns != 'end_price_bitcoin'] = np.where(full_data.loc[:, full_data.columns != 'end_price_bitcoin']>0,1, np.where(full_data.loc[:, full_data.columns != 'end_price_bitcoin']<0,-1,0))

In [None]:
#Filter by the start and end time
full_data = full_data.iloc[start:end]

In [None]:
#Add 100 USD at the start
full_data['Amount_Basic'] = 0
full_data['Amount_Basic'].iloc[0] = 100
for i in range(len(accounts)):
  full_data['Amount_Copying_' + str(accounts[i])] = 0
  full_data['Amount_Copying_' + str(accounts[i])].iloc[0] = 100

In [None]:
#Calculate the basic amount over time
#Loop upwards, calculating the next row each time
for i in range(full_data.shape[0]-1):
  #Calculate the new value
  row_number_temp = i + 1
  full_data['Amount_Basic'].iloc[row_number_temp] = full_data['Amount_Basic'].iloc[row_number_temp-1] * (1 + (full_data['end_price_bitcoin'].iloc[row_number_temp-1]))

In [None]:
#Calculate the copying amounts over time
#Loop upwards, calculating the next row each time
for j in range(len(accounts)):
  if j % 5 == 0:
    print(j)
  for i in range(full_data.shape[0]-1):
    row_number_temp = i + 1
    #Calculate the new value
    if i == 0:
      #Start by having bitcoin, unless it says sell
      full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp] = full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp-1] * np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] in [0,1], 1 + full_data['end_price_bitcoin'].iloc[row_number_temp-1], 1)
      full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp]  = np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] in [0,1], 1, -1)
    else:
      #For holding previously
      if full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp-1] == 1:
        #If 0 or 1, then keep (and change prediction to 1). If -1, then sell (and keep prediction as -1)
        full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp] = full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp-1] * np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] in [0,1], 1 + full_data['end_price_bitcoin'].iloc[row_number_temp-1], 1)
        full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp]  = np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] in [0,1], 1, -1)
      #For not holding previously
      if full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp-1] == -1:
        #If -1 or 0, then don't buy (and change prediction to -1). If 1, then buy (and keep prediction as 1)
        full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp] = full_data['Amount_Copying_' + str(accounts[j])].iloc[row_number_temp-1] * np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] == 1, 1 + full_data['end_price_bitcoin'].iloc[row_number_temp-1], 1)
        full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp]  = np.where(full_data['balance_' + str(accounts[j]+1)].iloc[row_number_temp] in [-1,0], -1, 1)

0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
125
130
135
140
145
150
155
160
165
170
175
180
185
190
195
200
205
210
215
220
225
230
235
240
245
250
255
260
265
270
275
280
285
290
295
300
305
310
315
320
325
330
335
340
345
350
355
360
365
370
375
380
385
390
395
400
405
410
415
420
425
430
435
440
445
450
455
460
465
470
475
480
485
490
495
500
505
510
515
520
525
530
535
540
545
550
555
560
565
570
575
580
585
590
595


In [None]:
###Save details in a dataframe, looking at starting at different points in time
#Create list of times
times = list(range(0,41000,250))

#Loop through these, summarise, and merge with summary data
for i in range(len(times)):
  full_data_temp = full_data[full_data.columns[pd.Series(full_data.columns).str.startswith('Amount')]].iloc[times[i]:].reset_index(drop = True)
  full_data_temp.iloc[-1] = 100*full_data_temp.iloc[-1]/full_data_temp.iloc[0]
  summary_2 = pd.DataFrame({"Start_" + str(times[i]) :full_data_temp[full_data_temp.columns[pd.Series(full_data_temp.columns).str.startswith('Amount')]].reset_index(drop = True).iloc[-1]}).reset_index(drop=True)
  summary = pd.merge(summary, summary_2, left_index = True, right_index = True, how = 'inner')

In [None]:
###Do the same, but for starting and ending
#Create list of times
times_start = list(range(0,41000,250))
times_end = list(range(250,41250,250))

#Loop through these, summarise, and merge with summary data
for i in range(len(times_start)):
  full_data_temp = full_data[full_data.columns[pd.Series(full_data.columns).str.startswith('Amount')]].iloc[times_start[i]:times_end[i]].reset_index(drop = True)
  full_data_temp.iloc[-1] = 100*full_data_temp.iloc[-1]/full_data_temp.iloc[0]
  summary_2 = pd.DataFrame({"Start_End_" + str(times_start[i]) :full_data_temp[full_data_temp.columns[pd.Series(full_data_temp.columns).str.startswith('Amount')]].reset_index(drop = True).iloc[-1]}).reset_index(drop=True)
  summary = pd.merge(summary, summary_2, left_index = True, right_index = True, how = 'inner')

In [None]:
### Add the account ID
summary3 = pd.DataFrame({"Account_ID": [-99]})
summary2 = pd.DataFrame({"Account_ID": accounts})
summary2 = summary3.append(summary2).reset_index()
summary = pd.merge(summary2, summary, left_index = True, right_index = True, how = 'inner')
summary.drop(labels = "index", axis = 1, inplace = True)

In [None]:
#print results for total
print("Leaving in: " + str(full_data['Amount_Basic'].iloc[-1]))
for j in range(len(accounts)):
  print("Account " + str(accounts[j]) + ": " + str(full_data['Amount_Copying_' + str(accounts[j])].iloc[-1]))

Leaving in: 3207.723450638447
Account 0: 4113.546703758477
Account 1: 2997.9721266843517
Account 2: 3207.723450638447
Account 3: 3207.723450638447
Account 4: 3207.723450638447
Account 5: 3207.723450638447
Account 6: 3207.723450638447
Account 7: 3384.941610727739
Account 8: 3230.0616880377443
Account 9: 2243.6574531718265
Account 10: 3207.723450638447
Account 11: 2996.225820093803
Account 12: 3207.723450638447
Account 13: 3449.9390991006594
Account 14: 2297.756278340381
Account 15: 3506.166554982679
Account 16: 3207.723450638447
Account 17: 3207.723450638447
Account 18: 3207.723450638447
Account 19: 566.4775590098149
Account 20: 3207.723450638447
Account 21: 3580.8196698293696
Account 22: 3207.723450638447
Account 23: 3207.723450638447
Account 24: 3207.723450638447
Account 25: 4155.311767434355
Account 26: 3853.287964275811
Account 27: 2899.823614226554
Account 28: 3207.723450638447
Account 29: 3207.723450638447
Account 30: 946.815380451089
Account 31: 3207.723450638447
Account 32: 1035

In [None]:
#Plot this
#plotting_data = full_data[full_data.columns[pd.Series(full_data.columns).str.startswith('Amount')]].reset_index(drop = True).reset_index()
#plotting_data = plotting_data.iloc[:]
#plotting_data = pd.melt(plotting_data, ['index'])
#ax = sns.lineplot(x = 'index', y="value", hue = 'variable', data=plotting_data, ci = None)

In [None]:
#Save all results for exploring
summary.to_csv(my_filepath + 'Results/Results from copying.csv',index=False)