* Wrangle your data. Get it into the notebook in the best form possible for your analysis and model building.

* Explore your data. Make visualizations and conduct statistical analyses to explain what’s happening with your data, why it’s interesting, and what features you intend to take advantage of for your modeling.

* Build a modeling pipeline. Your model should be build in a coherent pipeline of linked stages that is efficient and easy to implement.

* Evaluate your models. You should have built multiple models, which you should thoroughly evaluate and compare via a robust analysis of residuals and failures.

* Present and thoroughly explain your product. Describe your model in detail: why you chose it, why it works, what problem it solves, how it will run in a production like environment. What would you need to do to maintain it going forward?

In [None]:
import pandas as pd
import numpy as np
import sys
import time
import random
import matplotlib.pyplot as plt
import ccxt
import os
import statistics
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
%matplotlib inline

Historical market cap was downloaded from https://coin.dance/stats/marketcaphistorical.

In [None]:
file = '../data/historical market cap.csv'
df = pd.read_csv(file)

# convert date column to epoch time
df.rename(columns={0: 'date'}, inplace=True)
dates = [int(time.mktime(time.strptime(day, '%m/%d/%Y'))) for day in df['date']]
df['date'] = dates

# Total market cap
df['Total Market Cap'] = df['Altcoin Market Cap'] + df['Bitcoin Market Cap']

# Save file
df.set_index('date', drop=True, inplace=True)
df.to_csv('../data/historical market cap.csv')

Historical market prices of coins

HODL simulation function (returns HODL simulation dataframe)

In [None]:
def simulate_HODL():
    simulations = pd.DataFrame(index=sim_dates)

    for sim_num in range(1000):
        # Randomly select basket of coins
        random_list = random.sample(range(len(coins)-1), num_coins)

        # Determine amount of each coin bought on day 0
        coin_amts = amt_each / historical_prices[0, random_list]

        # Use coins as column name
        col = '-'.join([coins[i] for i in random_list])

        # Dot multiply list of coin amounts with array of historical prices of selected coins
        simulations[col] = historical_prices[:, random_list].dot(coin_amts)

    simulations.to_csv(file_path +  'HODL.csv')
    return simulations

Rebalance simulation function

Simulations

In [None]:
historical_prices = pd.read_csv(file_path + 'historical prices.csv')

coins = historical_prices.columns.tolist()[1:]

# Exclude date column from historical prices
historical_prices = np.array(historical_prices[coins])

# get date ranges used for simulations
historical_cap = pd.read_csv(file_path + 'historical market cap.csv')
historical_cap = np.array(historical_cap)

start_dates = historical_cap[:len(historical_cap) - 365]
end_dates = historical_cap[365:]

# Subtract the ending market caps from each other, located in the 4th column
cap_diffs = list(end_dates[:, 3] - start_dates[:, 3])

    
# Make sure there's an odd number of dates, so the median value can be indexed
if len(cap_diffs) % 2 == 0:
cap_diffs.pop(len(cap_diffs)-1)
        
# Start date for simulations
start_date = cap_diffs.index(np.median(cap_diffs))

# Limit dataframe dates to the date range
historical_prices = historical_prices[start_date:start_date + 365]
sim_dates = sim_dates[start_date:start_date + 365]

# Retrieve all current tickers on exchange
exchange = ccxt.bittrex()
tickers = set()
[tickers.add(ticker) for ticker in exchange.fetch_tickers()]

# Start with $5000 of Bitcoin at day 0 price
start_amt = 5000
num_coins = 5
amt_each = start_amt / num_coins

df = simulate_HODL()
simulate_rebalance(df)

In [None]:
path = 'C:/Users/Carter/Documents/Github/Thinkful__Projects/Final Capstone/'

# DataFrames we've created
historical_df = pd.read_csv(path + 'data/historical prices.csv')
hodl_df = pd.read_csv(path + 'hodl.csv')
rebalanced_df = pd.read_csv(path + 'rebalanced.csv')
summary_df = pd.read_csv(path + 'summary.csv')

# Date range used for simulations
start_date, end_date = historical_data['date'][0], historical_data['date'][len(historical_data)-1]
start_date = time.strftime('%m/%d/%Y', time.gmtime(start_date))
end_date = time.strftime('%m/%d/%Y', time.gmtime(end_date))


# list of coins used in each portfolio simulation
coins = historical_data.columns[1:].tolist()
cols = hodl_df.columns[1:]
# For each simulation, make a list of the coins randomly chosen
coin_lists = [i.split('-') for i in cols]

print('Coins used in analysis', coins)
print('Date range of simulation: {} - {}'.format(start_date, end_date)) 

In [None]:
# End prices 
# Note: explain how taxes were calculated
end_price_HODL = np.array(summary_df['end_price_HODL'] - summary_df['taxes_HODL'])
end_price_rebalanced = np.array(summary_df['end_price_rebalanced'] - summary_df['taxes_rebalanced'])
performance = list((end_price_rebalanced - end_price_HODL) / end_price_HODL)

In [None]:
# Dataframe to compare coin impact on outperforming HODL
df = pd.DataFrame(columns=coins)
df['beat market'] = performance
df['beat market'] = df['beat market'] > 0
df.fillna(False, inplace=True)

# Fill Dataframe with coins used for each simulation
for i in range(len(coin_lists)):
    for coin in coin_lists[i]:
        df.loc[i, coin] = True

In [None]:
# Feature importance analysis
tree = RandomForestClassifier()
X = df[coins]
Y = df['beat market']
tree.fit(X, Y)

feature_importance = tree.feature_importances_
feature_importance = 100 * (feature_importance / max(feature_importance))
temp = feature_importance.tolist()

# Take only top 10 features
top_feats = sorted(feature_importance,reverse=True)[:10]
sorted_features = np.array([temp.index(feat) for feat in top_feats])
pos = np.arange(sorted_features.shape[0]) + .5
plt.barh(pos, feature_importance[sorted_features], align='center')
plt.yticks(pos, X.columns[sorted_features])
plt.show()