# Bitcoin Data Analysis
#### Oct 2019  |   Work in Progress   |   Jason Su

## Introduction

Bitcoin was the first cryptocurrency and is still the largest cyrptocurrency by market capitalization. Some people think of it as an investment vehicle (similar to commodities in nature) while others think of it as a store of value (similar to gold). Bitcoin was designed to be a decentralized global virtual currency with minimum friction in tractions and a high level of security due to the nature of its underlying blockchain network. Since its inception Bitcoin has been rapidly adopted by enthusiasts and investors worldwide. Now it is traded on multiple online exchanges on the internet by virtually everyone in the world. Consequently, the number of factors affecting the price movements of the Bitcoins is huge and the underlying mechanisms are complex. Every investor wishes to gain a competitive advantage in predicting the price movements of Bitcoins. In this notebook I try to analyze the relationships between historical Bitcoin price movements and other relevant indicators such as the level of Bitcoin adoption, the level of difficulty in Bitcoin mining, etc, with the objective to gain insights into how different factors would affect the prices of Bitcoins and how this knowledge would convert to a strategic advantage in everyday Bitcoin trading. 

## Data Wrangling

First let's obtain and clean our Bitcoin price dataset to get ready for analysis.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # making plots and charts
import requests # getting data through APIs

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### Historical Bitcoin Price Data

The following dataset comes from [here](https://www.kaggle.com/mczielinski/bitcoin-historical-data) (Bitcoin price data at 1-minute intervals from select exchanges during the time period from Jan 2012 to August 2019): 

In [None]:
# Import the data from CSV file and save it to a dataframe

bitstamp = pd.read_csv('../input/bitcoin-historical-data/bitstampUSD_1-min_data_2012-01-01_to_2019-08-12.csv')

In [None]:
# Inspect the first a few rows of the data to see what potential cleaning is needed

bitstamp.head()

hmmm.... Seems there are a lot of missing values, and the first column is coded in Unix time.

In [None]:
# Check the number of rows, columns and the datatypes of each column

bitstamp.info()

There are 8 columnns and around 4 million rows of data. The datatypes seem to be fine since they are mostly float64, which is suitable for price data.

In [None]:
# Quickly check the statistics of all the data in each column to see if they make sense
# Based on my common sense historically the prices of Bitcoins went from 0 to an all 
# time high of around $20,000 per coin

bitstamp.describe()

In [None]:
bitstamp.shape

In [None]:
bitstamp.columns

In [None]:
bitstamp['Open'].value_counts(dropna = False)

In [None]:
bitstamp['Open'].plot('hist')

In [None]:
bitstamp.boxplot(column=['Open', 'High', 'Low', 'Close'])

In [None]:
pd.melt(bitstamp, id_vars = 'Timestamp', var_name = 'Measurement', value_name = 'dollars/trade volume')

In [None]:
print(bitstamp.dtypes)

In [None]:
bitstamp_no_na = bitstamp.dropna()

In [None]:
bitstamp_no_na.info()

In [None]:
bitstamp.set_index(pd.to_datetime(bitstamp['Timestamp'], unit='s'), inplace=True, drop=True)

In [None]:
bitstamp.head()

In [None]:
bitstamp.fillna(method = 'ffill', inplace = True)

In [None]:
bitstamp.head()

In [None]:
bitstamp.tail()

In [None]:
bitstamp['Close'].plot('hist')

In [None]:
bitstamp_clean = bitstamp.loc[:, ['Open', 'High', 'Low', 'Close', 'Volume_(BTC)', 'Volume_(Currency)', 'Weighted_Price']]

In [None]:
bitstamp_clean.info()

The below is a chart of the price of bitcoin going from the 17th of July 2010 to approximately the time of writing. Similar plots can be found at any website which lists the price of bitcoin.

In [None]:
bitstamp_clean.plot(y='Close')

Any price swings close to the present are so large in magnitude compared to the price in the past, that past prices seem meaningless. However, to make sense of a long-term price trend, all past prices should have some importance. The reason for the above effect is that using a linear scale is inconvenient for anything that goes through so many orders of magnitude. Using a logarithmic rather than linear scale is more useful [2]. The logarithmic scale gives equal spacing from e.g. 0.01 to 0.1 as from 1000 to 10000. Seen in this way, the bigger picture of the price evolution of bitcoin becomes more visible:

In [None]:
bitstamp_clean.plot(y='Close', logx=False, logy=True)

In the above plot, the price (y-axis) has been scaled logarithmically, but not the time (x-axis). Let’s see what happens when the x-axis is also scaled logarithmically, in a so-called log-log plot

In [None]:
bitstamp_clean.plot(y='Close', logx=True, logy=True)

Now let's import an additional dataset "[Bitcoin My Wallet Number of Users](https://www.quandl.com/data/BCHAIN/MWNUS)" which tells us the number of Bitcoin wallets using My Wallet Services on a global scale. This is an indicator of the degree of adoption of Bitcoins worldwide.

In [None]:
# Obtain Bitcoin wallet data from Quandl 
# (which is a dataset of number of wallets hosts using My Wallet Service on each day from 2009 to 2019. )
wallet = pd.read_csv('../input/bitcoin-my-wallet-number-of-users/BCHAIN-MWNUS.csv')

In [None]:
# Inspect the first 5 rows to see the latest wallet data
wallet.head()

In [None]:
# Inspect the last 5 rows to see the oldest data from 2009
wallet.tail()

In [None]:
# Convert the date column to datetime format for easier processing later
# Also rename the columns while we are here

wallet['Date'] = pd.to_datetime(wallet['Date'])
wallet.rename(columns = {'Date': 'Date', 'Value': 'Wallets'}, inplace = True)

In [None]:
# Group our Bitcoin price data by day so that it could be plotted on the same scale
# against the daily wallet data

bitstamp_clean_day = bitstamp_clean.resample('D').mean()

In [None]:
# Create a date column in the bitstamp_clean_day dataframe

bitstamp_clean_day['Date'] = bitstamp_clean_day.index

In [None]:
# Inspect the first 5 rows to confirm that the timestamps are indeed grouped by days

bitstamp_clean_day.head()

In [None]:
# Join the two dataframes (bitstamp_clean_day and wallet) by matching their dates columns

df = pd.merge(bitstamp_clean_day, wallet, how='inner', on='Date')

In [None]:
# Inspect the first a few rows to confirm the data looks good to go

df.head()

In [None]:
# Plot both daily prices and daily number of wallets for Bitcoin on the same graph

plt.plot(df['Date'], df['Close'], 'r', df['Date'], df['Wallets']/10000, 'b')
plt.yscale('log')
plt.xlabel('Year')
plt.ylabel ('Price and Number of Wallets')
plt.title('Bitcoin Price compared to the Number of Wallets')
plt.show()

From the above plot it seems that there is some kind of correlation between the number of wallets (which implies the degree of adoption of Bitcoins worldwide) and the prices of Bitcoins on a log scale. Therefore, by monitoring the level of increase/decrease of total number of wallets on a global scale, it is possible to predict the overall trend of Bitcoin prices over the next couple of years. Also it is worth noting that the rate of change for both quantities seem to be slowing down, indicating the level of volatility is being more stablized.

Now let's import another dataset "[Bitcoin Difficulty](https://www.quandl.com/data/BCHAIN/DIFF)" which is a measure of how difficult it is to find a hash below a given target. This is an indicator of the level of difficulty of Bitcoin mining, which in turn implies the level of scarcity of new Bitcoin supply.

In [None]:
# Import the Bitcoin difficulty dataset

diff = pd.read_csv('../input/bitcoin-difficulty/BCHAIN-DIFF.csv')

In [None]:
# Rename the columns for easier processing
# Also change the data format of the "Date" column while we are here

diff.rename(columns = {'Date': 'Date', 'Value': 'Difficulty'}, inplace = True)
diff['Date'] = pd.to_datetime(diff['Date'])

In [None]:
# Inspect the first a few rows of the dataset

diff.head()

In [None]:
# Merge these data with Bitcoin price dataframe for comparison later

df2 = pd.merge(bitstamp_clean_day, diff, how='inner', on='Date')

In [None]:
# Inspect the first a few rows of df2

df2.head()

In [None]:
# Plot both daily prices and level of difficulty for Bitcoin mining on the same graph

plt.plot(df2['Date'], df2['Close'], 'r', df2['Date'], df2['Difficulty']/100000, 'b')
plt.yscale('log')
plt.xlabel('Year')
plt.ylabel ('Price and Level of Difficulty')
plt.title('Bitcoin Price compared to the Level of Difficulty')
plt.show()

Again there seems to be a certain kind of correlation here between the level of mining difficulty and price increase over a long run. The level of difficulty of Bitcoin mining has been steadily increasing ever since it was first invented. This difficulty mechanism was hard coded into the blockchain network to ensure that Bitcoins can maintain its scarcity (fixed supply) and therefore prevent the inflation issues that we would experience with traditional currencies. Therefore, it makes perfect economic sense that as the level of supply of Bitcoins decreases its prices would go up.