**Problem Statement:**

Fargo (A firm) decides to invest in cryptocurrency(Bitcoin).Bitcoin prices fluctuates in USD comparitively, which makes it dificult to predict the future Bitcoin prices. Predictions can be made if there exists relation between previous data and future Bitcoin values.So Fargo decides to hire a data scientist who can build a model using RNN and LSTM to predict the Bitcoin prediction. We have to build a model and predict the prices of Bitcoin

**Dataset Description:**

Bitcoin is one of the well-known cryptocurrency, released as open source in 2009.Bitcoin functions as a decentralized digital exchange medium with transactions checked and recorded in a published blockchain without need for a trustworthy record keeping authority or central intermediary.The dataset consist of 627190 rows and 12 columns which consists of Bitcoin markets, cryptocurencies, volume, bid, ask values and so on 

In [0]:
!pip install -U tensorflow # upgrading tensorflow

Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.1.0)


**Importing the required libraries**

In [1]:
#for calculations and importing dataframes
import numpy as np
import pandas as pd
import math
#for scaling, metrics and tensorflow
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf

In [2]:
#for building LSTM model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM

In [3]:
#for visualizing the prediction
from bokeh.plotting import figure
from bokeh.io import output_notebook,show, push_notebook
import numpy as np
from bokeh.models import Legend

**Import the Dataset**

In [4]:
bt_coin= pd.read_csv("Bitcoin.csv")

In [5]:
bt_coin.head() #Having a look at the dataset

Unnamed: 0,date_id,datetime_id,market,rpt_key,last,diff_24h,diff_per_24h,bid,ask,low,high,volume
0,31-05-2019,01-06-2019 00:00,bitstamp,btc_eur,1996.72,2029.99,-1.638924,2005.5,2005.56,1950.0,2063.73,2314.50075
1,31-05-2019,01-06-2019 00:00,bitflyer,btc_jpy,267098.0,269649.0,-0.946045,267124.0,267267.0,267124.0,267267.0,70922.88011
2,31-05-2019,01-06-2019 00:00,korbit,btc_krw,3003500.0,3140000.0,-4.347134,3003500.0,3004000.0,3002000.0,3209500.0,6109.752872
3,31-05-2019,01-06-2019 00:00,bitstamp,btc_usd,2237.4,2239.37,-0.087971,2233.09,2237.4,2154.28,2293.46,13681.28202
4,31-05-2019,01-06-2019 00:00,okcoin,btc_usd,2318.82,2228.7,4.043613,2319.4,2319.99,2129.78,2318.82,4241.641516


**Calculating the total counts of the Cryptocurrencies**

As we can see USD has highest number of counts which means we proceed with that for our prediction

In [6]:
bt_coin['rpt_key'].value_counts()

btc_usd       125438
btc_krw        62719
btc_jpy        62719
fx_btc_jpy     62719
eth_btc        62719
eth_krw        62719
etc_krw        62719
ltc_usd        62719
btc_eur        62719
Name: rpt_key, dtype: int64

**Filtering out only USD Cryptocurrencies**

In [7]:
bt_coin1 = bt_coin.loc[(bt_coin['rpt_key'] == 'btc_usd')]

In [8]:
bt_coin1.head() #having a look at the data

Unnamed: 0,date_id,datetime_id,market,rpt_key,last,diff_24h,diff_per_24h,bid,ask,low,high,volume
3,31-05-2019,01-06-2019 00:00,bitstamp,btc_usd,2237.4,2239.37,-0.087971,2233.09,2237.4,2154.28,2293.46,13681.28202
4,31-05-2019,01-06-2019 00:00,okcoin,btc_usd,2318.82,2228.7,4.043613,2319.4,2319.99,2129.78,2318.82,4241.641516
15,01-06-2019,01-06-2019 00:01,bitstamp,btc_usd,2248.39,2242.44,0.265336,2247.77,2248.38,2154.28,2293.46,13701.6986
16,01-06-2019,01-06-2019 00:01,okcoin,btc_usd,2320.42,2228.4,4.12942,2320.99,2321.49,2129.78,2322.0,4260.261516
23,01-06-2019,01-06-2019 00:02,bitstamp,btc_usd,2248.35,2238.58,0.436437,2248.35,2248.69,2154.28,2293.46,13742.11091


**Creating a new Dataframe to store the datetime after certain dates**

In [9]:
bt_coin1 = bt_coin1.reset_index(drop=True)
bt_coin1['datetime'] = pd.to_datetime(bt_coin1['datetime_id'])
bt_coin1 = bt_coin1.loc[bt_coin1['datetime'] > pd.to_datetime('2019-06-28 00:00:00')]

In [10]:
bt_coin1 = bt_coin1[['datetime', 'last','low', 'high', 'volume']] #keeping only these values

In [11]:
bt_coin1.head() #having a lok at the data

Unnamed: 0,datetime,last,low,high,volume
17280,2019-07-06 00:00:00,2820.0,2608.25,2933.0,21078.15618
17281,2019-07-06 00:00:00,2940.32,2684.28,3095.0,4019.756
17282,2019-07-06 00:01:00,2820.0,2608.25,2933.0,21038.90547
17283,2019-07-06 00:01:00,2946.0,2684.28,3095.0,4018.371
17284,2019-07-06 00:02:00,2819.99,2608.25,2933.0,21035.46284


**We shall consider only last value because that is updated recently**

In [12]:
bt_coin2 = bt_coin1['last']
bt_coin2

17280     2820.00
17281     2940.32
17282     2820.00
17283     2946.00
17284     2819.99
           ...   
125433    2394.03
125434    2320.47
125435    2394.03
125436    2320.47
125437    2394.03
Name: last, Length: 47676, dtype: float64

**We rescale our data because neural networks are snesitive to input data, when we are using the activation functions such as tanh or sigmoid**

In [13]:
scaler = MinMaxScaler(feature_range=(1, 2))
bt_coin2 = scaler.fit_transform(np.array(bt_coin2).reshape(-1,1))

In [14]:
out=[] #store those in a dummy variable
for i in bt_coin2:
  out.append(i[0])
bt_coin2=out

**Splitting the dataset into 60-40 for training and testing**

In [15]:
train_size = int(len(bt_coin2) * 0.6)
test_size = len(bt_coin2) - train_size
train, test = bt_coin2[0:train_size], bt_coin2[train_size:len(bt_coin2)]
print(len(train), len(test))

28605 19071


In [16]:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back):
  dataX, dataY = [], []
  for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back)]
    dataX.append(a)
    dataY.append(dataset[i + look_back])
  return np.array(dataX), np.array(dataY)

In [17]:
look_back = 10
trainX, trainY =create_dataset(train,look_back) 
testX, testY = create_dataset(test,look_back)

In [18]:
trainX

array([[1.61532178, 1.74297142, 1.61532178, ..., 1.75926712, 1.62448811,
        1.76394577],
       [1.74297142, 1.61532178, 1.74899743, ..., 1.62448811, 1.76394577,
        1.62526258],
       [1.61532178, 1.74899743, 1.61531117, ..., 1.76394577, 1.62526258,
        1.78331813],
       ...,
       [1.36827643, 1.30647796, 1.36827643, ..., 1.30588385, 1.36827643,
        1.30659467],
       [1.30647796, 1.36827643, 1.30248891, ..., 1.36827643, 1.30659467,
        1.36827643],
       [1.36827643, 1.30248891, 1.36827643, ..., 1.30659467, 1.36827643,
        1.30659467]])

In [19]:
trainY

array([1.62526258, 1.78331813, 1.62911371, ..., 1.36827643, 1.30659467,
       1.36827643])

In [20]:
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

**Building LSTM model**

In [21]:
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back))) #four hidden layers
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=256, verbose=1)

Train on 28594 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/10

<tensorflow.python.keras.callbacks.History at 0x7f9e7f2565d0>

**Predicting the model**

In [22]:
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)

In [23]:
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

**Finding out the RMSE value**

In [24]:
import math
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:, 0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:, 0]))
print('Test Score: %.2f RMSE' % (testScore))

Train Score: 6.73 RMSE
Test Score: 7.73 RMSE


**Start Plotting using Bokeh**

In [25]:
# shift train predictions for plotting
trainPredictPlot = np.zeros(len(bt_coin2))
trainPredictPlot[:]=np.nan
trainPredictPlot[look_back:len(trainPredict) + look_back] = trainPredict[:,0]

In [26]:
# shift test predictions for plotting
testPredictPlot = np.zeros(len(bt_coin2),dtype=np.float)
testPredictPlot[:] = np.nan
testPredictPlot[len(trainPredict) + (look_back * 2) + 1:len(bt_coin2)-1] = testPredict[:,0]


In [27]:
output_notebook()

In [28]:
p=figure(plot_width=900,plot_height=300)
p.line(np.arange(len(bt_coin1['last'])),bt_coin1['last'], legend_label='Actual',color='Black')
tpp=pd.DataFrame(trainPredictPlot, columns=["close"], index=bt_coin1.index).close
qpp=pd.DataFrame(testPredictPlot, columns=["close"], index=bt_coin1.index).close
p.line(np.arange(len(tpp)),tpp, legend_label='Training',color='blue')
p.line(np.arange(len(qpp)),qpp,legend_label='Testing',color='orange')
p.legend.location = "top_right"
show(p)