# Smart Trading Agent
In this notebook, we predict cryptocurrency transaction based on five pieces of data:
- Timestamp (2018-05-01 00:00:00 - 2018-05-02 23:59:59)
- Price 
- Mid Price 
- Book Feature 
- Side (Sell / Buy)

<br/>

## Introduction
In this tutorial we will use the popular Deep Learning library, Keras, and the visualization libraries Matplotlib and Seaborn to build a classifying simple model. 
The libraries Numpy and Pandas will help us along the way

In [71]:
# Importing pandas
import pandas as pd

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('data/2018-05-newtrade.csv')

# Printing out the first 10 rows of our data
data[:10]

## Visualization
We will now start thinking of which of these features we will use in our model.
First let's make a plot of our data to see how it looks. 
To visualize our data, we will use matplotlib and seaborn.

Intuitively, it makes sense that the price of BTC ('price') would play a big role in the customer consumption('side').
Let's see if these hypotheses are correct:

In [72]:
# Importing matplotlib, seaborn
import matplotlib.pyplot as plt
import seaborn as sns

x = data.price.values

# Plotting the graphs
sns.distplot(x, kde = True, rug = True)
plt.show()

In [73]:
sns.violinplot(x='side', y='price', data=data)
plt.show()

Unfortunately, it is a bit hard to visualize prices since we have a lot of different samples. 

We found some 'curve' though. 
There are 2 peek points at about 0.99 and about 1.015. In this case, we seem 1.015 as the second peek-point ignoring 0.995, the actual second peek.
At these points, trader bought and sold BTC at high price.


Here we can see that ages on both ends of the spectrum seem to fare better, but we need to get a closer look. We will 'bin-ify' the ages, grouping them to bins according to their value. So, ages closer together will appear as one and it will be easier to visualize.

The function we will use will round the ages within a factor. To make our lives easier, we will use numpy.

In [74]:
# Importing numpy
import numpy as np

# Function to help us scaling
def make_bins(d, col, factor=2):
    rounding = lambda x: np.around(x / factor)
    d[col] = d[col].apply(rounding)
    return d

t = make_bins(data.copy(True), 'price', 100000)

# Plotting the bar graphs
sns.barplot(x='price', y='side', data = t)
plt.show()

There doesn't seem to be much correlation to sell rate.

How about counts?

In [75]:
# Plotting the bar graphs
sns.countplot(x='price', data = t)
plt.show()

It seems more clear that trader usually make highest numbers of transactions around 9,900,000 won. 
The number of transactions grows proportionally to around 9,900,000 won, since then it decreases. It returns to growth by around 1,015,000 won, but its counts are much smaller than before.

### conclusion 1:
Main trade price of May 2018 is about 9,900,000 won and most trades are under 10,100,000 won.


<br/>
<br/>
<br/>
Now to check the book feature:

In [76]:
t = make_bins(data.copy(True), 'book_feature', 1000000)

# Plotting the bar graphs
sns.countplot(x='book_feature', data = t)
plt.show()

It shows a correlation of transaction counts with book features.

In [None]:
# Plotting the bar graphs
sns.countplot(x='book_feature',hue='side', data = t)
plt.show()

on both sides, buy and sell.

Something does seem to be going on with 'Book Feature'. When it's over zero, total transaction counts at those feature numbers start to decrease.
Especially, the count numbers of 3 is more than twice the 8.

## conclusion 2
The latest roundup of information is that the trader usually make transactions at price from 9,900,000 won to 10,000,000 won. And those book-feature will be under 7,000,000 won with high possibility.

<br/>
<br/>
<br/>
Let's try to think about transaction types separately. And since now, we only focus on transaction price under 10,000,000 won and book-feature under 7,000,000 won to generalization.

In [None]:
datacopy = data[data['price'] <= 10000000]

# Two conditions to extract a book-feature range
top = datacopy['book_feature'] < 8000000
bottom = datacopy['book_feature'] > 0
datacopy = datacopy[top & bottom]

In a narrow spectrum, let's see:

In [None]:
c = make_bins(datacopy.copy(True), 'book_feature', 1000000)

# Plotting the bar graphs
sns.countplot(x='book_feature', hue='side', data = c)
plt.show()

The highest numbers of sells is around 1,000,000 won, on the other hand, the highest numbers of purchases is around 2,000,000 won.
The second one of sells is around 3,000,000 won and one of purchases is also around 3,000,000 won.

Anyway the trend line is decreasing on both graphs.


<br/>
<br/>
<br/>
For the first time, let's consider about transaction time:

In [None]:
datcopy = data
datacopy['timestamp'] = datacopy['timestamp'].map(lambda x:x.split(' ')[1].split(':')[0].strip())   

# Plotting the bar graphs
sns.countplot(x='timestamp', data=datacopy)
plt.show()


There doesn't seem to be much correlation to transaction time.

## final conclusion
So far, we've thought about several conditions to trader's transaction patterns.
Of course price was the first thing and we got first conclusion of main transaction price.
Second was book-feature which is the differences in price and mid-price. And it was clear that trader's main transaction book feature is under 7,000,000 won.
Lastly, we thought about transaction time. It didn't seem to be strong correlation to trades even though time is quite important factor in almost every cases.

As a result, we couldn't find an very obvious trend following to 2018-05-newtrade.csv dataset. Instead of generalized statements, we caught some dealing patterns and little trend lines.
- Transaction Counts x Book Feature:
Total counts of sells and purchases decreases proportionally since it's over 1,000,000 won. Each one has a downward tendency the same.
- Trader has high possibility of purchase around 2,000,000 won. It's the only book-feature section that total counts of purchases is more than one of sells. The others trader tends to sell BTC.

## Plotting the data

First let's make a plot of our data to see how it looks. In order to have a 2D plot, let's ignore the timestamp and mid-price.

In [None]:
# Function to help us plot
def plot_points(data):
    X = np.array(data[['price', 'book_feature']])
    y = np.array(data['side'])
    purchases = X[np.argwhere(y==0)]
    sells = X[np.argwhere(y==1)]
    plt.scatter([s[0][0] for s in purchases], [s[0][1] for s in purchases], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in sells], [s[0][1] for s in sells], s = 25, color = 'cyan', edgecolor = 'k')
    plt.xlabel('Price')
    plt.ylabel('Book Feature')
    
# Plotting the points    
plot_points(data)
plt.show()

Roughly, it looks like the price from 9,820,000 won to 10,000,000 with the book-feature from - 500,000 won to 2,000,000 won was dealt, while the ones with high prices didn't, but the data is not as nicely separable as we hoped it would. 
Maybe it would help to separate the book-feature ranges? Let's make 5 plots, each has 2,000,000 differences.

In [None]:
# Separating the book-feature ranges
cond1 = data['book_feature'] < 0
cond2 = data['book_feature'] > 0
cond3 = data['book_feature'] < 4000000
cond4 = data['book_feature'] > 4000000
cond5 = data['book_feature'] < 8000000
cond6 = data['book_feature'] > 8000000
cond7 = data['book_feature'] < 12000000
cond8 = data['book_feature'] > 12000000

data_range1 = data[cond1]
data_range2 = data[cond2 & cond3]
data_range3 = data[cond4 & cond5]
data_range4 = data[cond6 & cond7]
data_range5 = data[cond8]

# Plotting the graphs
plot_points(data_range1)
plt.title("Book Feature 1(~0)")
plt.show()
plot_points(data_range2)
plt.title("Book Feature 2(0~4,000,000)")
plt.show()
plot_points(data_range3)
plt.title("Book Feature 3(4,000,000~8,000,000)")
plt.show()
plot_points(data_range4)
plt.title("Book Feature 4(8,000,000~12,000,000)")
plt.show()
plot_points(data_range5)
plt.title("Book Feature 5(12,000,000~)")
plt.show()

It seems that the lowest transaction counts at range4, and the next is range3. Most of them are at range1 and range2. 
And it's common thing that few transactions are at price from 1,000,000 won to 12,000,000 won over all ranges.
Let's use the book-feature as one of our inputs. In order to do this, we should one-hot encode it.

## One-hot encoding the book feature
For this, we'll use the `get_dummies` function in pandas.

Before encoding, we should replace book_feature with simple integer.

In [None]:
# Replace book-feature price with five range numbers
data.loc[cond1, 'book_feature'] = 1 
data.loc[cond2 & cond3, 'book_feature'] = 2
data.loc[cond4 & cond5, 'book_feature'] = 3 
data.loc[cond6 & cond7, 'book_feature'] = 4 
data.loc[cond8, 'book_feature'] = 5

In [None]:
# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['book_feature'], prefix='book_feature')], axis=1)

# Drop the previous rank column
one_hot_data = one_hot_data.drop('book_feature', axis=1)

# Print the first 10 rows of our data
one_hot_data[:10]



## Scaling the data
The next step is to scale the data. We notice that the range for book-feature is 1.0-5.0, whereas the range for price is roughly 9,000,000-13,000,000 which is much larger. This means our data is skewed, and that makes it hard for a neural network to handle. Let's fit our two features into a range of 0-1, by dividing the book-feature by 5.0, and the prices by 13,000,000.

In [None]:
# Copying our data
processed_data = one_hot_data[:]

# Scaling the columns
processed_data['price'] = processed_data['price']/13000000
processed_data['book_feature'] = processed_data['book_feature']/4.0
processed_data[:10]

<br/>
<br/>
<br/>

In order to test our algorithm, we'll split the data into a Training and a Testing set. The size of the testing set will be 10% of the total data.


In [82]:
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
train_data, test_data = data.iloc[sample], data.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))

train_data[:10]
test_data[:10]

Number of training samples is 820
Number of testing samples is 92


Unnamed: 0,timestamp,price,side,mid_price,book_feature
1,2018-05-01 01:06:16,10163000,1,10141500.0,-3264355.0
10,2018-05-01 01:27:22,10153000,1,10147000.0,-2416599.0
33,2018-05-01 03:15:42,10175000,0,10169000.0,7247598.0
39,2018-05-01 04:08:35,10160000,1,10144000.0,10058654.0
46,2018-05-01 04:15:59,10154000,1,10137000.0,3889905.0
47,2018-05-01 04:20:00,10155000,0,10137000.0,4694330.0
52,2018-05-01 04:56:41,10179000,0,10161000.0,6163763.0
57,2018-05-01 06:50:41,10170000,0,10157000.0,-1347888.0
62,2018-05-01 06:54:09,10161000,1,10151000.0,4052992.0
80,2018-05-01 07:19:21,10216000,1,10189000.0,17147702.0


array([[0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]], dtype=float32)