# Cryptocurrency Closing Price Prediction

File name: CryptoCurrencyAI.ipynb

Author: kogni7

Date: September 2021

## Contents
* 1 Preparation
* 2 Data
* 3 Training
* 4 Prediction and Submission

This notebook uses only the data sets provided by ZINDI. These data sets contain information about a cryptocurrency. These are the only used features in this notebook. The task is to predict the closing price of the cryptocurrency.

The file system for this project is:

* CryptoCurrencyAI (root)
    * CryptoCurrencyAI_SVR.ipynb (this notebook)
    * Data
        * Train.csv
        * Test.csv
        * SampleSubmission.csv
    * Submission
        * 1 - x: Submission directories named by the version number
            * submission.csv

This jupyter notebook runs in Google Colab without special configuration. GPU is disabled.

This notebook uses a Support Vector Regression based approach.

This notebook uses ideas of the Starter Notebook.

## 1 Preparation
### Time

In [1]:
import time
start_time = time.time()

### Libraries and Seed

In [2]:
# Seed, Libraries
SEED = 42

# Math
import numpy as np
print("Numpy Version: " + str(np.__version__))

import random
import os
os.environ['PYTHONHASHSEED'] = str(SEED)

np.random.seed(SEED)

random.seed(SEED)

# CSV
import pandas as pd
print("Pandas Version: " + str(pd.__version__))

# Machine Learning
import sklearn
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVR
print("SciKit-Learn Version: " + str(sklearn.__version__))

from tqdm import tqdm
import gc

Numpy Version: 1.19.5
Pandas Version: 1.1.5
SciKit-Learn Version: 0.22.2.post1


### Parameters

In [3]:
CV = 5

EPSILON = 0.01
TOLERANCE = 1e-5
C = 75
LOSS = 'squared_epsilon_insensitive'
MAX_ITER = 10000

# The Version
VERSION = "SVR_13"

# for use in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
WD = os.getcwd() + "/drive/My Drive/CryptoCurrencyAI"
print(WD)

/content/drive/My Drive/CryptoCurrencyAI


## 2 Data

In [5]:
train_csv = pd.read_csv(WD + "/Data/Train.csv")
train_csv.head()

Unnamed: 0,id,asset_id,open,high,low,volume,market_cap,url_shares,unique_url_shares,reddit_posts,reddit_posts_score,reddit_comments,reddit_comments_score,tweets,tweet_spam,tweet_followers,tweet_quotes,tweet_retweets,tweet_replies,tweet_favorites,tweet_sentiment1,tweet_sentiment2,tweet_sentiment3,tweet_sentiment4,tweet_sentiment5,tweet_sentiment_impact1,tweet_sentiment_impact2,tweet_sentiment_impact3,tweet_sentiment_impact4,tweet_sentiment_impact5,social_score,average_sentiment,news,price_score,social_impact_score,correlation_rank,galaxy_score,volatility,market_cap_rank,percent_change_24h_rank,volume_24h_rank,social_volume_24h_rank,social_score_24h_rank,medium,youtube,social_volume,percent_change_24h,market_cap_global,close
0,ID_322qz6,1,9422.849081,9428.490628,9422.849081,713198600.0,173763500000.0,1689.0,817.0,55.0,105.0,61.0,271.0,3420.0,1671.0,11675867.0,39.0,1343.0,448.0,2237.0,124.0,330.0,331.0,2515.0,120.0,506133.0,1326610.0,1159677.0,8406185.0,281329.0,11681999.0,3.6,69.0,2.7,3.6,3.3,66.0,0.007118,1.0,606.0,2.0,1.0,1.0,2.0,5.0,4422,1.434516,281806600000.0,9428.279323
1,ID_3239o9,1,7985.359278,7992.059917,7967.567267,400475500.0,142694200000.0,920.0,544.0,20.0,531.0,103.0,533.0,1491.0,242.0,5917814.0,195.0,1070.0,671.0,3888.0,1.0,52.0,315.0,1100.0,23.0,1320.0,381117.0,1706376.0,3754815.0,80010.0,5924770.0,3.7,1.0,2.0,2.0,1.0,43.5,0.009419,1.0,,,,,,,2159,-2.459507,212689700000.0,7967.567267
2,ID_323J9k,1,49202.033778,49394.593518,49068.057046,3017729000.0,916697700000.0,1446.0,975.0,72.0,1152.0,187.0,905.0,9346.0,4013.0,47778746.0,104.0,2014.0,1099.0,11476.0,331.0,923.0,864.0,6786.0,442.0,9848462.0,5178557.0,2145663.0,25510267.0,5110490.0,47796942.0,3.7,22.0,3.1,3.0,3.3,65.5,0.01353,1.0,692.0,3.0,1.0,1.0,,,10602,4.942448,1530712000000.0,49120.738484
3,ID_323y5P,1,,,,,,,,17.0,424.0,268.0,443.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,285,,,
4,ID_324kJH,1,10535.737119,10535.737119,10384.798216,1150053000.0,192118300000.0,1012.0,638.0,24.0,42.0,50.0,173.0,3262.0,1652.0,14422172.0,21.0,511.0,190.0,2284.0,86.0,280.0,443.0,2284.0,169.0,311017.0,1977833.0,731277.0,10964321.0,440730.0,14426405.0,3.7,22.0,4.7,3.8,4.4,83.0,0.010332,1.0,749.0,2.0,1.0,1.0,,2.0,3996,2.609576,338692500000.0,10384.798216


We impute NAs with 0 like in the Starter Notebook.

In [6]:
train_csv = train_csv.fillna(0)
train_csv.head()

Unnamed: 0,id,asset_id,open,high,low,volume,market_cap,url_shares,unique_url_shares,reddit_posts,reddit_posts_score,reddit_comments,reddit_comments_score,tweets,tweet_spam,tweet_followers,tweet_quotes,tweet_retweets,tweet_replies,tweet_favorites,tweet_sentiment1,tweet_sentiment2,tweet_sentiment3,tweet_sentiment4,tweet_sentiment5,tweet_sentiment_impact1,tweet_sentiment_impact2,tweet_sentiment_impact3,tweet_sentiment_impact4,tweet_sentiment_impact5,social_score,average_sentiment,news,price_score,social_impact_score,correlation_rank,galaxy_score,volatility,market_cap_rank,percent_change_24h_rank,volume_24h_rank,social_volume_24h_rank,social_score_24h_rank,medium,youtube,social_volume,percent_change_24h,market_cap_global,close
0,ID_322qz6,1,9422.849081,9428.490628,9422.849081,713198600.0,173763500000.0,1689.0,817.0,55.0,105.0,61.0,271.0,3420.0,1671.0,11675867.0,39.0,1343.0,448.0,2237.0,124.0,330.0,331.0,2515.0,120.0,506133.0,1326610.0,1159677.0,8406185.0,281329.0,11681999.0,3.6,69.0,2.7,3.6,3.3,66.0,0.007118,1.0,606.0,2.0,1.0,1.0,2.0,5.0,4422,1.434516,281806600000.0,9428.279323
1,ID_3239o9,1,7985.359278,7992.059917,7967.567267,400475500.0,142694200000.0,920.0,544.0,20.0,531.0,103.0,533.0,1491.0,242.0,5917814.0,195.0,1070.0,671.0,3888.0,1.0,52.0,315.0,1100.0,23.0,1320.0,381117.0,1706376.0,3754815.0,80010.0,5924770.0,3.7,1.0,2.0,2.0,1.0,43.5,0.009419,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2159,-2.459507,212689700000.0,7967.567267
2,ID_323J9k,1,49202.033778,49394.593518,49068.057046,3017729000.0,916697700000.0,1446.0,975.0,72.0,1152.0,187.0,905.0,9346.0,4013.0,47778746.0,104.0,2014.0,1099.0,11476.0,331.0,923.0,864.0,6786.0,442.0,9848462.0,5178557.0,2145663.0,25510267.0,5110490.0,47796942.0,3.7,22.0,3.1,3.0,3.3,65.5,0.01353,1.0,692.0,3.0,1.0,1.0,0.0,0.0,10602,4.942448,1530712000000.0,49120.738484
3,ID_323y5P,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,424.0,268.0,443.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,285,0.0,0.0,0.0
4,ID_324kJH,1,10535.737119,10535.737119,10384.798216,1150053000.0,192118300000.0,1012.0,638.0,24.0,42.0,50.0,173.0,3262.0,1652.0,14422172.0,21.0,511.0,190.0,2284.0,86.0,280.0,443.0,2284.0,169.0,311017.0,1977833.0,731277.0,10964321.0,440730.0,14426405.0,3.7,22.0,4.7,3.8,4.4,83.0,0.010332,1.0,749.0,2.0,1.0,1.0,0.0,2.0,3996,2.609576,338692500000.0,10384.798216


In [7]:
test_csv = pd.read_csv(WD + "/Data/Test.csv")
test_csv.head()

Unnamed: 0,id,asset_id,open,high,low,volume,market_cap,url_shares,unique_url_shares,reddit_posts,reddit_posts_score,reddit_comments,reddit_comments_score,tweets,tweet_spam,tweet_followers,tweet_quotes,tweet_retweets,tweet_replies,tweet_favorites,tweet_sentiment1,tweet_sentiment2,tweet_sentiment3,tweet_sentiment4,tweet_sentiment5,tweet_sentiment_impact1,tweet_sentiment_impact2,tweet_sentiment_impact3,tweet_sentiment_impact4,tweet_sentiment_impact5,social_score,average_sentiment,news,price_score,social_impact_score,correlation_rank,galaxy_score,volatility,market_cap_rank,percent_change_24h_rank,volume_24h_rank,social_volume_24h_rank,social_score_24h_rank,medium,youtube,social_volume,percent_change_24h,market_cap_global
0,ID_323Sn2,1,,,,,,,,7.0,56.0,2.0,11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,9,,
1,ID_325SNW,1,11335.062188,11351.690956,11335.062188,1064152000.0,210146300000.0,1664.0,1045.0,64.0,213.0,51.0,274.0,6046.0,3034.0,23453171.0,119.0,2305.0,1367.0,6252.0,151.0,565.0,603.0,4553.0,174.0,2900568.0,1898920.0,2268741.0,14056214.0,2338771.0,23465365.0,3.7,39.0,3.2,3.4,2.8,65.5,0.004407,1.0,711.0,2.0,1.0,1.0,1.0,1.0,7245,-0.555698,363105200000.0
2,ID_325uzE,1,6322.560756,6328.362354,6294.714484,1516268000.0,115386200000.0,397.0,255.0,11.0,72.0,30.0,112.0,2404.0,304.0,3831278.0,12.0,346.0,73.0,604.0,39.0,103.0,668.0,1406.0,188.0,29147.0,411178.0,873284.0,2389256.0,129448.0,3832828.0,3.7,2.0,3.0,3.0,3.4,65.5,0.024035,1.0,715.0,2.0,1.0,1.0,,,2702,1.68937,177107500000.0
3,ID_328qCx,1,,,,,,,,8.0,96.0,217.0,244.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,225,,
4,ID_3293uJ,1,,,,,,,,26.0,49.0,33.0,38.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,59,,


We impute NAs with 0 like in the Starter Notebook.

In [8]:
test_csv = test_csv.fillna(0)
test_csv.head()

Unnamed: 0,id,asset_id,open,high,low,volume,market_cap,url_shares,unique_url_shares,reddit_posts,reddit_posts_score,reddit_comments,reddit_comments_score,tweets,tweet_spam,tweet_followers,tweet_quotes,tweet_retweets,tweet_replies,tweet_favorites,tweet_sentiment1,tweet_sentiment2,tweet_sentiment3,tweet_sentiment4,tweet_sentiment5,tweet_sentiment_impact1,tweet_sentiment_impact2,tweet_sentiment_impact3,tweet_sentiment_impact4,tweet_sentiment_impact5,social_score,average_sentiment,news,price_score,social_impact_score,correlation_rank,galaxy_score,volatility,market_cap_rank,percent_change_24h_rank,volume_24h_rank,social_volume_24h_rank,social_score_24h_rank,medium,youtube,social_volume,percent_change_24h,market_cap_global
0,ID_323Sn2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,56.0,2.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,0.0,0.0
1,ID_325SNW,1,11335.062188,11351.690956,11335.062188,1064152000.0,210146300000.0,1664.0,1045.0,64.0,213.0,51.0,274.0,6046.0,3034.0,23453171.0,119.0,2305.0,1367.0,6252.0,151.0,565.0,603.0,4553.0,174.0,2900568.0,1898920.0,2268741.0,14056214.0,2338771.0,23465365.0,3.7,39.0,3.2,3.4,2.8,65.5,0.004407,1.0,711.0,2.0,1.0,1.0,1.0,1.0,7245,-0.555698,363105200000.0
2,ID_325uzE,1,6322.560756,6328.362354,6294.714484,1516268000.0,115386200000.0,397.0,255.0,11.0,72.0,30.0,112.0,2404.0,304.0,3831278.0,12.0,346.0,73.0,604.0,39.0,103.0,668.0,1406.0,188.0,29147.0,411178.0,873284.0,2389256.0,129448.0,3832828.0,3.7,2.0,3.0,3.0,3.4,65.5,0.024035,1.0,715.0,2.0,1.0,1.0,0.0,0.0,2702,1.68937,177107500000.0
3,ID_328qCx,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,96.0,217.0,244.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,225,0.0,0.0
4,ID_3293uJ,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,49.0,33.0,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,59,0.0,0.0


In [9]:
sample_submission_csv = pd.read_csv(WD + "/Data/SampleSubmission.csv")
sample_submission_csv.head()

Unnamed: 0,id,close
0,ID_323Sn2,0
1,ID_325SNW,0
2,ID_325uzE,0
3,ID_328qCx,0
4,ID_3293uJ,0


## 3 Training

In [10]:
cv = 1

STATISTICS = {}
STATISTICS['TRAIN_RMSE'] = np.zeros(CV)
STATISTICS['VALIDATION_RMSE'] = np.zeros(CV)

MODELS = []

for train, val in KFold(n_splits=CV, shuffle=True, random_state=SEED).split(train_csv.id):
    print("Run {} of {}.".format(cv, CV))

    # Data
    features = train_csv.drop(columns=['id', 'close'])
    target = train_csv.close.astype(float)

    X_train = np.array(features.iloc[train])
    y_train = np.array(target.iloc[train])

    X_val = np.array(features.iloc[val])
    y_val = np.array(target.iloc[val])

    # Fit
    pipe = make_pipeline(StandardScaler(), LinearSVR(epsilon=EPSILON, tol=TOLERANCE, C=C, loss=LOSS, random_state=SEED, max_iter=MAX_ITER))
    pipe.fit(X_train, y_train)

    # RMSE
    y_pred = pipe.predict(X_train)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred))

    y_val_pred = pipe.predict(X_val)
    val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

    STATISTICS['TRAIN_RMSE'][cv-1] = train_rmse
    STATISTICS['VALIDATION_RMSE'][cv-1] = val_rmse

    MODELS.append(pipe)

    del train_rmse, val_rmse, features, target, X_train, y_train, X_val, y_val, pipe, y_pred, y_val_pred
    gc.collect()
    cv += 1

print("Result:")
for cv in range(CV):
    print("TRAINING: {:.1f}; VALIDATION: {:.1f}".format(STATISTICS['TRAIN_RMSE'][cv], STATISTICS['VALIDATION_RMSE'][cv]))
print("\n")
print("TRAINING: {:.1f}; VALIDATION: {:.1f}".format(np.mean(STATISTICS['TRAIN_RMSE']), np.mean(STATISTICS['VALIDATION_RMSE'])))

Run 1 of 5.




Run 2 of 5.




Run 3 of 5.




Run 4 of 5.




Run 5 of 5.
Result:
TRAINING: 54.2; VALIDATION: 76.4
TRAINING: 61.1; VALIDATION: 61.4
TRAINING: 101.1; VALIDATION: 92.0
TRAINING: 60.6; VALIDATION: 52.6
TRAINING: 61.6; VALIDATION: 55.8


TRAINING: 67.7; VALIDATION: 67.6




## 4 Prediction and Submission

In [11]:
predictions = np.zeros(len(sample_submission_csv))

for cv in range(CV):
    # Data
    features = test_csv.drop(columns=['id'])
    X_test = np.array(features)

    # Model
    pipe = MODELS[cv]

    # Prediction
    y_pred = pipe.predict(X_test)
    predictions += y_pred

    del features, X_test, pipe, y_pred
    gc.collect()

sample_submission_csv['close'] = predictions / CV

os.mkdir(WD + '/Submission/' + str(VERSION))
sample_submission_csv.to_csv(WD + '/Submission/' + str(VERSION) + '/submission.csv', index=False)

In [12]:
sample_submission_csv

Unnamed: 0,id,close
0,ID_323Sn2,14.069649
1,ID_325SNW,11342.195215
2,ID_325uzE,6311.218819
3,ID_328qCx,11.473426
4,ID_3293uJ,13.432424
...,...,...
6217,ID_zufSPk,8263.159983
6218,ID_zuz9yf,10849.906365
6219,ID_zvrMSX,11.991896
6220,ID_zy9Cfv,14.079326


In [13]:
drive.flush_and_unmount()

In [14]:
end_time = time.time()
print("Runtime of the Notebook: {} min".format(np.round((end_time - start_time) / 60, 2)))

Runtime of the Notebook: 2.4 min
