# CAPSTONE 3. Predicting Next Cryptocurrency Market Cycle Peak
## Data Wrangling

In this notebook we will perform data wrangling for our project. We will:<br>
<ol>1. Retreive historical data for four major cryptocurrencies:<br>
    <ol><i>1.1. Bitcoin (<b>BTC</b>)<br>
        1.2. Ethereum (<b>ETH</b>)<br>
        1.3. XPR (<b>XRP</b>)<br>
        1.4. Litecoin (<b>LTC</b>)<br>
    </ol>
    2. Organize it and make sure it's well defined and ready for the next step - Exploratory Data Analysis
</ol>

In [1]:
#importing all the necessary modules and libraries
import pandas as pd
import os
import glob
from functools import reduce

First, let's read all the data we downloaded from YahooFinance.

In [2]:
#creating one dataframe for each token
df_BTC = pd.read_csv('../datasets/BTC-USD.csv', parse_dates=True).sort_values(by='Date')
df_ETH = pd.read_csv('../datasets/ETH-USD.csv', parse_dates=True).sort_values(by='Date')
df_XRP = pd.read_csv('../datasets/XRP-USD.csv', parse_dates=True).sort_values(by='Date')
df_LTC = pd.read_csv('../datasets/LTC-USD.csv', parse_dates=True).sort_values(by='Date')

In [3]:
#merging dataframes and adding suffixes for better readability
dfs_1 = [df_BTC, df_ETH]
df_1 = reduce(lambda left,right: pd.merge(left,right, on=['Date'], suffixes=('_BTC', '_ETH')), dfs_1)
dfs_2 = [df_1, df_XRP]
df_2 = reduce(lambda left,right: pd.merge(left,right, on=['Date'], suffixes=(None, '_XRP')), dfs_2)
dfs_3 = [df_2, df_LTC]
df_merged = reduce(lambda left,right: pd.merge(left,right, on=['Date'], suffixes=(None, '_LTC')), dfs_3)
df_merged.head(1)

Unnamed: 0,Date,Open_BTC,High_BTC,Low_BTC,Close_BTC,Adj Close_BTC,Volume_BTC,Open_ETH,High_ETH,Low_ETH,...,Low,Close,Adj Close,Volume,Open_LTC,High_LTC,Low_LTC,Close_LTC,Adj Close_LTC,Volume_LTC
0,2015-08-07,278.740997,280.391998,276.365997,279.584991,279.584991,42484800.0,2.83162,3.53661,2.52112,...,0.007989,0.008152,0.008152,363643.0,4.06334,4.22069,3.97027,4.20828,4.20828,4192810.0


For some reason suffix for XRP was not added. Let's fix it.

In [4]:
#creating a dictionary with old and new names
mapper = {'Open':'Open_XRP', 'High':'High_XRP', 'Low':'Low_XRP', 'Close':'Close_XRP', 'Adj Close':'Adj Close_XRP', 'Volume':'Volume_XRP'}
df = df_merged.rename(mapper, axis=1)
print(df.columns)
df.head()

Index(['Date', 'Open_BTC', 'High_BTC', 'Low_BTC', 'Close_BTC', 'Adj Close_BTC',
       'Volume_BTC', 'Open_ETH', 'High_ETH', 'Low_ETH', 'Close_ETH',
       'Adj Close_ETH', 'Volume_ETH', 'Open_XRP', 'High_XRP', 'Low_XRP',
       'Close_XRP', 'Adj Close_XRP', 'Volume_XRP', 'Open_LTC', 'High_LTC',
       'Low_LTC', 'Close_LTC', 'Adj Close_LTC', 'Volume_LTC'],
      dtype='object')


Unnamed: 0,Date,Open_BTC,High_BTC,Low_BTC,Close_BTC,Adj Close_BTC,Volume_BTC,Open_ETH,High_ETH,Low_ETH,...,Low_XRP,Close_XRP,Adj Close_XRP,Volume_XRP,Open_LTC,High_LTC,Low_LTC,Close_LTC,Adj Close_LTC,Volume_LTC
0,2015-08-07,278.740997,280.391998,276.365997,279.584991,279.584991,42484800.0,2.83162,3.53661,2.52112,...,0.007989,0.008152,0.008152,363643.0,4.06334,4.22069,3.97027,4.20828,4.20828,4192810.0
1,2015-08-08,279.742004,279.928009,260.709991,260.997009,260.997009,58533000.0,2.79376,2.79881,0.714725,...,0.008164,0.008476,0.008476,678295.0,4.22099,4.22364,3.83542,3.85475,3.85475,4917730.0
2,2015-08-09,261.115997,267.002991,260.467987,265.083008,265.083008,23789600.0,0.706136,0.87981,0.629191,...,0.008472,0.008808,0.008808,531969.0,3.84339,3.98426,3.81139,3.89859,3.89859,3064680.0
3,2015-08-10,265.477997,267.032013,262.596008,264.470001,264.470001,20979400.0,0.713989,0.729854,0.636546,...,0.008746,0.00875,0.00875,472973.0,3.9008,3.98013,3.89761,3.94888,3.94888,2239890.0
4,2015-08-11,264.34201,270.385986,264.093994,270.385986,270.385986,25433900.0,0.708087,1.13141,0.663235,...,0.008591,0.008591,0.008591,282461.0,3.94874,4.15955,3.94295,4.15955,4.15955,3426300.0


Let's take a general overview of our dataframe.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1998 entries, 0 to 1997
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           1998 non-null   object 
 1   Open_BTC       1994 non-null   float64
 2   High_BTC       1994 non-null   float64
 3   Low_BTC        1994 non-null   float64
 4   Close_BTC      1994 non-null   float64
 5   Adj Close_BTC  1994 non-null   float64
 6   Volume_BTC     1994 non-null   float64
 7   Open_ETH       1994 non-null   float64
 8   High_ETH       1994 non-null   float64
 9   Low_ETH        1994 non-null   float64
 10  Close_ETH      1994 non-null   float64
 11  Adj Close_ETH  1994 non-null   float64
 12  Volume_ETH     1994 non-null   float64
 13  Open_XRP       1994 non-null   float64
 14  High_XRP       1994 non-null   float64
 15  Low_XRP        1994 non-null   float64
 16  Close_XRP      1994 non-null   float64
 17  Adj Close_XRP  1994 non-null   float64
 18  Volume_X

Okay, now we have the correct names for our columns. We can also see that even though the dataset mostly consists of the numbers, for some reason they are presented as 'objects'. We will need to convert them into floats later. Our dates are Datetime objects.

In [6]:
#looking how many observations and features we have
df.shape

(1998, 25)

We have 1998 observations and 25 features.

Now let's see if we have any missing data.

In [7]:
df.isnull().values.any()

True

We have null values. Let's find out how many.

In [8]:
df.isnull().value_counts()

Date   Open_BTC  High_BTC  Low_BTC  Close_BTC  Adj Close_BTC  Volume_BTC  Open_ETH  High_ETH  Low_ETH  Close_ETH  Adj Close_ETH  Volume_ETH  Open_XRP  High_XRP  Low_XRP  Close_XRP  Adj Close_XRP  Volume_XRP  Open_LTC  High_LTC  Low_LTC  Close_LTC  Adj Close_LTC  Volume_LTC
False  False     False     False    False      False          False       False     False     False    False      False          False       False     False     False    False      False          False       False     False     False    False      False          False         1994
       True      True      True     True       True           True        True      True      True     True       True           True        True      True      True     True       True           True        True      True      True     True       True           True             4
dtype: int64

We have four null values. That's not many so we can drop them.

In [9]:
df = df.dropna(axis=0)
df.isnull().any()

Date             False
Open_BTC         False
High_BTC         False
Low_BTC          False
Close_BTC        False
Adj Close_BTC    False
Volume_BTC       False
Open_ETH         False
High_ETH         False
Low_ETH          False
Close_ETH        False
Adj Close_ETH    False
Volume_ETH       False
Open_XRP         False
High_XRP         False
Low_XRP          False
Close_XRP        False
Adj Close_XRP    False
Volume_XRP       False
Open_LTC         False
High_LTC         False
Low_LTC          False
Close_LTC        False
Adj Close_LTC    False
Volume_LTC       False
dtype: bool

In [10]:
#checking data shape once again, shoud get 1998-4=1994 observations
df.shape

(1994, 25)

No more null values. Now let's find out if we have duplicated observations.

In [11]:
df.duplicated().values.any()

False

No duplicates. Our data is ready for the next strep - Exploratory Data Analysis.

In [12]:
datapath = 'D://Tutorials/SDST/My Projects/Capstone3/DW'
if not os.path.exists(datapath):
    os.mkdir(datapath)
datapath_DW = os.path.join(datapath, 'Data_for_EDA.csv')
if not os.path.exists(datapath_DW):
    df.to_csv(datapath_DW, index=False)