# CAPSTONE 3. Predicting Next Cryptocurrency Market Cycle Peak
## Data Wrangling

In this notebook we will perform data wrangling for our project. We will:<br>
<ol>1. Retreive historical data for Bitcoin and nine other major cryptocurrencies:<br>
    <ol><i>1.1. Bitcoin (<b>BTC</b>)<br>
        1.2. Ethereum (<b>ETH</b>)<br>
        1.3. XPR (<b>XRP</b>)<br>
        1.4. Cardano (<b>ADA</b>)<br>
        1.5. Litecoin (<b>LTC</b>)<br>
        1.6. Bitcoin Cash (<b>BCH</b>)<br>
        1.7. Binance Coin (<b>BNB</b>)<br>
        1.8. Stellar (<b>XLM</b>)<br>
        1.9. EOS (<b>EOS</b>)<br>
        1.10 Tezos (<b>XTZ</b>)<br></i>
    </ol>
    2. Organize it and make sure it's well defined and ready for the next step - Exploratory Data Analysis
</ol>

In [1]:
#importing all the necessary modules and libraries
import pandas as pd
import os
import glob
import seaborn as sns

In [3]:
#reading the dataset into dataframe and sotring data by date
df = pd.read_csv('../datasets/consolidated_coin_data.csv', parse_dates=True, index_col='Date').sort_values(by='Date', ascending=False)

In [6]:
#checking what the df looks like
df.head()

Unnamed: 0_level_0,Currency,Open,High,Low,Close,Volume,Market Cap
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-12-04,tezos,1.29,1.32,1.25,1.25,46048752,824588509
2019-12-04,binance-coin,15.35,15.69,15.01,15.28,237605471,2376597490
2019-12-04,bitcoin-sv,96.0,100.91,94.51,95.44,492295285,1724375560
2019-12-04,cardano,0.037906,0.038533,0.03655,0.037405,51692274,969802335
2019-12-04,ethereum,147.92,150.68,145.0,146.75,7865937094,15966157442


We parsed dates and made them an index column for our dataframe.

In [12]:
#overall look at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 28944 entries, 2019-12-04 to 2013-04-28
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Currency    28944 non-null  object
 1   Open        28944 non-null  object
 2   High        28944 non-null  object
 3   Low         28944 non-null  object
 4   Close       28944 non-null  object
 5   Volume      28944 non-null  object
 6   Market Cap  28944 non-null  object
dtypes: object(7)
memory usage: 1.8+ MB


We can see that even though the dataset mostly consists of the numbers, for some reason they are presented as 'objects'. We will need to convert them into floats later. Our dates are Datetime objects.

In [8]:
#looking how many observations and features we have
df.shape

(28944, 7)

Now let's see if we have any missing data.

In [10]:
df.isnull().values.any()

False

We don't have any missing values in the dataframe, which is great. Now let's find out if we have duplicated observations.

In [11]:
df.duplicated().values.any()

False

No duplicates. The dataset looks very well-organized so far.

In [14]:
datapath = 'D://Tutorials/SDST/My Projects/Capstone3/DW'
if not os.path.exists(datapath):
    os.mkdir(datapath)
datapath_DW = os.path.join(datapath, 'Data_Wrangling.csv')
if not os.path.exists(datapath_DW):
    df.to_csv(datapath_DW, index=False)