# CAPSTONE 3. Predicting Major Cryptocurrencies Prices
## Data Wrangling

In this notebook we will perform data wrangling for our project. We will:<br>
<ol>1. Retreive historical data for four major cryptocurrencies:<br>
    <ol><i>1.1. Bitcoin (<b>BTC</b>)<br>
        1.2. Ethereum (<b>ETH</b>)<br>
        1.3. XPR (<b>XRP</b>)<br>
        1.4. Litecoin (<b>LTC</b>)</i><br>
    </ol>
    2. Organize it and make sure it's well defined and ready for the next step - Exploratory Data Analysis
</ol>

In [1]:
#importing all the necessary modules and libraries
import pandas as pd
import os
import glob
from functools import reduce
import datetime as dt

First, let's read all the data we downloaded from YahooFinance.

In [2]:
#creating one dataframe for each token
df_BTC = pd.read_csv('../datasets/BTC-USD.csv', parse_dates=True).sort_values(by='Date', ascending=False)
df_ETH = pd.read_csv('../datasets/ETH-USD.csv', parse_dates=True).sort_values(by='Date', ascending=False)
df_XRP = pd.read_csv('../datasets/XRP-USD.csv', parse_dates=True).sort_values(by='Date', ascending=False)
df_LTC = pd.read_csv('../datasets/LTC-USD.csv', parse_dates=True).sort_values(by='Date', ascending=False)

In [3]:
df_BTC.head(3)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
1998,2021-01-24,31794.328125,32938.765625,31106.685547,31786.878906,31786.878906,46807680000.0
1997,2021-01-23,,,,,,
1996,2021-01-22,,,,,,


Now let's add the token column to each dataframe.

In [4]:
dfs = [df_BTC, df_ETH, df_XRP, df_LTC]
coins = ['BTC', 'ETH', 'XRP', 'LTC']

In [5]:
for df, coin in zip(dfs, coins):
    df['Coin'] = coin

In [6]:
for df in dfs:
    print(df['Coin'][:1])

1998    BTC
Name: Coin, dtype: object
1997    ETH
Name: Coin, dtype: object
1998    XRP
Name: Coin, dtype: object
1998    LTC
Name: Coin, dtype: object


In [7]:
df_BTC.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'Coin'], dtype='object')

Let's insert 'Coin' column after the 'Date' column.

In [8]:
for df in dfs:
    for coin in coins:
        col = df.pop("Coin")
        df.insert(1, "Coin", col)

In [9]:
for df in dfs:
    print(df.columns)

Index(['Date', 'Coin', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Index(['Date', 'Coin', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Index(['Date', 'Coin', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Index(['Date', 'Coin', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')


Great. All columns are in right spots. 

We're still missing one important piece of data - each token's total market capitalization. Let's read more datasets from CoinGecko which have that piece.

In [10]:
df_BTC_cap = pd.read_csv(r'D:\Tutorials\DATASETS\GeckoCryptos\btc-usd-max.csv', parse_dates=True, encoding='utf-8')
df_ETH_cap = pd.read_csv(r'D:\Tutorials\DATASETS\GeckoCryptos\eth-usd-max.csv', parse_dates=True, encoding='utf-8')
df_XRP_cap = pd.read_csv(r'D:\Tutorials\DATASETS\GeckoCryptos\xrp-usd-max.csv', parse_dates=True, encoding='utf-8')
df_LTC_cap = pd.read_csv(r'D:\Tutorials\DATASETS\GeckoCryptos\ltc-usd-max.csv', parse_dates=True, encoding='utf-8')

In [11]:
caps = [df_BTC_cap, df_ETH_cap, df_XRP_cap, df_LTC_cap]

In [12]:
df_BTC_cap.head(10)

Unnamed: 0,snapped_at,price,market_cap,total_volume
0,2013-04-28 00:00:00 UTC,135.3,1500518000.0,0.0
1,2013-04-29 00:00:00 UTC,141.96,1575032000.0,0.0
2,2013-04-30 00:00:00 UTC,135.3,1501657000.0,0.0
3,2013-05-01 00:00:00 UTC,117.0,1298952000.0,0.0
4,2013-05-02 00:00:00 UTC,103.43,1148668000.0,0.0
5,2013-05-03 00:00:00 UTC,91.01,1011066000.0,0.0
6,2013-05-04 00:00:00 UTC,111.25,1236352000.0,0.0
7,2013-05-05 00:00:00 UTC,116.79,1298378000.0,0.0
8,2013-05-06 00:00:00 UTC,118.33,1315992000.0,0.0
9,2013-05-07 00:00:00 UTC,106.4,1183766000.0,0.0


In [13]:
df_ETH_cap.head(10)

Unnamed: 0,snapped_at,price,market_cap,total_volume
0,2015-08-07 00:00:00 UTC,2.83162,0.0,90622.0
1,2015-08-08 00:00:00 UTC,1.33075,80339480.0,368070.0
2,2015-08-10 00:00:00 UTC,0.687586,41556310.0,400464.1
3,2015-08-11 00:00:00 UTC,1.067379,64539010.0,1518998.0
4,2015-08-12 00:00:00 UTC,1.256613,76013260.0,2073893.0
5,2015-08-13 00:00:00 UTC,1.825395,110468800.0,4380143.0
6,2015-08-14 00:00:00 UTC,1.825975,110555300.0,4355618.0
7,2015-08-15 00:00:00 UTC,1.67095,101215200.0,2519633.0
8,2015-08-16 00:00:00 UTC,1.476607,89480940.0,3032658.0
9,2015-08-17 00:00:00 UTC,1.203871,87313390.0,1880092.0


In [14]:
df_XRP_cap.head(10)

Unnamed: 0,snapped_at,price,market_cap,total_volume
0,2013-08-04 00:00:00 UTC,0.005874,45921034.0,0.0
1,2013-08-05 00:00:00 UTC,0.005653,44191247.0,0.0
2,2013-08-06 00:00:00 UTC,0.004669,36500633.0,0.0
3,2013-08-07 00:00:00 UTC,0.004486,35071445.0,0.0
4,2013-08-08 00:00:00 UTC,0.004196,32800191.0,0.0
5,2013-08-09 00:00:00 UTC,0.004277,33440085.0,0.0
6,2013-08-10 00:00:00 UTC,0.004318,33760306.0,0.0
7,2013-08-11 00:00:00 UTC,0.004372,34180440.0,0.0
8,2013-08-12 00:00:00 UTC,0.004397,34374714.0,0.0
9,2013-08-13 00:00:00 UTC,0.004228,33050911.0,0.0


In [15]:
df_LTC_cap.head(10)

Unnamed: 0,snapped_at,price,market_cap,total_volume
0,2013-04-28 00:00:00 UTC,4.29983,73773387.0,0.0
1,2013-04-29 00:00:00 UTC,4.3594,74936909.0,0.0
2,2013-04-30 00:00:00 UTC,4.18295,72037636.0,0.0
3,2013-05-01 00:00:00 UTC,3.64914,62957992.0,0.0
4,2013-05-02 00:00:00 UTC,3.38879,58565340.0,0.0
5,2013-05-03 00:00:00 UTC,2.78957,48265782.0,0.0
6,2013-05-04 00:00:00 UTC,3.51708,60927537.0,0.0
7,2013-05-05 00:00:00 UTC,3.63013,62963530.0,0.0
8,2013-05-06 00:00:00 UTC,3.50733,60937067.0,0.0
9,2013-05-07 00:00:00 UTC,3.21463,55968734.0,0.0


In [16]:
for df in caps:
    df.rename({'snapped_at':'Date'}, axis=1, inplace=True)

In [17]:
df_BTC_cap.head(3)

Unnamed: 0,Date,price,market_cap,total_volume
0,2013-04-28 00:00:00 UTC,135.3,1500518000.0,0.0
1,2013-04-29 00:00:00 UTC,141.96,1575032000.0,0.0
2,2013-04-30 00:00:00 UTC,135.3,1501657000.0,0.0


Now we will add 'market_cap'columns for each coin to our original dataframes.

In [18]:
# for df in dfs:
#     for cap in caps:
#         df['Market_Cap'] = pd.Series(cap['market_cap'])

df_BTC['Market_Cap'] = pd.Series(df_BTC_cap['market_cap'])
df_ETH['Market_Cap'] = pd.Series(df_ETH_cap['market_cap'])
df_XRP['Market_Cap'] = pd.Series(df_XRP_cap['market_cap'])
df_LTC['Market_Cap'] = pd.Series(df_LTC_cap['market_cap'])

In [19]:
df_BTC.head()

Unnamed: 0,Date,Coin,Open,High,Low,Close,Adj Close,Volume,Market_Cap
1998,2021-01-24,BTC,31794.328125,32938.765625,31106.685547,31786.878906,31786.878906,46807680000.0,112970700000.0
1997,2021-01-23,BTC,,,,,,,114953300000.0
1996,2021-01-22,BTC,,,,,,,114961000000.0
1995,2021-01-21,BTC,,,,,,,115236600000.0
1994,2021-01-20,BTC,,,,,,,109624300000.0


And let's concatenate our dataframes.

In [20]:
df = pd.concat([df for df in dfs],axis=0,sort=False)

Now we will convert our 'Date' column to datetime.

In [21]:
df['Date'] = pd.to_datetime(df['Date'])

In [22]:
df.head()

Unnamed: 0,Date,Coin,Open,High,Low,Close,Adj Close,Volume,Market_Cap
1998,2021-01-24,BTC,31794.328125,32938.765625,31106.685547,31786.878906,31786.878906,46807680000.0,112970700000.0
1997,2021-01-23,BTC,,,,,,,114953300000.0
1996,2021-01-22,BTC,,,,,,,114961000000.0
1995,2021-01-21,BTC,,,,,,,115236600000.0
1994,2021-01-20,BTC,,,,,,,109624300000.0


Let's drop 'Adj Close' column since we will not need it for our analysis.

In [23]:
df.drop('Adj Close', axis=1, inplace=True)

In [24]:
#looking how many observations and features we have
df.shape

(7995, 8)

We have 7995 observations and 9 features. Above we noticed we had some missing data.

In [25]:
df.isna().sum()

Date           0
Coin           0
Open          16
High          16
Low           16
Close         16
Volume        16
Market_Cap     3
dtype: int64

We don't have a lot of missing values so we wil just drop them.

In [26]:
df.dropna(axis=0, inplace=True)
df.isnull().any()

Date          False
Coin          False
Open          False
High          False
Low           False
Close         False
Volume        False
Market_Cap    False
dtype: bool

Great. No more missing values. Let's take a look at our data shape once again.

In [27]:
df.shape

(7976, 8)

We only dropped 20 observatios. Now let's check if we have any duplicates.

In [28]:
df.duplicated().any()

False

In [29]:
df['Market_Cap']

1998    1.129707e+11
1993    1.093727e+11
1992    1.088453e+11
1991    1.081558e+11
1990    1.139870e+11
            ...     
4       5.856534e+07
3       6.295799e+07
2       7.203764e+07
1       7.493691e+07
0       7.377339e+07
Name: Market_Cap, Length: 7976, dtype: float64

No duplicates. Our data is ready for the next strep - Exploratory Data Analysis.

In [30]:
#saving the data
datapath = 'D://Tutorials/SDST/My Projects/Capstone3/DW'
if not os.path.exists(datapath):
    os.mkdir(datapath)
datapath_DW = os.path.join(datapath, 'Data_for_EDA.csv')
if not os.path.exists(datapath_DW):
    df.to_csv(datapath_DW, index=False)