# **Data collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect data and save under outcome/dataset/collection

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate dataset under: outcome/dataset/collection/BTC_USD_Price_Prediction_Data.csv

---

## Change working directory

Need to change the working directory from its current folder to its parent folder.

Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/fifth-milestone-project-bitcoin/jupyter_notebooks'

Have to make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname())
print()

Confirm the new current directory.

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/fifth-milestone-project-bitcoin'

# Fetch the data from Kaggle

## Install the Kaggle

In [10]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Use kaggle token

In [12]:
import os
os.environ['KAGGLE_CONFIG_DIR']=os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset and destination folder, finally downloaded it

In [13]:
KaggleDatasetPath = "abhishek14398/bitcoin-prediction-dataset-bullrun"
DestinationFolder = "/workspace/fifth-milestone-project-bitcoin/jupyter_notebooks/inputs/dataset/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading bitcoin-prediction-dataset-bullrun.zip to /workspace/fifth-milestone-project-bitcoin/jupyter_notebooks/inputs/dataset/raw
100%|███████████████████████████████████████| 72.6k/72.6k [00:00<00:00, 693kB/s]
100%|███████████████████████████████████████| 72.6k/72.6k [00:00<00:00, 691kB/s]


Unzip the file, delete the zip and kaggle.json file

In [14]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  /workspace/fifth-milestone-project-bitcoin/jupyter_notebooks/inputs/dataset/raw/bitcoin-prediction-dataset-bullrun.zip
  inflating: /workspace/fifth-milestone-project-bitcoin/jupyter_notebooks/inputs/dataset/raw/BTC_USD_Price_Prediction_Data.csv  


Load and inspect the downloaded dataset

In [27]:
import pandas as pd
df = pd.read_csv(f"inputs/dataset/raw/BTC_USD_Price_Prediction_Data.csv").drop(['Currency'], axis=1)
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Closing Price (USD),24h Open (USD),24h High (USD),24h Low (USD)
0,0,2014-03-14,124.65499,125.30466,125.75166,123.56349
1,1,2014-03-15,126.455,124.65499,126.7585,124.63383
2,2,2014-03-16,109.58483,126.455,126.66566,84.32833
3,3,2014-03-17,119.67466,109.58483,119.675,108.05816
4,4,2014-03-18,122.33866,119.67466,122.93633,119.00566


Dataframe summary

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2787 entries, 0 to 2786
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           2787 non-null   int64  
 1   Currency             2787 non-null   object 
 2   Date                 2787 non-null   object 
 3   Closing Price (USD)  2787 non-null   float64
 4   24h Open (USD)       2787 non-null   float64
 5   24h High (USD)       2787 non-null   float64
 6   24h Low (USD)        2787 non-null   float64
dtypes: float64(4), int64(1), object(2)
memory usage: 152.5+ KB


Check the date against duplication

In [25]:
import datetime
df[df.duplicated(subset=['Date'])]


Unnamed: 0.1,Unnamed: 0,Currency,Date,Closing Price (USD),24h Open (USD),24h High (USD),24h Low (USD)


Converting Dates to numeric

In [3]:
import pandas as pd
df = pd.read_csv(f"inputs/dataset/raw/BTC_USD_Price_Prediction_Data.csv").drop(['Currency'], axis=1)
df['Date'] = pd.to_numeric(df['Date'], errors='coerce')

Check the data type of Date

In [39]:
df['Date'].dtype

dtype('float64')

In [4]:
import pandas as pd
df = pd.read_csv(f"inputs/dataset/raw/BTC_USD_Price_Prediction_Data.csv").drop(['Currency'], axis=1)
df['Closing Price (USD)'] = df['Closing Price (USD)'].astype('int')
df['24h Open (USD)'] = df['24h Open (USD)'].astype('int')
df['24h High (USD)'] = df['24h High (USD)'].astype('int')
df['24h Low (USD)'] = df['24h Low (USD)'].astype('int')
df.head()

Unnamed: 0.1,Unnamed: 0,Date,Closing Price (USD),24h Open (USD),24h High (USD),24h Low (USD)
0,0,2014-03-14,124,125,125,123
1,1,2014-03-15,126,124,126,124
2,2,2014-03-16,109,126,126,84
3,3,2014-03-17,119,109,119,108
4,4,2014-03-18,122,119,122,119


# Push file to repo

In [5]:
import os
import pandas as pd
df = pd.read_csv(f"inputs/dataset/raw/BTC_USD_Price_Prediction_Data.csv").drop(['Currency'], axis=1)

try:
    os.makedirs(name='outputs/dataset/collection/')
except Exception as e:
    print(e)

df.to_csv(f"/workspace/fifth-milestone-project-bitcoin/jupyter_notebooks/outputs/dataset/collection/Bitcoin_Price_Data.csv", index=False)

[Errno 17] File exists: 'outputs/dataset/collection/'


---

# Conclusions

* Successfully installed the Kaggle
* Successfully downloaded and unzipped the database
* Successfully loaded the database
* Realized that the most important column for answer the Business Requirements is the Date
* Checked the Date column against duplication
* Converted dates to numeric for usability of ML pipeline

# Next step

Study the differences between the opening and closing prices to answer the Business Requirement 1