# **Data collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect data and save under outcome/dataset/collection

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate dataset: under outcome/dataset/collection/BTC_USD_Price_Prediction_Data.csv

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/fifth-milestone-project-bitcoin/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print()




Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/fifth-milestone-project-bitcoin'

---

# Section 1

Fetch data from Kaggle

In [9]:
! pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-7.0.0-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=388b903e0ef1ac2b4f7063c91bf4edab554e7d490453984d7c7207613a7fc8

Use kaggle token

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR']=os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset and destination folder, finally downloaded it

In [12]:
KaggleDatasetPath = "abhishek14398/bitcoin-prediction-dataset-bullrun"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

/usr/bin/sh: 1: kaggle: not found


Unzip the file, delete the zip and kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Load and inspect the downloaded dataset

In [None]:
import panda as pd
df = pd.read_csv(f"inputs/datasets/raw/BTC_USD_Price_Prediction_Data.csv")
df.head()

Dataframe summary

In [None]:
df.info()

Check the date against duplication

In [None]:
df[df.duplicated(subset=['date'])]

Check the data type of Closing Price (USD), 24h Open (USD), 24h High (USD) and 24h Low (USD)

In [None]:
df['Closing Price (USD)', '24h Open (USD)', '24h High (USD)', '24h Low (USD)'].dtype

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
