# **TradeCare: Data Collection Notebook**

## Objectives
* Fetch historical Bitcoin OHLCV (Open, High, Low, Close, Volume) data from a GitHub-hosted repository that provides automated daily updates.
* Verify data loaded correctly (basic checks)
* Understand data structure and characteristics
* Document data source and live data collection strategy

## Inputs
*  **Data Source:** GitHub Repository (automated updates)
*   **URL:** https://raw.githubusercontent.com/mouadja02/bitcoin-hourly-ohclv-dataset/main/btc-hourly-price_2015_2025.csv\n
*   **Asset:** BTC-USD
*   **Timeframe:** 1 Hour
*   **Period:** November 2014 - present

## Outputs
* DataFrame loaded in memory for exploration
* Data understanding documented
* Validated raw data saved as CSV checkpoint: `inputs/datasets/raw/bitcoin_raw.csv`
* Subsequent notebooks load from CSV


## Additional Comments
This GitHub dataset provides a **unique combination** rarely found in ML projects:

* **Fresh & Maintained:** Automated workflow fetches current data from CryptoCompare API daily and stores backups on GitHub. Repository contains Bitcoin hourly price data from 2015 to present with continuous updates
* **Simple**: Direct CSV access via single URL
* **Free**: No API keys or costs  
* **Reliable**: No rate limits or auth failures  
* **Transparent**: Git history shows every change  
* **Scalable**: Should work in production environments  

**Data Pipeline Strategy:**
* This notebook fetches data from URL and validates it
* Validated data is saved as CSV checkpoint for fast iteration
* Subsequent notebooks load from CSV (no re-fetching needed)
* This provides: reliability, speed, and offline capability

---

## Change Working Directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch and Validate Data

## Import Validation Helper

In [None]:
# Import centralized validation function
import sys
sys.path.append('.')  # Add project root to path

from src.raw_data_validation import fetch_and_validate_data, get_data_info

## Fetch Validated Data

This function automatically:
- Fetches data from GitHub URL
- Validates column structure and names
- Validates string data safety (no injection attempts)
- Validates price ranges
- Validates data completeness
- Validates timestamps

If any validation fails, the notebook stops with a clear error message.

In [None]:
# Fetch and validate data in one call
df = fetch_and_validate_data()

---

# Data Summary

In [None]:
# Get data summary from helper function
import json
import pandas as pd
data_info = get_data_info(df)
print(json.dumps(data_info, indent=2))

## DataFrame Info

In [None]:
print(f"Data fetched on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total rows: {df.shape[0]:,}")
print(f"Total columns: {df.shape[1]}")
df.info()

## Display First Rows

In [None]:
df.head(10)

## Display Last Rows

In [None]:
df.tail(10)

## DataFrame Summary

In [None]:
df.info()

## Statistical Description

In [None]:
df.describe()

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Data Quality Checks

## Check for missing values 

In [None]:
print("Missing values per column:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

## Check for Duplicate Rows

In [None]:
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("\nDuplicate rows:")
    print(df[df.duplicated(keep=False)])

---


# Save Validated Data Checkpoint

In [None]:
## Create Directory Structure

In [None]:
# Create necessary directories
raw_data_dir = 'inputs/datasets/raw'
os.makedirs(raw_data_dir, exist_ok=True)
print(f"✓ Directory created/verified: {raw_data_dir}")

## Save Raw Data as CSV

In [None]:
# Save validated data
csv_path = f"{raw_data_dir}/bitcoin_raw.csv"
df.to_csv(csv_path, index=False)

# Confirm save
file_size_mb = os.path.getsize(csv_path) / (1024 * 1024)
print(f"✓ Data saved successfully")
print(f"  Location: {csv_path}")
print(f"  Rows: {len(df):,}")
print(f"  Size: {file_size_mb:.2f} MB")
print(f"  Timestamp: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")

---

# Conclusion

## Summary

✓ **Data Collection Complete**

This notebook successfully:
1. Used centralized validation helper (`src/raw_data_validation.py`)
2. Fetched Bitcoin hourly OHLCV data from GitHub repository
3. Automatically validated data structure, safety, and integrity
4. Explored the validated dataset

**Security Measures Applied:**
- Column structure verification
- Character injection prevention (no dangerous symbols)
- Date/time format validation
- Price range sanity checks
- Data completeness validation
- Timestamp range verification

**Key Findings:**
- Data covers the period from November 2014 to present (November 2025)
- Hourly granularity provides sufficient detail for short-term predictions
- All security validations passed
- **Dataset is exceptionally clean:**
  - No missing values detected
  - No duplicate rows found
  - Automated data collection ensures consistency
  - Public API source reduces manual entry errors
  - Validation confirms data integrity

**Data Quality Notes:**
- This dataset benefits from automated collection via CryptoCompare API
- Programmatic data generation minimizes human input errors
- Continuous validation by repository maintainers ensures reliability
- However, cleaning pipeline will be implemented for:
  - Future-proofing against potential data gaps
  - Demonstrating data preparation best practices
  - Handling edge cases in production deployment

**Data Pipeline Approach:**
- Validated raw data saved to: `inputs/datasets/raw/bitcoin_raw.csv`
- Subsequent notebooks will load from CSV (fast, reliable)
- Data cleaning notebook with minimal intervention needed
- Feature engineering will transform raw data to ML-ready format

**Next Steps:**
1. Proceed to Data Cleaning notebook (`2_DataCleaning.ipynb`)
2. Load raw CSV and confirm data quality
4. Prepare for feature engineering phase

