# **TradeCare: Data Collection Notebook**

## Objectives
* Fetch historical Bitcoin OHLCV (Open, High, Low, Close, Volume) data from a GitHub-hosted repository that provides automated daily updates.
* Verify data loaded correctly (basic checks)
* Understand data structure and characteristics
* Document data source and live data collection strategy

## Inputs
*  **Data Source:** GitHub Repository (automated updates)
*   **URL:** https://raw.githubusercontent.com/mouadja02/bitcoin-hourly-ohclv-dataset/main/btc-hourly-price_2015_2025.csv\n
*   **Asset:** BTC-USD
*   **Timeframe:** 1 Hour
*   **Period:** November 2014 - present

## Outputs
* DataFrame loaded in memory for exploration
* Data understanding documented
* Validated raw data saved as CSV checkpoint: `inputs/datasets/raw/bitcoin_raw.csv`
* Subsequent notebooks load from CSV


## Additional Comments
This GitHub dataset provides a **unique combination** rarely found in ML projects:

* **Fresh & Maintained:** Automated workflow fetches current data from CryptoCompare API daily and stores backups on GitHub. Repository contains Bitcoin hourly price data from 2015 to present with continuous updates
* **Simple**: Direct CSV access via single URL
* **Free**: No API keys or costs  
* **Reliable**: No rate limits or auth failures  
* **Transparent**: Git history shows every change  
* **Scalable**: Should work in production environments  

**Data Pipeline Strategy:**
* This notebook fetches data from URL and validates it
* Validated data is saved as CSV checkpoint for fast iteration
* Subsequent notebooks load from CSV (no re-fetching needed)
* This provides: reliability, speed, and offline capability

---

## Change Working Directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [29]:
import os
current_dir = os.getcwd()
current_dir

'/Users/ilianamarquez/Documents/vscode-projects'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [30]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [31]:
current_dir = os.getcwd()
current_dir

'/Users/ilianamarquez/Documents'

---

# Fetch and Validate Data

## Import Validation Helper

In [32]:
# Import centralized validation function
import sys
sys.path.append('.')  # Add project root to path

from src.raw_data_validation import fetch_and_validate_data, get_data_info

## Fetch Validated Data

This function automatically:
- Fetches data from GitHub URL
- Validates column structure and names
- Validates string data safety (no injection attempts)
- Validates price ranges
- Validates data completeness
- Validates timestamps

If any validation fails, the notebook stops with a clear error message.

In [33]:
# Fetch and validate data in one call
df = fetch_and_validate_data()

------------------------------------------------------------
TradeCare Data Validation
------------------------------------------------------------
Fetching data from GitHub...
URL: https://raw.githubusercontent.com/mouadja02/bitcoin-hourly-ohclv-dataset/main/btc-hourly-price_2015_2025.csv
✓ Data fetched: 96,594 rows & 9 columns

Validating data structure...
✓ Column structure valid: 9 columns present
✓ Column names safe: only alphanumeric and underscores
Validating string data safety...
✓ String data validated: safe formats, no injection patterns
Validating price ranges...
✓ Price ranges valid: all prices between $0 and $500,000
Validating data completeness...
✓ Row count valid: 96,594 rows (>= 96,000)
Validating timestamps...
Timestamps valid: starts from 2014-11-15
------------------------------------------------------------
All validation checks passed!
Data ready: 96,594 rows from 2014-11-15 to 2025-11-21
------------------------------------------------------------


---

# Data Summary

In [34]:
# Get data summary from helper function
import json
import pandas as pd
data_info = get_data_info(df)
print(json.dumps(data_info, indent=2))

{
  "total_rows": 96594,
  "total_columns": 9,
  "columns": [
    "TIME_UNIX",
    "DATE_STR",
    "HOUR_STR",
    "OPEN_PRICE",
    "HIGH_PRICE",
    "CLOSE_PRICE",
    "LOW_PRICE",
    "VOLUME_FROM",
    "VOLUME_TO"
  ],
  "date_range": {
    "first": "2014-11-15",
    "last": "2025-11-21"
  },
  "price_range": {
    "min": 165.07,
    "max": 126098.78,
    "mean": 26596.72792088536
  },
  "memory_usage_mb": 11.330789566040039,
  "fetched_at": "2025-11-22T04:58:05.476776"
}


## DataFrame Info

In [35]:
print(f"Data fetched on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Total rows: {df.shape[0]:,}")
print(f"Total columns: {df.shape[1]}")
df.info()

Data fetched on: 2025-11-22 04:58:05
Total rows: 96,594
Total columns: 9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96594 entries, 0 to 96593
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   TIME_UNIX    96594 non-null  int64  
 1   DATE_STR     96594 non-null  object 
 2   HOUR_STR     96594 non-null  int64  
 3   OPEN_PRICE   96594 non-null  float64
 4   HIGH_PRICE   96594 non-null  float64
 5   CLOSE_PRICE  96594 non-null  float64
 6   LOW_PRICE    96594 non-null  float64
 7   VOLUME_FROM  96594 non-null  float64
 8   VOLUME_TO    96594 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 6.6+ MB


## Display First Rows

In [36]:
df.head(10)

Unnamed: 0,TIME_UNIX,DATE_STR,HOUR_STR,OPEN_PRICE,HIGH_PRICE,CLOSE_PRICE,LOW_PRICE,VOLUME_FROM,VOLUME_TO
0,1416031200,2014-11-15,6,395.88,398.12,396.15,394.43,459.6,182309.81
1,1416034800,2014-11-15,7,396.15,397.49,397.15,395.96,428.88,170256.62
2,1416038400,2014-11-15,8,397.15,399.99,399.9,396.91,445.96,178280.48
3,1416042000,2014-11-15,9,399.9,399.9,392.56,391.83,494.09,195473.98
4,1416045600,2014-11-15,10,392.56,393.1,391.83,390.03,437.84,171654.03
5,1416049200,2014-11-15,11,391.83,391.84,389.82,387.8,388.56,151586.32
6,1416052800,2014-11-15,12,389.82,392.35,390.5,389.79,344.07,134639.52
7,1416056400,2014-11-15,13,390.5,390.67,387.34,384.05,408.71,158055.33
8,1416060000,2014-11-15,14,387.34,388.4,376.47,375.68,640.98,244901.78
9,1416063600,2014-11-15,15,376.47,378.16,374.82,371.64,343.39,130235.69


## Display Last Rows

In [37]:
df.tail(10)

Unnamed: 0,TIME_UNIX,DATE_STR,HOUR_STR,OPEN_PRICE,HIGH_PRICE,CLOSE_PRICE,LOW_PRICE,VOLUME_FROM,VOLUME_TO
96584,1763733600,2025-11-21,14,84037.64,85421.61,84791.57,83482.03,5361.01,452335800.0
96585,1763737200,2025-11-21,15,84791.57,85026.86,82864.05,82686.83,6320.78,528694100.0
96586,1763740800,2025-11-21,16,82864.05,84838.39,84838.26,82263.6,5876.83,489278300.0
96587,1763744400,2025-11-21,17,84838.26,85490.76,84821.59,84521.37,4688.87,398968200.0
96588,1763748000,2025-11-21,18,84821.59,85017.87,84531.79,83254.74,5278.19,444251300.0
96589,1763751600,2025-11-21,19,84531.79,84970.47,84157.3,84036.42,3022.38,255551300.0
96590,1763755200,2025-11-21,20,84157.3,84599.93,84514.58,83414.26,4065.42,341607800.0
96591,1763758800,2025-11-21,21,84514.58,85365.3,85137.95,84314.72,2691.81,228409500.0
96592,1763762400,2025-11-21,22,85137.95,85347.16,84246.0,84221.6,1560.62,132355200.0
96593,1763766000,2025-11-21,23,84246.0,85301.04,85087.62,84078.74,1471.0,124688700.0


## DataFrame Summary

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96594 entries, 0 to 96593
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   TIME_UNIX    96594 non-null  int64  
 1   DATE_STR     96594 non-null  object 
 2   HOUR_STR     96594 non-null  int64  
 3   OPEN_PRICE   96594 non-null  float64
 4   HIGH_PRICE   96594 non-null  float64
 5   CLOSE_PRICE  96594 non-null  float64
 6   LOW_PRICE    96594 non-null  float64
 7   VOLUME_FROM  96594 non-null  float64
 8   VOLUME_TO    96594 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 6.6+ MB


## Statistical Description

In [39]:
df.describe()

Unnamed: 0,TIME_UNIX,HOUR_STR,OPEN_PRICE,HIGH_PRICE,CLOSE_PRICE,LOW_PRICE,VOLUME_FROM,VOLUME_TO
count,96594.0,96594.0,96594.0,96594.0,96594.0,96594.0,96594.0,96594.0
mean,1589899000.0,11.500559,26595.774902,26699.783872,26596.727921,26487.117726,2748.234,42921920.0
std,100383900.0,6.922061,31262.006884,31363.240106,31262.663865,31157.096089,41257.57,98258470.0
min,1416031000.0,0.0,165.07,177.93,165.07,158.44,0.1471,1468.16
25%,1502965000.0,6.0,3398.765,3416.0775,3398.9,3389.9,791.1825,4583068.0
50%,1589899000.0,12.0,10795.1,10858.29,10795.355,10734.725,1457.515,18336800.0
75%,1676832000.0,18.0,41795.3675,41956.9825,41795.6525,41641.215,2701.238,49235020.0
max,1763766000.0,23.0,126098.78,126287.29,126098.78,125333.07,8410600.0,7859574000.0


NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Data Quality Checks

## Check for missing values 

In [40]:
print("Missing values per column:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

Missing values per column:
TIME_UNIX      0
DATE_STR       0
HOUR_STR       0
OPEN_PRICE     0
HIGH_PRICE     0
CLOSE_PRICE    0
LOW_PRICE      0
VOLUME_FROM    0
VOLUME_TO      0
dtype: int64

Total missing values: 0


## Check for Duplicate Rows

In [41]:
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("\nDuplicate rows:")
    print(df[df.duplicated(keep=False)])

Number of duplicate rows: 0


---


# Save Validated Data Checkpoint

In [42]:
## Create Directory Structure

In [43]:
# Create necessary directories
raw_data_dir = 'inputs/datasets/raw'
os.makedirs(raw_data_dir, exist_ok=True)
print(f"✓ Directory created/verified: {raw_data_dir}")

✓ Directory created/verified: inputs/datasets/raw


## Save Raw Data as CSV

In [44]:
# Save validated data
csv_path = f"{raw_data_dir}/bitcoin_raw.csv"
df.to_csv(csv_path, index=False)

# Confirm save
file_size_mb = os.path.getsize(csv_path) / (1024 * 1024)
print(f"✓ Data saved successfully")
print(f"  Location: {csv_path}")
print(f"  Rows: {len(df):,}")
print(f"  Size: {file_size_mb:.2f} MB")
print(f"  Timestamp: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")

✓ Data saved successfully
  Location: inputs/datasets/raw/bitcoin_raw.csv
  Rows: 96,594
  Size: 7.07 MB
  Timestamp: 2025-11-22 04:58:05


---

# Conclusion

## Summary

✓ **Data Collection Complete**

This notebook successfully:
1. Used centralized validation helper (`src/raw_data_validation.py`)
2. Fetched Bitcoin hourly OHLCV data from GitHub repository
3. Automatically validated data structure, safety, and integrity
4. Explored the validated dataset

**Security Measures Applied:**
- Column structure verification
- Character injection prevention (no dangerous symbols)
- Date/time format validation
- Price range sanity checks
- Data completeness validation
- Timestamp range verification

**Key Findings:**
- Data covers the period from November 2014 to present (November 2025)
- Hourly granularity provides sufficient detail for short-term predictions
- All security validations passed
- **Dataset is exceptionally clean:**
  - No missing values detected
  - No duplicate rows found
  - Automated data collection ensures consistency
  - Public API source reduces manual entry errors
  - Validation confirms data integrity

**Data Quality Notes:**
- This dataset benefits from automated collection via CryptoCompare API
- Programmatic data generation minimizes human input errors
- Continuous validation by repository maintainers ensures reliability
- However, cleaning pipeline will be implemented for:
  - Future-proofing against potential data gaps
  - Demonstrating data preparation best practices
  - Handling edge cases in production deployment

**Data Pipeline Approach:**
- Validated raw data saved to: `inputs/datasets/raw/bitcoin_raw.csv`
- Subsequent notebooks will load from CSV (fast, reliable)
- Data cleaning notebook with minimal intervention needed
- Feature engineering will transform raw data to ML-ready format

**Next Steps:**
1. Proceed to Data Cleaning notebook (`2_DataCleaning.ipynb`)
2. Load raw CSV and confirm data quality
4. Prepare for feature engineering phase

