# 01 - Data Download and Preprocessing

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook handles the initial data acquisition and preprocessing for the cryptocurrency sentiment analysis project.

## Objectives
1. Download and load Bitcoin tweets dataset from Kaggle
2. Download and load Bitcoin historical price data
3. Perform initial data exploration and quality assessment
4. Clean and preprocess the datasets
5. Save processed data for subsequent analysis

In [1]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import os
import re
import string
import json
from pathlib import Path
import zipfile
import shutil

warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("Environment setup complete!")

Environment setup complete!


## 1. Data Directory Setup

In [2]:
# Create data directory structure
data_dir = Path('../data') # Relative to the notebook directory
raw_data_dir = data_dir / 'raw'
processed_data_dir = data_dir / 'processed'

# Create directories if they don't exist
for directory in [data_dir, raw_data_dir, processed_data_dir]:
    directory.mkdir(exist_ok=True)
    print(f"Directory created/verified: {directory}")

Directory created/verified: ../data
Directory created/verified: ../data/raw
Directory created/verified: ../data/processed


## 2. Download Data from Kaggle

In [3]:
# Install kagglehub
!pip install -q kagglehub

import kagglehub

# Download latest version
tweet_path = kagglehub.dataset_download("alaix14/bitcoin-tweets-20160101-to-20190329")
price_path = kagglehub.dataset_download("mczielinski/bitcoin-historical-data")

# Copy files to the data/raw directory
def copy_files(src_dir, dest_dir):
    for item in os.listdir(src_dir):
        s = os.path.join(src_dir, item)
        d = os.path.join(dest_dir, item)
        if os.path.isdir(s):
            copy_files(s, dest_dir)
        else:
            shutil.copy2(s, d)
            print(f"Copied {s} to {d}")

copy_files(tweet_path, raw_data_dir)
copy_files(price_path, raw_data_dir)

print("✅ Datasets downloaded and copied to data/raw directory.")

Copied /Users/luke/.cache/kagglehub/datasets/alaix14/bitcoin-tweets-20160101-to-20190329/versions/2/tweets.csv to ../data/raw/tweets.csv
Copied /Users/luke/.cache/kagglehub/datasets/mczielinski/bitcoin-historical-data/versions/287/btcusd_1-min_data.csv to ../data/raw/btcusd_1-min_data.csv
✅ Datasets downloaded and copied to data/raw directory.


## 3. Load Datasets

In [4]:
# Load real datasets from Kaggle
print("📊 Loading real datasets...")
try:
    tweet_chunks = pd.read_csv(raw_data_dir / 'tweets.csv', delimiter=';', on_bad_lines='skip', engine='python', chunksize=100000)
    tweet_data = pd.concat([chunk for chunk in tweet_chunks])
    price_data = pd.read_csv(raw_data_dir / 'btcusd_1-min_data.csv')
    print("✅ Loaded real datasets from Kaggle")
except FileNotFoundError as e:
    print(f"⚠️ Error: {e}")
    print("Please make sure the Kaggle datasets are downloaded and unzipped in the 'data/raw' directory.")
    tweet_data, price_data = pd.DataFrame(), pd.DataFrame()

if not tweet_data.empty and not price_data.empty:
    print(f"\nDataset Summary:")
    print(f"Tweet data shape: {tweet_data.shape}")
    print(f"Price data shape: {price_data.shape}")

📊 Loading real datasets...
✅ Loaded real datasets from Kaggle

Dataset Summary:
Tweet data shape: (16890422, 9)
Price data shape: (7103245, 6)


## 4. Preprocess Datasets

In [6]:
def preprocess_tweets(df):
    df.dropna(subset=['text'], inplace=True)
    
    df = df[['timestamp', 'text', 'user_name', 'reply_count', 'like_count', 'retweet_count', 'tweet_id']]
    
    return df

def preprocess_prices(df):
    df = df[['timestamp', 'Open', 'High', 'Low', 'Close', 'Volume']]

    df = df.rename(columns={
        'Open': 'open', 
        'High': 'high', 
        'Low': 'low', 
        'Close': 'close',
        'Volume': 'volume'
    })
    
    return df

if not tweet_data.empty and not price_data.empty:
    print("Preprocessing datasets...")
    tweet_data = preprocess_tweets(tweet_data)
    price_data = preprocess_prices(price_data)
    print("✅ Preprocessing complete!")
    
    print(f"Final tweet data shape: {tweet_data.shape}")
    print(f"Final price data shape: {price_data.shape}")

Preprocessing datasets...
✅ Preprocessing complete!
Final tweet data shape: (16889041, 7)
Final price data shape: (7103245, 6)


## 5. Save Processed Data

In [7]:
if not tweet_data.empty and not price_data.empty:
    print("💾 Saving processed datasets...")
    
    # Save tweet data
    tweet_output_path = processed_data_dir / 'tweets_processed.csv'
    tweet_data.to_csv(tweet_output_path, index=False)
    print(f"✅ Saved processed tweets: {tweet_output_path}")
    
    # Save price data
    price_output_path = processed_data_dir / 'prices_processed.csv'
    price_data.to_csv(price_output_path, index=False)
    print(f"✅ Saved processed prices: {price_output_path}")
    
    # Save metadata
    metadata = {
        'processing_date': datetime.now().isoformat(),
        'tweet_count': len(tweet_data),
        'price_count': len(price_data),
        'date_range': {
            'start': tweet_data['timestamp'].min().isoformat(),
            'end': tweet_data['timestamp'].max().isoformat()
        }
    }
    
    metadata_path = processed_data_dir / 'metadata.json'
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"✅ Saved metadata: {metadata_path}")
    print("\n🎉 Data preprocessing complete!")
else:
    print("No data to save.")

💾 Saving processed datasets...
✅ Saved processed tweets: ../data/processed/tweets_processed.csv
✅ Saved processed prices: ../data/processed/prices_processed.csv
✅ Saved metadata: ../data/processed/metadata.json

🎉 Data preprocessing complete!
