# 01 - Data Download and Preprocessing

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook handles the initial data acquisition and preprocessing for the cryptocurrency sentiment analysis project.

## Objectives
1. Download and load Bitcoin tweets dataset from Kaggle
2. Download and load Bitcoin historical price data
3. Perform initial data exploration and quality assessment
4. Clean and preprocess the datasets
5. Save processed data for subsequent analysis

In [3]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import os
import re
import string
import json
from pathlib import Path

warnings.filterwarnings('ignore')

# Set style
plt.style.use('default')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

print("Environment setup complete!")
print(f"Current working directory: {os.getcwd()}")

Environment setup complete!
Current working directory: /Users/luke/Desktop/data mining project/notebook


## 1. Data Directory Setup

In [4]:
# Create data directory structure
data_dir = Path('data')
raw_data_dir = data_dir / 'raw'
processed_data_dir = data_dir / 'processed'

# Create directories if they don't exist
for directory in [data_dir, raw_data_dir, processed_data_dir]:
    directory.mkdir(exist_ok=True)
    print(f"Directory created/verified: {directory}")

print("\nData directory structure:")
print("data/")
print("├── raw/          # Original datasets from Kaggle")
print("└── processed/    # Cleaned and processed datasets")

Directory created/verified: data
Directory created/verified: data/raw
Directory created/verified: data/processed

Data directory structure:
data/
├── raw/          # Original datasets from Kaggle
└── processed/    # Cleaned and processed datasets


## 2. Data Loading Functions

In [5]:
def create_sample_bitcoin_tweets(n_tweets=50000, start_date='2018-01-01', end_date='2019-03-29'):
    """
    Create sample Bitcoin tweets dataset for demonstration.
    In production, replace with actual Kaggle dataset loading.
    """
    np.random.seed(42)
    
    # Sample tweet templates with varying sentiment
    positive_tweets = [
        "Bitcoin is going to the moon! 🚀 #BTC #crypto #bullish",
        "HODL strong! Bitcoin will recover soon #diamondhands",
        "Bitcoin breaking resistance levels! Bullish! #BTC",
        "Buying the Bitcoin dip, great opportunity #crypto",
        "Institutional adoption driving Bitcoin up #bitcoin",
        "Bitcoin whale accumulation detected! Bullish signal",
        "Technical analysis shows Bitcoin reversal incoming",
        "Bitcoin hashrate hitting new highs! Network strong",
        "Major companies adding Bitcoin to balance sheet",
        "Bitcoin Lightning Network adoption growing fast"
    ]
    
    negative_tweets = [
        "Bearish on Bitcoin today, expecting a dip #crypto",
        "Selling my Bitcoin, market looks scary #bearish",
        "Crypto winter is here, Bitcoin falling hard",
        "Bitcoin correlation with stocks concerning #risk",
        "Regulatory news impacting Bitcoin price negatively",
        "Bitcoin energy consumption criticism mounting",
        "Market manipulation in Bitcoin obvious today",
        "Bitcoin transaction fees too high for adoption",
        "Tether concerns affecting Bitcoin confidence",
        "Bitcoin technical indicators showing weakness"
    ]
    
    neutral_tweets = [
        "Bitcoin volatility is insane today #crypto",
        "Bitcoin mining difficulty adjustment coming",
        "Watching Bitcoin price action closely today",
        "Bitcoin options expiry this Friday #derivatives",
        "Bitcoin dominance at interesting levels",
        "Bitcoin halving cycle analysis #crypto",
        "Bitcoin on-chain metrics looking neutral",
        "Bitcoin futures market showing mixed signals",
        "Bitcoin ETF news still pending approval",
        "Bitcoin developer activity remains strong"
    ]
    
    all_tweets = positive_tweets + negative_tweets + neutral_tweets
    
    # Generate timestamps
    start = pd.to_datetime(start_date)
    end = pd.to_datetime(end_date)
    timestamps = pd.date_range(start, end, periods=n_tweets)
    
    # Generate tweet data
    tweets = np.random.choice(all_tweets, n_tweets)
    
    # Add realistic metadata
    tweet_data = pd.DataFrame({
        'timestamp': timestamps,
        'text': tweets,
        'user_followers': np.random.exponential(1000, n_tweets).astype(int),
        'retweet_count': np.random.poisson(5, n_tweets),
        'like_count': np.random.poisson(10, n_tweets),
        'user_verified': np.random.choice([True, False], n_tweets, p=[0.1, 0.9]),
        'tweet_id': [f"tweet_{i}" for i in range(n_tweets)]
    })
    
    return tweet_data

def create_sample_bitcoin_prices(start_date='2018-01-01', end_date='2019-03-29', freq='1min'):
    """
    Create sample Bitcoin price data for demonstration.
    In production, replace with actual Kaggle dataset loading.
    """
    np.random.seed(42)
    
    # Generate timestamps
    timestamps = pd.date_range(start_date, end_date, freq=freq)
    n_periods = len(timestamps)
    
    # Simulate realistic Bitcoin price movements
    initial_price = 10000  # Starting price around $10,000
    
    # Generate returns with realistic volatility
    daily_vol = 0.04  # 4% daily volatility
    minute_vol = daily_vol / np.sqrt(24 * 60)  # Scale to minute volatility
    
    returns = np.random.normal(0, minute_vol, n_periods)
    
    # Add some trend and mean reversion
    trend = np.linspace(0, 0.5, n_periods)  # Slight upward trend
    returns += trend / n_periods
    
    # Calculate prices
    log_prices = np.log(initial_price) + np.cumsum(returns)
    prices = np.exp(log_prices)
    
    # Generate OHLCV data
    price_data = pd.DataFrame({
        'timestamp': timestamps,
        'open': prices,
        'high': prices * (1 + np.abs(np.random.normal(0, 0.002, n_periods))),
        'low': prices * (1 - np.abs(np.random.normal(0, 0.002, n_periods))),
        'close': prices,
        'volume': np.random.exponential(1000000, n_periods)
    })
    
    # Ensure high >= close >= low and high >= open >= low
    price_data['high'] = np.maximum(price_data['high'], 
                                   np.maximum(price_data['open'], price_data['close']))
    price_data['low'] = np.minimum(price_data['low'], 
                                  np.minimum(price_data['open'], price_data['close']))
    
    return price_data

print("Data generation functions defined!")

Data generation functions defined!


## 3. Load Datasets

In [6]:
# Try to load real datasets, fall back to sample data
try:
    # Attempt to load real Kaggle datasets
    # Uncomment and modify these paths when you have the actual datasets
    
    # tweet_data = pd.read_csv(raw_data_dir / 'bitcoin_tweets.csv')
    # price_data = pd.read_csv(raw_data_dir / 'bitcoin_prices.csv')
    # print("✅ Loaded real datasets from Kaggle")
    
    # For now, create sample data
    raise FileNotFoundError("Using sample data for demonstration")
    
except FileNotFoundError:
    print("📊 Creating sample datasets for demonstration...")
    print("   (Replace with real Kaggle data loading in production)")
    
    # Create sample datasets
    tweet_data = create_sample_bitcoin_tweets(n_tweets=50000)
    price_data = create_sample_bitcoin_prices()
    
    print("✅ Sample datasets created successfully")

print(f"\nDataset Summary:")
print(f"Tweet data shape: {tweet_data.shape}")
print(f"Price data shape: {price_data.shape}")
print(f"Tweet date range: {tweet_data['timestamp'].min()} to {tweet_data['timestamp'].max()}")
print(f"Price date range: {price_data['timestamp'].min()} to {price_data['timestamp'].max()}")

📊 Creating sample datasets for demonstration...
   (Replace with real Kaggle data loading in production)
✅ Sample datasets created successfully

Dataset Summary:
Tweet data shape: (50000, 7)
Price data shape: (650881, 6)
Tweet date range: 2018-01-01 00:00:00 to 2019-03-29 00:00:00
Price date range: 2018-01-01 00:00:00 to 2019-03-29 00:00:00


## 4. Save Processed Data

In [7]:
# Save processed datasets
print("💾 Saving processed datasets...")

# Save tweet data
tweet_output_path = processed_data_dir / 'tweets_processed.csv'
tweet_data.to_csv(tweet_output_path, index=False)
print(f"✅ Saved processed tweets: {tweet_output_path}")

# Save price data
price_output_path = processed_data_dir / 'prices_processed.csv'
price_data.to_csv(price_output_path, index=False)
print(f"✅ Saved processed prices: {price_output_path}")

# Save metadata
metadata = {
    'processing_date': datetime.now().isoformat(),
    'tweet_count': len(tweet_data),
    'price_count': len(price_data),
    'date_range': {
        'start': tweet_data['timestamp'].min().isoformat(),
        'end': tweet_data['timestamp'].max().isoformat()
    }
}

metadata_path = processed_data_dir / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"✅ Saved metadata: {metadata_path}")
print("\n🎉 Data preprocessing complete!")

💾 Saving processed datasets...
✅ Saved processed tweets: data/processed/tweets_processed.csv
✅ Saved processed prices: data/processed/prices_processed.csv
✅ Saved metadata: data/processed/metadata.json

🎉 Data preprocessing complete!
