# 03 - Data Alignment and Sampling

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook determines the overlapping time period between the cleaned tweet and price datasets, and then creates 5 random one-week samples of the data.

## Objectives
1. Load the cleaned tweet and price data
2. Find the overlapping time period
3. Create 5 random one-week samples of the data
4. Save the sampled data to new files

In [14]:
# Core imports
import pandas as pd
from pathlib import Path
import random

print("Environment setup complete!")

Environment setup complete!


## 1. Load Cleaned Data

In [15]:
# Load cleaned data
data_dir = Path('../data')
processed_data_dir = data_dir / 'processed'

print("📊 Loading cleaned data...")
tweets_df = pd.read_csv(processed_data_dir / 'tweets_cleaned.csv')
prices_df = pd.read_csv(processed_data_dir / 'prices_cleaned.csv')

tweets_df['timestamp'] = pd.to_datetime(tweets_df['timestamp'])
prices_df['timestamp'] = pd.to_datetime(prices_df['timestamp'])

print(f"Loaded {len(tweets_df):,} cleaned tweets")
print(f"Loaded {len(prices_df):,} cleaned price records")

📊 Loading cleaned data...
Loaded 17,298,706 cleaned tweets
Loaded 7,103,245 cleaned price records


## 2. Find Overlapping Time Period

In [16]:
tweet_start = tweets_df['timestamp'].min()
tweet_end = tweets_df['timestamp'].max()
price_start = prices_df['timestamp'].min()
price_end = prices_df['timestamp'].max()

overlap_start = max(tweet_start, price_start)
overlap_end = min(tweet_end, price_end)

print(f"Tweet data range: {tweet_start} to {tweet_end}")
print(f"Price data range: {price_start} to {price_end}")
print(f"Overlapping data range: {overlap_start} to {overlap_end}")

Tweet data range: 2007-04-19 07:14:38 to 2019-11-23 15:45:57
Price data range: 2012-01-01 10:01:00 to 2025-07-05 00:45:00
Overlapping data range: 2012-01-01 10:01:00 to 2019-11-23 15:45:57


## 3. Create Random Samples

In [17]:
print("Creating random samples...")
n_samples = 5
sample_duration = pd.Timedelta(days=7)
sampled_tweets = []
sampled_prices = []

for i in range(n_samples):
    random_start = overlap_start + pd.Timedelta(days=random.randint(0, (overlap_end - overlap_start).days - 7))
    random_end = random_start + sample_duration
    
    tweet_sample = tweets_df[(tweets_df['timestamp'] >= random_start) & (tweets_df['timestamp'] < random_end)]
    price_sample = prices_df[(prices_df['timestamp'] >= random_start) & (prices_df['timestamp'] < random_end)]
    
    sampled_tweets.append(tweet_sample)
    sampled_prices.append(price_sample)
    
    print(f"Sample {i+1}: {random_start} to {random_end} - {len(tweet_sample)} tweets, {len(price_sample)} prices")

print(f"✅ Created {n_samples} random samples.")

Creating random samples...
Sample 1: 2017-02-01 10:01:00 to 2017-02-08 10:01:00 - 2072 tweets, 10080 prices
Sample 2: 2012-11-18 10:01:00 to 2012-11-25 10:01:00 - 240 tweets, 10080 prices
Sample 3: 2017-05-08 10:01:00 to 2017-05-15 10:01:00 - 2820 tweets, 10080 prices
Sample 4: 2015-07-08 10:01:00 to 2015-07-15 10:01:00 - 2973 tweets, 10080 prices
Sample 5: 2018-06-29 10:01:00 to 2018-07-06 10:01:00 - 7663 tweets, 10080 prices
✅ Created 5 random samples.


## 4. Save Sampled Data

In [18]:
print("💾 Saving sampled data...")
sampled_dir = processed_data_dir / 'sampled'
sampled_dir.mkdir(exist_ok=True)

for i, (tweets, prices) in enumerate(zip(sampled_tweets, sampled_prices)):
    tweet_output_path = sampled_dir / f'tweets_sample_{i+1}.csv'
    tweets.to_csv(tweet_output_path, index=False)
    
    price_output_path = sampled_dir / f'prices_sample_{i+1}.csv'
    prices.to_csv(price_output_path, index=False)

print(f"✅ Saved {n_samples} sampled datasets to {sampled_dir}")

💾 Saving sampled data...
✅ Saved 5 sampled datasets to ../data/processed/sampled
