# 02 - Data Cleaning

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook cleans the preprocessed tweet and price data to ensure that all timestamps are in the correct format.

## Objectives
1. Load the preprocessed tweet and price data
2. Identify and remove rows with invalid timestamp formats
3. Save the cleaned data to new files

In [10]:
# Core imports
import pandas as pd
from pathlib import Path

print("Environment setup complete!")

Environment setup complete!


## 1. Load Preprocessed Data

In [11]:
# Load preprocessed data
data_dir = Path('../data')
processed_data_dir = data_dir / 'processed'

print("📊 Loading preprocessed data...")
tweets_df = pd.read_csv(processed_data_dir / 'tweets_processed.csv')
prices_df = pd.read_csv(processed_data_dir / 'prices_processed.csv')

print(f"Loaded {len(tweets_df):,} tweets")
print(f"Loaded {len(prices_df):,} price records")

📊 Loading preprocessed data...
Loaded 17,298,735 tweets
Loaded 7,103,245 price records


## 2. Clean and Normalize Timestamps

In [12]:
print("Cleaning tweet timestamps...")
initial_rows = len(tweets_df)
tweets_df['timestamp'] = pd.to_datetime(tweets_df['timestamp'], errors='coerce')
tweets_df.dropna(subset=['timestamp'], inplace=True)
final_rows = len(tweets_df)
print(f"Removed {initial_rows - final_rows:,} rows with invalid timestamps from tweets data.")

print("Cleaning price timestamps...")
initial_rows = len(prices_df)
prices_df['timestamp'] = pd.to_datetime(prices_df['timestamp'], errors='coerce')
prices_df.dropna(subset=['timestamp'], inplace=True)
final_rows = len(prices_df)
print(f"Removed {initial_rows - final_rows:,} rows with invalid timestamps from price data.")

print("Normalizing timezones...")
# Convert both to UTC first to preserve time relationships
if tweets_df['timestamp'].dt.tz is not None:
    tweets_df['timestamp'] = tweets_df['timestamp'].dt.tz_convert('UTC')
else:
    tweets_df['timestamp'] = tweets_df['timestamp'].dt.tz_localize('UTC')

if prices_df['timestamp'].dt.tz is not None:
    prices_df['timestamp'] = prices_df['timestamp'].dt.tz_convert('UTC')
else:
    prices_df['timestamp'] = prices_df['timestamp'].dt.tz_localize('UTC')

# Now remove timezone info from both (they're now synchronized)
tweets_df['timestamp'] = tweets_df['timestamp'].dt.tz_localize(None)
prices_df['timestamp'] = prices_df['timestamp'].dt.tz_localize(None)

print("Timestamps normalized to UTC (timezone-naive).")

Cleaning tweet timestamps...
Removed 29 rows with invalid timestamps from tweets data.
Cleaning price timestamps...
Removed 0 rows with invalid timestamps from price data.
Normalizing timezones...
Timestamps normalized to UTC (timezone-naive).


## 3. Save Cleaned Data

In [13]:
print("💾 Saving cleaned data...")
tweet_output_path = processed_data_dir / 'tweets_cleaned.csv'
tweets_df.to_csv(tweet_output_path, index=False)
print(f"✅ Saved cleaned tweets: {tweet_output_path}")

price_output_path = processed_data_dir / 'prices_cleaned.csv'
prices_df.to_csv(price_output_path, index=False)
print(f"✅ Saved cleaned prices: {price_output_path}")

💾 Saving cleaned data...
✅ Saved cleaned tweets: ../data/processed/tweets_cleaned.csv
✅ Saved cleaned prices: ../data/processed/prices_cleaned.csv
