# Week 1: Data Collection & Cleaning

This notebook demonstrates the data collection and cleaning process for the CityPulse project.

## Setup: Install Dependencies

Run the cell below first to ensure all required packages are installed.


In [1]:
# Verify required packages are installed
import sys
import os

# Add src to path
project_root = os.path.abspath('..')
sys.path.insert(0, os.path.join(project_root, 'src'))

print("Checking required packages...")
print(f"Python: {sys.executable}\n")

# Test imports
required_packages = {
    "requests": "requests",
    "pandas": "pandas", 
    "numpy": "numpy",
    "vaderSentiment": "vaderSentiment",
    "textblob": "textblob",
    "plotly": "plotly",
    "dash": "dash",
    "dash-bootstrap-components": "dash-bootstrap-components",
    "geopy": "geopy",
    "tweepy": "tweepy",
    "python-dotenv": "dotenv"
}

all_ok = True
for import_name, package_name in required_packages.items():
    try:
        if import_name == "vaderSentiment":
            from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
        elif import_name == "dash-bootstrap-components":
            import dash_bootstrap_components
        elif import_name == "python-dotenv":
            import dotenv
        else:
            __import__(import_name)
        print(f"✓ {package_name} - OK")
    except ImportError as e:
        print(f"✗ {package_name} - MISSING")
        print(f"  Error: {e}")
        all_ok = False

if all_ok:
    print("\n✓ All packages are installed! You can now run the data collection cells.")
else:
    print("\n⚠ Some packages are missing!")
    print("Please run this command in your terminal:")
    print("  cd '/Users/mukul/Desktop/chieac project' && source .venv/bin/activate && pip install -r requirements.txt")


Checking required packages...
Python: /Users/mukul/Desktop/chieac project/.venv/bin/python

✓ requests - OK




✓ pandas - OK
✓ numpy - OK
✓ vaderSentiment - OK
✓ textblob - OK
✓ plotly - OK
✓ dash - OK
✓ dash-bootstrap-components - OK
✓ geopy - OK
✓ tweepy - OK
✓ dotenv - OK

✓ All packages are installed! You can now run the data collection cells.


## 1. Data Collection

Run the data collection scripts to gather data from all sources.

Each data source has its own cell below - run them individually.


In [2]:
# Setup: Import path configuration
import sys
import os

# Ensure src is in path
project_root = os.path.abspath('..')
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

# Verify basic imports
try:
    import requests
    import pandas as pd
    print("✓ Setup complete - ready for data collection")
except ImportError as e:
    print(f"✗ Import error: {e}")
    print("Please run the setup cell (cell 1) first!")
    raise


✓ Setup complete - ready for data collection


### 1.1 Collect 311 Service Requests Data


In [3]:
# Collect 311 Service Requests Data
print("="*60)
print("Collecting 311 Service Requests Data")
print("="*60)
print("Source: Chicago Data Portal - Transit-related complaints")
print("="*60)

try:
    from data_collection.collect_311_data import main as collect_311
    collect_311()
    print("\n✓ 311 data collection complete!")
except Exception as e:
    print(f"\n✗ Error collecting 311 data: {e}")
    print("\nNote: This might be due to:")
    print("- API query syntax issues")
    print("- Network connectivity")
    print("- API endpoint changes")
    print("\nYou can continue with other data sources.")


2025-12-08 10:22:52,819 - INFO - Starting 311 data collection
2025-12-08 10:22:52,819 - INFO - Attempting to fetch 311 data with keyword filter...
2025-12-08 10:22:52,819 - INFO - Fetching 311 data from 2025-09-09T00:00:00.000 to 2025-12-08T23:59:59.999
2025-12-08 10:22:52,819 - INFO - Fetching batch: offset=0, limit=5000


Collecting 311 Service Requests Data
Source: Chicago Data Portal - Transit-related complaints


2025-12-08 10:23:12,554 - ERROR - Error fetching data: 400 Client Error: Bad Request for url: https://data.cityofchicago.org/resource/v6vf-nfxy.json?%24limit=5000&%24offset=0&%24where=created_date+%3E%3D+%272025-09-09T00%3A00%3A00.000%27+AND+created_date+%3C%3D+%272025-12-08T23%3A59%3A59.999%27+AND+%28service_request_type+like+%27%25street%25%27+OR+service_request_type+like+%27%25light%25%27+OR+service_request_type+like+%27%25pothole%25%27+OR+service_request_type+like+%27%25traffic%25%27+OR+service_request_type+like+%27%25sidewalk%25%27+OR+service_request_type+like+%27%25alley%25%27%29&%24order=created_date+DESC
2025-12-08 10:23:12,560 - INFO - No data with keyword filter. Trying without service type filter...
2025-12-08 10:23:12,560 - INFO - Fetching 311 data from 2025-09-09T00:00:00.000 to 2025-12-08T23:59:59.999
2025-12-08 10:23:12,561 - INFO - Fetching batch: offset=0, limit=5000
2025-12-08 10:23:15,982 - INFO - Fetched 5000 records (total: 5000)
2025-12-08 10:23:16,488 - INFO - Fe


✓ 311 data collection complete!


### 1.2 Collect CTA Ridership Data


In [4]:
# Collect CTA Ridership Data (Bus & Train)
print("="*60)
print("Collecting CTA Ridership Data")
print("="*60)
print("Source: Chicago Data Portal - Bus and Train ridership")
print("="*60)

try:
    from data_collection.collect_cta_data import main as collect_cta
    collect_cta()
    print("\n✓ CTA data collection complete!")
except Exception as e:
    print(f"\n✗ Error collecting CTA data: {e}")
    print("\nNote: This might be due to:")
    print("- API endpoint changes")
    print("- Network connectivity")
    print("- Data availability for the date range")


2025-12-08 10:23:49,004 - INFO - Starting CTA ridership data collection
2025-12-08 10:23:49,004 - INFO - Fetching CTA bus ridership data
2025-12-08 10:23:49,005 - INFO - Fetching bus data: offset=0, limit=5000


Collecting CTA Ridership Data
Source: Chicago Data Portal - Bus and Train ridership


2025-12-08 10:23:49,501 - INFO - Fetched 2563 bus records (total: 2563)
2025-12-08 10:23:49,506 - INFO - Total bus records: 2563
2025-12-08 10:23:49,508 - INFO - Fetching CTA train ridership data
2025-12-08 10:23:49,508 - INFO - Trying train endpoint: https://data.cityofchicago.org/resource/5neh-572f.json
2025-12-08 10:23:49,509 - INFO - Fetching train data: offset=0, limit=5000
2025-12-08 10:23:49,974 - INFO - Fetched 3168 train records (total: 3168)
2025-12-08 10:23:49,975 - INFO - Successfully fetched train data from https://data.cityofchicago.org/resource/5neh-572f.json
2025-12-08 10:23:49,980 - INFO - Total train records: 3168
2025-12-08 10:23:49,984 - INFO - Combined ridership data: 5731 total records
2025-12-08 10:23:49,997 - INFO - Saved 5731 records to /Users/mukul/Desktop/chieac project/data/raw/cta_raw.csv
2025-12-08 10:23:49,998 - INFO - 
=== Data Summary ===
2025-12-08 10:23:49,998 - INFO - Total records: 5731
2025-12-08 10:23:49,998 - INFO - 
By mode:
2025-12-08 10:23:50,


✓ CTA data collection complete!


### 1.3 Collect Twitter/X Data (Real Data)


In [9]:
# Collect REAL Twitter/X Data using Twitter API v2
# Setup: Ensure src is in path
import sys
import os
project_root = os.path.abspath('..')
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

print("="*60)
print("Collecting REAL Twitter/X Data")
print("="*60)
print("Source: Twitter API v2 - Chicago-related hashtags")
print("Using Bearer Token from .env file")
print("="*60)

try:
    from data_collection.collect_tweets_tweepy import main as collect_tweets_tweepy
    collect_tweets_tweepy()
    print("\n✓ Real Twitter data collected successfully!")
except Exception as e:
    print(f"\n⚠ Error collecting real Twitter data: {e}")
    print("\nTroubleshooting:")
    print("1. Check that .env file exists with TWITTER_BEARER_TOKEN")
    print("2. Verify your Bearer Token is correct")
    print("3. Check Twitter API rate limits (free tier: 10k tweets/day)")
    print("4. Wait a few minutes if you hit rate limits - script will auto-retry")
    print("\nNote: Twitter API free tier only allows last 7 days of tweets")


2025-12-08 11:41:44,827 - INFO - Starting Twitter data collection using Twitter API v2
2025-12-08 11:41:44,828 - INFO - Using Bearer Token authentication
2025-12-08 11:41:44,830 - INFO - Twitter API client initialized successfully
2025-12-08 11:41:44,831 - INFO - Searching for tweets: CTA (max: 500)


Collecting REAL Twitter/X Data
Source: Twitter API v2 - Chicago-related hashtags
Using Bearer Token from .env file


2025-12-08 11:41:45,207 - INFO - Collected 21/500 tweets for CTA (request 1)
2025-12-08 11:41:45,208 - INFO - No more pages available for CTA
2025-12-08 11:41:45,208 - INFO - Collected 21 tweets for CTA
2025-12-08 11:41:50,210 - INFO - Searching for tweets: ChicagoTransit (max: 500)
2025-12-08 11:56:45,601 - INFO - Collected 4/500 tweets for ChicagoTransit (request 1)
2025-12-08 11:56:45,603 - INFO - No more pages available for ChicagoTransit
2025-12-08 11:56:45,604 - INFO - Collected 4 tweets for ChicagoTransit
2025-12-08 11:56:45,612 - INFO - Processed 25 unique tweets
2025-12-08 11:56:45,618 - INFO - Saved 25 tweets to /Users/mukul/Desktop/chieac project/data/raw/tweets_raw.csv
2025-12-08 11:56:45,618 - INFO - 
=== Data Summary ===
2025-12-08 11:56:45,619 - INFO - Total tweets: 25
2025-12-08 11:56:45,620 - INFO - Date range: 2025-12-07T18:02:35+00:00 to 2025-12-08T17:12:33+00:00
2025-12-08 11:56:45,620 - INFO - Hashtags collected: CTA, ChicagoTransit



✓ Real Twitter data collected successfully!


## 2. Data Cleaning

Clean and preprocess all collected datasets.


In [6]:
# Run data cleaning
from data_cleaning.clean_data import main as clean_data
clean_data()


2025-12-08 10:45:33,597 - INFO - Starting data cleaning process
  df_311 = pd.read_csv(PROJECT_ROOT / "data" / "raw" / "311_raw.csv")
2025-12-08 10:45:33,816 - INFO - Loaded 311 data: 50000 records
2025-12-08 10:45:33,821 - INFO - Loaded CTA data: 5731 records
2025-12-08 10:45:33,822 - INFO - Loaded tweet data: 23 records
2025-12-08 10:45:33,822 - INFO - Cleaning 311 data
2025-12-08 10:45:33,842 - INFO - Normalized created_date to datetime format
2025-12-08 10:45:33,851 - INFO - Normalized closed_date to datetime format
2025-12-08 10:45:33,934 - INFO - Dropped 0 rows with all missing values
2025-12-08 10:45:33,940 - INFO - Location normalization: Basic structure added. Full geocoding requires external API.
2025-12-08 10:45:33,941 - INFO - Cleaned 311 data: 50000 records
2025-12-08 10:45:34,443 - INFO - Saved cleaned 311 data: 50000 records
2025-12-08 10:45:34,443 - INFO - Cleaning CTA ridership data
2025-12-08 10:45:34,445 - INFO - Normalized date to datetime format
  df_clean[col] = p

## 3. Data Exploration

Explore the cleaned datasets to understand their structure.


In [7]:
import pandas as pd
import numpy as np
import os

# Load cleaned datasets (handle missing files gracefully)
print("=== Loading Cleaned Datasets ===\n")

# 311 Data
if os.path.exists('../data/cleaned/311_data.csv'):
    df_311 = pd.read_csv('../data/cleaned/311_data.csv')
    print("=== 311 Data Summary ===")
    print(f"Shape: {df_311.shape}")
    print(f"Columns: {list(df_311.columns)}")
    print(f"\nFirst few rows:")
    print(df_311.head())
    print()
else:
    print("⚠ 311_data.csv not found - run data collection and cleaning first")
    df_311 = None

# CTA Data
if os.path.exists('../data/cleaned/cta_ridership.csv'):
    df_cta = pd.read_csv('../data/cleaned/cta_ridership.csv')
    print("=== CTA Data Summary ===")
    print(f"Shape: {df_cta.shape}")
    print(f"Columns: {list(df_cta.columns)}")
    print(f"\nFirst few rows:")
    print(df_cta.head())
    print()
else:
    print("⚠ cta_ridership.csv not found - run data collection and cleaning first")
    df_cta = None

# Tweet Data
if os.path.exists('../data/cleaned/tweets.csv'):
    df_tweets = pd.read_csv('../data/cleaned/tweets.csv')
    print("=== Tweet Data Summary ===")
    print(f"Shape: {df_tweets.shape}")
    print(f"Columns: {list(df_tweets.columns)}")
    print(f"\nFirst few rows:")
    print(df_tweets.head())
else:
    print("⚠ tweets.csv not found - run data collection and cleaning first")
    df_tweets = None


  df_311 = pd.read_csv('../data/cleaned/311_data.csv')


=== Loading Cleaned Datasets ===

=== 311 Data Summary ===
Shape: (50000, 37)
Columns: ['sr_number', 'sr_type', 'sr_short_code', 'created_department', 'owner_department', 'status', 'origin', 'created_date', 'last_modified_date', 'closed_date', 'street_address', 'city', 'state', 'zip_code', 'street_number', 'street_direction', 'street_name', 'street_type', 'duplicate', 'legacy_record', 'created_hour', 'created_day_of_week', 'created_month', 'x_coordinate', 'y_coordinate', 'latitude', 'longitude', 'location', 'community_area', 'ward', 'electrical_district', 'electricity_grid', 'police_sector', 'police_district', 'police_beat', 'precinct', 'parent_sr_number']

First few rows:
       sr_number                       sr_type sr_short_code  \
0  SR25-02248186     311 INFORMATION ONLY CALL        311IOC   
1  SR25-02248185     311 INFORMATION ONLY CALL        311IOC   
2  SR25-02248184     311 INFORMATION ONLY CALL        311IOC   
3  SR25-02248183  Ice and Snow Removal Request           SDO  

## 4. Data Quality Checks

Perform data quality checks on cleaned datasets.


In [8]:
# Check for missing values
print("=== Missing Values ===\n")

if df_311 is not None and not df_311.empty:
    print("311 Data:")
    print(df_311.isnull().sum())
    print()
else:
    print("311 Data: Not available\n")

if df_cta is not None and not df_cta.empty:
    print("CTA Data:")
    print(df_cta.isnull().sum())
    print()
else:
    print("CTA Data: Not available\n")

if df_tweets is not None and not df_tweets.empty:
    print("Tweet Data:")
    print(df_tweets.isnull().sum())
    print()
else:
    print("Tweet Data: Not available\n")

# Check date ranges
print("=== Date Ranges ===")
if df_311 is not None and not df_311.empty and 'created_date' in df_311.columns:
    print(f"311 Data: {df_311['created_date'].min()} to {df_311['created_date'].max()}")
elif df_311 is not None:
    print("311 Data: Available but no date column found")
else:
    print("311 Data: Not available")

if df_cta is not None and not df_cta.empty and 'date' in df_cta.columns:
    print(f"CTA Data: {df_cta['date'].min()} to {df_cta['date'].max()}")
elif df_cta is not None:
    print("CTA Data: Available but no date column found")
else:
    print("CTA Data: Not available")

if df_tweets is not None and not df_tweets.empty and 'date' in df_tweets.columns:
    print(f"Tweet Data: {df_tweets['date'].min()} to {df_tweets['date'].max()}")
elif df_tweets is not None:
    print("Tweet Data: Available but no date column found")
else:
    print("Tweet Data: Not available")


=== Missing Values ===

311 Data:
sr_number                  0
sr_type                    0
sr_short_code              0
created_department     21431
owner_department           0
status                     0
origin                     0
created_date               0
last_modified_date         0
closed_date            10035
street_address            39
city                    5734
state                   5734
zip_code                7181
street_number             60
street_direction          70
street_name               39
street_type              412
duplicate                  0
legacy_record              0
created_hour               0
created_day_of_week        0
created_month              0
x_coordinate              77
y_coordinate              77
latitude                  77
longitude                 77
location                  77
community_area            96
ward                     100
electrical_district     8671
electricity_grid        8685
police_sector             97
police_di