Skip to content

311 Data Notebook

Bonnie Wolfe edited this page Oct 14, 2025 · 1 revision

311 Data Notebook, Cleaning & Hosting Overview with Google Colab

Background

The 311 service request dataset is very large and challenging to host or query in in-browser environments. To make this data more accessible and usable, the project aims to process, clean, and split it into manageable files, enabling users to work with the data efficiently.

Tools Used

  • Python & Jupyter Notebook – for data cleaning and transformation
  • Pandas – for processing large datasets efficiently
  • Google Colab – for running the data pipeline in-browser, processing datasets, and providing temporary access to cleaned files

Objectives

This project builds a reproducible pipeline that:

  • Downloads raw 311 Service Request data
  • Cleans the dataset according to standardized rules for consistency and quality
  • Splits the data by year, then by month, with each file around 100MB in size
  • Provides cleaned and split datasets via Colab for direct download by users (instead of publishing large datasets directly to GitHub)

Process

1. Data Acquisition

  • Downloaded 311 Service Request data from the city’s open data portal
  • Users can dynamically select the year they want to process and download in the notebook. (Refer to the notebook for the code snippet that maps each year to the corresponding CSV URL.)

2. Data Cleaning

High-level steps performed:

  • Removed duplicates
  • Handled missing values
  • Standardized date fields
  • Reviewed and simplified categorical variables
  • Dropped unnecessary columns
  • Converted text columns to lowercase
  • Cleaned and validated geographical data
  • Partitioned and saved cleaned dataset into monthly files

3. Data Splitting

  • Partitioned cleaned datasets by year, then by month
  • Organized files in a clear folder hierarchy for easy access

4. Notebook

  • Documented notebook automates:
    • Data download (with dynamic year selection)
    • Cleaning rules
    • Splitting logic
    • Saving outputs
  • Includes annotations explaining each step

How to Use Google Colab

Google Colab lets you run this project in your browser without installing anything on your computer.

Open the Colab Notebook

Steps

  1. Open the link above and sign in with your Google account.
  2. (Optional) Click the "Connect" button in the top right corner to start a runtime. Alternatively, running any cell or using "Run all" will automatically connect.
  3. Run the notebook cells:
    • To run all cells automatically, use "Runtime" → "Run all" from the top menu.
    • To run cells one by one, press Shift + Enter or click the play button next to each cell.
  4. The notebook will:
    • Download raw 311 data (you can select the year to process)
    • Apply cleaning rules
    • Split files by year and month
  5. Download the resulting files by opening the "Files" tab in the left sidebar, right-clicking the CSV files, and choosing "Download".

Note: Files exist only during your Colab session. Be sure to download anything you need before closing the session.

How to Use Locally

For instructions on cloning the repo, installing dependencies, and running the notebook on your machine, please see the project’s README.

Deliverables

  1. Annotated Jupyter Notebook with the full data pipeline
  2. Cleaned and partitioned datasets available via Colab runtime
  3. Cleaning Rules Documentation

References

Clone this wiki locally