# Readymade Data Module Assignment
## NYC 311 Service Request Analysis

**Student Name:** Nick Pisarczyk

**U-M Unique Name:** npisar

**Research Question:** RQ1: Do different neighborhoods have distinctive patterns in their 311 service requests? Analyze the relative frequency of complaint types across neighborhoods. What patterns do you notice?

---

**BEFORE YOU START:**
1. Read the assignment instructions on Canvas carefully.
2. Make a copy of this notebook and work on your own copy.
3. Understand the dataset before you start cleaning and analyzing. NYC Open Data has a nice portal and a data dictionary for exploring their datasets.
4. NYC 311 is a **very large** dataset. When you are fetching data from the portal or API, we would recommend you to first think about your research questions and start with a small subset of the data and then increase the size of the data as you get more comfortable with the data. Generally, you do **not** need to use the entire dataset to answer your research question.
5. This notebook serves as a template. You can add more cells or make adjustments as you see fit. But make sure to keep all the sections mentioned in the assignment instructions. Also, format your notebook properly for better readability.

## Data Statement

Describe your data source here. Include:
- Where you obtained the data (URL or API endpoint)
```
I didn't want to download a bunch of data that wasn't necessary for the RQ I chose, so I started with the API calls. I could filter the columns of data I wanted to use, and I had a lot more granular control over the data in a way I was already pretty comfortable with. I downloaded data per borough.

Once I gathered the data from the API, I saved each borough's data to pandas df objects, put all the dfs in a dictionary to be able to access easily later, and saved each of those dfs to csv files. Then, I went forward and read each CSV data file instead of calling the API over and over.
```



<br><br><br>
- What subset you're analyzing (dates, geography, etc.)
```python
columns = [
    "unique_key",
    "created_date",
    "closed_date",
    "complaint_type",
    "descriptor",
    "descriptor_2",
    "borough"
]
```
```
This is the list of columns I used (see in context in Option B first cell how I filtered). I figured that to understand if 'different neighborhoods have distinctive patterns in their 311 service requests' and to 'Analyze the relative frequency of complaint types across neighborhoods', all I needed were the complaint type, some descriptors, the borough, and the basic unique key / created/closed date information. I tried to take only the necessary columns to keep the API calls short and filesizes small. I don't need most of the stuff available in the data!
```



<br><br><br>
- Any filters or sampling you applied
```python
results = client.get("erm2-nwe9", 
                    select=select_str,
                    where=f"created_date >= '2025-06-01' AND borough = '{borough}'",
                    limit=250000)
```
```
Some more code. I filtered on created_date and did that above date (only records after 2025-06-01), and then filtered by borough as I was iterating through the list (see in context in Option B's first cell). I used a limit of 250,000 because I wanted to gather around 100,000 records for Staten Island. Since Staten Island naturally has fewer 311 datapoints (as it's essentially a tourist island compared to the other 4 urban city boroughs), I thought this was a good balance of not having a ridiculous number of rows and crazy filesizes, but still having plenty of data to analyze and generalize with. 
``` 


<br><br><br>
- File size/number of records
```
bronx.csv - 23.5mb
250k rows

brooklyn.csv - 24mb
250k rows

manhattan.csv - 24.5mb
250k rows

queens.csv - 23.5mb
250k rows

staten island.csv - 10.3mb
99.7k rows
```

## Assignment Details (Canvas)

https://umich.instructure.com/courses/825993/assignments/3016485

**Overview**
This assignment asks you to analyze NYC 311 service request data to understand patterns in urban neighborhoods. You will use Python to explore relationships between service requests and socioeconomic characteristics, practicing skills in data cleaning, exploratory analysis, visualization, and interpretation.

311 systems allow residents to report non-emergency issues like noise complaints, street conditions, and building problems. As Wang et al. (2017) demonstrated, these service request patterns can reveal distinctive “signatures” of urban neighborhoods that correlate with demographic and economic characteristics.

---

**Learning Objectives**<br>
By completing this assignment, you will:<br>
- Practice working with “readymade” administrative data from government sources
- Develop skills in data cleaning and exploratory data analysis
- Apply statistical methods to understand relationships in observational data
- Create effective visualizations to communicate data patterns
- Consider methodological limitations and ethical implications of using administrative data for research
- Data

---

**NYC Open Data - 311 Service Requests**

You will use NYC’s publicly available 311 service request dataLinks to an external site..

Recommended approach: Explore the dataset using the NYC Open Data online portal and the data dictionary. The full dataset is very large (40+ million records), so filtering is essential. For instance, one year of data was around 2.6 GB. 

Based on your research question, identify an appropriate subset of the data (e.g., one year of data for specific boroughs or zip codes), and use the Socrata API or directly download it.You can find more information about the API and the documentation on NYC Open Data website.

---

**Research Questions**
Choose ONE of the following research questions to investigate:

- Option 1: Neighborhood Service Request Signatures
<br>Do different neighborhoods have distinctive patterns in their 311 service requests? Analyze the relative frequency of complaint types across neighborhoods. What patterns do you notice?

- Option 2: Temporal Patterns in Service Requests
<br>How do 311 service requests vary by time of day, day of week, or season? Are there particular complaint types that show strong temporal patterns? What might explain these patterns?

- Option 3: Response Time Disparities
<br>Do response times for 311 requests differ across neighborhoods or complaint types? Calculate the time between request creation and closure, and examine whether there are systematic differences by location or complaint category.

- Option 4: Complaint Type Evolution
<br>How have the types and volumes of 311 requests changed over time (comparing multiple years)? What might explain these patterns?

---

**Assignment Requirements**<br>
Your submission should include two components. Both should be included in a single Jupyter Notebook.
<br>
1. Data Analysis<br>
Your notebook should include:

- Data loading and initial exploration: Load the data, explain how you retrieve the data, why you choose the specific subset, examine its structure, and identify relevant variables
- Data cleaning: Handle missing values, filter to relevant records, create derived variables as needed
- Exploratory analysis: Use descriptive statistics and visualizations to understand patterns
- Focused analysis: Apply appropriate statistical methods to address your research question
- Visualizations: Create at least 3 meaningful visualizations (e.g., time series, bar charts, heatmaps, scatter plots)
- Documentation: Include markdown cells explaining your approach, interpreting results, and noting limitations
- Technical expectations: Use numpy/pandas for data manipulation, matplotlib/seaborn for visualization, and appropriate statistical libraries (scipy, statsmodels, etc.) as needed.

2. Written Summary<br>
Briefly report on the following. Your notebook should have a section for each bullet point.

- Research question and motivation: What is your research question? Why is this question interesting? What might we learn?
- Methods: Describe your data source, cleaning steps, and analytical approach
- Findings: Summarize key patterns and statistical results (include 2-3 key visualizations)
- Limitations: What are the methodological limitations? What can and cannot be concluded from this analysis?
- Ethical considerations: Reflect on the use of 311 data for research. Who is represented? Who might be excluded? What are potential concerns about using this data?
 
---

**Assignment Submission**

Please upload your Jupyter Notebook file to Canvas by the deadline: A single .ipynb file that includes all your work. 

Before submitting the file, ensure that your notebook is properly running on your machine. Ensure your notebook contains all your responses, plots, all programming parts of this assignment and is properly formatted with headings, explanations and code comments.

If you use Generative AI tools in any part of your assignment, you need to follow the AI policy in our syllabus and document your AI use at the end of your notebook.

## 1. Setup and Data Loading

In [2]:
# Import libraries
# You can import any libraries you may need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sodapy import Socrata

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [3]:
# save to csv helper function
def df_to_csv(d):
    for borough, df in d.items():
        df.to_csv(f"../data/readymade_module_data/{borough}.csv", index=False)

In [4]:
# helper f for print output
def spacer_top(i=50):
    print(f"{'='*i}")
    print(f"{'='*i}")

def spacer_bottom(i=50):
    print(f"{'='*i}")
    print(f"{'='*i}")
    print(f"\n\n\n\n")

### Option A: Load from downloaded CSV file

**Note**: If you are running the notebook using `colab` kernel, you **cannot** directly import the data from your own laptop. Please see the class repo README files for more details

In [7]:
# Load your downloaded data

# df setup
# define the 5 NYC boroughs (neighborhoods)
boroughs = [
    "BRONX",
    "BROOKLYN",
    "MANHATTAN",
    "QUEENS",
    "STATEN ISLAND"
]

dfs = {}

# add each to dict
for borough in boroughs:
    dfs[borough.lower()] = pd.read_csv(f'../data/readymade_module_data/{borough.lower()}.csv', 
                    low_memory=False)
print(f"dfs keys is {dfs.keys()}")

dfs keys is dict_keys(['bronx', 'brooklyn', 'manhattan', 'queens', 'staten island'])


### Option B: Load from Socrata API (recommended for smaller datasets)

In [None]:
# """
# Should only have to do this once. Created this code to gather data with the columns I wanted, and then saved them to CSVs for easy access.
# """



# client = Socrata("data.cityofnewyork.us", None)
# client.timeout = 60

# # specify columns to reduce runtimes and get only pertinent data
# columns = [
#     "unique_key",
#     "created_date",
#     "closed_date",
#     "complaint_type",
#     "descriptor",
#     "descriptor_2",
#     "borough"
# ]
# select_str = ', '.join(columns)

# # df setup
# # define the 5 NYC boroughs (neighborhoods)
# boroughs = [
#     "BRONX",
#     "BROOKLYN",
#     "MANHATTAN",
#     "QUEENS",
#     "STATEN ISLAND"
# ]

# dfs = {}

# # find data for each borough and create 5 separate dfs
# for borough in boroughs:
#     print(f"Gathering data for borough: {borough}...")
#     results = client.get("erm2-nwe9", 
#                         select=select_str,
#                         where=f"created_date >= '2025-06-01' AND borough = '{borough}'",
#                         limit=250000)
#     dfs[borough.lower()] = pd.DataFrame.from_records(results)

#     print(f"Converting time columns...")
#     # Convert date columns
#     dfs[borough.lower()]['created_date'] = pd.to_datetime(dfs[borough.lower()]['created_date'])
#     dfs[borough.lower()]['closed_date'] = pd.to_datetime(dfs[borough.lower()]['closed_date'])

#     if not borough == "STATEN ISLAND":
#         print(f"Complete! Next bourough...\n\n")
#     else:
#         print(f"Complete for all boroughs!")
    
# print(f"dfs dict is:\n{dfs}")




Gathering data for borough: BRONX...
Converting time columns...
Complete! Next bourough...


Gathering data for borough: BROOKLYN...
Converting time columns...
Complete! Next bourough...


Gathering data for borough: MANHATTAN...
Converting time columns...
Complete! Next bourough...


Gathering data for borough: QUEENS...
Converting time columns...
Complete! Next bourough...


Gathering data for borough: STATEN ISLAND...
Converting time columns...
Complete for all boroughs!
dfs dict is:
{'bronx':        unique_key        created_date           complaint_type  \
0        67654136 2026-01-28 18:19:00            PAINT/PLASTER   
1        67654148 2026-01-28 11:54:43            PAINT/PLASTER   
2        65150823 2025-06-03 06:17:00              DOOR/WINDOW   
3        65150833 2025-06-03 16:08:40          FLOORING/STAIRS   
4        65152252 2025-06-03 20:31:15               WATER LEAK   
...           ...                 ...                      ...   
249995   65500182 2025-07-08 08:07:4

In [None]:
# """
# Like last cell, should only have to do this once. Created this code to save gathered data to CSVs for easy access, not needed once CSVs already on machine.
# """

# # save to CSV for easier access
# df_to_csv(dfs)

## 2.Data Description

You can describe the data in many ways. Here are some baseline requirements:
- Display basic information about the dataset (what are the relevant variables? What are their types? How many observations are there?)
- Conduct summary statistics of the relevant variables
- Check for missing values

In [None]:
# You can have as many cells as you want
for borough, df in dfs.items():
    spacer_top(75)
    
    print(f"{borough.upper()} DATA HEAD")
    print(df.head())
    
    spacer_bottom(75)

### NEED TO DESCRIBE DATA MORE!

BRONX DATA HEAD
  unique_key        created_date          complaint_type        descriptor  \
0   67517646 2026-01-16 15:35:05           PAINT/PLASTER      WINDOW/FRAME   
1   67353979 2026-01-01 22:57:00  Street Light Condition  Street Light Out   
2   67369613 2026-01-02 08:46:05    UNSANITARY CONDITION             PESTS   
3   67373383 2026-01-03 18:41:56                 GENERAL       COOKING GAS   
4   67373683 2026-01-03 18:41:56                PLUMBING        BASIN/SINK   

                  descriptor_2 borough         closed_date  
0     PEELING OR FLAKING PAINT   BRONX                 NaT  
1  Location Type: Intersection   BRONX 2026-01-06 09:12:00  
2                        OTHER   BRONX 2026-02-02 09:52:00  
3                     SHUT-OFF   BRONX 2026-01-26 09:32:02  
4      SINK DETACHED FROM WALL   BRONX 2026-01-26 09:32:02  





BROOKLYN DATA HEAD
  unique_key        created_date         closed_date  complaint_type  \
0   67392685 2026-01-05 23:57:47 2026-01-09 02:04:24 

## 3. Data Cleaning

Document your cleaning decisions and rationale here

In [6]:
# Example cleaning steps (customize based on your needs)

# 1. Remove rows with missing essential data
# df_clean = df.dropna(subset='created_date', 'complaint_type')

# 2. Filter to specific time period if needed
# df_clean = df_clean(df_clean'created_date' >= '2023-01-01') & 
#                     (df_clean'created_date' < '2024-01-01')

# 3. Create derived variables
# Example: Calculate response time
# df_clean'response_time_hours' = (
#     (df_clean'closed_date' - df_clean'created_date').dt.total_seconds() / 3600
# )

# Example: Extract temporal features
# df_clean'hour' = df_clean'created_date'.dt.hour
# df_clean'day_of_week' = df_clean'created_date'.dt.dayofweek
# df_clean'month' = df_clean'created_date'.dt.month

print(f"Original dataset: {len(df)} records")
# print(f"Cleaned dataset: {len(df_clean)} records")
# print(f"Removed: {len(df) - len(df_clean)} records ({((len(df) - len(df_clean))/len(df)*100):.1f}%)")

NameError: name 'df' is not defined

## 4.Exploratory Data Analysis

Add narrative about what you're exploring, why and what you've found

In [None]:
# Example: Most common complaint types
# complaint_counts = df_clean'complaint_type'.value_counts().head(15)
# plt.figure(figsize=(12, 6))
# complaint_counts.plot(kind='barh')
# plt.xlabel('Number of Requests')
# plt.ylabel('Complaint Type')
# plt.title('Top 15 Most Common 311 Complaint Types')
# plt.tight_layout()
# plt.show()

In [None]:
# Example: Temporal patterns
# requests_by_month = df_clean.groupby('month').size()
# plt.figure(figsize=(12, 6))
# requests_by_month.plot(kind='bar')
# plt.xlabel('Month')
# plt.ylabel('Number of Requests')
# plt.title('311 Requests by Month')
# plt.xticks(rotation=0)
# plt.tight_layout()
# plt.show()

## 5.Research Question Analysis

This is the core of your assignment. Document your analytical approach here. You can add any cells if you see fit.

In [None]:
# Your focused analysis goes here
# This will vary significantly based on your research question. You can organize your analysis as you like. You can have as many cells as you want.

### Statistical Testing (if applicable)

In [None]:
# Example: Statistical tests
# from scipy import stats

# # t-test example
# group1 = df_cleandf_clean'borough' == 'MANHATTAN''response_time_hours'.dropna()
# group2 = df_cleandf_clean'borough' == 'BRONX''response_time_hours'.dropna()
# t_stat, p_value = stats.ttest_ind(group1, group2)
# print(f"t-statistic: {t_stat:.3f}")
# print(f"p-value: {p_value:.3f}")

## Key Visualizations

Create at least 3 polished visualizations that answer your research question

In [None]:
# Visualization 1
# Your code here

Interpretation of Visualization 1

In [None]:
# Visualization 2
# Your code here

Interpretation of Visualization 2

In [None]:
# Visualization 3
# Your code here

Interpretation of Visualization 3

## Written Summary

Summarize your key findings here. What patterns did you discover? What can you conclude?

### Research Question and Motivation

- Why is this question interesting? 
- What might we learn?

### Methods

Describe your data source, cleaning steps, and analytical approach

### Findings

Summarize key patterns and statistical results (refer to your key visualizations)

### Limitations

Discuss methodological limitations. What are the potential biases in 311 data? What alternative explanations exist for your findings? What can and cannot be concluded from this analysis?

### Ethical Considerations

Reflect on the ethical implications of using 311 data:
- Who is represented in this data? Who might be underrepresented?
- What are potential privacy concerns?
- How might this analysis be used or misused?
- What are the implications for equity and justice?

## AI Appendix (if applicable)

If you used AI during this assignnment, explain
1. what part of the work it was used for; 
2. what AI tools you used; 
3. the prompts you used; 
4. how you analyzed the AI work for accuracy; and, 
5. steps you took to rework and revise your final documents so that they were both factually accurate and reflected your own voice and style.


## Submission

Submit your assignment as .ipynb file. Make sure to double check with the assignment instructions on Canvas.