# Austin Traffic Data Analysis: Part 1 - Fundamentals

**Exercise:** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kks32-courses/ce311k/blob/main/notebooks/lectures/00_intro/01_austin_traffic_fundamentals.ipynb)
**Solution:** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kks32-courses/ce311k/blob/main/notebooks/lectures/00_intro/01_austin_traffic_fundamentals_solution.ipynb)`

Austin generates over 400,000 traffic incidents per year. During the 2020 pandemic, traffic patterns changed dramatically - but not in ways you might expect.

We'll use computational methods to uncover these hidden patterns and learn what they tell us about urban infrastructure resilience.

## Variables and Data Discovery

### Variables: Storing Information

Variables are handles for values. When we write:

In [None]:
!wget https://github.com/kks32-courses/ce311k/raw/refs/heads/main/notebooks/lectures/00_intro/Real-Time_Traffic_Incident_Reports_20250805.csv

The expression on the right (432444) is evaluated and stored in the variable on the left (incident_count). This follows the mathematical notation: variable ← value

In engineering, we need variables to track:
- Measurements that change over time
- Parameters we want to adjust
- Results we need to reference later

### Data Types Define What We Can Do

Objects have types that determine valid operations:
- Can multiply two numbers
- Cannot multiply a number and text
- Each type serves a specific purpose

In [None]:
# Integer: whole numbers for counts


# Float: decimals for coordinates  

# String: text for categories

# Boolean: True/False for conditions


In [None]:
# Operations with variables


### Lists: Storing Multiple Values

A list is an ordered collection of values. Lists can contain any type of data and can be modified.

In [None]:
# List of years to analyze


# List of incident types


# Access elements by index (starts at 0)


print(f"First year: {first_year}")
print(f"Pandemic year: {pandemic_year}")

In [None]:
# Lists can be modified

# Get length of list


### CSV: Comma Separated Values

A CSV file is like a spreadsheet stored as text:
- Each row = one traffic incident  
- Each column = one attribute (date, location, type)
- Commas separate the values

Example row:
```
03/06/2024 01:29:39 AM,Stalled Vehicle,30.32358,-97.705874,IH 35,AUSTIN PD
```

### DataFrames: Smart Tables for Engineering Data

With 400,000+ incidents and 10+ attributes each, we need structure. A DataFrame is a 2D labeled data structure that:
- Handles different data types per column
- Provides powerful filtering and analysis methods
- Scales to millions of rows efficiently

In [None]:

# Load Austin traffic incidents


In [None]:
# Basic exploration



print(f"Dataset: {num_incidents:,} incidents")
print(f"Attributes per incident: {num_attributes}")

In [None]:
# Display column names


In [None]:
# Show first few incidents


### Extracting Time Information

Traffic patterns vary by time. We need to extract:
- Year: for pandemic comparison
- Month: for seasonal patterns
- Hour: for daily patterns

In [None]:
# Convert string dates to datetime objects


# Extract components


In [None]:
# Count incidents per year


In [None]:
# Most common incident types


### Plotting All Incidents: What Could Go Wrong?

Let's visualize all traffic incidents on a map. Since we have latitude and longitude for each incident, this should be straightforward...

### Data Quality Issues Revealed

The plot shows:
1. Points at (0, 0) - missing coordinates
2. Points far from Austin - data entry errors
3. Extreme outliers at -100 longitude

This is real-world data: messy and requiring cleaning before analysis.

## Conditional Statements for Data Cleaning

### Boolean Values: Making Decisions

Booleans are True or False values used for decision-making. They result from comparisons and control program flow.

In [None]:
# Simple boolean examples


# Check data quality

print(f"Are coordinates missing? {is_missing_data}")

### Comparison Operators Create Booleans

```
==  Equal to
!=  Not equal to
<   Less than  
>   Greater than
<=  Less than or equal
>=  Greater than or equal
```

Combine with:
- `and` - Both must be True
- `or` - Either can be True
- `not` - Negates the condition

### Creating Boolean Masks for Austin Area

In [None]:
# Define Austin's approximate boundaries
austin_lat_min = 30.0
austin_lat_max = 30.6
austin_lon_min = -98.0
austin_lon_max = -97.0

# Create boolean conditions for each coordinate

# Check how many are valid



print(f"Valid latitudes: {num_valid_lat:,}")
print(f"Valid longitudes: {num_valid_lon:,}")

### Boolean Indexing: Selecting Valid Data

We use boolean arrays to select rows that meet our criteria. `df[boolean_condition]` returns only rows where condition is True.

In [None]:
# Combine conditions

# Filter to Austin-only data


# Compare before and after
print(f"Original: {len(df):,} incidents")
print(f"Austin only: {len(austin_data):,} incidents")
print(f"Removed: {len(df) - len(austin_data):,} bad records")

In [None]:
# Visualize cleaned data
plt.figure(figsize=(8, 6))
plt.scatter(austin_data.Longitude, austin_data.Latitude, s=0.1, alpha=0.5)
plt.xlabel('Longitude')
plt.ylabel('Latitude')  
plt.title('Austin Traffic Incidents - After Cleaning')
plt.xlim(-97.9, -97.5)
plt.ylim(30.1, 30.5)
plt.show()

### Control Statements: Program Decision Making

Programs need to make decisions based on data:

```python
if condition:
    # Do this if True
elif other_condition:
    # Do this if first False but this True
else:
    # Do this if all False
```

In [None]:
# Calculate yearly incidents

# Analyze pandemic impact


### Filtering by Incident Type

In [None]:
# Different incident types


# Apply filters

print(f"Collisions: {len(collisions):,}")
print(f"Crashes: {len(crashes):,}")
print(f"Hazards: {len(hazards):,}")

### Comparative Incident Maps

In [None]:
# Create figure with 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot collisions
axes[0].scatter(collisions.Longitude, collisions.Latitude, s=0.05, alpha=0.5, c='red')
axes[0].set_title('Collisions')
axes[0].set_xlim(-97.9, -97.5)
axes[0].set_ylim(30.1, 30.5)
axes[0].set_xlabel('Longitude')
axes[0].set_ylabel('Latitude')

# Plot crashes  
axes[1].scatter(crashes.Longitude, crashes.Latitude, s=0.05, alpha=0.5, c='orange')
axes[1].set_title('Crash Urgent')
axes[1].set_xlim(-97.9, -97.5)
axes[1].set_ylim(30.1, 30.5)
axes[1].set_xlabel('Longitude')
axes[1].set_ylabel('Latitude')

# Plot hazards
axes[2].scatter(hazards.Longitude, hazards.Latitude, s=0.05, alpha=0.5, c='blue')
axes[2].set_title('Traffic Hazards')
axes[2].set_xlim(-97.9, -97.5)
axes[2].set_ylim(30.1, 30.5)
axes[2].set_xlabel('Longitude')
axes[2].set_ylabel('Latitude')

plt.tight_layout()
plt.show()

### Spatial Patterns Revealed

Different incident types show distinct geographic patterns:
- Collisions cluster at major intersections
- Crashes concentrate on highways
- Hazards spread throughout residential areas

## Loops for Automation

### The Problem with Manual Analysis

To analyze 2019, 2020, and 2021 separately:
- Write filtering code 3 times
- Risk copy-paste errors
- Hard to extend to more years

Loops automate repetitive tasks - write once, run many times.

### For Loop Structure

```python
for variable in collection:
    # Code block executes for each item
    # variable takes each value in turn
```

In [None]:
# Manual approach (tedious)





In [None]:
# Loop approach (scalable)


### Storing Results from Loops

In [None]:
# Using lists to store results

# Fix the x-axis


### Tuples: Immutable Sequences

Tuples are like lists but cannot be modified after creation. Use them when data shouldn't change.

In [None]:
# Tuple of Austin's coordinates (shouldn't change)



print(f"Austin center: {austin_center}")
print(f"Latitude: {lat}, Longitude: {lon}")

# Tuples are immutable
# austin_center[0] = 31.0  # This would cause an error

### Dictionaries: Structured Data Storage

Dictionaries store key-value pairs. Perfect for organizing related data.

In [None]:
# Simple dictionary for one year
year_2020_stats = {
    'total': 45321,
    'crashes': 12543,
    'peak_hour': 17,
    'worst_month': 'October'
}

print(f"2020 total incidents: {year_2020_stats['total']}")
print(f"2020 peak hour: {year_2020_stats['peak_hour']}:00")

In [None]:
# Why dictionaries? Store multiple statistics per year


# Display results


### Nested Loops: Multiple Dimensions

Analyze combinations: each location across each year

In [None]:
# Find top incident locations


# Analyze each location across years


### Loop-based Hourly Analysis

In [None]:
# Compare hourly patterns across years


### Finding Peak Incident Types by Year

In [None]:
# Track incident types across years


## Engineering Summary

Through computational analysis of Austin's traffic data, we've discovered:
1. Data quality issues requiring filtering (invalid coordinates)
2. Pandemic impact on traffic patterns
3. Spatial distribution of different incident types
4. Temporal patterns revealing peak risk hours

These insights enable data-driven infrastructure planning and resource allocation.