# Data Exploration
This notebook explores the raw parking data received from WSU Transportation Department.
## Objectives
- Load and examine all sheets in the Excel file
- Understand data structure, columns, and data types
- Identify gaps, missing values, and discrepancies
- Analyze temporal coverage and parking lot distribution
- Document findings for preprocessing phase

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
print("Libraries loaded successfully")

Libraries loaded successfully


## Load Excel File and Examine Sheets
First,identify all sheets in the Excel file and get a high-level overview.

In [10]:
# Load Excel file
excel_file = pd.ExcelFile('../data/raw/Data_For_Class_Project.xlsx')
# Get sheet names
sheet_names = excel_file.sheet_names
print(f"Total sheets: {len(sheet_names)}")
print(f"\nSheet names:")
for i, name in enumerate(sheet_names, 1):
    print(f"{i}. {name}")

Total sheets: 7

Sheet names:
1. AMP1
2. AMP2
3. AMP3
4. Tickets
5. LPR_Reads_FY25
6. LPR_Reads_FY24
7. LPR_Reads_FY23


## Examine Each Sheet
Load each sheet and examine its structure, dimensions, and first few rows.

In [11]:
# Dictionary to store all dataframes
sheets_data = {}
# Load each sheet and display basic info
for sheet_name in sheet_names:
    print(f"\n{'='*50}")
    print(f"Sheet: {sheet_name}")
    print(f"{'='*50}")
    df = pd.read_excel(excel_file, sheet_name=sheet_name)
    sheets_data[sheet_name] = df
    print(f"\nShape: {df.shape[0]:,} rows * {df.shape[1]} columns")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nData types:")
    print(df.dtypes)
    print(f"\nFirst 5 rows:")
    display(df.head())
    print(f"\nBasic statistics:")
    display(df.describe())


Sheet: AMP1

Shape: 590,249 rows * 3 columns

Columns: ['Zone', 'Start_Date', 'End_Date']

Data types:
Zone                  object
Start_Date    datetime64[ns]
End_Date      datetime64[ns]
dtype: object

First 5 rows:


Unnamed: 0,Zone,Start_Date,End_Date
0,Cougar Way on Street Meters,2020-08-10 07:51:00,2020-08-10 08:04:00
1,Thatuna Rd. on Street Meters,2020-08-10 08:14:00,2020-08-10 09:13:00
2,Library Garage,2020-08-10 08:20:00,2020-08-10 09:20:00
3,Green 1 PACCAR South,2020-08-10 08:31:00,2020-08-10 09:30:00
4,Library Garage,2020-08-10 09:20:00,2020-08-10 09:35:00



Basic statistics:


Unnamed: 0,Start_Date,End_Date
count,590249,590249
mean,2022-08-07 04:01:25.417984512,2022-08-07 06:14:17.761250048
min,2020-08-10 07:51:00,2020-08-10 08:04:00
25%,2022-02-22 17:07:00,2022-02-22 19:01:00
50%,2022-10-05 15:01:00,2022-10-05 17:00:00
75%,2023-02-24 15:06:00,2023-02-24 16:59:00
max,2023-06-30 21:08:00,2023-06-30 22:52:00



Sheet: AMP2

Shape: 962,257 rows * 3 columns

Columns: ['Zone', 'Start_Date', 'End_Date']

Data types:
Zone                  object
Start_Date    datetime64[ns]
End_Date      datetime64[ns]
dtype: object

First 5 rows:


Unnamed: 0,Zone,Start_Date,End_Date
0,Regents: AMP Marked Spot,2023-07-01 08:57:00,2023-07-01 09:25:00
1,Library Garage,2023-07-01 09:02:00,2023-07-01 09:47:00
2,CUE Garage,2023-07-01 11:19:00,2023-07-01 13:18:00
3,CUE Garage,2023-07-01 12:13:00,2023-07-01 15:13:00
4,Library Garage,2023-07-01 13:37:00,2023-07-01 14:37:00



Basic statistics:


Unnamed: 0,Start_Date,End_Date
count,962257,962257
mean,2024-07-01 19:53:28.531483648,2024-07-01 21:55:22.440492032
min,2023-07-01 08:57:00,2023-07-01 09:25:00
25%,2024-01-12 09:13:00,2024-01-12 11:13:00
50%,2024-07-15 13:19:00,2024-07-15 16:07:00
75%,2025-01-09 19:29:00,2025-01-09 21:09:00
max,2025-06-30 20:11:00,2025-06-30 21:41:00



Sheet: AMP3

Shape: 150,898 rows * 3 columns

Columns: ['Zone', 'Start_Date', 'End_Date']

Data types:
Zone                  object
Start_Date    datetime64[ns]
End_Date      datetime64[ns]
dtype: object

First 5 rows:


Unnamed: 0,Zone,Start_Date,End_Date
0,Ferdinand's Ice Cream Shoppe Parking,2025-07-01 07:00:00,2025-07-01 15:00:00
1,Cougar Way on Street Meters,2025-07-01 05:19:00,2025-07-01 07:18:00
2,Yellow 1 IPF Lot,2025-07-01 07:00:00,2025-07-01 17:59:00
3,Green 5 South Beasley,2025-07-01 07:00:00,2025-07-01 18:00:00
4,Student Rec Center,2025-07-01 06:00:00,2025-07-01 07:45:00



Basic statistics:


Unnamed: 0,Start_Date,End_Date
count,150898,150898
mean,2025-09-17 11:47:12.711102720,2025-09-17 14:08:04.354729216
min,2025-07-01 05:19:00,2025-07-01 07:10:00
25%,2025-08-27 18:32:15,2025-08-27 20:22:45
50%,2025-09-19 14:12:00,2025-09-19 16:15:00
75%,2025-10-12 10:56:30,2025-10-12 12:52:30
max,2025-10-31 14:57:00,2025-11-02 12:38:00



Sheet: Tickets

Shape: 192,630 rows * 2 columns

Columns: ['Issue Date / Time', 'Loc']

Data types:
Issue Date / Time    datetime64[ns]
Loc                          object
dtype: object

First 5 rows:


Unnamed: 0,Issue Date / Time,Loc
0,2018-07-02 08:13:00,075 VETERANS MALL
1,2018-07-02 08:17:00,075 VETERANS MALL
2,2018-07-02 08:26:00,025 SCOTT-COMAN
3,2018-07-02 08:51:00,079 S FAIRWAYTENNIS CRTS
4,2018-07-02 09:02:00,093 STEPHENSON N-W



Basic statistics:


Unnamed: 0,Issue Date / Time
count,192630
mean,2022-08-27 16:45:07.783522560
min,2018-07-02 08:13:00
25%,2021-01-12 12:05:00
50%,2022-11-29 15:39:00
75%,2024-07-08 12:15:15
max,2025-10-30 15:56:00



Sheet: LPR_Reads_FY25

Shape: 646,909 rows * 2 columns

Columns: ['Date_Time', 'LOT']

Data types:
Date_Time    datetime64[ns]
LOT                  object
dtype: object

First 5 rows:


Unnamed: 0,Date_Time,LOT
0,2024-07-03 13:59:06,LOT 193
1,2024-07-03 13:59:18,LOT 193
2,2024-07-05 11:13:51,LOT 193
3,2024-07-08 08:58:51,LOT 193
4,2024-07-08 08:59:00,LOT 193



Basic statistics:


Unnamed: 0,Date_Time
count,646909
mean,2024-12-22 13:54:10.223688192
min,2024-07-01 05:36:40
25%,2024-10-02 12:24:00
50%,2024-12-06 13:45:07
75%,2025-03-18 09:22:10
max,2025-06-30 21:58:49



Sheet: LPR_Reads_FY24

Shape: 612,428 rows * 2 columns

Columns: ['Date_Time', 'LOT']

Data types:
Date_Time    datetime64[ns]
LOT                  object
dtype: object

First 5 rows:


Unnamed: 0,Date_Time,LOT
0,2023-08-07 09:22:46,LOT 193
1,2023-08-15 08:14:59,LOT 193
2,2023-08-15 08:15:05,LOT 193
3,2023-08-15 08:15:05,LOT 193
4,2023-08-15 08:15:06,LOT 193



Basic statistics:


Unnamed: 0,Date_Time
count,612428
mean,2023-12-25 23:46:12.406287616
min,2023-07-01 11:10:46
25%,2023-10-09 07:57:35.500000
50%,2023-12-10 15:22:41.500000
75%,2024-03-18 10:59:38.249999872
max,2024-06-30 20:13:40



Sheet: LPR_Reads_FY23

Shape: 520,850 rows * 2 columns

Columns: ['Date_Time', 'LOT']

Data types:
Date_Time    datetime64[ns]
LOT                  object
dtype: object

First 5 rows:


Unnamed: 0,Date_Time,LOT
0,2022-07-11 09:11:53,LOT 193
1,2022-07-11 16:36:11,LOT 193
2,2022-07-11 16:36:24,LOT 193
3,2022-07-11 16:36:24,LOT 193
4,2022-07-11 16:36:25,LOT 193



Basic statistics:


Unnamed: 0,Date_Time
count,520850
mean,2023-01-21 08:59:04.638373888
min,2022-07-01 05:24:26
25%,2022-11-11 12:41:09.750000128
50%,2023-02-09 11:26:22
75%,2023-04-07 11:24:01.750000128
max,2023-06-30 21:53:16


## Total Data Points
Calculate total number of data points across all sheets.

In [12]:
total_rows = sum(df.shape[0] for df in sheets_data.values())
print(f"Total data points across all sheets: {total_rows:,}")
# Show breakdown by sheet
print(f"\nBreakdown by sheet:")
for sheet_name, df in sheets_data.items():
    pct = (df.shape[0] / total_rows) * 100
    print(f"{sheet_name}: {df.shape[0]:,} rows ({pct:.1f}%)")

Total data points across all sheets: 3,676,221

Breakdown by sheet:
AMP1: 590,249 rows (16.1%)
AMP2: 962,257 rows (26.2%)
AMP3: 150,898 rows (4.1%)
Tickets: 192,630 rows (5.2%)
LPR_Reads_FY25: 646,909 rows (17.6%)
LPR_Reads_FY24: 612,428 rows (16.7%)
LPR_Reads_FY23: 520,850 rows (14.2%)


## Missing Values Analysis
Identify missing values and gaps in the data.

In [13]:
for sheet_name, df in sheets_data.items():
    print(f"\n{'='*50}")
    print(f"Missing Values - {sheet_name}")
    print(f"{'='*50}")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing,
        'Percentage': missing_pct
    })
    missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
    if len(missing_df) > 0:
        display(missing_df)
    else:
        print("No missing values found")


Missing Values - AMP1
No missing values found

Missing Values - AMP2
No missing values found

Missing Values - AMP3
No missing values found

Missing Values - Tickets
No missing values found

Missing Values - LPR_Reads_FY25
No missing values found

Missing Values - LPR_Reads_FY24
No missing values found

Missing Values - LPR_Reads_FY23
No missing values found


## Temporal Coverage Analysis
Examine the date/time range of the data to understand temporal coverage.

In [14]:
# Identify date/time columns in each sheet
for sheet_name, df in sheets_data.items():
    print(f"\n{'='*50}")
    print(f"Temporal Analysis - {sheet_name}")
    print(f"{'='*50}")
    # Look for datetime columns
    datetime_cols = df.select_dtypes(include=['datetime64']).columns.tolist()
    # Also check for columns that might contain dates but weren't parsed
    potential_date_cols = [col for col in df.columns if any(word in col.lower() for word in ['date', 'time', 'timestamp'])]
    print(f"\nDatetime columns: {datetime_cols}")
    print(f"Potential date columns: {potential_date_cols}")
    # Analyze temporal coverage for each datetime column
    for col in datetime_cols + potential_date_cols:
        if col in df.columns:
            try:
                # Try to convert to datetime if not already
                if df[col].dtype != 'datetime64[ns]':
                    temp_series = pd.to_datetime(df[col], errors='coerce')
                else:
                    temp_series = df[col]
                if temp_series.notna().sum() > 0:
                    print(f"\n  Column: {col}")
                    print(f"    Start: {temp_series.min()}")
                    print(f"    End: {temp_series.max()}")
                    print(f"    Duration: {temp_series.max() - temp_series.min()}")
                    print(f"    Valid dates: {temp_series.notna().sum():,}")
            except:
                print(f"\n  Column: {col} - Could not parse as datetime")


Temporal Analysis - AMP1

Datetime columns: ['Start_Date', 'End_Date']
Potential date columns: ['Start_Date', 'End_Date']

  Column: Start_Date
    Start: 2020-08-10 07:51:00
    End: 2023-06-30 21:08:00
    Duration: 1054 days 13:17:00
    Valid dates: 590,249

  Column: End_Date
    Start: 2020-08-10 08:04:00
    End: 2023-06-30 22:52:00
    Duration: 1054 days 14:48:00
    Valid dates: 590,249

  Column: Start_Date
    Start: 2020-08-10 07:51:00
    End: 2023-06-30 21:08:00
    Duration: 1054 days 13:17:00
    Valid dates: 590,249

  Column: End_Date
    Start: 2020-08-10 08:04:00
    End: 2023-06-30 22:52:00
    Duration: 1054 days 14:48:00
    Valid dates: 590,249

Temporal Analysis - AMP2

Datetime columns: ['Start_Date', 'End_Date']
Potential date columns: ['Start_Date', 'End_Date']

  Column: Start_Date
    Start: 2023-07-01 08:57:00
    End: 2025-06-30 20:11:00
    Duration: 730 days 11:14:00
    Valid dates: 962,257

  Column: End_Date
    Start: 2023-07-01 09:25:00
    End:

## Parking Lot Analysis
Identify unique parking lots and their characteristics.

In [15]:
# Look for parking lot identifier columns
for sheet_name, df in sheets_data.items():
    print(f"\n{'='*80}")
    print(f"Parking Lot Identifiers - {sheet_name}")
    print(f"{'='*80}")
    # Look for columns that might identify parking lots
    lot_cols = [col for col in df.columns if any(word in col.lower() for word in ['lot', 'location', 'facility', 'zone', 'area'])]
    print(f"\nPotential lot identifier columns: {lot_cols}")
    for col in lot_cols:
        if df[col].dtype == 'object' or df[col].dtype.name == 'category':
            unique_vals = df[col].nunique()
            print(f"\n  {col}: {unique_vals} unique values")
            if unique_vals <= 50:  # Only show if reasonable number
                print(f"  Values: {sorted(df[col].unique())}")
            print(f"  Value counts:")
            display(df[col].value_counts().head(10))


Parking Lot Identifiers - AMP1

Potential lot identifier columns: ['Zone']

  Zone: 59 unique values
  Value counts:


Zone
CUE Garage                              144573
Library Garage                          130773
Columbia Street Lot Top Bays             59948
Green 5 South Beasley                    37815
Wilson Road on Street Meters             34637
Student Rec Center                       31208
Cougar Way on Street Meters              23761
Thatuna Rd. on Street Meters             11215
Ferdinand's Ice Cream Shoppe Parking     11112
B St Hourly Lot                          10787
Name: count, dtype: int64


Parking Lot Identifiers - AMP2

Potential lot identifier columns: ['Zone']

  Zone: 57 unique values
  Value counts:


Zone
CUE Garage                                222133
Student Rec Center                        187643
Library Garage                            140161
Columbia Street Lot Top Bays               66585
Green 5 South Beasley                      56513
Wilson Road on Street Meters               46231
Cougar Way on Street Meters                22752
Ferdinand's Ice Cream Shoppe Parking       20385
Green 1 PACCAR South                       19117
Cougar Health Services: Patron Parking     17769
Name: count, dtype: int64


Parking Lot Identifiers - AMP3

Potential lot identifier columns: ['Zone']

  Zone: 55 unique values
  Value counts:


Zone
CUE Garage                              32563
Student Rec Center                      29013
Library Garage                          20267
Green 5 South Beasley                   10577
Columbia Street Lot Top Bays             8753
Wilson Road on Street Meters             8112
Cougar Way on Street Meters              4028
Green 2 KMac Lot                         3896
Green 1 PACCAR South                     3867
Ferdinand's Ice Cream Shoppe Parking     3781
Name: count, dtype: int64


Parking Lot Identifiers - Tickets

Potential lot identifier columns: []

Parking Lot Identifiers - LPR_Reads_FY25

Potential lot identifier columns: ['LOT']

  LOT: 180 unique values
  Value counts:


LOT
LOT 150    57102
LOT 124    37845
LOT 009    37820
LOT 026    35786
LOT 071    33140
LOT 146    29605
LOT 120    23266
LOT 104    23173
LOT 001    22596
LOT 160    21811
Name: count, dtype: int64


Parking Lot Identifiers - LPR_Reads_FY24

Potential lot identifier columns: ['LOT']

  LOT: 172 unique values
  Value counts:


LOT
LOT 150    65168
LOT 009    40476
LOT 071    35943
LOT 026    35670
LOT 146    29066
LOT 124    27444
LOT 001    23781
LOT 104    21235
LOT 047    18928
LOT 100    15502
Name: count, dtype: int64


Parking Lot Identifiers - LPR_Reads_FY23

Potential lot identifier columns: ['LOT']

  LOT: 175 unique values
  Value counts:


LOT
LOT 150    57093
LOT 071    51644
LOT 009    34678
LOT 026    32245
LOT 124    30191
LOT 146    29755
LOT 104    20342
LOT 038    14211
LOT 120    12325
LOT 047    11392
Name: count, dtype: int64

## Summary of Findings
This section will be updated after running the above analysis to document key findings, issues, and next steps for preprocessing.

In [16]:
# Summary statistics
print("SUMMARY")
print("="*50)
print(f"Total sheets: {len(sheets_data)}")
print(f"Total rows: {total_rows:,}")

SUMMARY
Total sheets: 7
Total rows: 3,676,221
