# Data Cleaning and prepping SCAT values

### Objective

#### Analysis
- Analyze the spreadsheet files
- Understand and explain what each rows and cols are?
- Understand the values and its units
- Consider the columns to keep and justify the columns to be disregarded.

In [None]:
import pandas as pd

# CSV raw datasets
SCATS_DATA_OCTOBER_2006_CSV_PATH = 'datasets/Scats Data October 2006.csv'
SCATS_SITE_LISTING_SPREADSHEET_VICROADS_CSV_PATH = 'datasets/SCATSSiteListingSpreadsheet_VicRoads.csv'
TRAFFIC_COUNT_LOCATIONS_WITH_LONG_LAT_CSV_PATH = 'datasets/Traffic_Count_Locations_with_LONG_LAT.csv'

# reading the csv files
df_scats_data_october = pd.read_csv(SCATS_DATA_OCTOBER_2006_CSV_PATH, header = 1)
df_scats_site_listing = pd.read_csv(SCATS_SITE_LISTING_SPREADSHEET_VICROADS_CSV_PATH)
df_traffic_long_lat = pd.read_csv(TRAFFIC_COUNT_LOCATIONS_WITH_LONG_LAT_CSV_PATH)

In [None]:
# Outputting SCATS Data October 2006
df_scats_data_october


- Time-Based Data (e.g., V00, V01, V02, ...): These columns represent traffic counts or flow rates at specific time intervals. They are critical for understanding traffic patterns over time and predicting future traffic flow.
- Date (Start Time): This is important for understanding the day-to-day variation in traffic. It can be useful to include the day of the week or whether the day is a weekday/weekend, as traffic patterns often differ.

Location Information:

- SCATS Number and Location: These provide context on where the data was recorded, allowing the model to learn location-specific traffic patterns.
Coordinates (Latitude and Longitude): If multiple locations are involved, geographical data can help in understanding spatial traffic patterns.

Internal Codes (if meaningful):
- HF VicRoads Internal, VR Internal Stat, VR Internal Loc: If these columns represent different types of traffic flow or control information, they could provide additional context to improve predictions.

In a machine learning model, you'll likely treat the time-based data as your primary features, with additional features such as date, day of the week, and location to provide context. You might also need to consider temporal relationships and how past traffic data influences future traffic predictions.



In [None]:
# Outputting SCATSSiteListingSpreadsheet
df_scats_site_listing

Columns:
- Site Number: This likely represents the unique identifier for each SCATS (Sydney Coordinated Adaptive Traffic System) site.
- Location Description: Describes the physical location of the SCATS site, such as the intersection or road segment it monitors.
- Site Type: Indicates the type of site, which might be "Intersection" (INT) or other designations.
- Directory: Possibly refers to a reference directory or map that includes the SCATS site.
- Map Reference: A specific reference on a map (e.g., Melway) to locate the site geographically.

Rows:

   Each row after the headers appears to correspond to a SCATS site with details about its location, type, and reference information.

   This dataset provides geographical and descriptive information about SCATS sites, which could be useful in your traffic prediction model by linking traffic data to specific sites and their characteristics.

In [None]:
# Outputting Traffic_Count_Locations_with_LONG_LAT 
df_traffic_long_lat

The file "Traffic_Count_Locations_with_LONG_LAT.csv" contains the following columns:

- X and Y: These are the geographic coordinates (longitude and latitude, respectively) of the traffic count locations.
- FID and OBJECTID: These are identifiers that may refer to specific features or objects within a geographic or mapping system.
- TFM_ID: This is an identifier for the traffic flow monitoring site.
- TFM_DESC: Description of the traffic monitoring site, typically indicating the road or intersection being monitored.
- TFM_TYP_DE: Type of the traffic flow monitoring, which seems to describe the location type (e.g., "INTERSECTION").
- MOVEMENT_T: Describes the movement type being monitored (e.g., "All Moves" likely means all traffic directions).
- SITE_DESC: A more detailed description of the site, often including road names or specific intersections.
- ROAD_NBR: The road number for the monitored site.
- DECLARED_R: The declared road name or type, often indicating the official name or designation of the road.
- LOCAL_ROAD: Describes the local road connected to the declared road, giving further context to the location.
- DATA_SRC_C: Source of the data, likely indicating the method of data collection.
- DATA_SOURC: Further description of the data source, such as whether it was manually collected or obtained through a specific method.
- TIME_CATEG: Indicates the time category, such as how old the data is (e.g., "Greater than 10 Years").
- YEAR_SINCE: The number of years since the data was first collected.
- LAST_YEAR: The last year in which the data was updated.
- AADT_ALLVE: The Annual Average Daily Traffic (AADT) for all vehicles at the site.
- AADT_TRUCK: The AADT specifically for trucks at the site.
- PER_TRUCKS: The percentage of total traffic that consists of trucks.

Rows:

Each row in this file corresponds to a specific traffic count location. The data in each row provides detailed information about that location, including geographic coordinates, traffic flow details, and historical traffic data.

This file provides important context that can be combined with time-based traffic data to enhance the accuracy of your traffic flow prediction model. For instance, the geographic coordinates and AADT values can help in understanding spatial and volume-based traffic patterns.

#### Data Cleaning & Feature Engineering

In [None]:
df_scats_data_october

In [None]:
df_scats_data_october = df_scats_data_october.dropna(how='all', axis=1)
df_scats_data_october = df_scats_data_october.fillna(0)

In [None]:
numerical_columns = []
for i in range(96):
    if i < 10:
        numerical_columns.append('V0' + str(i))
    else :
        numerical_columns.append('V' + str(i))
        
df_scats_data_october[numerical_columns] = df_scats_data_october[numerical_columns].apply(pd.to_numeric, errors='coerce')

In [None]:
# Correctly parse the Date column with the adjusted format
df_scats_data_october['Date'] = pd.to_datetime(df_scats_data_october['Date'], format='%d/%m/%y', dayfirst=True)

# Create the new columns
new_date_columns = pd.DataFrame({
    'Day_Of_Week': df_scats_data_october['Date'].dt.dayofweek,
    'Is_Weekend': df_scats_data_october['Date'].dt.dayofweek >= 5
})

# Concatenate the new columns with the original DataFrame
df_scats_data_october = pd.concat([df_scats_data_october, new_date_columns], axis=1)

In [None]:
df_scats_data_october['Rolling_1hr'] = df_scats_data_october[numerical_columns].iloc[:, :96].mean(axis=1)

In [None]:
df_scats_data_october.columns


In [None]:
df_scats_data_october