
<hr>
<h2 style='color:darkblue; text-align:center;'><strong>Data Preprocessing Pipeline for Taxi Demand Prediction</strong></h2>

<h3 style='color:darkblue;'><strong>Introduction</strong></h3>
<p style='color:black; font-size:16px; text-align:justify;'>
Building on the foundations laid by our <em style='color:darkgreen;'><strong>Exploratory Data Analysis (EDA)</strong></em>, we have developed a comprehensive <em style='color:darkgreen;'><strong>Data Preprocessing Pipeline</strong></em> to further refine and prepare our data for predictive modeling. This pipeline is an integral part of our analytical process, transforming raw taxi trip data into a structured and insightful format. By focusing on key aspects such as data cleaning, feature extraction, and data segmentation, our preprocessing efforts are intricately designed to address the unique nuances and challenges identified in the EDA. This meticulous approach ensures that our data is not only clean and reliable but also enriched with meaningful attributes, paving the way for more accurate and efficient demand forecasting models.
</p>

<h3 style='color:darkblue;'><strong>Strategic Importance</strong></h3>
<p style='color:black; font-size:16px; text-align:justify;'>
The strategic significance of this pipeline lies in its ability to "<em style='color:darkgreen;'><strong>streamline and optimize the taxi demand prediction process</strong></em>," thereby directly contributing to our overarching business goal. By meticulously processing and enhancing data, we are positioned to unlock deeper insights and predict taxi demand with greater precision, ultimately leading to enhanced fleet management and customer satisfaction.
</p>
<h3 style='color:darkblue;'><strong>Technical Approach</strong></h3>
<p style='color:black; font-size:16px; text-align:justify;'>
Our technical approach in developing this pipeline is multifaceted, focusing on "<em style='color:darkgreen;'><strong>data integration, granularity control, feature engineering, and encapsulation in a callable API</strong></em>." The pipeline begins by integrating taxi trip data with external weather data, using SQL queries in BigQuery for efficient data joining. This step is crucial to understand the impact of weather on taxi demand. Following this, granularity control is applied, segmenting data based on time and spatial dimensions to capture the nuances of demand. Advanced SQL techniques are then used for feature extraction, converting raw data into a format rich with informative attributes, including time-based features and trigonometric transformations to capture cyclical patterns in demand.

Crucially, the entire pipeline is designed as an end-to-end data preprocessing solution, encapsulated within a package/function. This function is implemented as a callable API, seamlessly integrating with our data processing infrastructure. This API-centric approach ensures that the preprocessed data is readily accessible and can be dynamically fed into the served, production model for real-time demand prediction. By embedding the pipeline within a callable API, we achieve a high level of automation and efficiency, enabling the model to access the latest preprocessed data with minimal latency, thus ensuring that our predictive insights are both current and accurate.
</p>





<hr>


In [None]:
CREATE OR REPLACE PROCEDURE `mlops-363723.ChicagoTaxitrips.data_preprocessing_pipeline_chicago_taxi_trips`()
BEGIN

  -- Joining the table with weather_hourly data
  CREATE OR REPLACE TABLE mlops-363723.ChicagoTaxitrips.joined_table AS
  SELECT 
      t.*,
      w.*
  FROM 
      mlops-363723.ChicagoTaxitrips.taxi_trips_demand_1 AS t
  JOIN 
      mlops-363723.ChicagoTaxitrips.weather_hourly AS w
  ON 
      EXTRACT(DATE FROM t.trip_start_timestamp) = EXTRACT(DATE FROM w.time) AND 
      EXTRACT(HOUR FROM t.trip_start_timestamp) = EXTRACT(HOUR FROM w.time)
  WHERE 
      EXTRACT(YEAR FROM w.time) BETWEEN 2020 AND 2023;
    CREATE OR REPLACE TABLE mlops-363723.ChicagoTaxitrips.cleaned_data AS
    WITH CappedData AS (
      SELECT *,
         CASE 
            WHEN trip_seconds/60 < PERCENTILE_CONT(trip_seconds/60, 0.01) OVER() THEN 
              PERCENTILE_CONT(trip_seconds/60, 0.01) OVER()
            WHEN trip_seconds/60 > PERCENTILE_CONT(trip_seconds/60, 0.99) OVER() THEN 
              PERCENTILE_CONT(trip_seconds/60, 0.99) OVER()
            ELSE trip_seconds/60
          END AS duration,
        CASE 
            WHEN trip_miles < PERCENTILE_CONT(trip_miles, 0.01) OVER() THEN       PERCENTILE_CONT(trip_miles, 0.01) OVER()
            WHEN trip_miles > PERCENTILE_CONT(trip_miles, 0.99) OVER() THEN PERCENTILE_CONT(trip_miles, 0.99) OVER()
            ELSE trip_miles
        END AS capped_trip_miles,
        CASE 
            WHEN trip_total < PERCENTILE_CONT(trip_total, 0.01) OVER() THEN PERCENTILE_CONT(trip_total, 0.01) OVER()
            WHEN trip_total > PERCENTILE_CONT(trip_total, 0.99) OVER() THEN PERCENTILE_CONT(trip_total, 0.99) OVER()
            ELSE trip_total
        END AS capped_trip_total
      FROM mlops-363723.ChicagoTaxitrips.joined_table
      WHERE 
        pickup_longitude BETWEEN -87.9401 AND -87.5241
        AND pickup_latitude BETWEEN 41.6445 AND 42.0231
        AND trip_seconds > 0
        AND trip_miles > 0
        AND trip_total > 0
        AND pickup_community_area>0
    )
    SELECT * FROM CappedData;
    -- -- 2. Feature Extraction
    CREATE OR REPLACE TABLE mlops-363723.ChicagoTaxitrips.feature_extracted_data AS
    SELECT 
        *,
        CASE WHEN public_holiday = TRUE THEN 1 ELSE 0 END AS encoded_public_holiday,
        EXTRACT(YEAR FROM trip_start_timestamp) AS year,
        EXTRACT(MONTH FROM trip_start_timestamp) AS month,
        EXTRACT(DAY FROM trip_start_timestamp) AS day,
        EXTRACT(HOUR FROM trip_start_timestamp) AS hour,
        EXTRACT(DAYOFWEEK FROM trip_start_timestamp) - 1 AS weekday,  -- Subtracting 1 to get Monday as 0 and Sunday as 6
        DATE(trip_start_timestamp) AS trip_date,
        SIN(2 * 3.14159265359 * EXTRACT(HOUR FROM trip_start_timestamp) / 23.0) AS hour_sin,
        COS(2 * 3.14159265359 * EXTRACT(HOUR FROM trip_start_timestamp) / 23.0) AS hour_cos,
        SIN(2 * 3.14159265359 * (EXTRACT(DAYOFWEEK FROM trip_start_timestamp) - 1) / 6.0) AS day_sin, 
        COS(2 * 3.14159265359 * (EXTRACT(DAYOFWEEK FROM trip_start_timestamp) - 1) / 6.0) AS day_cos,
        SIN(2 * 3.14159265359 * EXTRACT(MONTH FROM trip_start_timestamp) / 12.0) AS month_sin,
        COS(2 * 3.14159265359 * EXTRACT(MONTH FROM trip_start_timestamp) / 12.0) AS month_cos
    FROM 
        mlops-363723.ChicagoTaxitrips.cleaned_data;
    CREATE OR REPLACE TABLE mlops-363723.ChicagoTaxitrips.sorted_feature_extracted_data AS
  SELECT 
    * 
  FROM 
    mlops-363723.ChicagoTaxitrips.feature_extracted_data
  ORDER BY 
    pickup_community_area, 
    year, 
    month, 
    day, 
    hour;

    -- 3. Data Aggregation
    CREATE OR REPLACE TABLE mlops-363723.ChicagoTaxitrips.aggregated_data AS
    SELECT 
        pickup_community_area,
        year,
        month,
        hour,
        day,
        -- encoded_public_holiday AS public_holiday,
        COUNT(unique_key) AS demand,
        AVG(duration) AS duration,
        AVG(capped_trip_miles) AS trip_miles,
        AVG(capped_trip_total) AS trip_total,
        AVG(temperature_2m) AS temperature_2m,
        AVG(relativehumidity_2m) AS relativehumidity_2m,
        AVG(precipitation) AS precipitation,
        AVG(rain) AS rain,
        AVG(snowfall) AS snowfall,
        AVG(weathercode) AS weathercode,
        MAX(encoded_public_holiday) AS public_holiday,
        MAX(hour_sin) AS hour_sin,
        MAX(hour_cos) AS hour_cos,
        MAX(day_sin) AS day_sin,
        MAX(day_cos) AS day_cos,
        MAX(month_sin) AS month_sin,
        MAX(month_cos) AS month_cos
    FROM 
        mlops-363723.ChicagoTaxitrips.sorted_feature_extracted_data
    GROUP BY 
        pickup_community_area, year,
        month,
        hour,
        day;
        
    CREATE OR REPLACE TABLE `mlops-363723.ChicagoTaxitrips.training_data` AS
    SELECT *
    FROM `mlops-363723.ChicagoTaxitrips.aggregated_data`
    WHERE year = 2020 OR year = 2021;

    -- Create Validation Set
    CREATE OR REPLACE TABLE `mlops-363723.ChicagoTaxitrips.validation_data` AS
    SELECT *
    FROM `mlops-363723.ChicagoTaxitrips.aggregated_data`
    WHERE year = 2022;

    -- Create Test Set
    CREATE OR REPLACE TABLE `mlops-363723.ChicagoTaxitrips.test_data` AS
    SELECT *
    FROM `mlops-363723.ChicagoTaxitrips.aggregated_data`
    WHERE year = 2023 AND month <= 4;
  
     EXPORT DATA OPTIONS(
      uri='gs://chicago_taxitrips/DATA_DIRECTORY/training_data/*.csv',
      format='CSV',
      overwrite=true
    ) AS
    SELECT * FROM `mlops-363723.ChicagoTaxitrips.training_data`;

    EXPORT DATA OPTIONS(
      uri='gs://chicago_taxitrips/DATA_DIRECTORY/validation_data/*.csv',
      format='CSV',
      overwrite=true
    ) AS
    SELECT * FROM `mlops-363723.ChicagoTaxitrips.validation_data`;

    EXPORT DATA OPTIONS(
      uri='gs://chicago_taxitrips/DATA_DIRECTORY/test_data/*.csv',
      format='CSV',
      overwrite=true
    ) AS
    SELECT * FROM `mlops-363723.ChicagoTaxitrips.test_data`;
END;



<span style='color:darkblue'><strong>Comment:</strong></span> <span style='color:black'>This SQL procedure encapsulates our comprehensive data preprocessing pipeline, tailored for the Chicago Taxi Trips dataset, with the following key steps:
<ul>
    <li><strong>Data Integration:</strong> Joining taxi trip data with hourly weather data to incorporate environmental factors into our analysis.</li>
    <li><strong>Data Cleaning and Capping:</strong> Applying capping techniques to trip duration, distance, and cost data to mitigate the impact of outliers and ensure data quality.</li>
    <li><strong>Feature Extraction:</strong> Extracting and engineering features such as time-based variables and trigonometric transformations to capture cyclical patterns in taxi demand.</li>
    <li><strong>Data Segmentation:</strong> Creating sorted and organized datasets for feature extraction, ensuring data is systematically processed.</li>
    <li><strong>Data Aggregation:</strong> Aggregating data at community area levels to analyze demand trends and patterns more effectively.</li>
    <li><strong>Data Splitting for Model Training:</strong> Segmenting the data into training, validation, and test sets based on different years (2020-2021 for training, 2022 for validation, 2023 for testing). This approach is vital to avoid data leakage and ensure that our models are trained on historical data, validated on recent data, and tested on current data, reflecting a realistic scenario for demand prediction.</li>
    <li><strong>Data Export:</strong> Exporting the processed data sets to separate CSV files, facilitating easy access and use in subsequent modeling phases.</li>
</ul>
This pipeline is specifically designed to enhance the reliability and accuracy of our demand prediction models, ensuring they are based on robust and well-prepared data. By segregating data based on years, we maintain the integrity of our predictive models, preventing them from inadvertently learning from future data, which is crucial in the context of time-sensitive demand prediction.</span>


In [None]:
CALL `mlops-363723.ChicagoTaxitrips.data_preprocessing_pipeline_chicago_taxi_trips`();

<span style='color:darkblue'><strong>Comment:</strong></span> <span style='color:black'>This line of code is the API call to our data preprocessing pipeline. By invoking the procedure `mlops-363723.ChicagoTaxitrips.data_preprocessing_pipeline_chicago_taxi_trips`, we execute the entire sequence of data preprocessing steps encapsulated within it. This callable API simplifies the process, allowing us to trigger the complex pipeline with a single command. Once executed, it performs all the necessary data transformations, from joining and cleaning to feature extraction and segmentation, making the data ready for advanced analysis and model training. This approach streamlines our workflow, ensuring consistency and efficiency in data processing for our demand prediction models.</span>



<hr>
<h3 style='color:darkblue; text-align:center;'><strong>Conclusion: Key Takeaways from Our Data Preprocessing Endeavor</strong></h3>
<p style='color:black; font-size:16px; text-align:justify;'>
As we conclude the documentation of our data preprocessing pipeline for the Chicago Taxi Trips dataset, we reflect on the significant strides made in preparing our data for predictive analytics. 
<ul>
<li>Through this pipeline, we have integrated, cleaned, and transformed the raw taxi trip data, enriching it with relevant weather information and time-based features to capture the complex dynamics of taxi demand.</li>
<li>The pipeline's design reflects our commitment to data integrity and relevance, ensuring that the processed data is of the highest quality for predictive modeling.</li>
<li>It's crucial to note that while this documentation serves as an illustrative guide, the actual execution and handling of these large-scale data transformations are performed within BigQuery. This leverages BigQuery's robust data processing capabilities to efficiently manage and analyze large datasets.</li>
<li>This documentation of the pipeline not only serves as a record of our methodological approach but also guides future improvements and iterations in our data processing strategies.</li>
</ul>
In conclusion, the establishment of this data preprocessing pipeline is a pivotal step in our analytical journey, setting the stage for developing highly accurate and reliable predictive models. It demonstrates our meticulous approach to preparing data, ensuring that every subsequent step in our machine learning workflow is built upon a solid and well-prepared foundation.
</p>
<hr>
