Seasonality as a growth lever for two-sided marketplaces

💹 Carving out predictability to GROW REVENUE in a ride-hailing business • May 2023

This repository details the steps of a quick data analysis on the matching and contrasting points in the daily and weekly seasonality of the different sides of a ride-hailing marketplace. The SQL code used in it is available here and on BigQuery.

Tags: product-analytics, sql, bigquery

Resources

Spreadsheet • Google Sheets
SQL Code • (on BigQuery)
SQL Code • (Github file)

Query structure

CTE   1 • Data collection: fetching data from the original table
CTE   2 • Data cleaning: (a) finding interquartile ranges (IQR) of trip_seconds
CTE   3 • Data cleaning: (i) converting from UTC to Chicago Time, (ii) Excluding outliers: duration (trip_seconds)
CTE   4 • Data cleaning: checking results from cleaning (i) + (ii)
CTE   5 • Data cleaning: (b) aggregating partially clean data, preparing to exclude extreme hours (esp. peaks)
CTE   6 • Data cleaning: (c) finding interquartile ranges (IQR) of trip_cnt, taxi_cnt
CTE   7 • Data cleaning: (iii) based on trip_cnt, taxi_cnt, remove extreme hours from pre-cleaned (i)+(ii) data
CTE   8 • Data cleaning: (c) aggregating final clean data
CTE   9 • Data cleaning: checking results from cleaning (iii)
CTE 10 • Data analysis: typical duration of trips, according to clean data
CTE 11 • Data analysis: hourly count of trips (demand) + (estimated) Hourly count of possible trips (supply)

Step 1 • Business question

The search in this analysis is for a quick insight into the following questions:

« Are there any significant differences between favorite drivers' schedules and passengers' needs? »
« If so, when do the most critical imbalances occur? »

Why would the company care about that?

The duty of a markteplace is to make supply meet demand and vice versa. In some cases, « when » it happens may be secondary. Take, for example, the purchase of a totally specific item found only abroad. In the absence of another option, the customer (eventually you 🙋) may be willing to wait a long time (even a few months). In some cases, on the other hand, timimg may be non-negotiable. Imagine waiting for a car to go to work, for instance. Five minutes probably make a difference, and anything over fifteen minutes has a high chance of entering the deal-breaker zone.

This is where seasonality plays a role in the ride-hailing business. In order to make customers happy, it is not enough, in this case, to acquire a sufficient global number of drivers to meet passangers' requests, and to acquire a sufficient number of passangers to guarantee the drivers' income: one side's availability has to take place at the same time as the need for the other side (± no more than a few minutes).

Becoming data-informed about each side's natural preferences equips the business to intentionally, thoughtfully, introduce incentives and create advantages so that one side is attracted to the organic schedule of the other side.

↑

Step 2 • Data collection

While ideal data for this analysis is actual product data from a ride-hailing app, such data is kept private by companies. The supply-demand dynamics of a city-hall regulated taxi market is therefore taken as an approximation.

Taxi operations data is available in BigQuery public datasets:

BigQuery.Taxi.Data.mp4

You will see that searching for « taxi » on Google Cloud Marketplace for datasets returns two entries _{(as of 2023-04-17)}:

Chicago Taxi Trips
NYC TLC Trips

Clicking on « view dataset » opens it on BigQuery, where it is possible to see different structures for the collection of these datasets.

The dataset « NYC TLC Trips » has separate tables for each year and line of business

The dataset « Chicago Taxi Trips », in turn, has the whole data collected in a single table

Inspecting the taxi_trips table schema of « Chicago Taxi Trips » reveals that the fields needed for a quick study are available in it:

unique_key
taxi_id
trip_start_timestamp
trip_seconds (duration)

Considering agility, only the « Chicago Taxi Trips » dataset has been chosen for the study. In this way, all necessary that can be retrived fetching just FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`. Data from the « NYC TLC Trips » dataset may be added for validation and further elaboration in the future.

Data is retrived from BigQuery public data in the first Common Table Expression (CTE) of the query, as follows:

CTE 1 • Data collection: fetching data from the original table

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 1 • Data collection: fetching data from the original table
-------------------------------------------------------------------------------------------------------------------------------
WITH raw_data AS (
  SELECT
    unique_key               -- REQUIRED  STRING      Unique identifier for the trip.
    , taxi_id                -- REQUIRED  STRING      A unique identifier for the taxi.
    , trip_start_timestamp   -- NULLABLE  TIMESTAMP   When the trip started, rounded to nearest 15 minutes.
    , trip_seconds           -- NULLABLE  INTEGER     Duration of the trip in seconds.
  FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
  WHERE trip_seconds > 0
)

Calling CTE 1:

SELECT COUNT(*) AS record_cnt FROM raw_data

The other CTEs are gradually introduced below, in their respective step, and the full sql code is available here.

↑

Step 3 • Data cleaning

Data cleaning is performed in CTE-2 to CTE-9, comprising the following tasks:

       (i)     Convert from UTC to local (Chicago) Time
       (ii)    Exclude outliers: duration (trip_seconds)
       (iii) Exclude outliers: hours with extreme loads/scarcity of supply or demand

CTE 2 • Data cleaning: (a) finding interquartile ranges (IQR) of trip_seconds

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 2 • Data cleaning: (a) finding interquartile ranges (IQR) of trip_seconds
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_trip_seconds_iqr AS (
  SELECT
      APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] AS trip_seconds_iqr_lower
    , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(2)] AS trip_seconds_med
    , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] AS trip_seconds_iqr_upper
    , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] AS trip_seconds_iqr
    FROM raw_data
)

Calling CTE 2:

SELECT * FROM data_cleaning_trip_seconds_iqr

CTE 3 • Data cleaning: (i) converting from UTC to Chicago Time, (ii) Excluding outliers: duration (trip_seconds)

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 3 • Data cleaning: (i) converting from UTC to Chicago Time, (ii) Excluding outliers: duration (trip_seconds)
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaned_from_duration_outliers AS (
    SELECT
    unique_key
    , taxi_id
    , DATETIME(trip_start_timestamp, 'America/Chicago') AS trip_start_local_datetime
    , trip_seconds
  FROM raw_data, data_cleaning_trip_seconds_iqr
  WHERE (trip_seconds BETWEEN trip_seconds_iqr_lower - 1.5 * trip_seconds_iqr
                          AND trip_seconds_iqr_upper + 1.5 * trip_seconds_iqr)
)

Calling CTE 3:

SELECT COUNT(*) FROM data_cleaned_from_duration_outliers

CTE 4 • Data cleaning: checking results from cleaning (i) + (ii)

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 4 • Data cleaning: checking results from cleaning (i) + (ii)
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_duration_outliers_results AS (
  SELECT
  'raw_data' AS cte
  , COUNT(*) record_cnt
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(2)] med_trip_seconds
  , AVG(trip_seconds) avg_trip_seconds
  , MIN(trip_seconds) min_trip_seconds
  , MAX(trip_seconds) max_trip_seconds 
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] q1_trip_seconds
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] q3_trip_seconds
  , ( APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] ) iqr_trip_seconds
  FROM raw_data
  UNION ALL
  SELECT
  'data_cleaned_from_duration_outliers' AS cte
  , COUNT(*) record_cnt
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(2)] med_trip_seconds
  , AVG(trip_seconds) avg_trip_seconds
  , MIN(trip_seconds) min_trip_seconds
  , MAX(trip_seconds) max_trip_seconds 
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] q1_trip_seconds
  , APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] q3_trip_seconds
  , ( APPROX_QUANTILES(trip_seconds, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] ) iqr_trip_seconds
  FROM data_cleaned_from_duration_outliers
)

Calling CTE 4:

SELECT * FROM data_cleaning_duration_outliers_results

CTE 5 • Data cleaning: (b) aggregating partially clean data, preparing to exclude extreme hours (esp. peaks)

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 5 • Data cleaning: (b) aggregating partially clean data, preparing to exclude extreme hours (esp. peaks)
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_agg AS (
  SELECT
      DATETIME_TRUNC(trip_start_local_datetime, HOUR) AS trip_start_local_datehour
    , COUNT(DISTINCT unique_key) AS trip_cnt
    , COUNT(DISTINCT taxi_id) AS taxi_cnt
    FROM data_cleaned_from_duration_outliers
    GROUP BY trip_start_local_datehour
)

Calling CTE 5:

SELECT COUNT(*) FROM data_cleaning_agg

CTE 6 • Data cleaning: (c) finding interquartile ranges (IQR) of trip_cnt, taxi_cnt

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 6 • Data cleaning: (c) finding interquartile ranges (IQR) of trip_cnt, taxi_cnt
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_trips_taxis_iqr AS (
  SELECT
      APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] AS trip_cnt_iqr_lower
    , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(2)] AS trip_cnt_med
    , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] AS trip_cnt_iqr_upper
    , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] AS trip_cnt_iqr
    , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] AS taxi_cnt_iqr_lower
    , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(2)] AS taxi_cnt_med
    , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] AS taxi_cnt_iqr_upper
    , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] AS taxi_cnt_iqr
    FROM data_cleaning_agg
)

Calling CTE 6:

SELECT * FROM data_cleaning_trips_taxis_iqr

CTE 7 • Data cleaning: (iii) based on trip_cnt, taxi_cnt, remove extreme hours from pre-cleaned (i)+(ii) data

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 7 • Data cleaning: (iii) based on trip_cnt, taxi_cnt, remove extreme hours from pre-cleaned (i)+(ii) data
-------------------------------------------------------------------------------------------------------------------------------
, clean_data AS (
    SELECT
    trip_start_local_datetime
    , unique_key
    , taxi_id
    , trip_seconds
  FROM data_cleaned_from_duration_outliers, data_cleaning_trips_taxis_iqr
  JOIN data_cleaning_agg
    ON data_cleaning_agg.trip_start_local_datehour = DATETIME_TRUNC(trip_start_local_datetime, HOUR)
  WHERE (trip_cnt BETWEEN trip_cnt_iqr_lower - 1.5 * trip_cnt_iqr
                      AND trip_cnt_iqr_upper + 1.5 * trip_cnt_iqr)
    AND (taxi_cnt BETWEEN taxi_cnt_iqr_lower - 1.5 * taxi_cnt_iqr
                      AND taxi_cnt_iqr_upper + 1.5 * taxi_cnt_iqr)
)

Calling CTE 7:

SELECT COUNT(*) FROM clean_data

CTE 8 • Data cleaning: (c) aggregating final clean data

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 8 • Data cleaning: (c) aggregating final clean data
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_agg_clean_data AS (
  SELECT
      DATETIME_TRUNC(trip_start_local_datetime, HOUR) AS trip_start_local_datehour
    , COUNT(DISTINCT unique_key) AS trip_cnt
    , COUNT(DISTINCT taxi_id) AS taxi_cnt
    FROM clean_data
    GROUP BY trip_start_local_datehour
)

Calling CTE 8:

SELECT COUNT(*) FROM data_cleaning_agg_clean_data

CTE 9 • Data cleaning: checking results from cleaning (iii)

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 9 • Data cleaning: checking results from cleaning (iii)
-------------------------------------------------------------------------------------------------------------------------------
, data_cleaning_results AS (
  SELECT
  'data_cleaning_agg' AS cte
  , COUNT(*) record_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(2)] med_trip_cnt
  , AVG(trip_cnt) avg_trip_cnt
  , MIN(trip_cnt) min_trip_cnt
  , MAX(trip_cnt) max_trip_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] q1_trip_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] q3_trip_cnt
  , ( APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] ) iqr_trip_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(2)] med_taxi_cnt
  , AVG(taxi_cnt) avg_taxi_cnt
  , MIN(taxi_cnt) min_taxi_cnt
  , MAX(taxi_cnt) max_taxi_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] q1_taxi_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] q3_taxi_cnt
  , ( APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] ) iqr_taxi_cnt
  FROM data_cleaning_agg
  UNION ALL
  SELECT
  'data_cleaning_agg_clean_data' AS cte
  , COUNT(*) record_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(2)] med_trip_cnt
  , AVG(trip_cnt) avg_trip_cnt
  , MIN(trip_cnt) min_trip_cnt
  , MAX(trip_cnt) max_trip_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] q1_trip_cnt
  , APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] q3_trip_cnt
  , ( APPROX_QUANTILES(trip_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(trip_cnt, 4)[OFFSET(1)] ) iqr_trip_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(2)] med_taxi_cnt
  , AVG(taxi_cnt) avg_taxi_cnt
  , MIN(taxi_cnt) min_taxi_cnt
  , MAX(taxi_cnt) max_taxi_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] q1_taxi_cnt
  , APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] q3_taxi_cnt
  , ( APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(3)] - APPROX_QUANTILES(taxi_cnt, 4)[OFFSET(1)] ) iqr_taxi_cnt
  FROM data_cleaning_agg_clean_data
)

Calling CTE 9:

SELECT * FROM data_cleaning_results

↑

Step 4 • Analysis

In the step of actual analysis, the goal is to generate a list of the hourly levels of supply and demand, with their corresponding time, and then pivot it into days of week by hours of the day, for a final picture of a typical weekly schedule of each side.

The hourly demand for this list is directly obtained from the clean dataset, by simply summing up the hourly count of the trips' unique_key.

Hourly values for the supply side, on the other hand, have to be estimated.

With the relevant Product Data available, the hourly supply could be defined as availability_of_drivers_in_minutes (total sum) multiplied by the typical_trip_minutes (duration). As the total time that drivers are in service is not available from our « Chicago Taxi Trips » dataset, an estimation for it is performed based on the trips taken.

The supply numbers that can be directly extracted from the dataset are unique taxi counts. Considering that a single taxi may be able to take multiple trips in an hour, a model is applied to estimate how many trips each taxi was expected to be able to perform. Taking into consideration a factor of 2/3 for the drivers' idle time, an estimated availability_of_drivers_in_minutes is obtained directly from trips taken, dividing this number by the drivers_idle_time.

⚠️ It is important to notice, though, that using the number of trips actually performed to estimate supply is expected to flatten out the difference between supply and demand that shall, in fact, occur. So, any potential gains found out in this study are expected to be observed at a higher intensity if product data is available foor the analysis. The framework of the analysis is independent of it, and fully applicable in a business scenario.

CTEs 10 e 11 combined perform the described approach, fetching an hourly list of 89k+ records from Dec 2012 to Apr 2023:

CTE 10 • Data analysis: typical duration of trips, according to clean data

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 10 • Data analysis: typical duration of trips, according to clean data
-------------------------------------------------------------------------------------------------------------------------------
, typical_trip_seconds AS 
  (SELECT APPROX_QUANTILES(trip_seconds, 4)[OFFSET(1)] AS median_trip_seconds FROM clean_data)
)

CTE 11 • Data analysis: hourly count of trips (demand) + (estimated) Hourly count of possible trips (supply)

-------------------------------------------------------------------------------------------------------------------------------
-- CTE 11 • Data analysis: hourly count of trips (demand) + (estimated) Hourly count of possible trips (supply)
-------------------------------------------------------------------------------------------------------------------------------
-- Model
-- hourly_trips_supply: total #trips in 1hr that could have happened, based on drivers' availability and typical trip duration
-- hourly_trips_supply =
--   = estimated_number_of_taxis_available_in_the_hour × potential_number_of_trips_per_hour_per_driver
--
-- estimated_number_of_taxis_available_in_the_hour = 
--   = number_of_taxis_w_trips_in_the_hour ÷ drivers_typical_idle_time
-- 
-- potential_number_of_trips_per_hour_per_driver =
--   = 60 ÷ typical_trip_minutes
-------------------------------------------------------------------------------------------------------------------------------
-- Assumption
-- drivers_idle_time = 2/3
-- Ref.: https://www.uberpeople.net/threads/what-is-your-idle-time-and-idle-running-in-km-as-uber-driver.146607/
-------------------------------------------------------------------------------------------------------------------------------
-- Note on impact of Model & Assumption on findings
-- These are only applied in getting realistic absolute numbers, not impacting findings of the analysis [based on proportion]
-------------------------------------------------------------------------------------------------------------------------------
, hourly_supply_demand AS (
  SELECT
      DATETIME_TRUNC(trip_start_local_datetime, HOUR) AS trip_start_local_datehour
    , dayOfWeek(trip_start_local_datetime) AS trip_start_local_dayofweek
    , EXTRACT(HOUR FROM trip_start_local_datetime) AS trip_start_local_hour
    , (median_trip_seconds / 60.0) AS typical_trip_minutes
    , CAST(FLOOR(60 / (median_trip_seconds / 60.0)) AS INT64) AS potential_number_of_trips_per_hour_per_driver
    , COUNT(DISTINCT unique_key) AS hourly_trips_demand
    , CAST(FLOOR((COUNT(DISTINCT taxi_id)/(2/3)) * FLOOR(60/(AVG(median_trip_seconds)/60.0))) AS INT64) AS hourly_trips_supply
  FROM clean_data, typical_trip_seconds
  GROUP BY 1, 2, 3, 4, 5
)

Calling CTE 11:

SELECT * FROM hourly_supply_demand ORDER BY 1

The generated list is in a convenient format to be pivoted on a spreadsheet. Making use of the convenient feature of saving directly from the cloud console, the results were exported to Google Sheets:

On Google Sheets, generating a pivot table is a fast way to arrange the data in the wanted format: hours of day x days of week.

For supply:

For demand:

Showing the tables for supply and demand side-by-side enables the comparison of the typical schedules of passengers and drivers:

⚠️ As mentioned above, the difference between the organic schedules of supply and demand have been toned down here by the fact that demand has been included in the estimation of demand. It is though possible, from the chosen data and the designed framework, to detect opportunities of increasing the synchonicity between schedules of the different sides of the marketplace.

The darker red area close to the right bottom of the demand side indicates a possible trend of passengers' behavior on weekend nights not being organically met by the supply side.

Similarly, an added availability of drivers in the early mornings of week days is found to have no correspondent proportional increase from the demand side.

From this point, it is possible to further summarize data in a single piece, converting it into a visualization that emphasizes the most critical points, when the highest imbalance between supply and demand occurs. For convenience, the trigger to sign action may be defined as entry parameters, as follows:

Look at seasonality of Supply and seasonality of Demand
Calculate typical % ditributions
Scale % distribution of volumes to a [0,1] interval
Calculate the difference between the scaled Supply and Demand distributions
Highlight action items

1. Look at seasonality of Supply and seasonality of Demand

2. Calculate typical % ditributions

3. Scale % distribution of volumes to a [0,1] interval

4. Calculate the difference between the scaled Supply and Demand distributions

5. Highlight action items

↑

Step 5 • Synthesis

Being in charge of the operation of a marketplace is a challenging task, loaded with even greater levels of uncertainty than those natural to any business, as, in this case, in addition to the lack of control over the demand side, there is also no control over the supply side.

The study presented in this repository introduces a framework for applying the concept of seasonality in navigating and managing such unusual levels of uncertainty. Seasonality arises when repeated events occur following a defined temporal pattern, with a fixed frequency, drawing foreseeable cycles and, thus, introducing predictability into a system.

By carrying out seasonality studies, we can separate into conceptual 'buckets' of knowledge what we can anticipate about a system from its erractic part. When it comes to consumer markets, seasonality includes some pretty robust and consistently stable patterns of behavior, especially those related to nature's cycles. For thousands of years, we humankind have been mostly active during the day, resting during the night.

This stability in end-user seasonal patterns comes in handy for Strategy teams in the quest to design products and services to improve lives.

The figure below illustrates the application of seasonality studies to the ride-hailing business (approximate here by the Chicago Taxi market, for which public data is available). On the left, we see the typical weekly schedule of drivers (supply), on the right, the same for passengers (demand). In each image, the numbers on the vertical axis are the hours of the day, with the days of week in the columns. Color coding represents relative intensity, with green for low, white for medium, and red for high. So, the image on the left side, for instance, shows that the peak supply time, when there is the highest number of drivers on the streets, is from 1:00 pm to 2:00 pm on Fridays.

Combining both sides — supply and demand — into a single net surplus figure, and attaching to it a table summarizing estimated results of planned operations scenarios, provides a practical Action Board, such as the one below:

↑

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-analysis.md

data-analysis.md

Seasonality as a growth lever for two-sided marketplaces

Contents

Resources

Query structure

Step 1 • Business question

Step 2 • Data collection

CTE 1 • Data collection: fetching data from the original table

Step 3 • Data cleaning

CTE 2 • Data cleaning: (a) finding interquartile ranges (IQR) of trip_seconds

CTE 3 • Data cleaning: (i) converting from UTC to Chicago Time, (ii) Excluding outliers: duration (trip_seconds)

CTE 4 • Data cleaning: checking results from cleaning (i) + (ii)

CTE 5 • Data cleaning: (b) aggregating partially clean data, preparing to exclude extreme hours (esp. peaks)

CTE 6 • Data cleaning: (c) finding interquartile ranges (IQR) of trip_cnt, taxi_cnt

CTE 7 • Data cleaning: (iii) based on trip_cnt, taxi_cnt, remove extreme hours from pre-cleaned (i)+(ii) data

CTE 8 • Data cleaning: (c) aggregating final clean data

CTE 9 • Data cleaning: checking results from cleaning (iii)

Step 4 • Analysis

CTE 10 • Data analysis: typical duration of trips, according to clean data

CTE 11 • Data analysis: hourly count of trips (demand) + (estimated) Hourly count of possible trips (supply)

1. Look at seasonality of Supply and seasonality of Demand

2. Calculate typical % ditributions

3. Scale % distribution of volumes to a [0,1] interval

4. Calculate the difference between the scaled Supply and Demand distributions

5. Highlight action items

Step 5 • Synthesis

Files

data-analysis.md

Latest commit

History

data-analysis.md

File metadata and controls

Seasonality as a growth lever for two-sided marketplaces

Contents

Resources

Query structure

Step 1 • Business question

Step 2 • Data collection

CTE 1 • Data collection: fetching data from the original table

Step 3 • Data cleaning

CTE 2 • Data cleaning: (a) finding interquartile ranges (IQR) of trip_seconds

CTE 3 • Data cleaning: (i) converting from UTC to Chicago Time, (ii) Excluding outliers: duration (trip_seconds)

CTE 4 • Data cleaning: checking results from cleaning (i) + (ii)

CTE 5 • Data cleaning: (b) aggregating partially clean data, preparing to exclude extreme hours (esp. peaks)

CTE 6 • Data cleaning: (c) finding interquartile ranges (IQR) of trip_cnt, taxi_cnt

CTE 7 • Data cleaning: (iii) based on trip_cnt, taxi_cnt, remove extreme hours from pre-cleaned (i)+(ii) data

CTE 8 • Data cleaning: (c) aggregating final clean data

CTE 9 • Data cleaning: checking results from cleaning (iii)

Step 4 • Analysis

CTE 10 • Data analysis: typical duration of trips, according to clean data

CTE 11 • Data analysis: hourly count of trips (demand) + (estimated) Hourly count of possible trips (supply)

1. Look at seasonality of Supply and seasonality of Demand

2. Calculate typical % ditributions

3. Scale % distribution of volumes to a [0,1] interval

4. Calculate the difference between the scaled Supply and Demand distributions

5. Highlight action items

Step 5 • Synthesis