# Bronze Data Exploration

Note: This notebook uses Azure Synapse Analytics with PySpark

The purpose of doing some data exploration on the raw (bronze) data is to help understand how to filter the data when going from the Bronze layer to the Silver layer.
Some simple data transformations may be required to help assess how to filter out data.
Note that all the data will be unioned together into one DataFrame.
This is to help evaluate filtering rules.
A seperate notebook / script will be developed to process individual files such that it could be used in an incremental ETL process.
The data dictionary for the yellow taxi cab data can be found [here](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

In [1]:
import pyspark.sql.functions as F
from pyspark.ml import Transformer, Pipeline

StatementMeta(ExecSmall, 38, 2, Finished, Available, Finished)

## Load the Data

The raw bronze data was contactenated into one dataframe called union.
Please review Concat Bronze notebook to see how this was done.

In [None]:
%%pyspark
input_file_path = '<ADD YOUR FILE PATH HERE>'
union_df = spark.read.load(input_file_path, format='parquet')

StatementMeta(ExecSmall, 38, 3, Finished, Available, Finished)

## Simple Data Transformations

In [3]:
class ExtractDurationMin(Transformer):
  """
  Extracts trip duration in minutes
  returns Numeric
  """
  def __init__(self, pu_col="tpep_pickup_datetime", do_col="tpep_dropoff_datetime", output_col="trip_duration_min"):
    self.pu_col = pu_col
    self.do_col = do_col
    self.output_col = output_col

  def _transform(self, df):
    return df.withColumn(self.output_col, (F.to_timestamp(F.col(self.do_col)).cast("long") - F.to_timestamp(F.col(self.pu_col)).cast("long"))/60.0)

StatementMeta(ExecSmall, 38, 4, Finished, Available, Finished)

In [4]:
xform_pipe = Pipeline(stages=[ExtractDurationMin()])
xform_pipe_model = xform_pipe.fit(union_df)
union_df = xform_pipe_model.transform(union_df)


StatementMeta(ExecSmall, 38, 5, Finished, Available, Finished)

In [10]:
union_df.count()

StatementMeta(ExecSmall, 38, 11, Finished, Available, Finished)

131119556

## Columns To Drop

Some columns can be automatically dropped, as they do not contain valuable information.
Reducing the number of columns can help with efficiency.
They are:

1. Store_and_fwd_flag -- this is an audit based field and contains no useful information

In [5]:
union_df = union_df.drop("store_and_fwd_flag")

StatementMeta(ExecSmall, 38, 6, Finished, Available, Finished)

## Basic Filtering

There are a few basic filtering methods we can use to help weed out bad data.
Bad data can skew statistical summaries and analysis which inform filtering rules.
This gives a better picture for data exploration and a better sense of how to filter numerical data.

In [6]:
# Payment Type = 1, 2
# Other Payment types are invalid.
union_df = union_df.where(F.col("payment_type").isin([1,2]))

StatementMeta(ExecSmall, 38, 7, Finished, Available, Finished)

In [7]:
# Remove negative values from fields that cannot be negative or must be greater than zero
union_df = union_df.where(
    (F.col("fare_amount") > 0)
    & (F.col("passenger_count") > 0)
    & (F.col("extra") >= 0)
    & (F.col("mta_tax") >= 0)
    & (F.col("improvement_surcharge") >= 0)
    & (F.col("tip_amount") >= 0)
    & (F.col("tolls_amount") >= 0)
    & (F.col("total_amount") > 0)
    & (F.col("congestion_surcharge") >= 0)
    & (F.col("airport_fee") >= 0)
    & (F.col("trip_duration_sec") > 0)
    & (F.col("trip_distance") > 0)    
    )

StatementMeta(ExecSmall, 38, 8, Finished, Available, Finished)

## Caching The Data

Important step: cache the data!
Since we will be scanning the data multiple times, it's best to cache this data for efficiency!

In [8]:
union_df.cache()

StatementMeta(ExecSmall, 38, 9, Finished, Available, Finished)

DataFrame[VendorID: bigint, tpep_pickup_datetime: timestamp_ntz, tpep_dropoff_datetime: timestamp_ntz, passenger_count: double, trip_distance: double, RatecodeID: double, PULocationID: bigint, DOLocationID: bigint, payment_type: bigint, fare_amount: double, extra: double, mta_tax: double, tip_amount: double, tolls_amount: double, improvement_surcharge: double, total_amount: double, congestion_surcharge: double, airport_fee: double, pu_year_month: string, trip_duration_min: double, trip_duration_sec: bigint, pu_date: date]

In [11]:
union_df.count()

StatementMeta(ExecSmall, 38, 12, Finished, Available, Finished)

131119556

## Data Exploration

In [12]:
union_df.columns

StatementMeta(ExecSmall, 38, 13, Finished, Available, Finished)

['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'airport_fee',
 'pu_year_month',
 'trip_duration_min',
 'trip_duration_sec',
 'pu_date']

### Numerical Data
Let's look at some of the columns of numerical data first

In [13]:
num_col = ["passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "improvement_surcharge", "tip_amount", "tolls_amount", "total_amount", "congestion_surcharge", "airport_fee", "trip_duration_sec", "trip_duration_min"]

StatementMeta(ExecSmall, 38, 14, Finished, Available, Finished)

In [14]:
union_df.select(num_col).summary().show()

StatementMeta(ExecSmall, 38, 15, Finished, Available, Finished)

+-------+------------------+-----------------+------------------+------------------+-------------------+---------------------+------------------+------------------+-----------------+--------------------+-------------------+------------------+--------------------+
|summary|   passenger_count|    trip_distance|       fare_amount|             extra|            mta_tax|improvement_surcharge|        tip_amount|      tolls_amount|     total_amount|congestion_surcharge|        airport_fee| trip_duration_sec|   trip_duration_min|
+-------+------------------+-----------------+------------------+------------------+-------------------+---------------------+------------------+------------------+-----------------+--------------------+-------------------+------------------+--------------------+
|  count|         131119556|        131119556|         131119556|         131119556|          131119556|            131119556|         131119556|         131119556|        131119556|           131119556|     

#### Passenger Count

In [15]:
union_df.groupBy("passenger_count").count().sort(F.asc("passenger_count")).show()

StatementMeta(ExecSmall, 38, 16, Finished, Available, Finished)

+---------------+--------+
|passenger_count|   count|
+---------------+--------+
|            1.0|99852540|
|            2.0|20155266|
|            3.0| 5083065|
|            4.0| 2587089|
|            5.0| 2082866|
|            6.0| 1358214|
|            7.0|     176|
|            8.0|     262|
|            9.0|      76|
|           96.0|       1|
|          112.0|       1|
+---------------+--------+



Passenger count should be between 1 and 6.

#### Trip Distance

The 75th percentile of trip distance is 3.4 miles.
We can see that the maximum is almost 400k miles, which is totally bogus.
Let's find some additional percentile information

In [16]:
col = "trip_distance"
union_df.select(
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99")
).show()

StatementMeta(ExecSmall, 38, 17, Finished, Available, Finished)

+-------+-------+---------+-------+
|pcnt_90|pcnt_95|pcnt_97.5|pcnt_99|
+-------+-------+---------+-------+
|   8.87|  14.76|    18.13|  20.16|
+-------+-------+---------+-------+



Let's also examine the upper limit cutoff for outliers using the IQR method.

In [20]:
iqr = 3.4 - 1.1 # Q3 and Q1 values
upper_limit = iqr * 1.5 + 3.4
print(upper_limit)

StatementMeta(ExecSmall, 38, 21, Finished, Available, Finished)

6.85


Interesting to note that the Upper limit cuttoff for IQR is 6.895 miles.
While longer trips appear less frequent, we can filter some data, do some rounding, and get some discrete counts to get a good feel for trip counts.
To be cautious, let's set a cutoff of 50 miles for trip distances since the 99th percentile was 20.16 miles

In [18]:
union_df.select(F.round(F.col("trip_distance")).alias("trip_distance_rnd")).where(union_df.trip_distance < 50).groupBy("trip_distance_rnd").count().sort(F.asc(F.col("trip_distance_rnd"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show(51)

StatementMeta(ExecSmall, 38, 19, Finished, Available, Finished)

+-----------------+--------+--------------------+
|trip_distance_rnd|   count|          pcnt_total|
+-----------------+--------+--------------------+
|              0.0| 4527314|   3.452813705378929|
|              1.0|47614672|  36.313936267447396|
|              2.0|32288034|  24.624880517441653|
|              3.0|14904499|   11.36710606311083|
|              4.0| 7349345|    5.60507160350665|
|              5.0| 4166014|  3.1772636569940795|
|              6.0| 2693454|    2.05419701085626|
|              7.0| 1977458|  1.5081335388292498|
|              8.0| 1738915|  1.3262056805622495|
|              9.0| 2026530|  1.5455589248639616|
|             10.0| 1838442|  1.4021112152027115|
|             11.0| 1410326|  1.0756030931038234|
|             12.0|  953343|  0.7270791856555707|
|             13.0|  535110|  0.4081084594276692|
|             14.0|  420667| 0.32082704733991013|
|             15.0|  445122|  0.3394779646752312|
|             16.0|  620828| 0.47348238427530975|


A good cutoff would be 30 miles.
Any trips above 30 miles would be less than 0.01% of total rides.
Remember, filtering of data from Bronze layer to Silver layer is not meant to be project specific filtering, but a general filtering.
30 miles is a good upper limit.

#### Fare Amount

In [21]:
col = "fare_amount"
union_df.select(
    F.percentile_approx(col, 0.25).alias("pcnt_25"),
    F.percentile_approx(col, 0.50).alias("pcnt_50"),
    F.percentile_approx(col, 0.75).alias("pcnt_75"),
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99"),
    F.percentile_approx(col, 0.999).alias("pcnt_999")
).show()

StatementMeta(ExecSmall, 38, 22, Finished, Available, Finished)

+-------+-------+-------+-------+-------+---------+-------+--------+
|pcnt_25|pcnt_50|pcnt_75|pcnt_90|pcnt_95|pcnt_97.5|pcnt_99|pcnt_999|
+-------+-------+-------+-------+-------+---------+-------+--------+
|    8.0|   11.5|   19.1|   35.5|   52.0|     70.0|   70.0|   127.5|
+-------+-------+-------+-------+-------+---------+-------+--------+



The 99th percentile is $70 and the 99.9th percentile is 127.50.
Let's use a cutoff of $70 for fare amount.

#### Extra

In [26]:
union_df.select("extra").groupBy("extra").count().sort(F.desc(F.col("count"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 27, Finished, Available, Finished)

+-----+--------+-------------------+
|extra|   count|         pcnt_total|
+-----+--------+-------------------+
|  0.0|52136826|  39.76281463308189|
|  2.5|25845612| 19.711485295145447|
|  1.0|22512382|  17.16935496639418|
|  0.5|11273392|  8.597796045007962|
|  3.5| 6854093|  5.227361355616549|
|  5.0| 5044984|  3.847621326600587|
|  3.0| 3692206| 2.8159079489256356|
|  6.0|  642143| 0.4897385406033559|
|  7.5|  625826| 0.4772941726556792|
| 3.75|  425518|0.32452672429732754|
| 4.25|  295806|0.22560021481463832|
|  4.5|  277876|0.21192567186545386|
| 9.25|  266317| 0.2031100532402657|
| 1.75|  265614|0.20257390133322295|
| 1.25|  262486|0.20018829227884208|
| 6.75|   91356|0.06967381738235905|
| 2.25|   88384|0.06740718371560074|
| 2.75|   79327|0.06049974726882083|
|10.25|   73022|0.05569115868574174|
| 7.75|   56222|0.04287842463407975|
+-----+--------+-------------------+
only showing top 20 rows



In [30]:
union_df.select("extra").groupBy("extra").count().sort(F.desc(F.col("extra"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 31, Finished, Available, Finished)

+-----+-----+--------------------+
|extra|count|          pcnt_total|
+-----+-----+--------------------+
|88.81|    1| 7.62662741170356E-7|
|87.56|    2|1.525325482340712E-6|
| 65.0|    1| 7.62662741170356E-7|
|41.07|    1| 7.62662741170356E-7|
| 36.1|    1| 7.62662741170356E-7|
| 33.5|    1| 7.62662741170356E-7|
| 30.5|    1| 7.62662741170356E-7|
| 25.5|    1| 7.62662741170356E-7|
| 24.5|    1| 7.62662741170356E-7|
| 23.5|    1| 7.62662741170356E-7|
| 20.0|    1| 7.62662741170356E-7|
|19.95|    1| 7.62662741170356E-7|
|16.25|    1| 7.62662741170356E-7|
|16.19|    1| 7.62662741170356E-7|
| 16.0|    1| 7.62662741170356E-7|
| 15.6|    1| 7.62662741170356E-7|
|15.25|    1| 7.62662741170356E-7|
| 14.8|    1| 7.62662741170356E-7|
|14.44|    1| 7.62662741170356E-7|
|14.35|    1| 7.62662741170356E-7|
+-----+-----+--------------------+
only showing top 20 rows



In [29]:
col = "extra"
union_df.select(
    F.percentile_approx(col, 0.25).alias("pcnt_25"),
    F.percentile_approx(col, 0.50).alias("pcnt_50"),
    F.percentile_approx(col, 0.75).alias("pcnt_75"),
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99"),
    F.percentile_approx(col, 0.999).alias("pcnt_999"),
    F.percentile_approx(col, 0.9999).alias("pcnt_9999")
).show()

StatementMeta(ExecSmall, 38, 30, Finished, Available, Finished)

+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|pcnt_25|pcnt_50|pcnt_75|pcnt_90|pcnt_95|pcnt_97.5|pcnt_99|pcnt_999|pcnt_9999|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|    0.0|    1.0|    2.5|    3.5|    5.0|      5.0|   6.75|   10.25|    88.81|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+



A good cutoff value would be $15.

#### MTA Tax

In [24]:
union_df.select("mta_tax").groupBy("mta_tax").count().sort(F.desc(F.col("count"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 25, Finished, Available, Finished)

+-------+---------+--------------------+
|mta_tax|    count|          pcnt_total|
+-------+---------+--------------------+
|    0.5|130243243|   99.33166872529678|
|    0.0|   866664|  0.6609723419136654|
|    0.8|     7722|0.005889281687317489|
|   2.54|      555|4.232778213495475...|
|   0.05|      336|2.562546810332396E-4|
|   0.85|      292|2.226975204217439...|
|    1.3|      250| 1.90665685292589E-4|
|    1.5|      119|9.075686619927237E-5|
|    3.3|       92|7.016497218767275E-5|
|   2.78|       84| 6.40636702583099E-5|
|    4.0|       72|5.491171736426563E-5|
|    0.3|       42|3.203183512915495E-5|
|   1.03|       19|1.449059208223676...|
|   1.05|       11|8.389290152873916E-6|
|   0.35|        6|4.575976447022136E-6|
|   3.25|        6|4.575976447022136E-6|
|    3.5|        5| 3.81331370585178E-6|
|    2.8|        4|3.050650964681424E-6|
|   2.64|        4|3.050650964681424E-6|
|    3.0|        3|2.287988223511068E-6|
+-------+---------+--------------------+
only showing top

MTA Tax must be $0.00 or $0.50

#### Improvement Surcharge

In [22]:
union_df.select("improvement_surcharge").groupBy("improvement_surcharge").count().sort(F.asc(F.col("improvement_surcharge"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 23, Finished, Available, Finished)

+---------------------+--------+------------------+
|improvement_surcharge|   count|        pcnt_total|
+---------------------+--------+------------------+
|                  0.0|  290217|0.2213376927542372|
|                  0.3|59726207| 45.55095275032811|
|                  1.0|71103132| 54.22770955691766|
+---------------------+--------+------------------+



Improvement surchage is either $0, $0.30, or $1.00.

#### Tip Amount

In [31]:
col = "tip_amount"
union_df.select(
    F.percentile_approx(col, 0.25).alias("pcnt_25"),
    F.percentile_approx(col, 0.50).alias("pcnt_50"),
    F.percentile_approx(col, 0.75).alias("pcnt_75"),
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99"),
    F.percentile_approx(col, 0.999).alias("pcnt_999"),
    F.percentile_approx(col, 0.9999).alias("pcnt_9999")
).show()

StatementMeta(ExecSmall, 38, 32, Finished, Available, Finished)

+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|pcnt_25|pcnt_50|pcnt_75|pcnt_90|pcnt_95|pcnt_97.5|pcnt_99|pcnt_999|pcnt_9999|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|    1.0|    2.5|   3.98|   6.64|  10.37|    13.95|  16.58|   26.87|  1400.16|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+



In [40]:
union_df.select(F.round(F.col("tip_amount")).alias("tip_rnd")).where(F.col("tip_amount") > 500).distinct().count()

StatementMeta(ExecSmall, 38, 41, Finished, Available, Finished)

20

Tip amount should be >= $0.00.
It's hard to define an upper limit for a tip because there's no real hard limit to respect.
Some folks just might be very generous tippers.

#### Tolls Amount

In [32]:
col = "tolls_amount"
union_df.select(
    F.percentile_approx(col, 0.25).alias("pcnt_25"),
    F.percentile_approx(col, 0.50).alias("pcnt_50"),
    F.percentile_approx(col, 0.75).alias("pcnt_75"),
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99"),
    F.percentile_approx(col, 0.999).alias("pcnt_999"),
    F.percentile_approx(col, 0.9999).alias("pcnt_9999")
).show()

StatementMeta(ExecSmall, 38, 33, Finished, Available, Finished)

+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|pcnt_25|pcnt_50|pcnt_75|pcnt_90|pcnt_95|pcnt_97.5|pcnt_99|pcnt_999|pcnt_9999|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+
|    0.0|    0.0|    0.0|    0.0|   6.55|     6.94|   6.94|   20.32|  1702.88|
+-------+-------+-------+-------+-------+---------+-------+--------+---------+



In [41]:
union_df.select(F.round(F.col("tolls_amount")).alias("tolls_rnd")).where(F.col("tolls_amount") > 20).distinct().count()

StatementMeta(ExecSmall, 38, 42, Finished, Available, Finished)

120

In [42]:
union_df.select(F.round(F.col("tolls_amount")).alias("tolls_rnd")).where(union_df.tolls_amount > 20).groupBy("tolls_rnd").count().sort(F.asc(F.col("tolls_rnd"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show(51)

StatementMeta(ExecSmall, 38, 43, Finished, Available, Finished)

+---------+-----+--------------------+
|tolls_rnd|count|          pcnt_total|
+---------+-----+--------------------+
|     20.0|23425| 0.01786537471191559|
|     21.0|27824|0.021220328110323986|
|     22.0|25342| 0.01932739918673916|
|     23.0|12750| 0.00972394994992204|
|     24.0| 8719|0.006649656440264334|
|     25.0| 8663|0.006606947326758794|
|     26.0| 5244|0.003999403414697347|
|     27.0| 6992|0.005332537886263129|
|     28.0| 4000|0.003050650964681424|
|     29.0| 3293|0.002511448406673982|
|     30.0| 3396|0.002590002669014529|
|     31.0| 1639|0.001250004232778...|
|     32.0| 1492|0.001137892809826...|
|     33.0|  839|6.398740398419287E-4|
|     34.0| 1486|0.001133316833379149|
|     35.0|  986| 7.51985462793971E-4|
|     36.0|  905|6.902097807591722E-4|
|     37.0|  550|4.194645076436958E-4|
|     38.0|  608|4.636989466315765E-4|
|     39.0|  394|3.004891200211202...|
|     40.0|  480|3.660781157617708...|
|     41.0|  368| 2.80659888750691E-4|
|     42.0|  208|1.586338

Toll amount upper limit should be <= $30.
There are such few counts above $30 toll feels.

#### Total Amount

Total amount will not be analyzed.
Instead, it's just more efficient to recalculate the total amount from the other validated cost fields.

#### Congestion Surcharge

In [34]:
union_df.select("congestion_surcharge").groupBy("congestion_surcharge").count().sort(F.asc(F.col("congestion_surcharge"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 35, Finished, Available, Finished)

+--------------------+---------+--------------------+
|congestion_surcharge|    count|          pcnt_total|
+--------------------+---------+--------------------+
|                 0.0|  9162281|    6.98773034283307|
|                 0.3|        1| 7.62662741170356E-7|
|                 0.5|       22|1.677858030574783...|
|                0.75|       72|5.491171736426563E-5|
|                 1.0|       41| 3.12691723879846E-5|
|                 1.8|        1| 7.62662741170356E-7|
|                2.25|        1| 7.62662741170356E-7|
|                 2.5|121957077|   93.01211864994418|
|                2.52|        1| 7.62662741170356E-7|
|                2.75|       59|  4.4997101729051E-5|
+--------------------+---------+--------------------+



Congestion fees should either be $0.00 or $2.50

#### Airport Fee

In [33]:
union_df.select("airport_fee").groupBy("airport_fee").count().sort(F.asc(F.col("airport_fee"))).withColumn("pcnt_total", 100*F.col("count")/union_df.count()).show()

StatementMeta(ExecSmall, 38, 34, Finished, Available, Finished)

+-----------+---------+-------------------+
|airport_fee|    count|         pcnt_total|
+-----------+---------+-------------------+
|        0.0|120629616|  91.99971360488743|
|        0.5|        1|7.62662741170356E-7|
|        1.0|        1|7.62662741170356E-7|
|       1.25|  5071019| 3.8674772510669575|
|        1.7|        1|7.62662741170356E-7|
|       1.75|  5418918|  4.132806856057384|
+-----------+---------+-------------------+



Airport fees should be $0.00, $1.25, or $1.75

#### Trip Duration (minutes)

In [45]:
col = "trip_duration_min"
union_df.select(
    F.percentile_approx(col, 0.25).alias("pcnt_25"),
    F.percentile_approx(col, 0.50).alias("pcnt_50"),
    F.percentile_approx(col, 0.75).alias("pcnt_75"),
    F.percentile_approx(col, 0.90).alias("pcnt_90"),
    F.percentile_approx(col, 0.95).alias("pcnt_95"),
    F.percentile_approx(col, 0.975).alias("pcnt_97.5"),
    F.percentile_approx(col, 0.99).alias("pcnt_99"),
    F.percentile_approx(col, 0.999).alias("pcnt_999"),
    F.percentile_approx(col, 0.9999).alias("pcnt_9999")
).show()

StatementMeta(ExecSmall, 38, 46, Finished, Available, Finished)

+-----------------+-------+------------------+------------------+------------------+---------+-----------------+-----------------+--------------------+
|          pcnt_25|pcnt_50|           pcnt_75|           pcnt_90|           pcnt_95|pcnt_97.5|          pcnt_99|         pcnt_999|           pcnt_9999|
+-----------------+-------+------------------+------------------+------------------+---------+-----------------+-----------------+--------------------+
|7.516666666666667|   12.3|19.983333333333334|31.416666666666668|41.916666666666664|    53.15|67.71666666666667|820.4166666666666|1.0322055183333334E7|
+-----------------+-------+------------------+------------------+------------------+---------+-----------------+-----------------+--------------------+



A good cutoff for trip duration could be 90 minutes.
Longer trips may happen, but are far outside the norm.
90 minute trip duration should be more than generous.

## Conclusion

Each numeric column has been analyzed and general filter rules have been established.
The remaining columns that are categorical or date fields have predefined filtering rules based on the data dictionary.