<a href="https://www.kaggle.com/code/kamaljp/pyspark-uberny-etl-to-sparkctg?scriptVersionId=121545318" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=7859b385982ce46927f7ef29b55b628da17acfa6b51741ff66cc4b1df319efec
  Stored in directory: /root/.cache/pip/wheels/5a/54/9b/a89cac960efb57c4c35d41cc7c9f7b80daa21108bc376339b7
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
  

### The ETL process:

1) Import the data into the Kaggle environment

2) Start the Spark Session and read in the data

3) Do basic analysis of the data to understand it

4) Initiate transformation on the data to make the data more useful

5) Write the data to Parquet file

6) Write the data to Postgres instance running in RDS

In [2]:
import os

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

import warnings
warnings.filterwarnings('ignore')

### The datasets in Notebook

We can get the data inside the environment by using the following linux command. 

We can also use the python OS module for listing the folders

In [3]:
!ls /kaggle/input/

airbnb-ratings-dataset	  uber-nyc-forhire-vehicles-trip-data-2021
aws-honeypot-attack-data


In [4]:
#You will get list
print(os.listdir('/kaggle/input/'))
os.listdir('/kaggle/input/uber-nyc-forhire-vehicles-trip-data-2021/')

['aws-honeypot-attack-data', 'airbnb-ratings-dataset', 'uber-nyc-forhire-vehicles-trip-data-2021']


['fhvhv_tripdata_2021-02.parquet',
 'working_parquet_format.pdf',
 'data_dictionary_trip_records_hvfhs.pdf',
 'fhvhv_tripdata_2021-10.parquet',
 'fhvhv_tripdata_2021-05.parquet',
 'taxi_zone_lookup.csv',
 'fhvhv_tripdata_2021-06.parquet',
 'fhvhv_tripdata_2021-04.parquet',
 'fhvhv_tripdata_2021-09.parquet',
 'fhvhv_tripdata_2021-12.parquet',
 'taxi_zones',
 'fhvhv_tripdata_2021-08.parquet',
 'fhvhv_tripdata_2021-01.parquet',
 'fhvhv_tripdata_2021-03.parquet',
 'fhvhv_tripdata_2021-07.parquet',
 'nyc 2021-01-01 to 2021-12-31.csv',
 'fhvhv_tripdata_2021-11.parquet']

### We are going to use the parquet reader method that belongs to SparkSession inside Pyspark.

In [5]:
spark = SparkSession. \
    builder. \
    appName('ETL_Rider'). \
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/09 09:31:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
uber_raw_pqt = spark.read.parquet('/kaggle/input/uber-nyc-forhire-vehicles-trip-data-2021/fhvhv_tripdata_2021-01.parquet')

                                                                                

In [7]:
uber_raw_pqt.show(2,truncate=False)



+-----------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+----+---------+--------------------+-----------+----+----------+-------------------+-----------------+------------------+----------------+--------------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|request_datetime   |on_scene_datetime  |pickup_datetime    |dropoff_datetime   |PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|bcf |sales_tax|congestion_surcharge|airport_fee|tips|driver_pay|shared_request_flag|shared_match_flag|access_a_ride_flag|wav_request_flag|wav_match_flag|
+-----------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+----+---------+--------------------+-----------+--

                                                                                

In [8]:
#We read only single file till now... and that contains 11 Million records
uber_raw_pqt.count()

11908468

### We are going to seperate folder for the parquet files first

### Then read the entire folder... Big Data way

In [9]:
!mkdir uber-nyc-forhire

In [10]:
!cp /kaggle/input/uber-nyc-forhire-vehicles-trip-data-2021/*.parquet \
    /kaggle/working/uber-nyc-forhire/

In [11]:
!ls /kaggle/working/uber-nyc-forhire/

fhvhv_tripdata_2021-01.parquet	fhvhv_tripdata_2021-07.parquet
fhvhv_tripdata_2021-02.parquet	fhvhv_tripdata_2021-08.parquet
fhvhv_tripdata_2021-03.parquet	fhvhv_tripdata_2021-09.parquet
fhvhv_tripdata_2021-04.parquet	fhvhv_tripdata_2021-10.parquet
fhvhv_tripdata_2021-05.parquet	fhvhv_tripdata_2021-11.parquet
fhvhv_tripdata_2021-06.parquet	fhvhv_tripdata_2021-12.parquet


In [12]:
uber_full_data = spark.read.parquet('/kaggle/working/uber-nyc-forhire/')

### Now we are talking... 174 million rows into the Spark Execution environment. 

#### We can learn couple of things with this massive dataset

1) Writing the data to spark metastore as table

2) Transforming it using Spark SQL

3) Creating temp views on Spark Metastore

4) We can also write the data to database using the Spark Drivers for Databases

In [13]:
uber_full_data.count()

174596652

In [14]:
#get the schema for understanding... and we see that all the column can have nulls
uber_full_data.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp (nullable = true)
 |-- on_scene_datetime: timestamp (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_flag: string (nul

In [15]:
uber_full_data.groupBy('hvfhs_license_num').count().show()



+-----------------+---------+
|hvfhs_license_num|    count|
+-----------------+---------+
|           HV0004|   891819|
|           HV0005| 47575769|
|           HV0003|126129064|
+-----------------+---------+



                                                                                

#### Its better touse SQL for the transformation activity since it is way more intuitive

For that we have to create the database and make the above dataframe into a table inside the spark metastore

In [16]:
#creating the database. Keep in mind the free working space allocated is only 20gb
spsql = spark.sql
spsql("CREATE DATABASE IF NOT EXISTS uber_nyc_db")
spsql("USE DATABASE uber_nyc_db")
spctg = spark.catalog

In [17]:
spctg.listDatabases()

[Database(name='default', description='default database', locationUri='file:/kaggle/working/spark-warehouse'),
 Database(name='uber_nyc_db', description='', locationUri='file:/kaggle/working/spark-warehouse/uber_nyc_db.db')]

In [18]:
#uber_hv0005_data.write.saveAsTable("uber_table")
spctg.listTables(dbName='uber_nyc_db')

[]

In [19]:
uber_hv0004_data = uber_full_data.filter(uber_full_data['hvfhs_license_num'] == 'HV0004')

In [20]:
uber_hv0004_data.count()

                                                                                

891819

In [21]:
spsql("""SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359')""").show()

+-----------------------------------------+
|date_trunc(HOUR, 2015-03-05T09:32:05.359)|
+-----------------------------------------+
|                      2015-03-05 09:00:00|
+-----------------------------------------+



In [22]:
uber_full_data.select(date_format('pickup_datetime','dd-MM-yyyy'). \
                        alias('pu_date'),
                     date_format('request_datetime','dd-MM-yyyy'). \
                        alias('req_date'),
                    date_format('pickup_datetime','EEEE'). \
                        alias('pu_day'),
                    date_format('pickup_datetime','MMM'). \
                        alias('pu_month'),
                     col('tolls'),col('trip_time'),
                     col('tips'),col('base_passenger_fare'),
                     col('originating_base_num'),col('dispatching_base_num'),
                     col("hvfhs_license_num")).show(2,truncate=False)

+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
|pu_date   |req_date  |pu_day|pu_month|tolls|trip_time|tips|base_passenger_fare|originating_base_num|dispatching_base_num|hvfhs_license_num|
+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
|01-02-2021|31-01-2021|Monday|Feb     |0.0  |629      |0.0 |17.14              |B02764              |B02764              |HV0003           |
|01-02-2021|01-02-2021|Monday|Feb     |0.0  |998      |0.0 |32.11              |B02764              |B02764              |HV0003           |
+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
only showing top 2 rows



In [23]:
uber_transformed_data = uber_full_data.select(date_format('pickup_datetime','dd-MM-yyyy'). \
                        alias('pu_date'),
                     date_format('request_datetime','dd-MM-yyyy'). \
                        alias('req_date'),
                     date_format('pickup_datetime','EEEE'). \
                        alias('pu_day'),
                    date_format('pickup_datetime','MMM'). \
                        alias('pu_month'),
                     col('tolls'),col('trip_time'),
                     col('tips'),col('base_passenger_fare'),
                     col('originating_base_num'),col('dispatching_base_num'),
                     col("hvfhs_license_num"))

In [24]:
uber_transformed_data.filter((uber_transformed_data.pu_month == 'Jan') \
                             & (uber_transformed_data.originating_base_num == 'B02682')). \
                        show(2,truncate=False)



+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
|pu_date   |req_date  |pu_day|pu_month|tolls|trip_time|tips|base_passenger_fare|originating_base_num|dispatching_base_num|hvfhs_license_num|
+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
|01-01-2021|01-01-2021|Friday|Jan     |0.0  |923      |0.0 |22.28              |B02682              |B02682              |HV0003           |
|01-01-2021|01-01-2021|Friday|Jan     |0.0  |1382     |0.0 |18.36              |B02682              |B02682              |HV0003           |
+----------+----------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+
only showing top 2 rows



                                                                                

In [25]:
uber_jan_b02682 = uber_transformed_data.filter((uber_transformed_data.pu_month == 'Jan') \
                                                    & (uber_transformed_data.originating_base_num == 'B02682'))

In [26]:
uber_jan_b02682.count()

                                                                                

320804

In [27]:
uber_jan_b02682.groupBy("pu_date").count().show()



+----------+-----+
|   pu_date|count|
+----------+-----+
|05-01-2021| 9556|
|24-01-2021| 9935|
|31-01-2021|10290|
|27-01-2021|10632|
|18-01-2021| 8441|
|25-01-2021| 9553|
|02-01-2021| 8853|
|09-01-2021|11578|
|04-01-2021| 9164|
|06-01-2021| 9928|
|26-01-2021| 9823|
|16-01-2021|11340|
|12-01-2021|10094|
|14-01-2021|10279|
|13-01-2021| 9919|
|29-01-2021|13447|
|10-01-2021| 9592|
|20-01-2021|10101|
|22-01-2021|11754|
|23-01-2021|12132|
+----------+-----+
only showing top 20 rows



                                                                                

In [28]:
uber_jan_b02682.write.saveAsTable("Jan_b02682",mode='ignore')

                                                                                

In [29]:
### Lets rumble with SQL now
spsql("""SELECT req_date, 
      ROUND(SUM(tips),1) AS tip_sums, ROUND(SUM(trip_time),1) AS trip_time,
      ROUND(SUM(base_passenger_fare),1) AS total_fare
      FROM Jan_b02682
      GROUP BY req_date
      ORDER BY req_date""").show(10)

+----------+--------+---------+----------+
|  req_date|tip_sums|trip_time|total_fare|
+----------+--------+---------+----------+
|01-01-2021|  4073.4|  9215642|  196959.8|
|02-01-2021|  3786.0|  8264862|  158462.1|
|03-01-2021|  2813.6|  6663247|  132929.7|
|04-01-2021|  4121.2|  8643996|  158123.2|
|05-01-2021|  4132.7|  9097644|  162828.7|
|06-01-2021|  4459.9|  9482856|  170405.6|
|07-01-2021|  4720.8| 10124063|  177125.1|
|08-01-2021|  5320.1| 11254577|  197214.9|
|09-01-2021|  4787.9| 10512786|  190288.2|
|10-01-2021|  3797.0|  8427555|  159252.5|
+----------+--------+---------+----------+
only showing top 10 rows



In [30]:
uber_jan_b02682.write.parquet(path='/kaggle/working/jan_b02682',
                              partitionBy='req_date')

                                                                                

In [31]:
!ls /kaggle/working/jan_b02682

 _SUCCESS	       'req_date=11-01-2021'  'req_date=22-01-2021'
'req_date=01-01-2021'  'req_date=12-01-2021'  'req_date=23-01-2021'
'req_date=02-01-2021'  'req_date=13-01-2021'  'req_date=24-01-2021'
'req_date=03-01-2021'  'req_date=14-01-2021'  'req_date=25-01-2021'
'req_date=04-01-2021'  'req_date=15-01-2021'  'req_date=26-01-2021'
'req_date=05-01-2021'  'req_date=16-01-2021'  'req_date=27-01-2021'
'req_date=06-01-2021'  'req_date=17-01-2021'  'req_date=28-01-2021'
'req_date=07-01-2021'  'req_date=18-01-2021'  'req_date=29-01-2021'
'req_date=08-01-2021'  'req_date=19-01-2021'  'req_date=30-01-2021'
'req_date=09-01-2021'  'req_date=20-01-2021'  'req_date=31-01-2021'
'req_date=10-01-2021'  'req_date=21-01-2021'  'req_date=31-12-2020'


In [32]:
!ls /kaggle/working/jan_b02682/req_date\=01-01-2021

part-00012-9d369bc9-fbb3-4d6e-b66b-ec69523642b3.c000.snappy.parquet


In [None]:
#Writing the dataframe as database will throw out of memory error... CRASSSSHHHHH
#In your machine, the OS will crash and hang. Requiring restart

#uber_hv0004_data.write.saveAsTable("uber_hv04_table")

In [33]:
#We will first consolidate the data using spark's groupby clause and then table it
uber_full_data.groupBy(["hvfhs_license_num","dispatching_base_num",
                        "originating_base_num"]).count().show()



+-----------------+--------------------+--------------------+-------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|  count|
+-----------------+--------------------+--------------------+-------+
|           HV0003|              B02875|              B00446|     22|
|           HV0003|              B02875|              B03153|    134|
|           HV0004|              B02800|              B02800|   5564|
|           HV0003|              B02764|              B00446|     11|
|           HV0003|              B02880|              B02880|1383662|
|           HV0003|              B02866|              B02866|4125161|
|           HV0003|              B02884|              B00457|      3|
|           HV0003|              B02835|              B02026|     21|
|           HV0003|              B02765|              B02729|     32|
|           HV0003|              B02872|                null|   2678|
|           HV0003|              B02836|              B02729|     34|
|           HV0003| 

                                                                                

In [34]:
uber_consolidate_count = uber_full_data.groupBy(["hvfhs_license_num","dispatching_base_num",
                        "originating_base_num"]).count()

In [36]:
uber_select_data = uber_full_data.select("hvfhs_license_num","dispatching_base_num",
                        "originating_base_num","tolls","trip_time","trip_miles",
                                         "driver_pay","base_passenger_fare")

In [38]:
uber_select_data.show(2,truncate=False)

+-----------------+--------------------+--------------------+-----+---------+----------+----------+-------------------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|tolls|trip_time|trip_miles|driver_pay|base_passenger_fare|
+-----------------+--------------------+--------------------+-----+---------+----------+----------+-------------------+
|HV0003           |B02764              |B02764              |0.0  |629      |2.06      |9.79      |17.14              |
|HV0003           |B02764              |B02764              |0.0  |998      |3.15      |24.01     |32.11              |
+-----------------+--------------------+--------------------+-----+---------+----------+----------+-------------------+
only showing top 2 rows



In [39]:
hvfhs_license_aggregate = uber_select_data.groupby("hvfhs_license_num").sum()

In [41]:
hvfhs_license_aggregate.show(truncate=False)




+-----------------+--------------------+--------------+-------------------+-------------------+------------------------+
|hvfhs_license_num|sum(tolls)          |sum(trip_time)|sum(trip_miles)    |sum(driver_pay)    |sum(base_passenger_fare)|
+-----------------+--------------------+--------------+-------------------+-------------------+------------------------+
|HV0004           |153265.0799999978   |1095101799    |4736608.4          |8025853.720000012  |2.27800269900019E7      |
|HV0005           |4.5605888759988375E7|53269572548   |2.354133522799954E8|7.889760770456022E8|1.0597546318811505E9    |
|HV0003           |1.1894894449076162E8|137829874026  |6.043828214200091E8|2.288988531865339E9|2.7566374856481566E9    |
+-----------------+--------------------+--------------+-------------------+-------------------+------------------------+



                                                                                

In [42]:
disp_origin_aggregate = uber_select_data.groupby(["hvfhs_license_num",
                                                  "dispatching_base_num",
                                                  "originating_base_num"]).sum()

In [43]:
disp_origin_aggregate.show(2,truncate=False)



+-----------------+--------------------+--------------------+------------------+--------------+-----------------+---------------+------------------------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|sum(tolls)        |sum(trip_time)|sum(trip_miles)  |sum(driver_pay)|sum(base_passenger_fare)|
+-----------------+--------------------+--------------------+------------------+--------------+-----------------+---------------+------------------------+
|HV0003           |B02875              |B00446              |34.26             |29175         |77.92999999999999|717.59         |645.2199999999999       |
|HV0003           |B02875              |B03153              |58.089999999999996|115706        |367.06           |2062.69        |1822.05                 |
+-----------------+--------------------+--------------------+------------------+--------------+-----------------+---------------+------------------------+
only showing top 2 rows



                                                                                

In [44]:
disp_origin_aggregate.count()

                                                                                

188

In [45]:
disp_origin_aggregate.write.saveAsTable('disp_orig_license',mode='ignore')

                                                                                

In [46]:
spctg.listTables(dbName='uber_nyc_db')

[Table(name='disp_orig_license', database='uber_nyc_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='jan_b02682', database='uber_nyc_db', description=None, tableType='MANAGED', isTemporary=False)]

### Now we can fire on all cylinders using spark sql

In [47]:
spsql("""SELECT * FROM disp_orig_license LIMIT 10""").show()

+-----------------+--------------------+--------------------+------------------+--------------+--------------------+--------------------+------------------------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|        sum(tolls)|sum(trip_time)|     sum(trip_miles)|     sum(driver_pay)|sum(base_passenger_fare)|
+-----------------+--------------------+--------------------+------------------+--------------+--------------------+--------------------+------------------------+
|           HV0003|              B02875|              B00446|             34.26|         29175|   77.92999999999999|              717.59|       645.2199999999999|
|           HV0003|              B02875|              B03153|58.089999999999996|        115706|              367.06|             2062.69|                 1822.05|
|           HV0004|              B02800|              B02800|401.43999999999994|       6553330|            23975.48|                 0.0|      144181.17000000042|
|           HV0003|   

In [48]:
spsql("""SELECT * FROM disp_orig_license 
            WHERE dispatching_base_num = 'B02875' 
            LIMIT 10""").show()

+-----------------+--------------------+--------------------+------------------+--------------+------------------+--------------------+------------------------+
|hvfhs_license_num|dispatching_base_num|originating_base_num|        sum(tolls)|sum(trip_time)|   sum(trip_miles)|     sum(driver_pay)|sum(base_passenger_fare)|
+-----------------+--------------------+--------------------+------------------+--------------+------------------+--------------------+------------------------+
|           HV0003|              B02875|              B00446|             34.26|         29175| 77.92999999999999|              717.59|       645.2199999999999|
|           HV0003|              B02875|              B03153|58.089999999999996|        115706|            367.06|             2062.69|                 1822.05|
|           HV0003|              B02875|              B00457|               0.0|         23002|             70.76|  347.36000000000007|      410.64000000000004|
|           HV0003|              B

In [49]:
spsql("""SELECT dispatching_base_num, originating_base_num, `sum(tolls)`,
            `sum(trip_time)`
            FROM disp_orig_license 
            WHERE dispatching_base_num = 'B02875' 
            LIMIT 10""").show()

+--------------------+--------------------+------------------+--------------+
|dispatching_base_num|originating_base_num|        sum(tolls)|sum(trip_time)|
+--------------------+--------------------+------------------+--------------+
|              B02875|              B00446|             34.26|         29175|
|              B02875|              B03153|58.089999999999996|        115706|
|              B02875|              B00457|               0.0|         23002|
|              B02875|              B02875| 6555401.540002581|    8783422003|
|              B02875|              B02026|            286.29|        280291|
|              B02875|              B02729|1440.9400000000005|       2685804|
|              B02875|              B00887|             12.24|         19398|
|              B02875|                null|1146.6100000000004|       1893857|
|              B02875|              B02826|              6.55|          4393|
+--------------------+--------------------+------------------+--

In [51]:
spsql("""SELECT dispatching_base_num, originating_base_num, 
                ROUND(`sum(tolls)`,1) AS toll_sum,
                ROUND(`sum(trip_time)`,1) AS trip_sum
            FROM disp_orig_license 
            WHERE `sum(tolls)` > 58 AND`sum(trip_time)` > 10000 
            LIMIT 10""").show()

+--------------------+--------------------+---------+----------+
|dispatching_base_num|originating_base_num| toll_sum|  trip_sum|
+--------------------+--------------------+---------+----------+
|              B02875|              B03153|     58.1|    115706|
|              B02800|              B02800|    401.4|   6553330|
|              B02880|              B02880|1314992.8|1503616149|
|              B02866|              B02866|3636406.9|4458512780|
|              B02872|                null|   1698.3|   2541839|
|              B02882|              B02882|2471208.2|2867311515|
|              B02836|              B02836|1407971.0|1614030643|
|              B02764|                null|   3376.9|   4835600|
|              B02871|              B02871|2852879.8|3647028943|
|              B02867|              B02867|2019888.5|2361136337|
+--------------------+--------------------+---------+----------+



In [52]:
spsql("""SELECT dispatching_base_num, originating_base_num, 
                ROUND(`sum(tolls)`,1) AS toll_sum,
                ROUND(`sum(trip_time)`,1) AS trip_sum
            FROM disp_orig_license 
            WHERE `sum(tolls)` > 58 AND`sum(trip_time)` > 10000 
            ORDER BY toll_sum ASC
            LIMIT 10""").show()

+--------------------+--------------------+--------+--------+
|dispatching_base_num|originating_base_num|toll_sum|trip_sum|
+--------------------+--------------------+--------+--------+
|              B02875|              B03153|    58.1|  115706|
|              B02884|              B02729|    64.6|  115789|
|              B02889|                null|    68.7|   62969|
|              B02879|                null|    75.2|   91700|
|              B02395|              B02729|    77.7|  164586|
|              B02876|              B02729|    78.6|  124149|
|              B02512|                null|    79.3|  111325|
|              B02876|                null|    84.4|   60046|
|              B02617|                null|   119.7|  149206|
|              B02872|              B02026|   140.8|  259781|
+--------------------+--------------------+--------+--------+



In [53]:
!ls /kaggle/working/spark-warehouse/uber_nyc_db.db/

disp_orig_license  jan_b02682


In [55]:
!ls /kaggle/working/jan_b02682/

 _SUCCESS	       'req_date=11-01-2021'  'req_date=22-01-2021'
'req_date=01-01-2021'  'req_date=12-01-2021'  'req_date=23-01-2021'
'req_date=02-01-2021'  'req_date=13-01-2021'  'req_date=24-01-2021'
'req_date=03-01-2021'  'req_date=14-01-2021'  'req_date=25-01-2021'
'req_date=04-01-2021'  'req_date=15-01-2021'  'req_date=26-01-2021'
'req_date=05-01-2021'  'req_date=16-01-2021'  'req_date=27-01-2021'
'req_date=06-01-2021'  'req_date=17-01-2021'  'req_date=28-01-2021'
'req_date=07-01-2021'  'req_date=18-01-2021'  'req_date=29-01-2021'
'req_date=08-01-2021'  'req_date=19-01-2021'  'req_date=30-01-2021'
'req_date=09-01-2021'  'req_date=20-01-2021'  'req_date=31-01-2021'
'req_date=10-01-2021'  'req_date=21-01-2021'  'req_date=31-12-2020'


### Complete different way to read the parquet files

In [56]:
spctg.createTable("jan_uber_data",path='/kaggle/working/jan_b02682/',
                 source='parquet')

DataFrame[pu_date: string, pu_day: string, pu_month: string, tolls: double, trip_time: bigint, tips: double, base_passenger_fare: double, originating_base_num: string, dispatching_base_num: string, hvfhs_license_num: string, req_date: string]

In [57]:
spsql("SELECT * FROM jan_uber_data LIMIT 10").show(truncate=False)

+-------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+--------+
|pu_date|pu_day|pu_month|tolls|trip_time|tips|base_passenger_fare|originating_base_num|dispatching_base_num|hvfhs_license_num|req_date|
+-------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+--------+
+-------+------+--------+-----+---------+----+-------------------+--------------------+--------------------+-----------------+--------+



In [58]:
spsql("SHOW PARTITIONS jan_uber_data").show()

+---------+
|partition|
+---------+
+---------+



In [59]:
spctg.recoverPartitions("jan_uber_data")

In [60]:
spsql("SHOW PARTITIONS jan_uber_data").show()

+-------------------+
|          partition|
+-------------------+
|req_date=01-01-2021|
|req_date=02-01-2021|
|req_date=03-01-2021|
|req_date=04-01-2021|
|req_date=05-01-2021|
|req_date=06-01-2021|
|req_date=07-01-2021|
|req_date=08-01-2021|
|req_date=09-01-2021|
|req_date=10-01-2021|
|req_date=11-01-2021|
|req_date=12-01-2021|
|req_date=13-01-2021|
|req_date=14-01-2021|
|req_date=15-01-2021|
|req_date=16-01-2021|
|req_date=17-01-2021|
|req_date=18-01-2021|
|req_date=19-01-2021|
|req_date=20-01-2021|
+-------------------+
only showing top 20 rows



In [61]:
spsql("SELECT * FROM jan_uber_data LIMIT 10").show(truncate=False)

+----------+------+--------+-----+---------+-----+-------------------+--------------------+--------------------+-----------------+----------+
|pu_date   |pu_day|pu_month|tolls|trip_time|tips |base_passenger_fare|originating_base_num|dispatching_base_num|hvfhs_license_num|req_date  |
+----------+------+--------+-----+---------+-----+-------------------+--------------------+--------------------+-----------------+----------+
|29-01-2021|Friday|Jan     |0.0  |289      |0.0  |6.33               |B02682              |B02682              |HV0003           |29-01-2021|
|29-01-2021|Friday|Jan     |0.0  |498      |3.0  |10.32              |B02682              |B02682              |HV0003           |29-01-2021|
|29-01-2021|Friday|Jan     |0.0  |966      |0.0  |12.45              |B02682              |B02682              |HV0003           |29-01-2021|
|29-01-2021|Friday|Jan     |0.0  |1422     |0.0  |26.4               |B02682              |B02682              |HV0003           |29-01-2021|
|29-01

In [63]:
spsql("""SELECT req_date, COUNT(*) AS day_trips
      FROM jan_uber_data
      GROUP BY req_date
      ORDER BY day_trips DESC""").show(31,truncate=False)

+----------+---------+
|req_date  |day_trips|
+----------+---------+
|29-01-2021|13460    |
|23-01-2021|12137    |
|30-01-2021|12098    |
|15-01-2021|12040    |
|22-01-2021|11767    |
|09-01-2021|11576    |
|08-01-2021|11518    |
|16-01-2021|11355    |
|28-01-2021|11204    |
|27-01-2021|10633    |
|01-01-2021|10615    |
|21-01-2021|10361    |
|07-01-2021|10295    |
|14-01-2021|10285    |
|31-01-2021|10234    |
|20-01-2021|10097    |
|12-01-2021|10088    |
|06-01-2021|9935     |
|24-01-2021|9921     |
|13-01-2021|9917     |
|26-01-2021|9823     |
|17-01-2021|9721     |
|10-01-2021|9571     |
|25-01-2021|9547     |
|05-01-2021|9542     |
|19-01-2021|9539     |
|11-01-2021|9489     |
|04-01-2021|9171     |
|02-01-2021|8844     |
|18-01-2021|8440     |
|03-01-2021|7568     |
+----------+---------+
only showing top 31 rows

