## Let's start pondering 🌡️

Just to get you warmed up and familiar with this toy dataset, let’s try a few queries. 

First, let's try to initialize the Ponder connection. 

In [1]:
import modin.pandas as pd
import ponder.snowflake
from ponder.utils.core import Teleporter

# Initialize snowflake connection here 
snowflake_con = ponder.snowflake.connect(
    user="PONDER",
    password="***REMOVED***",
    account="***REMOVED***",
    role="ACCOUNTADMIN",
    database="TAXI",
    schema="PUBLIC",
    warehouse="PONDER",
)
ponder.snowflake.init(snowflake_con, timeout=1200)

setting client_row_transfer_limit = 10000

Connected to
       ___               __
      / _ \___  ___  ___/ /__ ____
     / ___/ _ \/ _ \/ _  / -_) __/
    /_/___\___/_//_/\_,_/\__/_/
      / __/__ _____  _____ ____
     _\ \/ -_) __/ |/ / -_) __/
    /___/\__/_/  |___/\__/_/



Now that we have the connection initialized. Let's read the **YELLOW_TRIPDATA_O** table that already exists in your database. This dataset shows yellow taxi trip data in New York city.

In [2]:
df = pd.read_sql("YELLOW_TRIPDATA_O", con='auto')

Now that you have read the dataset, let's see what is in the dataset.

In [3]:
df.head()

Unnamed: 0,VENDORID,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,PAYMENT_TYPE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE
0,2,2021-06-26 00:44:16,2021-06-26 00:56:27,2,3,1,N,68,229,1,11,1,1,2,0,0,17,3,0.0
1,2,2021-06-26 00:35:58,2021-06-26 00:37:32,1,1,1,N,239,142,1,4,1,1,1,0,0,9,3,0.0
2,2,2021-06-26 00:35:15,2021-06-26 00:55:33,1,4,1,N,249,141,1,16,1,1,3,0,0,23,3,0.0
3,2,2021-06-26 00:41:57,2021-06-26 01:04:24,1,5,1,N,48,145,1,18,1,1,4,0,0,26,3,0.0
4,2,2021-06-26 00:22:59,2021-06-26 00:28:21,2,1,1,N,48,68,1,6,1,1,2,0,0,11,3,0.0


Next, we look at the size of our dataset.

In [4]:
df.size

190000000

We filter out all trips with more than two passengers.

In [6]:
df2= df[df['PASSENGER_COUNT']>2]
df2.head()

Unnamed: 0,VENDORID,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,PAYMENT_TYPE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE
8,2,2021-06-26 00:17:10,2021-06-26 00:27:17,3,2,1,N,234,148,1,9,1,1,1,0,0,14,3,0.0
11,1,2021-06-26 00:53:23,2021-06-26 00:57:06,3,1,1,N,236,141,1,5,3,1,1,0,0,10,3,0.0
17,2,2021-06-26 00:03:38,2021-06-26 00:05:49,3,1,1,N,68,68,1,5,1,1,2,0,0,10,3,0.0
43,2,2021-06-26 00:11:08,2021-06-26 00:23:33,3,2,1,N,186,237,1,11,1,1,4,0,0,18,3,0.0
50,2,2021-06-26 01:43:58,2021-06-26 02:00:50,3,4,1,N,43,79,2,15,1,1,0,0,0,19,3,0.0


Finally, we would like to see the most common payment methods in these trips. 

In [7]:
# write your code here
df2.groupby('PAYMENT_TYPE').count()

Unnamed: 0_level_0,VENDORID,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE
PAYMENT_TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,873748,403559,59416
2,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,393758,157433,17353
3,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,3565,1720,222
4,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,1618,981,208
