In this example, we would like to work with a subset of **"Taxi cab"** dataset stored in a csv file. Taxi cab dataset shows yellow taxi trip data in New York city. 

This dataset includes fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

This data set contains 210035 rows and 20 columns. 

### Initialize the connection
To use Ponder, we first need to initialize Ponder Snowflake connection. Please find more instruction on how initialize the connection between Ponder and Snowflake here (https://docs.ponder.io/getting_started/quickstart.html#step-3-connect-to-snowflake).

In [None]:
import modin.pandas as pd
import ponder.snowflake
from ponder.utils.core import Teleporter

snowflake_con = ponder.snowflake.connect(user=*****, password=*****, account=*****, role=*****, database=*****, schema=*****, warehouse=*****)

ponder.snowflake.init(snowflake_con, timeout=1200)

We first read the **"yellow_tripdata_2015-01.csv"** file using **read_csv** command.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/yellow_tripdata_2015-01.csv", header=0)
df.head()

Looking at the columns of the dataset.

In [None]:
df.columns

Next, we are going to drop some of the uncessary columns since we are not going to use them during our analysis.

In [None]:
df_cleaned = df.drop(columns=['FARE_AMOUNT', 'STORE_AND_FWD_FLAG','RATECODEID','AIRPORT_FEE',' '])
df_cleaned.columns

Looking at the PAYMENT_TYPE attribute in our dataframe, we notice that payment type values (e.g., credit card, cash, etc) are represented using numerical values (e.g., 1, 2, etc). We replace these numerical values with their corresponding payment types to make the dataframe more readable. To do this: <br> - We need to change the 'PAYMENT_TYPE' variable type from integer to string. <br> - We then need to replace the numerical values with their string equivalent.

In [None]:
df_cleaned['PAYMENT_TYPE'] = df_cleaned['PAYMENT_TYPE'].astype(str)
df_cleaned['PAYMENT_TYPE'] = df_cleaned['PAYMENT_TYPE'].replace(['1', '2', '3', '4', '5', '6'], ['credit card', 'cash', 'No charge', 'Dispute' ,'Unknown', 'voided trip'])
df_cleaned['PAYMENT_TYPE'].head()

Next, to get a better sense of our data we at dimensionality and descriptive statistics of our dataframe.

In [None]:
df_cleaned.shape

In [None]:
df_cleaned.describe(include='all')

We use isna command to detect the number of missing values in each column.

In [None]:
df_cleaned.isna().sum()

One of our main questions that we have is to investigate the distribution of passengers ("PASSENGER_COUNT") per trip. To answer this question, we groupby PASSENGER_COUNT and get the size of each group.

In [None]:
df_cleaned.groupby(['PASSENGER_COUNT']).size()

We notice that most of the trips have one or two passengers, so we focus on this subset of data. 

In [None]:
df2 = df_cleaned.loc[(df_cleaned['PASSENGER_COUNT'] >= 1) & (df_cleaned['PASSENGER_COUNT'] <= 2)]
df2

Now that we filtered out all trips that are not one or two passengers, we want to know what is the longest and shortest trip distance for this subset of data.

In [None]:
longest= df2.nlargest(1,'TRIP_DISTANCE')
shortest = df2.nsmallest(1,'TRIP_DISTANCE')
print(longest, shortest)

Finally, we would like to see the most common payment methods in these trips. 

In [None]:
df2.groupby(['PAYMENT_TYPE']).size()