## Data discovery: Load and query Yellow Taxi data
> Download the dataset from [the official TLC Trip Record Data website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

### This cell only shows how to document code
```python
# Load file
local_file = 'datasets/your-downloaded-from-TLC-taxis-file-here.parquet'

# Show data
spark.read.parquet(local_file).show()
```

In [None]:
from pyspark.sql import SparkSession

### What is master(local N)?
The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads.

<b>Source</b>: See Spark [docs here](spark.apache.org/docs/latest). See all [options here](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls)

In [None]:
# Create SparkSession
spark = SparkSession.builder\
             .master("local[1]")\
             .appName("spark-app-version-x")\
             .getOrCreate()

In [None]:
# Read taxi data
local_file = 'datasets/parquet/'
df = spark.read.parquet(local_file)

In [None]:
# DF is like a relation table in memory. Let's see the columns
df.printSchema()

In [None]:
# Query sample:
df2 = df.select('VendorID','total_amount').where('total_amount > 1')

In [None]:
# Query sample:
# df.select('VendorID','total_amount').where('total_amount > 1').show(n=5)

In [None]:
# Query sample, using Spark SQL
df.createOrReplaceTempView('tbl_raw_yellow_taxis')

In [None]:
# SQL Statement
spark.sql('select VendorID, tpep_pickup_datetime, passenger_count from tbl_raw_yellow_taxis where total_amount > 1 and passenger_count > 2').show(n=5)

In [None]:
# Stop the session
spark.stop()