## Import NYC Yellow Taxi data into Data Lake
First, lets load NYC Yellow Taxi trips from years 2014-2018 to the data frame.


In [3]:
from azureml.opendatasets import NycTlcYellow

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-12-31')
start_date = parser.parse('2014-01-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()

## Create a hive table
Save data frame to the nyctaxitrips table in default database. This table will be accessible by SQL On Demand engine.


In [3]:
%%sql
DROP TABLE IF EXISTS nyctaxitrips

In [None]:
nyc_tlc_df.write.mode('overwrite').format('parquet').saveAsTable('nyctaxitrips')

Let's count rows in a newly created table.

In [6]:
%%sql
SELECT COUNT(*) FROM nyctaxitrips

## Write data to the data lake
Save data from table to the data lake using Parquet format.


In [7]:
df = spark.sql('SELECT * FROM nyctaxitrips')
df.write.format('parquet').save('/nyc/')