# Exploring the NYC taxi data

In Project 2, you will work on the [NYC taxi trip data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Every month, the city of New York publishes open data which contains a record of every taxi ride taken that month in the city.

The function `get_taxi_data()` is provided for you in `utils.py` to easily download and read data for a particular month and type of taxi. You should use it in your project.

Open `utils.py` in VSCode, study it carefully, and try the example below. If you are not sure how it works, ask a tutor!

In [None]:
import pandas as pd

# Import the function get_taxi_data() from utils.py
from utils import get_taxi_data

In [None]:
# Example: get yellow taxi data for January 2022
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'passenger_count',
                'trip_distance',
                'fare_amount']

# Download the data and get the specified columns, save the file locally
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read, save=True)
df.info()

In [None]:
# Now, get the data only for those 3 columns.
# We have the file already saved from the previous command, so this should be faster!
cols_to_read = ['tpep_pickup_datetime',
                'tpep_dropoff_datetime',
                'trip_distance']

# We also don't need to save this as it's a subset of the file we already have.
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df.info()

In [None]:
# Now, I want the same data, but I need a new column 'total_amount' which is not in my current file.
cols_to_read = ['fare_amount',
                'total_amount']

# The function tries to get the columns from the existing data file,
# but can't find them, so it automatically re-downloads the data.
df = get_taxi_data('2022', '01', 'yellow', columns=cols_to_read)
df.info()

Now, choose another month, a type of vehicle, use `get_taxi_data()` to obtain the data, and start exploring the dataset!

---

## Important tips about memory usage

Some of the data files are very heavy (several gigabytes!). Depending on your computer's RAM (memory), you may not be able to read entire data files at once, in a single data frame.

### Specify `columns`

The `columns` input argument is provided for you to select which columns you want to include in your dataframe. You should always specify which columns you need when you read data, to avoid loading unnecessary data into memory.

### Save your processed data into CSV files

To create your report, you will be selecting specific parts of the data, and likely performing some cleaning and/or aggregation on this data. You may wish to save your data at intermediate steps of your processing into CSV files, so that you can load these directly the next time you start your notebook (instead of having to re-do all the processing every time you restart Jupyter).

---