<a href="https://colab.research.google.com/github/omanofx/portfolio/blob/main/TLC_Trip_Record_Data/TLC_Trip_Record_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TLC Trip Record Data

Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

For-Hire Vehicle (“FHV”) trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. Note: The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness. Therefore, this may not represent the total amount of trips dispatched by all TLC-licensed bases. The TLC performs routine reviews of the records and takes enforcement actions when necessary to ensure, to the extent possible, complete and accurate information.

ATTENTION!

On 05/13/2022, we are making the following changes to trip record files:

    All files will be stored in the PARQUET format. Please see the ‘Working With PARQUET Format’ under the Data Dictionaries and MetaData section.
    Trip data will be published monthly (with two months delay) instead of bi-annually.
    HVFHV files will now include 17 more columns (please see High Volume FHV Trips Dictionary for details). Additional columns will be added to the old files as well. The earliest date to include additional columns: February 2019.
    Yellow trip data will now include 1 additional column (‘airport_fee’, please see Yellow Trips Dictionary for details). The additional column will be added to the old files as well. The earliest date to include the additional column: January 2011.


https://aws.amazon.com/marketplace/pp/prodview-okyonroqg5b2u?sr=0-2&ref_=beagle&applicationId=AWSMPContessa#overview

https://github.com/aws-samples/cloud-experiments/tree/master/experiments/notebooks/exploring-data

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [1]:
!pip install pandas pyarrow



In [None]:
#!pip install pandas fastparquet

Ejemplo 1: Leer un archivo Parquet

Si tienes un archivo Parquet almacenado localmente o en un servicio en la nube (como AWS S3), puedes leerlo usando pandas.
Leer archivo Parquet con pyarrow:

In [None]:
import pandas as pd

# Leer archivo Parquet
df = pd.read_parquet('ruta_al_archivo.parquet', engine='pyarrow')

# Mostrar las primeras filas
print(df.head())

Ejemplo 3: Leer y escribir archivos Parquet desde AWS S3

Puedes trabajar directamente con archivos Parquet almacenados en Amazon S3 utilizando la biblioteca s3fs para manejar la conexión a S3.
Instalación de s3fs:

In [2]:
!pip install s3fs

Collecting s3fs
  Downloading s3fs-2024.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting aiobotocore<3.0.0,>=2.5.4 (from s3fs)
  Downloading aiobotocore-2.15.0-py3-none-any.whl.metadata (23 kB)
Collecting fsspec==2024.9.0.* (from s3fs)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting botocore<1.35.17,>=1.35.16 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading botocore-1.35.16-py3-none-any.whl.metadata (5.7 kB)
Collecting aioitertools<1.0.0,>=0.5.1 (from aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading aioitertools-0.12.0-py3-none-any.whl.metadata (3.8 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from botocore<1.35.17,>=1.35.16->aiobotocore<3.0.0,>=2.5.4->s3fs)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Downloading s3fs-2024.9.0-py3-none-any.whl (29 kB)
Downloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ai

In [3]:
!pip install --upgrade gcsfs

Collecting gcsfs
  Downloading gcsfs-2024.9.0.post1-py2.py3-none-any.whl.metadata (1.6 kB)
Downloading gcsfs-2024.9.0.post1-py2.py3-none-any.whl (34 kB)
Installing collected packages: gcsfs
  Attempting uninstall: gcsfs
    Found existing installation: gcsfs 2024.6.1
    Uninstalling gcsfs-2024.6.1:
      Successfully uninstalled gcsfs-2024.6.1
Successfully installed gcsfs-2024.9.0.post1


In [4]:
!pip install fsspec==2024.6.1

Collecting fsspec==2024.6.1
  Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.6.1-py3-none-any.whl (177 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.9.0
    Uninstalling fsspec-2024.9.0:
      Successfully uninstalled fsspec-2024.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.9.0.post1 requires fsspec==2024.9.0, but you have fsspec 2024.6.1 which is incompatible.
s3fs 2024.9.0 requires fsspec==2024.9.0.*, but you have fsspec 2024.6.1 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2024.6.1


Leer archivo Parquet desde S3:

In [None]:
import pandas as pd

# Leer un archivo Parquet desde S3
df = pd.read_parquet('s3://mi-bucket/ruta_al_archivo.parquet', engine='pyarrow', storage_options={'key': 'mi-access-key', 'secret': 'mi-secret-key'})

print(df.head())

Ventajas del formato Parquet

    Eficiencia en el almacenamiento: Al ser columnar, Parquet comprime muy bien los datos, lo que reduce el espacio en disco.
    Lectura más rápida: Puedes cargar solo las columnas que necesites, ahorrando tiempo y memoria.
    Compatibilidad: Es compatible con muchas herramientas en el ecosistema Big Data (como Spark, Hadoop, etc.).

En resumen, Parquet es ideal para trabajar con grandes conjuntos de datos y puedes integrarlo fácilmente con Pandas y otros entornos de procesamiento de datos en Python.