# Importing NYC TLC Files

The goal of this notebook is to generate the commands to download and export the NYC TLC data directly from the website.

For doing this, we will use the features of the Google Cloud (GC) platform.

Following the Google Cloud best practices, since our data has less than 10TB in size, we could perform this operation in a command prompt of a GC VMs instance.

First, we will use the `wget` command to download the data into the VM instance, then, with the `gsutil` library we could export the data to the Google Cloud Storage (GCS) service, where we could easily access it.

## Step 1: Import Dependencies

In [3]:
import datetime

## Step 2: Analyse the format of the links

Checking the data link in the webpage, we found the following structure:

`https://d37ci6vzurychx.cloudfront.net/trip-data/{category}_tripdata_{year}-{month}.parquet`

The file links follow a predefined structure with category, year, and month. We need to defined the variables above to look for data:

In [8]:
months = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
category = ["yellow", "green", "fhv", "fhvhv"]

now = datetime.datetime.now()
begin = 2020
years = []
for i in range(begin, now.year+1):
  years.append(i)

## Step 3: Download the files via Google VM Instance

Since we first have to download the data into the VM instance, and after export to GCS, we will do this by batches of data, which batch corresponds to a category.

To check the availability of new data, we will simply pass the whole month's links(Jan to Dec) of the current year, as the result, the system will check the links and download until the most recent month is available.

In [10]:
yellow_down = []
green_down = []
fhv_down = []
fhvhv_down = []

for y in years:
  for m in months:
    for t in category:
      if t == "yellow":
        yellow_down.append(f"wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{y}-{m}.parquet;")
      elif t == "green":
        green_down.append(f"wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_{y}-{m}.parquet;")
      elif t == "fhv":
        fhv_down.append(f"wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_{y}-{m}.parquet;")
      else:
        fhvhv_down.append(f"wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_{y}-{m}.parquet;")

Example:

In [15]:
for i in yellow_down[:3]:
  print(i)

wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-01.parquet;
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-02.parquet;
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2020-03.parquet;


The command above could be easily pasted on the VM terminal to quickly download the data.

## Step 4: Export data to Google cloud storage

In a same way, we will generate the commands to export the data from the VM instance to the a bucket in GCS.

We will create a folder sctruture as follows:

├── mobilab-tech-task-bucket

│   ├── yellow-taxi

│   ├── green-taxi

│   ├── fhv

│   ├── fhvhv

In [24]:
yellow_cs = []
green_cs = []
fhv_cs = []
fhvhv_cs = []

for y in years:
  for m in months:
    for t in category:
      if t == "yellow":
        yellow_cs.append(f"gsutil cp yellow_tripdata_{y}-{m}.parquet gs://mobilab-tech-task-bucket/yellow-taxi;")
      elif t == "green":
        green_cs.append(f"gsutil cp green_tripdata_{y}-{m}.parquet gs://mobilab-tech-task-bucket/green-taxi;")
      elif t == "fhv":
        fhv_cs.append(f"gsutil cp fhv_tripdata_{y}-{m}.parquet gs://mobilab-tech-task-bucket/fhv;")
      else:
        fhvhv_cs.append(f"gsutil cp fhvhv_tripdata_{y}-{m}.parquet gs://mobilab-tech-task-bucket/fhvhv;")

Example

In [25]:
for i in yellow_cs[:3]:
  print(i)

gsutil cp yellow_tripdata_2020-01.parquet gs://mobilab-tech-task-bucket/yellow-taxi;
gsutil cp yellow_tripdata_2020-02.parquet gs://mobilab-tech-task-bucket/yellow-taxi;
gsutil cp yellow_tripdata_2020-03.parquet gs://mobilab-tech-task-bucket/yellow-taxi;


## Step 5: Delete Files from VM Instance

In [22]:
yellow_del = []
green_del = []
fhv_del = []
fhvhv_del = []

for y in years:
  for m in months:
    for t in category:
      if t == "yellow":
        yellow_del.append(f"rm yellow_tripdata_{y}-{m}.parquet;")
      elif t == "green":
        green_del.append(f"rm green_tripdata_{y}-{m}.parquet;")
      elif t == "fhv":
        fhv_del.append(f"rm fhv_tripdata_{y}-{m}.parquet;")
      else:
        fhvhv_del.append(f"rm fhvhv_tripdata_{y}-{m}.parquet;")


Example

In [23]:
for i in yellow_del[:3]:
  print(i)

rm yellow_tripdata_2020-01.parquet;
rm yellow_tripdata_2020-02.parquet;
rm yellow_tripdata_2020-03.parquet;
