<a href="https://colab.research.google.com/github/kavyajeetbora/parrallel-computing-with-dask-datacamp/blob/main/part-1/generators.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from google.colab import files
from tqdm import tqdm
import requests 

## Generators 

Generator expressions resemble comprehensions, but use lazy evaluation. This means **elements are generated one-at-a-time**, so they are never in memory simultaneously This is extremely helpful when operating at the limits of available memory. 

In [2]:
year = 2013
month = 1
month = "0"+str(month) if month < 10 else month
file_location = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_{:d}-{:s}.csv".format(year, month)
file_location

'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2013-01.csv'

In [3]:
%%time
for i in tqdm(range(1,13),unit="csv_file"):
    month = "0"+str(i) if i < 10 else str(i)
    file_location = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_{:d}-{:s}.csv".format(2013, month)
    # get the response from the link
    req = requests.get(file_location)

    output_file_name = "yellow_tripdata_{:d}-{:s}.csv".format(2013, month)
    # save the reponse content as csv file
    with open(output_file_name, "wb") as csv_file:
        csv_file.write(req.content)
    del req # deleting the request once the csv file is downloaded, to clear up the RAM

100%|██████████| 12/12 [14:26<00:00, 72.21s/csv_file]

CPU times: user 1min 42s, sys: 1min 19s, total: 3min 2s
Wall time: 14min 26s





Since the data is very huge and cannot be loaded in the memory once, we will use generators which is based on lazy computing, the calculation will be done one at a time> By this all the data will not be loaded simaltaneously. 


Here the task is to compute the long trips which are more than 1200 seconds. But beofre that we need to calculate the duration and convert it to seconds

And aslo first we need to see the column names in a dataframe, for that we will load a csv file in chuck since the data volume is huge:

In [33]:
for chunk in pd.read_csv("yellow_tripdata_2013-11.csv", chunksize=100, parse_dates=[1,2]):
    print(chunk.head())
    print(chunk.info())
    print(chunk.columns)
    break

  vendor_id     pickup_datetime  ... tolls_amount  total_amount
0       CMT 2013-11-25 15:53:33  ...          0.0           8.5
1       CMT 2013-11-25 15:24:41  ...          0.0           9.0
2       CMT 2013-11-25 09:43:42  ...          0.0          17.5
3       CMT 2013-11-25 06:49:58  ...          0.0          17.4
4       CMT 2013-11-25 10:02:12  ...          0.0          14.5

[5 rows x 18 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   vendor_id           100 non-null    object        
 1   pickup_datetime     100 non-null    datetime64[ns]
 2   dropoff_datetime    100 non-null    datetime64[ns]
 3   passenger_count     100 non-null    int64         
 4   trip_distance       100 non-null    float64       
 5   pickup_longitude    100 non-null    float64       
 6   pickup_latitude     100 non-null    

Here the difference between pick_datetime and dropoff_datetime will yield the trip duration. That is to be converted to seconds:

In [34]:
(chunk["dropoff_datetime"] - chunk["pickup_datetime"]).dt.seconds

0      438
1      337
2     1155
3      864
4      903
      ... 
95    1111
96     597
97    1163
98     224
99     373
Length: 100, dtype: int64

In [44]:
def count_long_trips(df):
    '''
    Takes a dataframe as input and returns a data with 
    number of trips that are long trip duration and total number of trips made
    '''
    df['duration'] = (df['dropoff_datetime'] - df['pickup_datetime']).dt.seconds
    is_long_trip = df.duration>1200
    result_dict = {"n_long_trips":[sum(is_long_trip)], "n_total_trips":[len(df)]}
    return pd.DataFrame(result_dict)

In [45]:
filenames = []
for i in range(1,13):
    month = "0"+str(i) if i < 10 else str(i)
    filenames.append("yellow_tripdata_2013-{:s}.csv".format(month))

filenames

['yellow_tripdata_2013-01.csv',
 'yellow_tripdata_2013-02.csv',
 'yellow_tripdata_2013-03.csv',
 'yellow_tripdata_2013-04.csv',
 'yellow_tripdata_2013-05.csv',
 'yellow_tripdata_2013-06.csv',
 'yellow_tripdata_2013-07.csv',
 'yellow_tripdata_2013-08.csv',
 'yellow_tripdata_2013-09.csv',
 'yellow_tripdata_2013-10.csv',
 'yellow_tripdata_2013-11.csv',
 'yellow_tripdata_2013-12.csv']

In [46]:
%%time
dataframes = (pd.read_csv(fname, parse_dates=[1,2]) for fname in filenames)  ## generator
totals = (count_long_trips(df) for df in dataframes) ## generator for calculating the long trips
annual_totals = sum(totals)

  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)
  exec(code, glob, local_ns)


CPU times: user 11min 11s, sys: 58.3 s, total: 12min 9s
Wall time: 13min 49s


In [47]:
annual_totals

Unnamed: 0,n_long_trips,n_total_trips
0,26645689,173179759


In [48]:
annual_totals["n_long_trips"]/annual_totals["n_total_trips"]

0    0.153861
dtype: float64

Roughly 15% of the trips in the year 2013 were long trips. 