# Chicago Taxi Fare Prediction - Data Aquisition

Chicago Taxi dataset is available in the URL - https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew

The above dataset is around 70GB in size and contains more 193M records.  The same dataset is available in Google Cloud as well, which can be accessed via big query.

Below code shows how to access the data via big query

Install Google Cloud Bigquery (API to connect to Google Big Query) and Pandas GBQ (for querying into Pandas Dataframe)

In [1]:
#pip install --upgrade google-cloud-bigquery[pandas]
#pip install --user pandas-gbq -U

Import Required Libraries

In [2]:
from google.cloud import bigquery
import pandas as pd
import pandas_gbq

Set Environment variable for Google Credentials

In [3]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + "/Key.json"

Create Big Query Client

In [4]:
bg_client = bigquery.Client(project='bigquery-public-data')

Get the data set and display all the tables from the given dataset

In [5]:
data_set_ref = bg_client.dataset('chicago_taxi_trips', project='bigquery-public-data')
data_set = bg_client.get_dataset(data_set_ref)
for tab in bg_client.list_tables(data_set):
    print(tab.table_id)

taxi_trips


taxi_trips


List all the columns and the corresponding details for the given table

In [6]:
tab = bg_client.get_table(data_set.table('taxi_trips'))
tab.schema

[SchemaField('unique_key', 'STRING', 'REQUIRED', 'Unique identifier for the trip.', ()),
 SchemaField('taxi_id', 'STRING', 'REQUIRED', 'A unique identifier for the taxi.', ()),
 SchemaField('trip_start_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip started, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_end_timestamp', 'TIMESTAMP', 'NULLABLE', 'When the trip ended, rounded to the nearest 15 minutes.', ()),
 SchemaField('trip_seconds', 'INTEGER', 'NULLABLE', 'Time of the trip in seconds.', ()),
 SchemaField('trip_miles', 'FLOAT', 'NULLABLE', 'Distance of the trip in miles.', ()),
 SchemaField('pickup_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('dropoff_census_tract', 'INTEGER', 'NULLABLE', 'The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips.', ()),
 SchemaField('pickup_community_area', 'INTEGER', 'NULLABLE', '

Get the total number of rows from the table

In [7]:
tab.num_rows

193151452

Given there are more than 193 million rows, lets create a Sample query and assign the subset of data into Pandas Dataframe

In [8]:
QUERY = """
    SELECT 
    unique_key, taxi_id, trip_start_timestamp, trip_end_timestamp, trip_seconds, trip_miles
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
LIMIT 10
        """
df=pd.read_gbq(QUERY)
df

Downloading: 100%|██████████| 10/10 [00:00<00:00, 47.01rows/s]


Unnamed: 0,unique_key,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles
0,af820d83a3ca0c0efd742073602eae8296511c64,31e64df976779ba4fa36cdff7e93b2e388e9d0dda8ec69...,2014-05-21 14:45:00+00:00,2014-05-21 14:45:00+00:00,0,0.0
1,1a6ea96ac0622cc493429534f90d8b092ea603e3,daf945fb8e25eb32a49824dda26e56441467cdb86f7d9d...,2014-04-17 17:00:00+00:00,2014-04-17 17:00:00+00:00,0,0.0
2,136ec0dd2ba1b2d387f3df6a6703e0a3b1f0b0c9,c069b62695a13c54d43a5208353bcfc999dc9548f71041...,2014-06-21 01:45:00+00:00,2014-06-21 01:45:00+00:00,0,0.0
3,26f85de01e046e0d243665af019db87617ade589,249ef6f75a49feebb50f4bc68cf7ba703c4006498c63b6...,2014-05-14 13:00:00+00:00,2014-05-14 13:00:00+00:00,0,0.0
4,004869f23f2662487ac1a12a0921710bb2e0a36a,6bece81c8b02e5631185bb018734a4c6f31b1db05d14f6...,2014-05-13 15:00:00+00:00,2014-05-13 15:00:00+00:00,0,0.0
5,912a2edda71d9d502a6a8680b2282baf9a6e7425,f8d3b9a91df83387f39b14924f52dc76b879eb5c27ea76...,2014-05-24 20:45:00+00:00,2014-05-24 20:45:00+00:00,0,0.0
6,836cf7729e1f73ddfc88ee5bdb9de13b58db22e0,2a01a33af1d2c0d4dfe3a39e46fc6dc8070d1ae4c532e6...,2014-06-20 20:30:00+00:00,2014-06-20 20:30:00+00:00,60,0.0
7,2b88d16d33b0d931f7ee27a5e8135a07bd7a7990,4c8b6783201bdc422fd78043aceeea92a005af4c37bba7...,2015-05-04 14:00:00+00:00,2015-05-04 14:00:00+00:00,420,0.0
8,0f51d320ba30e09b8b03af031acfa04d95dcad74,e42e04fdf9a0e3051134941539806a37a413ddc9289287...,2014-04-22 11:15:00+00:00,2014-04-22 11:15:00+00:00,0,0.0
9,a046a61f9dfd07bc4766adccc1e79c8501f092f8,8001d19894ae43c347584201b713ceb2daf35eeb9bdc77...,2014-04-17 11:30:00+00:00,2014-04-17 11:30:00+00:00,0,0.0
