# Questions for Annual Datasets

### Why do this analysis?

The private hire travel industry is estimated to be worth $100 billion dollars worldwide and there are an estimated 15million taxi drivers worldwide. As cities grow denser and travel infrastructures strain, the demand for private hire vehicles (as opposed to personal ownership) has increased. This is evident with the sharp rise of companies like Uber and Lyft.

Despite this, there is the growing issue of climate change brought on by pollution; one of the key culprits being vehicle emissions. The increase usage of private hire vehicles has actually helped with this somewhat as it is much more efficient to have fewer, specialised vehicles to transport people rather than have everyone own a car. However, more can be done.

A successful taxi driver is one that can minimise the downtime between picking up and travelling to their passengers but this also has the added benefit of minimising vehicle emissions by reducing wasted travelling. 

With data gathered from taxis over the past few years, it may be possible to build a model that can predict when and where clients will be before they have even requested a taxi. If this is done effectively, taxi drivers will be able to minimise downtime between clients and thus, increase profits and decrease vehicle emissions.


### Who is this analysis for?

Taxi drivers can be broadly split into three categories:

1) Freelance drivers who look for clients to hail them

2) Drivers who are on call and respond to requests from a dispatcher

3) Drivers who pick up passengers from a specific location (E.g. airports, train stations)

This analysis will mostly benefit the first two categories as the first group solely relies on being in the right place at the right time, while the second group could minimise their downtime by being close to a potential client before being dispatched. Should this predictive model be sucessful, the need for the third category would be redundant as groups 1 and 2 should have it covered.

### What data do we have?

This project will use the New York City Taxi & Limousine Commission's TLC trip dataset. I picked this dataset as it is currently the most comprehensive, with data stemming back as far as 2009, and New York is a city that famously has a thriving private hire industry.

The dataset is approximately 180gb, however, for this initial investigation, we will only be taking a sample from the 2009 dataset as I neither have the time nor resources to be able to do a full analysis at this time.

The data files are split into months and represent data from three different taxi types. The first being the famous Yellow taxis, the second being the Green taxis (that arrived in 2013 to cover the much neglected 'north' of New York) and For-Hire Vehicle (FHV) which covers upcoming taxi companies such as Uber and Lyft.

As we are only looking at the 2009 dataset, we will only be looking at Yellow taxis for the meantime.

The yellow taxi datasets have 18 data fields. They are: Vendor name, Pickup date & time, Dropoff date & time, Number of passengers, Travel distance, Pickup latitute/longitute, Rate code, Data acquisition information, Dropoff latitute/longitute, Payment type, Fare Amount, Surcharge amount, Tax amount, Tip amount, Toll amount and Total amount.

### Initial questions

There are quite a few questions we can ask to get an idea of what the data is telling us.

1) What are the most common pickup times?

2) What are the most common pickup locations?

3) What are the most common dropoff locations?

4) What journeys generate the most revenue?

5) What are the fastest journeys?

6) Is there a correlation between revenue and location?

7) Is there a correlation between speed and location?

In [62]:
import pandas as pd
import os
import numpy as np
import gc

path="data/yellow_tripdata_2009_sample.csv"
df=pd.read_csv(open(path))
df_time=pd.DataFrame({})

pickup_times=pd.DatetimeIndex(pd.to_datetime(df['Trip_Pickup_DateTime']))#.dt.strftime('%H:%M')
df_time['minute']=pickup_times.minute
df_time['hour']=pickup_times.hour

df_time['minute']=(10*np.floor(df_time['minute']/10)).astype("int64")
df_time

Unnamed: 0,minute,hour
0,0,0
1,20,12
2,30,20
3,0,21
4,20,15
5,10,20
6,0,12
7,0,10
8,10,7
9,50,2
