## NYC Taxi data

The city releases data on taxis and for-hire vehicles on the Taxi and Limousine Commission (TLC) Website. There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.

We'll be working with a subset of this data: Yellow taxi trips to and from New York City airports between January and June 2016.

We have randomly sampled approximately 90,000 trips for our analysis, representing one 50th of the trips for the six month period.

Find the csv data here - [nycTaxis](https://drive.google.com/file/d/17LWD9zCPGme69BiOkOIbsNzmYkZcnWLk/view?usp=sharing)

In this exercise we intend to achieve the following:-
    1. Sort the data according to the speeds in ascending order.
    2. Find out the number of rides in a particuar month say January.
    3. Find out the most preffered drop airport.
    4. Clean the data from the rows with speed more than 100 mph.
    5. Calculate the mean speed- distance- time - charge of the cleaned data 
    

To start working with this CSV data in NumPy, we'll first need to start by importing the NumPy library into our Python environment. For this, we use a simple import statement:

list of lists(string) ----> list of lists(float) -----> numpy ndarray

In [30]:
opened_file = open('nyc_taxis.csv')
from csv import reader
read_file = reader(opened_file)
taxi_list = list(read_file)
taxi_list = taxi_list[1:] # Excluding the column names

import numpy as np 

converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))  
               
    converted_taxi_list.append(converted_row)
    #converted_taxi_list = float(taxi_list)

taxi = np.array(converted_taxi_list)
    #taxi is an ndarray (n dimensional array) now.

Let's also add average speed column.

In [31]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_mph = trip_distance_miles / trip_length_hours #trip_mph is an Ndarray although not an element of taxi now.

In [32]:
print(trip_mph.shape) #-----> (89560,)

trip_mph_2d = np.expand_dims(trip_mph ,axis = 1)

print(trip_mph_2d.shape) #--------> (89560 ,1)

taxi = np.concatenate([taxi , trip_mph_2d],axis = 1) #Now the avg speed column (trip_mph_2d) is part of taxis.



(89560,)
(89560, 1)


Let's sort the taxi data according to the newly added speed column. 

In [33]:
to_sort = taxi[:,15]
sorted_indices = np.argsort(to_sort)
taxi_sorted = taxi[sorted_indices]



Now let's evaluate the number of trips in a particular month. Say January.

In [34]:
pickup_month = taxi[: ,1]
january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0]
print(january_rides)

13481


Now, we'll try to predict which Airport is the most poplular Drop off choice.

In [35]:
drop_airport = taxi[: , 6]
jfk =  taxi[drop_airport == 2 ,6]
jfk_count = jfk.shape[0]

laguardia =  taxi[drop_airport == 3 ,6]
laguardia_count = laguardia.shape[0]

newark =  taxi[drop_airport == 5 ,6]
newark_count = newark.shape[0]
print(jfk_count)
print(laguardia_count)
print(newark_count)

11832
16602
63


Clearly Laguardia is the favourite drop location.

## Cleaning the data from the speeds greater than 100 mph

In [36]:
speed = taxi[: ,15]
cleaned_taxi = taxi[speed <100]
mean_distance = cleaned_taxi[: ,7].mean()
mean_length = cleaned_taxi[: ,8].mean()
mean_total_amount = cleaned_taxi[: ,13].mean()
mean_mph = cleaned_taxi[:,15].mean()
print(mean_distance)
print(mean_length)
print(mean_total_amount)
print(mean_mph)

12.666396599932893
2239.503657309026
48.98131853260262
23.353238774840836
