Hi! I am a newbie from Data Science. Please give me your feedback by adding comments below. Thanks!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.metrics import classification_report, confusion_matrix, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split


## Data description:
In this competition, Kaggle is challenging you to build a model that predicts the total ride duration of taxi trips in New York City. Your primary dataset is one released by the NYC Taxi and Limousine Commission, which includes pickup time, geo-coordinates, number of passengers, and several other variables. <br>
### Data fields
* id - a unique identifier for each trip
* vendor_id - a code indicating the provider associated with the trip record
* pickup_datetime - date and time when the meter was engaged
* dropoff_datetime - date and time when the meter was disengaged
* passenger_count - the number of passengers in the vehicle (driver entered value)
* pickup_longitude - the longitude where the meter was engaged
* pickup_latitude - the latitude where the meter was engaged
* dropoff_longitude - the longitude where the meter was disengaged
* dropoff_latitude - the latitude where the meter was disengaged
* store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* trip_duration - duration of the trip in seconds


In [None]:
taxi_train = pd.read_csv('../input/train.csv')

In [None]:
taxi_test = pd.read_csv('../input/test.csv')

In [None]:
taxi_train.info()
print('\n')
taxi_test.info()

#### We need to change the type of pickup_datetime and dropoff_datetime from String to time_stamp

In [None]:
taxi_train['pick_date'] = pd.to_datetime(taxi_train['pickup_datetime'])

In [None]:
taxi_train['drop_date'] = pd.to_datetime(taxi_train['dropoff_datetime'])

Then remove the original pickup and dropoff features

In [None]:
taxi_train.drop(['pickup_datetime', 'dropoff_datetime'], 1, inplace=True)

In [None]:
taxi_train.head()

From those features: longtitude and latitude, we can calculate the distances between the pickup location and dropoff location. In this problem, the records were created in New York, the distance between two points should follow Manhattan Distance method.<br>
The reason I don't use geopy API to calculate distance (more accurate than using Manhattan) because the values will be much larger than other features, which may create noise.

In [None]:
from sklearn.neighbors import DistanceMetric

In [None]:
dist = DistanceMetric.get_metric('manhattan')

In [None]:
def manhattan_dist(x):
    pick_long = x[0]
    pick_lat = x[1]
    drop_long = x[2]
    drop_lat = x[3]
    V = [[pick_long, pick_lat],[drop_long,drop_lat]]
    return dist.pairwise(V)[0][1]

In [None]:
taxi_train['distance'] = taxi_train[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']].apply(manhattan_dist,1)

No need to use longtitude and latitude, drop all of them

In [None]:
taxi_train.drop(['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude'],1,inplace=True)

In [None]:
taxi_train.head()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(taxi_train.corr()*100,annot=True)

#### Rush Hours and weekend classification:
Based on NYTimes:<br>
Commuters are getting up pretty early in the morning - and heading home pretty late in the day - to beat the increasingly long and crowded rush hours in the New York region. Although the hours from 7 to 9 A.M. and from 4 to 6 P.M. are still the busiest - and getting busier - early morning travel is growing.<br>
From this definition, we will create new feature named "Rush_hours" composed from Pickup_datetime. "Weekend" feature will be created by the same way.

In [None]:
def rush_hours(x):
    hour = x.hour
    if (hour >= 7 and hour <= 9) or (hour >= 16 and hour <= 18):
        return 1
    else:
        return 0

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
taxi_train['Day of week'] = taxi_train['pick_date'].apply(lambda time: time.dayofweek)

In [None]:
taxi_train['Day of week'] = taxi_train['Day of week'].map(dmap)

In [None]:
taxi_train.head()

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x='Day of week', y='trip_duration', data=taxi_train)

The above diagram is the average of trip_duration by day of week:
* Trip time is longer on Tue, Fri, and Thu.
* Trip time is shorter on Sun and Mon.

In [None]:
taxi_train['Rush_hours'] = taxi_train['pick_date'].apply(rush_hours)

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x='Rush_hours', y='trip_duration', data=taxi_train)

The average time of a trip in rush hours is not quite different from the others

#### Federal and state holidays may impact the taxi usage and trip duration
Let's create new feature called "Holidays" which includes the weight of each kind of holiday as follow:
* 0 - no holiday
* 1 - state holidays
* 2 - federal holidays (short)
* 3 - ferderal holidays (long) i.e. Christmas or New Year

In [None]:
fed_holidays = ((16,1),(20,2),(29,5),(4,7),(4,9),(9,10),(10,11),(11,11),(23,11))

In [None]:
state_holidays = ((12,2),(13,2),(9,10),(24,11))

In [None]:
def federal_holidays(x):
    day = x.day
    month = x.month
    if (day >=24 and month == 12) or (day <= 2 and month ==1):
        return 3
    elif (day, month) in fed_holidays:
        return 2
    elif (day, month) in state_holidays:
        return 1
    else:
        return 0

In [None]:
taxi_train['Holidays'] = taxi_train['pick_date'].apply(federal_holidays)

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x='Holidays', y='trip_duration', data=taxi_train,palette='rainbow')

### ==>Trip duration average is highest on state holidays and lowest on long term holidays.

What is different in average trip duration between travel alone and travel with accompany?

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x='passenger_count', y='trip_duration',hue='vendor_id', data=taxi_train,palette='rainbow')

#### From the chart above, we can observe:
* Maximum number of passengers: 9
* The average time of empty taxies is quite longer than that of occupied ones in vendor 2 (not applicable for vendor 1)
* The more people in the taxi, the longer time it travels.
* Groups of people from 7 or above tend to travel in short trip.

Convert "Day of week" back to number for calculating purpose

In [None]:
taxi_train['Day of week'] = taxi_train['pick_date'].apply(lambda time: time.dayofweek)

In [None]:
taxi_train.head()

#### Let's have a look of the correlations again:

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(taxi_train.corr()*100,annot=True)