In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ny-taxi-data-2016/NY_Taxi_Data_2016.csv


#  Scenario
A taxi company in New York City (NYC) decided  wants to optimize the number of cabs waiting at JFK International Airport. The airport is well off the main Manhattan business district. As a result, it takes a very long time for taxis waiting there to be available for other areas. On the other hand, guests arriving at the airport usually cover longer and therefore more lucrative routes. So how many cabs should the cab operator provide at JFK?


![A service run by Via and Curb will give travelers in Manhattan the option of sharing a yellow cab with someone for a reduced fare.Credit...Joshua Bright for The New York Times](https://static01.nyt.com/images/2017/06/07/nyregion/07TAXI1/07TAXI1-superJumbo.jpg?quality=75&auto=webp)

In order to examine this questions, the company provides us with the data he collected from 2016 in the file NY_Taxi_Data_2016.csv


The following tasks have to be solved:
* What is the overall percentage of taxis booked from the airport (JFK)? This evaluation is therefore represented by the following formula:
# $Share_{JFK} = \frac{Trips_{JFK}}{Trips_{Overall}}$
* Where are taxis taken in New York? 
* What is the percentage of taxis booked from the airport per day of the week? On which day of the week is the highest proportion and when is the lowest?
# $Share_{Weekdays,JFK} = \frac{Trips_{Weekdays,JFK}}{Trips_{Weekdays,Overall}}$
* How much a day of the week contributes to the total number of trips? This should be done for the trips from the airport as well as for the entire set in order to be able to identify any differences.
# $Share_{Weekdays,Overall} = \frac{Trips_{Weekdays,Overall}}{Trips_{All Days,Overall}}$
# $Share_{Weekdays,JFK} = \frac{Trips_{Weekdays,JFK}}{Trips_{All Days,JFK}}$
* The proportion of a time in the total number of journeys. This should be done for the trips from the airport as well as for the entire set in order to be able to identify any differences
# $Share_{Time,Overall} = \frac{Trips_{Time,Overall}}{Trips_{All Time,Overall}}$
# $Share_{Time,JFK} = \frac{Trips_{Time,JFK}}{Trips_{All Time,JFK}}$


By the way, the coordinates of New York City are something like this:

| Variable     | Value    | Description                        |
|--------------|----------|------------------------------------|
| nyc_max_lat  | 40.9176  | Maximum latitude of New York City  |
| nyc_min_lat  | 40.5774  | Minimum latitude of New York City  |
| nyc_max_long | -73.7004 | Maximum longitude of New York City |
| nyc_min_long | -74.15   | Minimum longitude of New York City |


The data is structured as follows:

| Column Number | Column Name         | Data Type             | Description                                                     |
|---------------|---------------------|-----------------------|-----------------------------------------------------------------|
| 0             | 'pickup_weekday'    | categorical (ordinal) | Day of the week on which the trip started (0=Monday, 6=Sunday)  |
| 1             | 'pickup_hour'       | categorical (ordinal) | Hour in which the pickup started                                |
| 2             | 'pickup_longitude'  | numeric (float)       | Longitude at which pickup started                               |
| 3             | 'pickup_latitude'   | numeric (float)       | Latitude where the pickup started                               |
| 4             | 'dropoff_longitude' | numeric (float)       | Longitude at which trip ended                                   |
| 5             | 'dropoff_latitude'  | numeric (float)       | Latitude at which the trip ended                                |
| 6             | 'passenger_count'   | categorical (ordinal) | Number of passengers in the car.                                |
| 7             | 'trip_distance'     | numeric (float)       | Traveled distance in miles                                      |
| 8             | 'fare_amount'       | numeric (float)       | Amount that the taximeter calculates based on time and distance |
| 9             | 'tip_amount'        | numeric (float)       | Tip given when paying by card (0.00 when paying in cash)        |
| 10            | 'tolls_amount'      | numeric (float)       | Tolls incurred                                                  |
| 11            | 'payment_type'      | categorical (nominal) | Type of payment (1=credit card, 2=cash, 3=no fee, 4=dispute)    |


In order to determine which trips depart from JFK Airport, you are given the following coordinates:

| Variable     | Value    | Description                               |
|--------------|----------|-------------------------------------------|
| jfk_max_lat  | 40.66018 | Maximum pickup latitude of airport rides  |
| jfk_min_lat  | 40.62666 | Minimum pickup latitude of airport rides  |
| jfk_max_long | -73.76599| Maximum pickup longitude of airport rides |
| jfk_min_long | -73.80822| Minimum pickup longitude of airport rides |

By the way, the coordinates of New York City are something like this:

| Variable     | Value    | Description                                |
|--------------|----------|--------------------------------------------|
| nyc_max_lat  | 40.9176  | Maximum pickup latitude of  New York City  |
| nyc_min_lat  | 40.5774  | Minimum pickup latitude of  New York City  |
| nyc_max_long |-73.7004  | Maximum pickup longitude of  New York City |
| nyc_min_long | -74.15   | Minimum pickup longitude of  New York City |

# **Data Gathering and Cleaning**

In [2]:
# Import required libraries: pandas and numpy
import pandas as pd
import numpy as np

##### Remarks ####
# method pd.read_csv offers dtype parameter by which data taype to each column can be manually assigned.
# This is required to significantly reduce the RAM required for storing numbers.or example, by default
# whole numbers are read as int64, i.e. each number requires 64 bits in memory. The larger this value,
# the larger whole numbers can be stored. For floating point numbers, the precision with which they are stored increases.
# However, our data is in an area that does not require such a large memory size. If the numbers are read in as int32,
# for example, only half the RAM is required by the DataFrame.With the following dictionary you can read in the data record with little memory:
# read and check the data
col_dtypes = {'pickup_weekday': 'int16', 
              'pickup_hour': 'int16', 
              'pickup_longitude': 'float32', 
              'pickup_latitude': 'float32', 
              'dropoff_longitude': 'float32', 
              'dropoff_latitude': 'float32', 
              'passenger_count': 'int16', 
              'trip_distance': 'float32', 
              'fare_amount': 'float32', 
              'tip_amount': 'float32', 
              'tolls_amount': 'float32', 
              'payment_type': 'int16'}
# Read data
df = pd.read_csv('/kaggle/input/ny-taxi-data-2016/NY_Taxi_Data_2016.csv', dtype=col_dtypes)
display(df.head())

# Check the data types
display(df.dtypes)

Unnamed: 0,pickup_weekday,pickup_hour,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,trip_distance,fare_amount,tip_amount,tolls_amount,payment_type
0,3,19,-73.78997,40.64666,-74.005051,40.748081,1,18.610001,52.0,10.0,5.54,1
1,5,3,-73.986237,40.746513,-73.996796,40.742504,1,0.99,5.0,1.0,0.0,1
2,4,20,-73.874634,40.77396,-73.959923,40.762802,3,9.25,26.5,8.34,5.54,1
3,5,2,-73.952477,40.772064,-73.949371,40.675156,1,9.2,28.0,0.0,0.0,2
4,4,21,-73.988281,40.764488,-73.996513,40.753239,1,0.9,5.0,1.26,0.0,1


pickup_weekday         int16
pickup_hour            int16
pickup_longitude     float32
pickup_latitude      float32
dropoff_longitude    float32
dropoff_latitude     float32
passenger_count        int16
trip_distance        float32
fare_amount          float32
tip_amount           float32
tolls_amount         float32
payment_type           int16
dtype: object

In [3]:
# Scaling categorial data
categorial_col = ['pickup_weekday','pickup_hour','passenger_count','payment_type']
df.loc[:,categorial_col] = df.loc[:,categorial_col].astype('category')

# Check the data type again   
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   pickup_weekday     300000 non-null  category
 1   pickup_hour        300000 non-null  category
 2   pickup_longitude   300000 non-null  float32 
 3   pickup_latitude    300000 non-null  float32 
 4   dropoff_longitude  300000 non-null  float32 
 5   dropoff_latitude   300000 non-null  float32 
 6   passenger_count    300000 non-null  category
 7   trip_distance      300000 non-null  float32 
 8   fare_amount        300000 non-null  float32 
 9   tip_amount         300000 non-null  float32 
 10  tolls_amount       300000 non-null  float32 
 11  payment_type       300000 non-null  category
dtypes: category(4), float32(8)
memory usage: 10.3 MB


In [4]:
# Any missing values?
print(df.isna().sum())

pickup_weekday       0
pickup_hour          0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
trip_distance        0
fare_amount          0
tip_amount           0
tolls_amount         0
payment_type         0
dtype: int64
