# Maven Taxi Challenge
Four years worth of NYC Taxi Trips to clean, analyze, and visualize using Tableau. The submission will be a Tableau dashboard that will answer:
- average number of trips
- average fare per trip
- average distance traveled
- change in trip volume
- days / times of the week that are the busiest
- popular pick-up and drop-off locations

### Data Review
I will start by gathering the libraries I will need to work with the taxi data

In [7]:
# import libraries
import os
import glob
import pandas as pd

path = os.getcwd()

Merge taxi_trip data into one dataframe

In [15]:
# using glob to get all the csv files in the taxi_trips folder
filepath = path + "/taxi_trips/"
csv_files = glob.glob(os.path.join(filepath, "*.csv"))

# read data from each csv file
taxi_data = [pd.read_csv(file, low_memory=False) for file in csv_files]

# append each file to the dataframe
taxi_data = pd.concat(taxi_data)

taxi_data.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  import sys


Unnamed: 0,DOLocationID,PULocationID,RatecodeID,VendorID,congestion_surcharge,extra,fare_amount,improvement_surcharge,lpep_dropoff_datetime,lpep_pickup_datetime,mta_tax,passenger_count,payment_type,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,trip_type
0,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 06:52:54.000,2020-01-01 06:47:28.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.47,1.0
1,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 13:30:43.000,2020-01-01 13:25:34.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.49,1.0
2,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 14:26:25.000,2020-01-01 14:20:35.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.31,1.0
3,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 07:03:03.000,2020-01-02 06:56:47.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.43,1.0
4,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 09:41:02.000,2020-01-02 09:34:46.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.1,1.0


In [16]:
# count of rows and columns
taxi_data.shape

(28326071, 19)

#### Command Line Shortcuts
Using the command line to do a quick count of each csv file to make sure the total number of rows in the taxi_data dataframe is equal to the total number of rows in all the csv files combined.
- use ls to list files in the current directory
- use cd folder-name/ to change directories
- use wc -l < file-name.csv to count the number of lines in the file (keep in mind one line is the header)

In [17]:
taxi_data.dtypes

DOLocationID               int64
PULocationID               int64
RatecodeID               float64
VendorID                 float64
congestion_surcharge     float64
extra                    float64
fare_amount              float64
improvement_surcharge    float64
lpep_dropoff_datetime     object
lpep_pickup_datetime      object
mta_tax                  float64
passenger_count          float64
payment_type             float64
store_and_fwd_flag        object
tip_amount               float64
tolls_amount             float64
total_amount             float64
trip_distance            float64
trip_type                float64
dtype: object

In [23]:
# missing values in each column
taxi_data.isnull().sum()

DOLocationID                    0
PULocationID                    0
RatecodeID                      0
VendorID                        0
congestion_surcharge     21059703
extra                           0
fare_amount                     0
improvement_surcharge           0
lpep_dropoff_datetime           0
lpep_pickup_datetime            0
mta_tax                         0
passenger_count                 0
payment_type                    0
store_and_fwd_flag              0
tip_amount                      0
tolls_amount                    0
total_amount                    0
trip_distance                   0
trip_type                     365
dtype: int64

### Data Cleaning
The raw data has some issues, so instructions / assumptions were given to clean and prep the data

In [24]:
# values in the "store and forward" column
taxi_data["store_and_fwd_flag"].unique()

array(['N'], dtype=object)

In [22]:
# Keep trips that were not sent via "store and forward"
taxi_data = taxi_data[taxi_data["store_and_fwd_flag"] == "N"]
taxi_data.head()

Unnamed: 0,DOLocationID,PULocationID,RatecodeID,VendorID,congestion_surcharge,extra,fare_amount,improvement_surcharge,lpep_dropoff_datetime,lpep_pickup_datetime,mta_tax,passenger_count,payment_type,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,trip_type
0,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 06:52:54.000,2020-01-01 06:47:28.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.47,1.0
1,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 13:30:43.000,2020-01-01 13:25:34.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.49,1.0
2,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 14:26:25.000,2020-01-01 14:20:35.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.31,1.0
3,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 07:03:03.000,2020-01-02 06:56:47.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.43,1.0
4,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 09:41:02.000,2020-01-02 09:34:46.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.1,1.0


In [31]:
taxi_data["payment_type"].unique()

array([2., 1.])

In [30]:
# keep trips where payment type is equal to 1 - Credit Card or 2 - Cash
taxi_data = taxi_data[taxi_data["payment_type"].isin([1.0, 2.0])]

In [32]:
taxi_data["RatecodeID"].unique()

array([ 1.,  5.,  2.,  3.,  4.,  6., 99.])

In [33]:
# keep trips where RatecodeID is equal to 1 - Standard Rate
taxi_data = taxi_data[taxi_data["RatecodeID"] == 1.0]
taxi_data.head()

Unnamed: 0,DOLocationID,PULocationID,RatecodeID,VendorID,congestion_surcharge,extra,fare_amount,improvement_surcharge,lpep_dropoff_datetime,lpep_pickup_datetime,mta_tax,passenger_count,payment_type,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,trip_type
0,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 06:52:54.000,2020-01-01 06:47:28.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.47,1.0
1,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 13:30:43.000,2020-01-01 13:25:34.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.49,1.0
2,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-01 14:26:25.000,2020-01-01 14:20:35.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.31,1.0
3,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 07:03:03.000,2020-01-02 06:56:47.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.43,1.0
4,75,74,1.0,2.0,0.0,0.0,6.5,0.3,2020-01-02 09:41:02.000,2020-01-02 09:34:46.000,0.5,1.0,2.0,N,0.0,0.0,7.3,1.1,1.0


In [34]:
taxi_data.isnull().sum()

DOLocationID                    0
PULocationID                    0
RatecodeID                      0
VendorID                        0
congestion_surcharge     20359525
extra                           0
fare_amount                     0
improvement_surcharge           0
lpep_dropoff_datetime           0
lpep_pickup_datetime            0
mta_tax                         0
passenger_count                 0
payment_type                    0
store_and_fwd_flag              0
tip_amount                      0
tolls_amount                    0
total_amount                    0
trip_distance                   0
trip_type                     354
dtype: int64