<a href="https://www.kaggle.com/code/helddata/taxi-trip-data-data-exploration?scriptVersionId=177186993" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Automatidata Project - New York Taxi Trip data analysis

You have just started as a data professional in a fictional data consulting firm, Automatidata. Their client, the New York City Taxi and Limousine Commission (New York City TLC), has hired the Automatidata team for its reputation in helping their clients develop data-based solutions.

The team is still in the early stages of the project. Previously, you were asked to complete a project proposal by your supervisor, DeShawn Washington. You have received notice that your project proposal has been approved and that New York City TLC has given the Automatidata team access to their data. To get clear insights, New York TLC's data must be analyzed, key variables identified, and the dataset ensured it is ready for analysis.




## Inspect and analyze data

In this activity, you will examine data provided and prepare it for analysis. This activity will help ensure the information is:

1. Ready to answer questions and yield insights
2. Ready for visualizations
3. Ready for future hypothesis testing and statistical methods

**The purpose** of this project is to investigate and understand the data provided.

**The goal** is to use a dataframe constructed within Python, perform a cursory inspection of the provided dataset, and inform team members of your findings.

*This activity has three parts:*

**Part 1:** Understand the situation
- Prepare to understand and organize the provided taxi cab dataset and information.

**Part 2:** Understand the data
- Create a pandas dataframe for data learning, future exploratory data analysis (EDA), and statistical activities.
- Compile summary information about the data to inform next steps.

**Part 3:** Understand the variables
- Use insights from your examination of the summary data to guide deeper investigation into specific variables.

Follow the instructions and answer the following questions to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.


## Understand the data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv('/kaggle/input/yellow-taxi-trip-data-2017/C2_2017_Yellow_Taxi_Trip_Data.csv')

print("done")

In [None]:
df.head(10)

In [None]:
df.info()

Dtypes are non-numeric. Two of which are datetime. No null values

In [None]:
df.describe()

Regarding fare amount, the distribution is worth considering. The maximum fare amount is a much larger value ($1000) than the 25-75 percent range of values. 

Also, it's questionable how there are negative values for fare amount. 

Regarding trip distance, most rides are between 1-3 miles, but the maximum is over 33 miles.

## Understand the variables

Sort and interpret the data table for two variables:`trip_distance` and `total_amount`.



In [None]:
# Sort the data by trip distance from maximum to minimum value

df_sort = df.sort_values(by=['trip_distance'],ascending=False)
df_sort.head(10)

In [None]:
# Sort the data by total amount and print the top 20 values
total_amount_sorted = df.sort_values(
    ['total_amount'], ascending=False)['total_amount']
total_amount_sorted.head(20)

In [None]:
# Sort the data by total amount and print the bottom 20 values
total_amount_sorted.tail(20)

In [None]:
# How many of each payment type are represented in the data?
df['payment_type'].value_counts()

In [None]:
# What is the average tip for trips paid for with credit card?
avg_cc_tip = df[df['payment_type']==1]['tip_amount'].mean()
print('Avg. cc tip:', avg_cc_tip)

# What is the average tip for trips paid for with cash?
avg_cash_tip = df[df['payment_type']==2]['tip_amount'].mean()
print('Avg. cash tip:', avg_cash_tip)

In [None]:
# How many times is each vendor ID represented in the data?
df['VendorID'].value_counts()

In [None]:
# What is the mean total amount for each vendor?
df.groupby(['VendorID']).mean(numeric_only=True)[['total_amount']]

In [None]:
# Filter the data for credit card payments only
credit_card = df[df['payment_type']==1]

# Filter the credit-card-only data for passenger count only
credit_card['passenger_count'].value_counts()

In [None]:
# Calculate the average tip amount for each passenger count (credit card payments only)
credit_card.groupby(['passenger_count']).mean(numeric_only=True)[['tip_amount']]


### Understand the data - Investigate the variables

Sort and interpret the data table for two variables: `trip_distance` and `total_amount`.

**Answer the following three questions:**

**Question 1:** Sort your first variable (`trip_distance`) from maximum to minimum value, do the values seem normal?

**Question 2:** Sort your second variable (`total_amount`), are any values unusual?

**Question 3:** Are the resulting rows similar for both sorts? Why or why not?

------------------------

**Question 1:**  The values align with our earlier data discovery, where we noticed that the longest rides are approximately 33 miles.

**Question 2:** Yes, the first two values are significantly higher than the others.

**Question 3:** The most expensive rides are not necessarily the longest ones.


## Conclusion 

Which two variables are most helpful for building a predictive model for the client: NYC TLC?

After looking at the dataset, the two variables that are most likely to help build a predictive model for taxi ride fares are total_amount and trip_distance because those variables show a picture of a taxi cab ride.