# **Project Name**    - NYC Yellow Taxi System



**Project Summary -** The objective of this project is to analyze sales trends of NYC yellow taxis and segment customers based on demographic and transactional data. By understanding sales patterns over time and identifying customer segments, the project aims to provide insights for optimizing service delivery and marketing strategies.

The dataset includes various attributes such as VendorID, pickup and dropoff datetimes, passenger count, trip distance, geographical coordinates, rate code, store and forward flag, payment type, fare amount, additional charges (extra, MTA tax, tolls), tip amount, and total amount. These attributes provide comprehensive information about taxi trips and customer transactions.


In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load Dataset
df=pd.read_csv('yellow_tripdata_2016-01 (2).csv')

In [None]:
# Dataset First Look
df.head()

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
# Dataset Info
df.info()

In [None]:
# Dataset Duplicate Value Count
du=df.duplicated().value_counts()
du

In [None]:
# Missing Values/Null Values Count
miss=df.isnull().sum()
miss.sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(5,8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

In [None]:
# Check Unique Values for each variable.
df['trip_distance'].unique()

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='VendorID', y='trip_distance', estimator=np.mean)
plt.title('Avg Trip Distance per Vendor')
plt.xlabel('Vendor ID')
plt.ylabel('Average Trip Distance')
plt.show()

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='payment_type', y='fare_amount', estimator=np.sum)
plt.title('Total Fare Amount per Payment Type')
plt.xlabel('Payment Type')
plt.ylabel('Total Fare Amount')
plt.show()


#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='passenger_count', y='tip_amount', estimator=np.mean)
plt.title('Average Tip Amount per Passenger Count')
plt.xlabel('Passenger Count')
plt.ylabel('Average Tip Amount')
plt.show()
#This code snippet creates a bar plot showing the average tip amount based on the passenger count.
#The x-axis represents the passenger count, and the y-axis represents the average tip amount.


##### 1. Why did you pick the specific chart?

I selected the bar plot of the average tip amount per passenger count because it allows for a comparison of tip amounts based on the number of passengers in a trip. This visualization helps in understanding whether there is any correlation between the number of passengers and the tip amount left by customers.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the average tip amount given by customers for trips with different passenger counts. It provides insights into customer tipping behavior based on the number of passengers in a trip. This insight can help in understanding tipping trends and whether certain passenger counts tend to result in higher or lower tip amounts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can potentially create a positive business impact:
Understanding tipping patterns based on passenger count can help businesses optimize customer service strategies, such as providing incentives or encouragement for larger groups to tip more generously.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.scatter(df['trip_distance'], df['tip_amount'],alpha=0.5)
plt.xlabel('Trip Distance')
plt.ylabel('Tip Amount')
plt.title('Scatter Plot: Trip Distance vs Tip Amount')
plt.show()

#### Chart - 5

In [None]:
# Chart - 5 visualization code
passenger_count_counts = df['passenger_count'].value_counts()    # Calculate the frequency of each passenger count
passenger_count_counts.plot(kind='bar')                          # Create a bar plot
plt.xlabel('Passenger Count')
plt.ylabel('Frequency')
plt.title('Distribution of Passenger Count')
plt.show()

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Pie chart for VendorID
vendor_counts = df['VendorID'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(vendor_counts, labels=vendor_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Taxi Trips by Vendor')
plt.axis('equal')
plt.show()


#### Chart - 7

In [None]:
# Chart - 7 visualization code
passenger_counts = df['passenger_count'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(passenger_counts, labels=passenger_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Taxi Trips by Number of Passengers')
plt.axis('equal')
plt.show()




#### Chart - 8

In [None]:
# # Chart - 8 visualization code
ratecode_counts = df['RatecodeID'].value_counts()
ratecode_counts.plot(kind='bar')
plt.xlabel('Ratecode ID')
plt.ylabel('Frequency')
plt.title('Distribution of Ratecode IDs')
plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
store_fwd_counts = df['store_and_fwd_flag'].value_counts()
store_fwd_counts.plot(kind='bar')
plt.xlabel('Store and Forward Flag')
plt.ylabel('Frequency')
plt.title('Distribution of Store and Forward Flags')
plt.show()


#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.hist(df['fare_amount'], bins=20, edgecolor='black')
plt.xlabel('Fare Amount')
plt.ylabel('Frequency')
plt.title('Histogram of Fare Amount')
plt.show()

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.hist(df['tip_amount'], bins=20, edgecolor='black')
plt.xlabel('Tip Amount')
plt.ylabel('Frequency')
plt.title('Histogram of Tip Amount')
plt.show()



#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.scatter(df['pickup_longitude'], df['pickup_latitude'], alpha=0.5)
plt.xlabel('Pickup Longitude')
plt.ylabel('Pickup Latitude')
plt.title('Scatter Plot: Pickup Longitude vs Pickup Latitude')
plt.show()

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.scatter(df['dropoff_longitude'], df['dropoff_latitude'], alpha=0.5)
plt.xlabel('Dropoff Longitude')
plt.ylabel('Dropoff Latitude')
plt.title('Scatter Plot: Dropoff Longitude vs Dropoff Latitude')
plt.show()


#### Chart - 14

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame containing the taxi trip data

# Selecting numerical columns for pair plot
numerical_columns = ['VendorID', 'passenger_count', 'trip_distance', 'pickup_longitude',
                     'pickup_latitude', 'RatecodeID', 'dropoff_longitude', 'dropoff_latitude',
                     'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
                     'tolls_amount', 'improvement_surcharge', 'total_amount']

# Creating pair plot
plt.show()
