# Project 1
## NYC Yellow Taxi Trip Analysis ðŸš•  
In this project, I want to explore the data **NYC Yellow Taxi Trip dataset** from 2023.  
I will focus on one numeric column `trip_distance` to calculate summary statistics and create a simple visualization.


In [None]:
import pandas as pd
url = "https://data.cityofnewyork.us/resource/4b4i-vvec.csv?$limit=10000"
taxi = pd.read_csv(url)
taxi.head()





Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01T00:32:10.000,2023-01-01T00:40:36.000,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
1,2,2023-01-01T00:55:08.000,2023-01-01T01:01:27.000,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0
2,2,2023-01-01T00:25:04.000,2023-01-01T00:37:49.000,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0
3,1,2023-01-01T00:03:48.000,2023-01-01T00:13:25.000,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01T00:10:29.000,2023-01-01T00:21:19.000,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0


# Step 1: Select a numeric column   
Since I will analyze the **trip distance** (in miles) for these 10,000 NYC taxi rides, I will remove the N/A in this step.

In [None]:
col = "trip_distance"
data = taxi[col].dropna()
print(data)


0       0.97
1       1.10
2       2.51
3       1.90
4       1.43
        ... 
9995    1.97
9996    3.33
9997    4.70
9998    1.00
9999    1.80
Name: trip_distance, Length: 10000, dtype: float64


# Compute Mean, Median, and Mode with pandas  


In [3]:
mean_taxi = data.mean()
median_taxi = data.median()
mode_taxi = data.mode()[0]

print("The mean trip distance is:", mean_taxi)
print("The median trip distance is:", median_taxi)
print("The mode trip distance is:", mode_taxi)

The mean trip distance is: 3.200456
The median trip distance is: 2.12
The mode trip distance is: 0.0


# Hard Way [Compute mean, median, mode manually]

In [4]:
import urllib.request, csv

url = "https://data.cityofnewyork.us/resource/4b4i-vvec.csv?$limit=10000"
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
reader = csv.DictReader(lines)

trip_distance = []
for row in reader:
    try:
        trip_distance.append(float(row["trip_distance"]))
    except(KeyError, ValueError):
        continue
print(f"Loaded {len(trip_distance)} numeric trip distances.")

# Compute mean
total = 0
for x in trip_distance:
    total += x
mean = total/len(trip_distance)
print("Mean for trip distance is:", mean)

#Compute median
sorted_data = sorted(trip_distance)
n = len(sorted_data)
if n % 2 == 0:
    median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
    median = sorted_data[n//2]
print("Median for trip distance is:", median)

#Compute mode
value = set(sorted_data)
mode = max(value, key=sorted_data.count)
print("Mode for trip distance is:", mode)





Loaded 10000 numeric trip distances.
Mean for trip distance is: 3.200455999999992
Median for trip distance is: 2.12
Mode for trip distance is: 0.0


# Data Visualization

In [None]:
data = taxi["trip_distance"].dropna().tolist()

bins = [0, 1, 2, 3, 5, 10]
labels = ["0â€“1", "1â€“2", "2â€“3", "3â€“5", "5â€“10", "10+"]

counts = [0]*len(labels)
for x in data:
    if x <= 1:
        counts[0] += 1
    elif x <= 2:
        counts[1] += 1
    elif x <= 3:
        counts[2] += 1
    elif x <= 5:
        counts[3] += 1
    elif x <= 10:
        counts[4] += 1
    else:
        counts[5] += 1

max_count = max(counts)
scale = 50 / max_count  

print("Trip Distance Distribution (each â–‡ â‰ˆ number of trips)\n")
for label, count in zip(labels, counts):
    bar = "â–‡" * int(count * scale)
    print(f"{label:>5} mi | {bar} ({count})")


Trip Distance Distribution (each â–‡ â‰ˆ number of trips)

  0â€“1 mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (1830)
  1â€“2 mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (2913)
  2â€“3 mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (1837)
  3â€“5 mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (1782)
 5â€“10 mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (1181)
  10+ mi | â–‡â–‡â–‡â–‡â–‡â–‡â–‡ (457)
