
# *Exploration of Taxi Trip Fares in the San Francisco Bay Area*

In [1]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
taxi_data = Table.read_table("SF_taxi_data.csv")

### Relationship Between Fares and Distance

According to the SFMTA fare calculation table, <br>
$Fare(x) \geq 3.5 + 0.55 × (5x − 1)$ <br>
We will analyze how this model compares to the actual relationship between fares and distance using real data.

In [3]:
predicted_fare = 3.5 + 0.55 * (5 * taxi_data.column("dist (miles)") - 1)
actual_fare = taxi_data.column("fare ($)")
least_squared_error = sum((actual_fare - predicted_fare) ** 2) / len(actual_fare)
least_squared_error

51.760001810791501

Since the least squared error between the actual and predicted fares is significantly over 0, we can see that the model is not an accurate indicator of the actual relationship between fares and distance.

### Analysis of Trips Related to SFO

In order to split up the total taxi trips to trips that include SFO and trips that do not include SFO, let us divide the taxi_data table into two tables that fit the criteria.

In [4]:
sfo_taz = taxi_data.group("deptaz").sort("count", descending=True).column("deptaz").item(0)
sfo_labeled_table = taxi_data.with_column("sfo", np.logical_or(taxi_data.column("deptaz") == np.full(taxi_data.num_rows, sfo_taz), taxi_data.column("arrtaz") == np.full(taxi_data.num_rows, sfo_taz)))
sfo_trips = sfo_labeled_table.where("sfo", True).drop("sfo")
no_sfo_trips = sfo_labeled_table.where("sfo", False).drop("sfo")

Let us analyze the differences between trips that include SFO and trips that do not by looking at the distances and fares.

In [5]:
print("Mean distance for sfo trips : ", np.mean(sfo_trips.column("dist (miles)")))
print("Mean distance for non-sfo trips : ", np.mean(no_sfo_trips.column("dist (miles)")))
print("Mean fare for sfo trips : ", np.mean(sfo_trips.column("fare ($)")))
print("Mean fare for non-sfo trips : ", np.mean(no_sfo_trips.column("fare ($)")))

Mean distance for sfo trips :  13.6399973334
Mean distance for non-sfo trips :  2.14582874527
Mean fare for sfo trips :  49.0883790447
Mean fare for non-sfo trips :  12.3978968537


From these means, we can see that the distances for trips including SFO are on average much bigger than other trips, hence the average fare is also higher. This logically makes sense because SFO is quite far from most hotspots in the Bay Area.

### Linear Regression of Travel Distance vs. Extra Cost (PART 3)

In [6]:
import pandas as pd

In [8]:
taxi = taxi_data.to_df()
taxi.head()

Unnamed: 0,id,departure time,arrival time,fare ($),num,dep lon,dep lat,arr lon,arr lat,deptaz,arrtaz,dist (miles)
0,0,9/1/12 0:11,9/1/12 0:20,13.2,1,-122.41354,37.802683,-122.421277,37.785395,38,30,1.980835
1,1,9/1/12 0:23,9/1/12 0:31,10.65,1,-122.4197,37.78609,-122.435217,37.762177,30,94,2.402241
2,2,9/1/12 0:45,9/1/12 0:49,9.0,1,-122.41512,37.774672,-122.407657,37.782615,10,11,0.479348
3,3,9/1/12 0:41,9/1/12 0:54,13.95,2,-122.419392,37.806622,-122.415393,37.778115,40,10,2.122408
4,4,9/1/12 1:09,9/1/12 1:13,7.35,1,-122.429722,37.79779,-122.41806,37.789032,45,32,1.03807


In [6]:
# Analysis here

### Linear Regression of Travel Duration vs. Extra Cost

In [7]:
# Analysis here

### Linear Regression vs. K-Nearest Neighbors

In [None]:
# Analysis here