In [1]:
import pandas as pd
from pandas import Series, DataFrame

trip_distance = pd.read_csv('../data/taxi-distance.csv', header=None).squeeze()
passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()

df = DataFrame({'trip_distance': trip_distance,
                'passenger_count': passenger_count})

# Beyond 1

If we define outliers to be the lowest 10% and highest 10% of values, then how many are they? Why is (or isn't) this a good measure?

In [2]:
df[(df['trip_distance'] < df['trip_distance'].quantile(0.1)) | 
   (df['trip_distance'] > df['trip_distance'].quantile(0.9)) ]

Unnamed: 0,trip_distance,passenger_count
1,0.46,1
7,11.90,4
9,0.60,1
10,0.01,3
13,0.50,2
...,...,...
9976,12.60,1
9978,0.38,1
9979,11.30,1
9980,9.13,1


The good news with this measure is that it's easy to understand. The bad news is that if we have many short trips (as we do here), we might end up calling them outliers even though they're very close in value to non-outliers.

# Beyond 2

How many short, medium, and long trips were there for trips that had only one passenger? Note that data for passenger count and trip length are from the same data set, meaning that the indexes are the same.

If we're only interested in removing the non-outlier values, then we could use the `scipy.stats.trimboth` function on our series. It takes a second argument, the proportion we want to cut from both the top and bottom.

In [3]:
from scipy.stats import trimboth
trimboth(df['trip_distance'], 0.1)

array([0.63, 0.63, 0.63, ..., 8.2 , 8.2 , 8.2 ])

# Beyond 3

The `scipy.stats.zscore` function rescales and centers (i.e., normalizes) our data set. Our mean is set to 0, values can be above and below that value. Find all of the distances for which the absolute value of the z-score is greater than 3.

In [4]:
from scipy.stats import zscore
df['trip_distance'][abs(zscore(df['trip_distance'])) > 3]

88      23.76
238     18.32
379     16.38
509     16.82
641     19.72
        ...  
9897    16.11
9899    17.48
9906    17.70
9955    15.49
9964    18.55
Name: trip_distance, Length: 306, dtype: float64