In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Beyond 1

Create a single data frame containing rides from both January 2019 and January 2020, with a column `year` indicating which year it came from. Use `groupby` to compare the average cost of a taxi in January of each of these two years.

In [2]:
jan_2019_filename = '../data/nyc_taxi_2019-01.csv'

jan_2019_df = pd.read_csv(jan_2019_filename,
                usecols=['passenger_count',
                         'trip_distance', 'total_amount'])
jan_2019_df['year'] = 2019

jan_2020_filename = '../data/nyc_taxi_2020-01.csv'

jan_2020_df = pd.read_csv(jan_2020_filename,
                usecols=['passenger_count',
                         'trip_distance', 'total_amount'])
jan_2020_df['year'] = 2020

df = pd.concat([jan_2019_df, jan_2020_df])

In [3]:
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount,year
0,1.0,1.5,9.95,2019
1,1.0,2.6,16.3,2019
2,3.0,0.0,5.8,2019
3,5.0,0.0,7.55,2019
4,5.0,0.0,55.55,2019


In [4]:
df.groupby('year')['total_amount'].mean()

year
2019    15.682222
2020    18.663149
Name: total_amount, dtype: float64

# Beyond 2

Now create a two-level grouping, first by year and then by `passenger_count`.

In [5]:
# Group by year, then by passenger_count
# 그런 다음 total_amount의 평균을 구함
# The result is a series with a multi-index
df.groupby(['year', 'passenger_count'])['total_amount'].mean()

year  passenger_count
2019  0.0                18.663658
      1.0                15.609601
      2.0                15.831294
      3.0                15.604015
      4.0                15.650307
      5.0                15.546940
      6.0                15.437892
      7.0                48.278421
      8.0                64.105517
      9.0                31.094444
2020  0.0                18.059724
      1.0                18.343110
      2.0                19.050504
      3.0                18.736862
      4.0                19.128092
      5.0                18.234443
      6.0                18.367962
      7.0                71.143103
      8.0                58.197059
      9.0                81.244211
Name: total_amount, dtype: float64

# Beyond 3

Finally, the `corr` method allows us to see how strongly two columns correlate with one another. Use `corr` and then `sort_values` to find which have the highest correlation.

In [6]:
# df.corr() returns a square, so we can look at just the first column
# and get everything we need

# When we sort, we see very clearly that there's basically no correlation
# between passenger_count and any other column. (Except for itself, of 
# course...)  So there really doesn't seem to be any financial advantage
# for a driver to take a larger group.
df.corr().sort_values('passenger_count')

Unnamed: 0,passenger_count,trip_distance,total_amount,year
year,-0.021602,0.00114,0.007657,1.0
total_amount,-0.000136,0.004331,1.0,0.007657
trip_distance,0.008974,1.0,0.004331,0.00114
passenger_count,1.0,0.008974,-0.000136,-0.021602


`trip_distance`와 `total_amount` 사이의 상관계수가 0.004 정도로 의뢰로 낮게 나옴.
이에 대한 이유를 확인해 봐야 함.

In [7]:
df.describe()

Unnamed: 0,passenger_count,trip_distance,total_amount,year
count,14007360.0,14072800.0,14072800.0,14072800.0
mean,1.543659,2.859596,17.03894,2019.455
std,1.192295,56.16996,193.8733,0.497983
min,0.0,-30.62,-1242.3,2019.0
25%,1.0,0.93,9.36,2019.0
50%,1.0,1.58,12.8,2019.0
75%,2.0,2.88,18.3,2020.0
max,9.0,210240.1,623261.7,2020.0


- `trip_distance`의 최솟값이 음수임. 
- `trip_distance`의 최댓값이 21만 마일임.
- 두 값 모두 잘못 측정된 것으로 판단해야 함.

In [8]:
df['trip_distance'].min()

-30.62

In [9]:
df['trip_distance'].max()

210240.07

`trip_distance`의 분포를 자세히 확인해 봐야 함.
`trip_distance`의 하위 99%에 해당하는 값이 19마일임.

In [10]:
df['trip_distance'].describe(percentiles=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

count    1.407280e+07
mean     2.859596e+00
std      5.616996e+01
min     -3.062000e+01
1%       3.000000e-02
5%       4.600000e-01
10%      6.000000e-01
25%      9.300000e-01
50%      1.580000e+00
75%      2.880000e+00
90%      6.500000e+00
95%      1.066000e+01
99%      1.900000e+01
max      2.102401e+05
Name: trip_distance, dtype: float64

`trip_distance`가 음수인 경우가 꽤 포함됨.

In [11]:
df[df['trip_distance'] < 0]

Unnamed: 0,passenger_count,trip_distance,total_amount,year
273001,1.0,-0.88,12.75,2020
277810,1.0,-0.88,11.55,2020
282887,1.0,-0.71,10.95,2020
787926,1.0,-9.29,34.45,2020
905236,1.0,-1.79,13.95,2020
...,...,...,...,...
6378826,,-11.37,38.86,2020
6378968,,-7.80,27.02,2020
6378978,,-12.73,47.26,2020
6379017,,-6.55,39.50,2020


200마일 이상인 경우는 10건임. 그런데 요금이 거리와 일치하지 않는 경우가 많음.

In [12]:
df[df['trip_distance'] > 200]

Unnamed: 0,passenger_count,trip_distance,total_amount,year
4286612,1.0,700.7,9.0,2019
4707513,2.0,211.36,56.56,2019
4881766,1.0,201.27,152.46,2019
6074021,1.0,831.8,11.76,2019
6770897,5.0,214.01,761.8,2019
565751,4.0,259.22,577.8,2020
1442784,1.0,211.7,620.3,2020
2154680,4.0,262.88,658.35,2020
3813213,2.0,241.64,617.3,2020
6398446,,210240.07,41.32,2020


시내 이동거리를 대상으로 상관계수를 다시 확인해볼 필요 있음.

In [13]:
df[(df['trip_distance'] <= 30) & (df['total_amount'] >= 0)].corr()

Unnamed: 0,passenger_count,trip_distance,total_amount,year
passenger_count,1.0,0.009516,-0.000136,-0.021492
trip_distance,0.009516,1.0,0.062103,0.012671
total_amount,-0.000136,0.062103,1.0,0.007819
year,-0.021492,0.012671,0.007819,1.0


하지만 여전히 `trip_distance`와 `total_amount` 사이의 상관계수가 0.06 정도로 매우 낮다.
따라서 이번엔 `total_amount`의 분포를 확인해 봐야 함.
상위 1%에 해당하는 값이 95달러 정도임.

In [14]:
df['total_amount'].describe(percentiles=[0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

count    1.407280e+07
mean     1.703894e+01
std      1.938733e+02
min     -1.242300e+03
1%       4.800000e+00
5%       6.300000e+00
10%      7.300000e+00
25%      9.360000e+00
50%      1.280000e+01
75%      1.830000e+01
90%      3.030000e+01
95%      4.677000e+01
99%      7.370000e+01
max      6.232617e+05
Name: total_amount, dtype: float64

따라서 택시 요금이 100달러 이하인 경우만 대상으로 상관계수를 다시 확인해 봐야 함.

In [21]:
df_trip = df[(df['trip_distance'] <= 30) & (df['trip_distance'] >= 0)]
df_total = df_trip[(df_trip['total_amount'] <= 100) & (df_trip['total_amount'] >= 0)]

print('필터링 후 데이터셋 크기:', f'{len(df_total):,}')
print('상관계수:', round(df_total.corr().loc['trip_distance', 'total_amount'], 3))

필터링 후 데이터셋 크기: 14,019,206
상관계수: 0.904


**결론**

- 택시 이동거리와 요금을 시내에서 일상적으로 발생하는 경우에 한정하면 `trip_distance`와 `total_amount` 사이의 상관계수는 0.9 정도로 매우 높음.
- 기타의 경우엔 입력 오류와 이상치가 많아 상관계수 계산을 방해함.