<font color = "#CC3D3D"><p>
# Topics
* [Apply](#Apply)
* [Manipulating Dates and Times](#Manipulating-Dates-and-Times)
* [Handling Missing Data](#Handling-Missing-Data)

## Apply

<font color = 'blue'>Apply : a function to each value in a Series

In [2]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
Series(range(1,5)).apply(np.log)

0    0.000000
1    0.693147
2    1.098612
3    1.386294
dtype: float64

In [9]:
#lambda함수의 파라미터 x에는 Series의 특정 원소값 할당
Series(range(1,5)).apply(lambda x : 1/x)

0    1.000000
1    0.500000
2    0.333333
3    0.250000
dtype: float64

In [11]:
#파라미터를 추가하고 싶을 땐 apply()의 args파라미터에 튜플로 값을 지정
Series(range(1,5)).apply(lambda x,y : x+y, args=(3,))

0    4
1    5
2    6
3    7
dtype: int64

In [26]:
#Sample Data
sample_df = pd.DataFrame({
        'id': [1,1,1,1,2,2,2],
        'site': ['a','b','c','a','a','b','b'],
        'pageview': np.arange(1,8),
        'dwelltime': np.arange(7.0, 0, -1),
    }, columns=['id','site','pageview','dwelltime'])
sample_df

Unnamed: 0,id,site,pageview,dwelltime
0,1,a,1,7.0
1,1,b,2,6.0
2,1,c,3,5.0
3,1,a,4,4.0
4,2,a,5,3.0
5,2,b,6,2.0
6,2,b,7,1.0


In [28]:
def normalize(x, min, max):
    return (x-min)/(max-min)

min = sample_df['pageview'].min()
max = sample_df['pageview'].max()
min, max

(1, 7)

In [29]:
#apply로 사용자 정의 함수도 적용 가능
sample_df['pageview'].apply(normalize, args=(min,max))

0    0.000000
1    0.166667
2    0.333333
3    0.500000
4    0.666667
5    0.833333
6    1.000000
Name: pageview, dtype: float64

## Manipulating Dates and Times

In [30]:
#pd.date_range 함수를 사용하면 날짜 및 시간을 일일히 입력할 필요없이 지정한 범위 내의 날짜를 생성
t = Series(pd.date_range('2021-09-05', periods=7))
#t = pd.to_datetime(Series(['2021-09-05','2021-09-06','2021-09-07','2021-09-08','2021-09-09','2021-09-10','2021-09-11']))와 동일
t

0   2021-09-05
1   2021-09-06
2   2021-09-07
3   2021-09-08
4   2021-09-09
5   2021-09-10
6   2021-09-11
dtype: datetime64[ns]

In [31]:
#Series의 한 원소에 대한 날짜 정보를 얻으려면
t[0].year, t[0].month, t[0].day

(2021, 9, 5)

In [32]:
#Series의 모든 원소에 대한 날짜 정보를 한꺼번에 얻으려면
print(t.dt.year)
print(t.dt.month)
print(t.dt.day)

0    2021
1    2021
2    2021
3    2021
4    2021
5    2021
6    2021
dtype: int64
0    9
1    9
2    9
3    9
4    9
5    9
6    9
dtype: int64
0     5
1     6
2     7
3     8
4     9
5    10
6    11
dtype: int64


In [33]:
list(zip(t.dt.year, t.dt.month, t.dt.day))

[(2021, 9, 5),
 (2021, 9, 6),
 (2021, 9, 7),
 (2021, 9, 8),
 (2021, 9, 9),
 (2021, 9, 10),
 (2021, 9, 11)]

<font color = 'blue'>Weekday

In [34]:
#weekday : 월~일 -> 0~6으로 반환
t.dt.weekday

0    6
1    0
2    1
3    2
4    3
5    4
6    5
dtype: int64

In [35]:
t.apply(lambda x : x.weekday())

0    6
1    0
2    1
3    2
4    3
5    4
6    5
dtype: int64

In [36]:
#day_name()을 이용하여 요일의 이름 추출
t.dt.day_name()

0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
5       Friday
6     Saturday
dtype: object

In [42]:
#요일의 이름을 대문자로 변경
t.dt.day_name().str.upper()

0       SUNDAY
1       MONDAY
2      TUESDAY
3    WEDNESDAY
4     THURSDAY
5       FRIDAY
6     SATURDAY
dtype: object

In [38]:
t.dt.day_name().str.upper().str[:3]

0    SUN
1    MON
2    TUE
3    WED
4    THU
5    FRI
6    SAT
dtype: object

In [40]:
#요일의 이름 앞 3글자가 SAT인가?
t.dt.day_name().str.upper().str[:3].str.contains('SAT')

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [43]:
#요일을 한국어로 표시하기
t.apply(lambda x : ('월','화','수','목','금','토','일')[x.weekday()]+'요일')

0    일요일
1    월요일
2    화요일
3    수요일
4    목요일
5    금요일
6    토요일
dtype: object

In [44]:
sample_df['date'] = Series(['2020-05-20','2020-05-21','2020-05-22','2020-05-23','2020-05-24','2020-05-25','2020-05-26'])
sample_df

Unnamed: 0,id,site,pageview,dwelltime,date
0,1,a,1,7.0,2020-05-20
1,1,b,2,6.0,2020-05-21
2,1,c,3,5.0,2020-05-22
3,1,a,4,4.0,2020-05-23
4,2,a,5,3.0,2020-05-24
5,2,b,6,2.0,2020-05-25
6,2,b,7,1.0,2020-05-26


In [50]:
#날짜가 문자열로 되어있는 컬럼에서 요일을 얻으려면
sample_df['dayofweek'] = sample_df.date.astype('datetime64').dt.day_name()
sample_df

Unnamed: 0,id,site,pageview,dwelltime,date,dayofweek
0,1,a,1,7.0,2020-05-20,Wednesday
1,1,b,2,6.0,2020-05-21,Thursday
2,1,c,3,5.0,2020-05-22,Friday
3,1,a,4,4.0,2020-05-23,Saturday
4,2,a,5,3.0,2020-05-24,Sunday
5,2,b,6,2.0,2020-05-25,Monday
6,2,b,7,1.0,2020-05-26,Tuesday


<font color = blue>Elapsed time

In [51]:
#오늘 날짜 기준, 나는 태어난지 며칠이 지났는가?
pd.to_datetime('2021-09-05') - pd.to_datetime('2001-08-01')

Timedelta('7340 days 00:00:00')

In [53]:
#2001-08-12 부터의 경과일 계산
edays = (t - pd.to_datetime('2001-08-12'))
edays

0   7329 days
1   7330 days
2   7331 days
3   7332 days
4   7333 days
5   7334 days
6   7335 days
dtype: timedelta64[ns]

In [54]:
edays.astype('timedelta64[D]').astype('int') #timedelta는 두 날짜의 차이 기간을 나타내는 모듈

0    7329
1    7330
2    7331
3    7332
4    7333
5    7334
6    7335
dtype: int32

In [55]:
sample_df['elapsed'] = (sample_df.date.astype('datetime64') - pd.to_datetime('2000-01-01')).astype('timedelta64[D]').astype('int')
sample_df

Unnamed: 0,id,site,pageview,dwelltime,date,dayofweek,elapsed
0,1,a,1,7.0,2020-05-20,Wednesday,7445
1,1,b,2,6.0,2020-05-21,Thursday,7446
2,1,c,3,5.0,2020-05-22,Friday,7447
3,1,a,4,4.0,2020-05-23,Saturday,7448
4,2,a,5,3.0,2020-05-24,Sunday,7449
5,2,b,6,2.0,2020-05-25,Monday,7450
6,2,b,7,1.0,2020-05-26,Tuesday,7451


## Handling Missing Data

In [64]:
#Sample Data
df1 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})
df2 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
df = pd.merge(df1, df2, left_on='lkey', right_on='rkey', how='outer')
df

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


<font color = 'blue'>Find missing values

In [65]:
#isnull을 사용하여 NaN값 찾기. True/False로 반환됨.
pd.isnull(df)

Unnamed: 0,lkey,data1,rkey,data2
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,True,True
7,True,True,False,False


In [67]:
pd.notnull(df.data1) #df.data1.notnull()

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
Name: data1, dtype: bool

<font color = blue>Remove rows with missing values

In [68]:
df

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


In [70]:
#dropna를 사용하여 NaN값이 존재하는 행 삭제
#how='any'로 설정하면 컬럼 중 하나라도 NaN이면 해당 행을 삭제함
df.dropna(how='any')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0


In [71]:
#how='all'로 설정하면 모든 컬럼이 NaN인 경우에만 행을 삭제함
df.dropna(how='all')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


In [73]:
#모든 열 값이 NaN인 행 생성
df.iloc[6] = [np.nan, np.nan, np.nan, np.nan]
df

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,,,,
7,,,d,2.0


In [76]:
df.dropna(how='all').reset_index(drop=True)

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,,,d,2.0


<font color = blue>Replace Missing Values

In [78]:
#fillna를 이용하여 NaN에 값 채우기
df.fillna(-1)

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,-1,-1.0,-1,-1.0
7,-1,-1.0,d,2.0


In [79]:
#딕셔너리를 사용하면 컬럼별로 채울 값을 다르게 설정할 수 있음
df.fillna({'data1':1.5, 'data2':0.5, 'lkey':'Y', 'rkey':''})

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,Y,1.5,,0.5
7,Y,1.5,d,2.0
