# Ch.10 시계열
### 1. 고정 빈도 표현
### 2. 일관적인 간격

pandas의 강력한 기능: 불규칙적이고 고정된 빈도를 갖는 시계열 데이터를 림샘플링 메서드를 통해

금융, 경제, 서버 로그 데이터 분석에 주로 이용

# 10.1 날짜, 시간 자료형, 도구
파이썬 표준 라이브러리 : 날짜, 시간, 달력 기능을 제공

datetime, time, calendar 모듈 : 기본 내장 라이브러리/날짜에 최적화

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings(action='ignore')

In [2]:
from datetime import datetime

In [3]:
now = datetime.now()
now

datetime.datetime(2020, 4, 1, 12, 44, 28, 765151)

In [4]:
now.year, now.month, now.day

(2020, 4, 1)

datetime.timedelta : datetime 간의 시간적 차이

In [5]:
delta = datetime(2011, 1, 7) - datetime(2008,6,24,8,15)
delta

datetime.timedelta(days=926, seconds=56700)

In [6]:
delta.days

926

In [7]:
delta.seconds

56700

datetime.timedelta : timedelta(n) 의 n만큼 더하거나 빼서 새로운 객체를 만들 수 있다.
timedelta(n) : 기본값으로 단위는 '일(day)'이다.

In [8]:
from datetime import timedelta

In [9]:
start = datetime(2011,1,7)

In [10]:
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

In [11]:
start - 2*timedelta(12)
#24일 전

datetime.datetime(2010, 12, 14, 0, 0)

### datetime 모듈 자료형
1. date :달력을 사용하여 날짜(년, 월, 일)을 저장한다.
2. time : 하루 중 시간을 시간, 분, 초, 마이크로 초 단위로 저장한다.
3. datetime : 날짜와 시간을 같이 저장한다.
4. timedelta : 두 datetime 값 간의 차이(일, 초, 마이크로초)를 표현한다.

# 10.1.1 문자열을 datetime으로 변환하기

pandas로 외부 데이터를 불러오는 경우 날짜를 나타내는 열이 있으면 문자열로 저장되기 때문에 datetime 객체로 변환하는 작업이 필요하다.
- 날짜 -> 문자 : str(datetime객체)
- 문자 -> 날짜 : datetime.strptime(날짜를나타내는 문자열, 포맷형식)

포맷 규칙¶
- %Y : 네 자리 연도
- %y : 두 자리 연도
- %m : 월 [01,03]
- %d : 일 [03, 31]
- % H : 24시 형식 시간[00,23]
- % I : 12시 형식 시간[01,12]
- % M : 분
- % S : 초

In [12]:
stamp = datetime(2018, 3, 27)
stamp

datetime.datetime(2018, 3, 27, 0, 0)

In [13]:
str(stamp)

'2018-03-27 00:00:00'

In [14]:
#str -> datetime
datetime.strptime('18-01-03', '%y-%m-%d')

datetime.datetime(2018, 1, 3, 0, 0)

In [15]:
#datetime -> str
stamp.strftime('%y-%m-%d')

'18-03-27'

In [16]:
stamp.strftime('%Y-%m-%d')

'2018-03-27'

In [17]:
stamp.strftime('%d-%m-%Y')

'27-03-2018'

In [18]:
value = '2018-03-27'
type(value)

str

In [19]:
datetime.strptime(value, '%Y-%m-%d')

datetime.datetime(2018, 3, 27, 0, 0)

매번 고정된 규칙을 따르는 것은 귀찮은 일이다.  
대부분 사람들이 인지하는 날짜 표현 방식을 인식하여 파싱하기

여러 규칙에 대응할 수 있는 좀 더 유연한 파싱:
- dateutil.parser.parse('0000-00-00', dayfirst =T/F)

장점: 정규화되지 않은 날짜 형식을 datetime 객체로 변환할 때 유용하다.  
단점 : 날짜를 나타내지 않는 문자열도 datetime으로 변환하는 경우가 있다.

In [20]:
from dateutil.parser import parse

In [21]:
parse('2011/03/14')

datetime.datetime(2011, 3, 14, 0, 0)

In [22]:
parse('2013-2-23')

datetime.datetime(2013, 2, 23, 0, 0)

In [23]:
parse('Jan 31, 1997 10:45 pm')

datetime.datetime(1997, 1, 31, 22, 45)

In [24]:
parse('7/2/1998')
#means Feburary 7th but,,
#2월 7일을 표현하고 싶어..

datetime.datetime(1998, 7, 2, 0, 0)

In [25]:
parse('7/2/1998', dayfirst=True)

datetime.datetime(1998, 2, 7, 0, 0)

In [26]:
notnorm_dt = ['2017/3/24 00:23', '13:00 23-4-2000', '15.3.2018 18:00']

for i in notnorm_dt:
    print(parse(i))

2017-03-24 00:23:00
2000-04-23 13:00:00
2018-03-15 18:00:00


## pandas의 to_datetime 메서드
- pd.to_datetime()  : 많은 종류의 날짜형식을 DatetimeIndex 객체로 반환  
dateutil.parser.parse() 메서드의 index화 버전이라고 생각해도 된다.

DataFrame에서 날짜 형식의 col을 인자로 넘길 수 있다.

In [27]:
datestrs = ['7-6-2011 13:00', '8/6/2011']

In [28]:
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06 13:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

In [29]:
#날짜를 모르는 경우(None)값도 처리해준다. 
idx = pd.to_datetime(datestrs +[None])
idx

DatetimeIndex(['2011-07-06 13:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

In [30]:
pd.isnull(idx)

array([False, False,  True])

# 10.2 시계열 기초
기본적인 시계열 종류 : 타임스탬프로 색인된 Series 객체

In [31]:
dates = [datetime(2011,1,2), datetime(2011,1,5), datetime(2011,1,7),
        datetime(2011,1,8), datetime(2011,1,10), datetime(2011,1,12)]

In [32]:
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02   -0.274074
2011-01-05    0.924186
2011-01-07    0.167108
2011-01-08   -1.720296
2011-01-10   -1.117970
2011-01-12   -0.946407
dtype: float64

In [33]:
ts[::2]

2011-01-02   -0.274074
2011-01-07    0.167108
2011-01-10   -1.117970
dtype: float64

In [34]:
ts + ts[::2]

2011-01-02   -0.548147
2011-01-05         NaN
2011-01-07    0.334216
2011-01-08         NaN
2011-01-10   -2.235940
2011-01-12         NaN
dtype: float64

## 규칙적인 특정기간(날짜) 색인하기
- pd.date_range('날짜 문자열', periods = 숫자(숫자만큼의 날), freq=)

###### freq옵션
- 'D' : day 별로 (default)
- 'W' : week '날짜 문자열'- 다음의 가장 가까운 '일요일'부터 시작하여 '일요일만 체크'
- 'M' : month 별로 마지막 날짜

In [35]:
pd.date_range('5/1/2020', periods = 100, freq = 'W')

DatetimeIndex(['2020-05-03', '2020-05-10', '2020-05-17', '2020-05-24',
               '2020-05-31', '2020-06-07', '2020-06-14', '2020-06-21',
               '2020-06-28', '2020-07-05', '2020-07-12', '2020-07-19',
               '2020-07-26', '2020-08-02', '2020-08-09', '2020-08-16',
               '2020-08-23', '2020-08-30', '2020-09-06', '2020-09-13',
               '2020-09-20', '2020-09-27', '2020-10-04', '2020-10-11',
               '2020-10-18', '2020-10-25', '2020-11-01', '2020-11-08',
               '2020-11-15', '2020-11-22', '2020-11-29', '2020-12-06',
               '2020-12-13', '2020-12-20', '2020-12-27', '2021-01-03',
               '2021-01-10', '2021-01-17', '2021-01-24', '2021-01-31',
               '2021-02-07', '2021-02-14', '2021-02-21', '2021-02-28',
               '2021-03-07', '2021-03-14', '2021-03-21', '2021-03-28',
               '2021-04-04', '2021-04-11', '2021-04-18', '2021-04-25',
               '2021-05-02', '2021-05-09', '2021-05-16', '2021-05-23',
      

In [36]:
pd.date_range('5/1/2020', periods = 100, freq = 'W-TUE')
#입력 날짜로부터 가장 가까운 해당 요일부터 그 요일만 체크

DatetimeIndex(['2020-05-05', '2020-05-12', '2020-05-19', '2020-05-26',
               '2020-06-02', '2020-06-09', '2020-06-16', '2020-06-23',
               '2020-06-30', '2020-07-07', '2020-07-14', '2020-07-21',
               '2020-07-28', '2020-08-04', '2020-08-11', '2020-08-18',
               '2020-08-25', '2020-09-01', '2020-09-08', '2020-09-15',
               '2020-09-22', '2020-09-29', '2020-10-06', '2020-10-13',
               '2020-10-20', '2020-10-27', '2020-11-03', '2020-11-10',
               '2020-11-17', '2020-11-24', '2020-12-01', '2020-12-08',
               '2020-12-15', '2020-12-22', '2020-12-29', '2021-01-05',
               '2021-01-12', '2021-01-19', '2021-01-26', '2021-02-02',
               '2021-02-09', '2021-02-16', '2021-02-23', '2021-03-02',
               '2021-03-09', '2021-03-16', '2021-03-23', '2021-03-30',
               '2021-04-06', '2021-04-13', '2021-04-20', '2021-04-27',
               '2021-05-04', '2021-05-11', '2021-05-18', '2021-05-25',
      

In [37]:
pd.date_range('5/1/2020', periods = 50, freq ='M')
#월의 마지막 날짜

DatetimeIndex(['2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31',
               '2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31',
               '2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31', '2022-06-30', '2022-07-31', '2022-08-31',
               '2022-09-30', '2022-10-31', '2022-11-30', '2022-12-31',
               '2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',
               '2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',
               '2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31',
               '2024-01-31', '2024-02-29', '2024-03-31', '2024-04-30',
               '2024-05-31', '2024-06-30'],
              dtype='datetime64[ns]', freq='M')

# 10.2.1 인덱싱, 선택, 부분 선택
TimeSeries 객체 :Series의 하위 클래스, 인덱싱에서 Series와 동일하게 동작

In [38]:
ts

2011-01-02   -0.274074
2011-01-05    0.924186
2011-01-07    0.167108
2011-01-08   -1.720296
2011-01-10   -1.117970
2011-01-12   -0.946407
dtype: float64

In [39]:
stmp = ts.index[2]

In [40]:
ts[stmp]

0.16710801051526247

In [41]:
ts['1/10/2011']

-1.117970171206535

In [42]:
ts['20110112']

-0.946406730697485

긴 시계열에서는 년, 월, 일의 인덱스만 넘겨서 따로 추출 가능하다.

In [43]:
longer_ts = pd.Series(np.random.randint(1000, size=600),
                    index = pd.date_range('1/1/2020', periods = 600))
longer_ts

2020-01-01    135
2020-01-02    117
2020-01-03    338
2020-01-04    690
2020-01-05     75
             ... 
2021-08-18    296
2021-08-19    606
2021-08-20    322
2021-08-21    796
2021-08-22    109
Freq: D, Length: 600, dtype: int32

In [44]:
print('start_time : {0}\nend_time : {1}'.format(longer_ts.index[0], longer_ts.index[-1]))

start_time : 2020-01-01 00:00:00
end_time : 2021-08-22 00:00:00


In [45]:
longer_ts.index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2021-08-13', '2021-08-14', '2021-08-15', '2021-08-16',
               '2021-08-17', '2021-08-18', '2021-08-19', '2021-08-20',
               '2021-08-21', '2021-08-22'],
              dtype='datetime64[ns]', length=600, freq='D')

In [46]:
longer_ts['2020']

2020-01-01    135
2020-01-02    117
2020-01-03    338
2020-01-04    690
2020-01-05     75
             ... 
2020-12-27    269
2020-12-28    378
2020-12-29    531
2020-12-30    101
2020-12-31    416
Freq: D, Length: 366, dtype: int32

In [47]:
longer_ts['2020-03'].head(10)

2020-03-01     59
2020-03-02    713
2020-03-03    867
2020-03-04    609
2020-03-05    728
2020-03-06    776
2020-03-07    332
2020-03-08     87
2020-03-09    566
2020-03-10    842
Freq: D, dtype: int32

In [48]:
#슬라이싱하는 방법은 Series와 동일
longer_ts[datetime(2021,1,7):].head(10)

2021-01-07    260
2021-01-08    422
2021-01-09    738
2021-01-10    125
2021-01-11     69
2021-01-12    868
2021-01-13    359
2021-01-14    209
2021-01-15     94
2021-01-16    587
Freq: D, dtype: int32

In [49]:
longer_ts['2020-01-03':'2020/10/5']

2020-01-03    338
2020-01-04    690
2020-01-05     75
2020-01-06    628
2020-01-07    441
             ... 
2020-10-01     93
2020-10-02    551
2020-10-03    677
2020-10-04    967
2020-10-05    342
Freq: D, Length: 277, dtype: int32

- truncate() : TimeSeries를 2개의 날짜로 나누어준다.

DataFrame에서도 동일 적용. 인덱스에 적용

In [50]:
ts

2011-01-02   -0.274074
2011-01-05    0.924186
2011-01-07    0.167108
2011-01-08   -1.720296
2011-01-10   -1.117970
2011-01-12   -0.946407
dtype: float64

In [51]:
ts.truncate(before = '1/8/2011')
# before 이전 데이터를 잘라내버림.
## before 이후의 데이터를 추출

2011-01-08   -1.720296
2011-01-10   -1.117970
2011-01-12   -0.946407
dtype: float64

In [52]:
ts.truncate(after = '1/8/2011')
#after 이후의 데이터를 잘라내버림
## after 이전의 데이터를 추출

2011-01-02   -0.274074
2011-01-05    0.924186
2011-01-07    0.167108
2011-01-08   -1.720296
dtype: float64

In [53]:
dates_pd = pd.date_range('1/1/2020', periods = 100, freq = 'W-WED')
dates_pd

DatetimeIndex(['2020-01-01', '2020-01-08', '2020-01-15', '2020-01-22',
               '2020-01-29', '2020-02-05', '2020-02-12', '2020-02-19',
               '2020-02-26', '2020-03-04', '2020-03-11', '2020-03-18',
               '2020-03-25', '2020-04-01', '2020-04-08', '2020-04-15',
               '2020-04-22', '2020-04-29', '2020-05-06', '2020-05-13',
               '2020-05-20', '2020-05-27', '2020-06-03', '2020-06-10',
               '2020-06-17', '2020-06-24', '2020-07-01', '2020-07-08',
               '2020-07-15', '2020-07-22', '2020-07-29', '2020-08-05',
               '2020-08-12', '2020-08-19', '2020-08-26', '2020-09-02',
               '2020-09-09', '2020-09-16', '2020-09-23', '2020-09-30',
               '2020-10-07', '2020-10-14', '2020-10-21', '2020-10-28',
               '2020-11-04', '2020-11-11', '2020-11-18', '2020-11-25',
               '2020-12-02', '2020-12-09', '2020-12-16', '2020-12-23',
               '2020-12-30', '2021-01-06', '2021-01-13', '2021-01-20',
      

In [54]:
long_df = pd.DataFrame(np.random.randint(100, size = 400).reshape(100,4),
                      index = dates_pd,
                      columns = ['colorado', 'texas', 'newyork', 'ohio'])

In [55]:
long_df.head(10)

Unnamed: 0,colorado,texas,newyork,ohio
2020-01-01,79,46,1,72
2020-01-08,56,10,33,49
2020-01-15,2,8,77,11
2020-01-22,41,35,21,69
2020-01-29,17,9,35,46
2020-02-05,0,38,8,82
2020-02-12,86,33,46,99
2020-02-19,78,0,25,81
2020-02-26,61,63,56,83
2020-03-04,6,20,97,7


In [56]:
long_df['5-2020']

Unnamed: 0,colorado,texas,newyork,ohio
2020-05-06,99,57,47,67
2020-05-13,15,67,0,63
2020-05-20,25,66,42,16
2020-05-27,82,83,15,41


In [57]:
long_df['2021'].head(10)

Unnamed: 0,colorado,texas,newyork,ohio
2021-01-06,68,59,7,26
2021-01-13,93,37,33,62
2021-01-20,8,66,23,95
2021-01-27,12,73,54,20
2021-02-03,57,50,11,77
2021-02-10,10,98,92,39
2021-02-17,26,68,27,14
2021-02-24,60,37,7,42
2021-03-03,54,86,3,12
2021-03-10,70,54,88,93


# 10.2.2 중복된 색인을 갖는 시계열
중복된 timestamp에 여러 데이터가 몰려있는 경우.  
즉, 서로 다른 데이터가 같은 날짜로 적혀있는 경우

In [58]:
dates_dup = pd.DatetimeIndex(['1/1/2020'] + ['1/2/2020']*3 + ['1/3/2020'])

In [59]:
dup_ts = pd.Series(np.arange(5), index=dates_dup)
dup_ts

2020-01-01    0
2020-01-02    1
2020-01-02    2
2020-01-02    3
2020-01-03    4
dtype: int32

In [60]:
dup_ts.index.is_unique

False

In [61]:
dup_ts.index.duplicated()

array([False, False,  True,  True, False])

In [62]:
dup_ts['1/2/2020']

2020-01-02    1
2020-01-02    2
2020-01-02    3
dtype: int32

In [63]:
dup_ts.groupby(level=0).mean()

2020-01-01    0
2020-01-02    2
2020-01-03    4
dtype: int32

In [64]:
dup_ts.groupby(dup_ts.index).count()

2020-01-01    1
2020-01-02    3
2020-01-03    1
dtype: int64

# 10.3.1 날짜 범위 생성하기
- pd.date_range : range나 np.arange와 비슷한 [특정 규칙이 있는 '날짜']를 생성하는 메서드.
    + pd.date_range(start= ,end= , periods= , freq= , normalize=T/F)
    + 특정 빈도(freq)에 따라 길이(periods)만큼의 DatetimeIndex 생성

In [65]:
index = pd.date_range('4/1/2020', '6/1/2020')
index

DatetimeIndex(['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04',
               '2020-04-05', '2020-04-06', '2020-04-07', '2020-04-08',
               '2020-04-09', '2020-04-10', '2020-04-11', '2020-04-12',
               '2020-04-13', '2020-04-14', '2020-04-15', '2020-04-16',
               '2020-04-17', '2020-04-18', '2020-04-19', '2020-04-20',
               '2020-04-21', '2020-04-22', '2020-04-23', '2020-04-24',
               '2020-04-25', '2020-04-26', '2020-04-27', '2020-04-28',
               '2020-04-29', '2020-04-30', '2020-05-01', '2020-05-02',
               '2020-05-03', '2020-05-04', '2020-05-05', '2020-05-06',
               '2020-05-07', '2020-05-08', '2020-05-09', '2020-05-10',
               '2020-05-11', '2020-05-12', '2020-05-13', '2020-05-14',
               '2020-05-15', '2020-05-16', '2020-05-17', '2020-05-18',
               '2020-05-19', '2020-05-20', '2020-05-21', '2020-05-22',
               '2020-05-23', '2020-05-24', '2020-05-25', '2020-05-26',
      

In [66]:
pd.date_range(start='1/1/2020', periods=20)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20'],
              dtype='datetime64[ns]', freq='D')

In [67]:
pd.date_range(end = '1/1/2020', periods=20)

DatetimeIndex(['2019-12-13', '2019-12-14', '2019-12-15', '2019-12-16',
               '2019-12-17', '2019-12-18', '2019-12-19', '2019-12-20',
               '2019-12-21', '2019-12-22', '2019-12-23', '2019-12-24',
               '2019-12-25', '2019-12-26', '2019-12-27', '2019-12-28',
               '2019-12-29', '2019-12-30', '2019-12-31', '2020-01-01'],
              dtype='datetime64[ns]', freq='D')

In [68]:
pd.date_range('1/1/2020 12:23:54', periods = 20)

DatetimeIndex(['2020-01-01 12:23:54', '2020-01-02 12:23:54',
               '2020-01-03 12:23:54', '2020-01-04 12:23:54',
               '2020-01-05 12:23:54', '2020-01-06 12:23:54',
               '2020-01-07 12:23:54', '2020-01-08 12:23:54',
               '2020-01-09 12:23:54', '2020-01-10 12:23:54',
               '2020-01-11 12:23:54', '2020-01-12 12:23:54',
               '2020-01-13 12:23:54', '2020-01-14 12:23:54',
               '2020-01-15 12:23:54', '2020-01-16 12:23:54',
               '2020-01-17 12:23:54', '2020-01-18 12:23:54',
               '2020-01-19 12:23:54', '2020-01-20 12:23:54'],
              dtype='datetime64[ns]', freq='D')

In [69]:
#시간을 살리고 십지 않으면 normalize = True 옵션
pd.date_range('1/1/2020 12:23:54', '3/1/2020 23:23:08', normalize=True)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
               '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
               '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
               '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
               '2020-02-14', '2020-02-15', '2020-02-16', '2020-02-17',
               '2020-02-18', '2020-02-19', '2020-02-20', '2020-02-21',
               '2020-02-22', '2020-02-23', '2020-02-24', '2020-02-25',
      

# 10.3.2 빈도와 날짜 오프셋
pandas에서 빈도란: 기본빈도와 배수의 조합
- 기본 빈도: 'M'(월별), 'H'(시간별)
- 날짜 오프셋 : datetime의 timedelta와 유사

In [70]:
from pandas.tseries.offsets import Hour, Minute

In [71]:
hour = Hour()

In [72]:
four_hour = Hour(4)
four_hour

<4 * Hours>

In [73]:
Hour(4) + Minute(20)

<260 * Minutes>

In [74]:
pd.date_range('1/1/2020', '1/3/2020 23:59:59', freq = '4h')

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 04:00:00',
               '2020-01-01 08:00:00', '2020-01-01 12:00:00',
               '2020-01-01 16:00:00', '2020-01-01 20:00:00',
               '2020-01-02 00:00:00', '2020-01-02 04:00:00',
               '2020-01-02 08:00:00', '2020-01-02 12:00:00',
               '2020-01-02 16:00:00', '2020-01-02 20:00:00',
               '2020-01-03 00:00:00', '2020-01-03 04:00:00',
               '2020-01-03 08:00:00', '2020-01-03 12:00:00',
               '2020-01-03 16:00:00', '2020-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

In [75]:
pd.date_range('1/1/2020', periods =20 , freq ='2h30min')

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 02:30:00',
               '2020-01-01 05:00:00', '2020-01-01 07:30:00',
               '2020-01-01 10:00:00', '2020-01-01 12:30:00',
               '2020-01-01 15:00:00', '2020-01-01 17:30:00',
               '2020-01-01 20:00:00', '2020-01-01 22:30:00',
               '2020-01-02 01:00:00', '2020-01-02 03:30:00',
               '2020-01-02 06:00:00', '2020-01-02 08:30:00',
               '2020-01-02 11:00:00', '2020-01-02 13:30:00',
               '2020-01-02 16:00:00', '2020-01-02 18:30:00',
               '2020-01-02 21:00:00', '2020-01-02 23:30:00'],
              dtype='datetime64[ns]', freq='150T')

월별 주차  
- freq = 'WOM-3FRI' : Week of Month 셋 째주 금요일

In [76]:
rng = pd.date_range('1/1/2020', periods = 4, freq = 'WOM-2TUE')

In [77]:
rng

DatetimeIndex(['2020-01-14', '2020-02-11', '2020-03-10', '2020-04-14'], dtype='datetime64[ns]', freq='WOM-2TUE')

# 10.3.3 데이터 시프트
데이터를 시간 축에서 앞, 뒤로 이동시키는 것  
pandas나 Series의 하위 메서드  
timestamp 인덱스를 변경시키지 않고 데이터를 앞이나 뒤로 유연하게 이동시키기  

여러 시계열에서 퍼센트 변화를 계산할 때 흔히 사용

In [78]:
np.random.seed(5)
ts_shift = pd.Series(np.random.randint(10, size=10), 
                    index = pd.date_range('1/1/2020', periods=10))

In [79]:
ts_shift

2020-01-01    3
2020-01-02    6
2020-01-03    6
2020-01-04    0
2020-01-05    9
2020-01-06    8
2020-01-07    4
2020-01-08    7
2020-01-09    0
2020-01-10    0
Freq: D, dtype: int32

In [80]:
ts_shift.shift(-2)
#앞으로 두 인덱스를 당긴다

2020-01-01    6.0
2020-01-02    0.0
2020-01-03    9.0
2020-01-04    8.0
2020-01-05    4.0
2020-01-06    7.0
2020-01-07    0.0
2020-01-08    0.0
2020-01-09    NaN
2020-01-10    NaN
Freq: D, dtype: float64

In [81]:
ts_shift.shift(3)
#뒤로 세 인덱스를 민다.

2020-01-01    NaN
2020-01-02    NaN
2020-01-03    NaN
2020-01-04    3.0
2020-01-05    6.0
2020-01-06    6.0
2020-01-07    0.0
2020-01-08    9.0
2020-01-09    8.0
2020-01-10    4.0
Freq: D, dtype: float64

In [82]:
ts_shift

2020-01-01    3
2020-01-02    6
2020-01-03    6
2020-01-04    0
2020-01-05    9
2020-01-06    8
2020-01-07    4
2020-01-08    7
2020-01-09    0
2020-01-10    0
Freq: D, dtype: int32

In [83]:
ts_shift / ts_shift.shift(1) -1

2020-01-01         NaN
2020-01-02    1.000000
2020-01-03    0.000000
2020-01-04   -1.000000
2020-01-05         inf
2020-01-06   -0.111111
2020-01-07   -0.500000
2020-01-08    0.750000
2020-01-09   -1.000000
2020-01-10         NaN
Freq: D, dtype: float64

In [84]:
ts_shift.shift(3)

2020-01-01    NaN
2020-01-02    NaN
2020-01-03    NaN
2020-01-04    3.0
2020-01-05    6.0
2020-01-06    6.0
2020-01-07    0.0
2020-01-08    9.0
2020-01-09    8.0
2020-01-10    4.0
Freq: D, dtype: float64

In [85]:
#freq를 넘겨 timestamp가 확장되도록 할 수 있다.
#시작을 01-04 부터하고 뒤에 01-11, 01-12, 01-13 이 더 생겼다.
ts_shift.shift(3, freq ='D')

2020-01-04    3
2020-01-05    6
2020-01-06    6
2020-01-07    0
2020-01-08    9
2020-01-09    8
2020-01-10    4
2020-01-11    7
2020-01-12    0
2020-01-13    0
Freq: D, dtype: int32

In [86]:
ts_shift.shift(3, freq='M')

2020-03-31    3
2020-03-31    6
2020-03-31    6
2020-03-31    0
2020-03-31    9
2020-03-31    8
2020-03-31    4
2020-03-31    7
2020-03-31    0
2020-03-31    0
Freq: D, dtype: int32

## 오프셋만큼 날짜 shift하기
날짜 오프셋은 datetime이나 Timestamp 객체 둘 다에 적용 가능

In [87]:
from pandas.tseries.offsets import Day, MonthEnd

In [88]:
now = datetime(2011,11,17)
now + 3*Day()

Timestamp('2011-11-20 00:00:00')

In [89]:
now + MonthEnd()

Timestamp('2011-11-30 00:00:00')

In [90]:
now + MonthEnd(2)

Timestamp('2011-12-31 00:00:00')

In [91]:
now + MonthEnd(-1)

Timestamp('2011-10-31 00:00:00')

In [92]:
MonthEnd().rollback(now) #해당 날짜의 전 달의 마지막 날

Timestamp('2011-10-31 00:00:00')

In [93]:
MonthEnd().rollforward(now) #해당 날짜의 달의 마지막 날

Timestamp('2011-11-30 00:00:00')

groupby와 오프셋 같이 사용하기

In [94]:
ts_off = pd.Series(np.random.randint(20, size=20), 
                  index = pd.date_range('1/1/2020', periods=20, freq = '4d'))
ts_off

2020-01-01     7
2020-01-05    12
2020-01-09    15
2020-01-13    17
2020-01-17     7
2020-01-21    16
2020-01-25    12
2020-01-29    13
2020-02-02    11
2020-02-06     1
2020-02-10    15
2020-02-14    18
2020-02-18     9
2020-02-22    10
2020-02-26     9
2020-03-01     9
2020-03-05     1
2020-03-09    18
2020-03-13     7
2020-03-17    16
Freq: 4D, dtype: int32

# 10.4 시간대 다루기
일광절약시간(DST)  
UTC로부터 떨어진 offset으로 각 나라, 위치의 시간 파악
- pytz : 전 세계의 시간대 정보를 담고 있는 서브파티 라이브러리

In [95]:
import pytz

In [96]:
pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

In [97]:
pytz.timezone('US/Hawaii')

<DstTzInfo 'US/Hawaii' LMT-1 day, 13:29:00 STD>

# 10.4.1 지역화와 변환
pandas에서 시계열은 시간대를 엄격히 다루지 않는다.  
지역화 적용 X : 지역마다 시간을 재는 방식이 다른 것(eg: Summertime)을 세세히 감안하지 않는다. 

In [98]:
ts

2011-01-02   -0.274074
2011-01-05    0.924186
2011-01-07    0.167108
2011-01-08   -1.720296
2011-01-10   -1.117970
2011-01-12   -0.946407
dtype: float64

In [100]:
pd.date_range('1/1/2020', periods = 20, freq='M', tz='UTC')

DatetimeIndex(['2020-01-31 00:00:00+00:00', '2020-02-29 00:00:00+00:00',
               '2020-03-31 00:00:00+00:00', '2020-04-30 00:00:00+00:00',
               '2020-05-31 00:00:00+00:00', '2020-06-30 00:00:00+00:00',
               '2020-07-31 00:00:00+00:00', '2020-08-31 00:00:00+00:00',
               '2020-09-30 00:00:00+00:00', '2020-10-31 00:00:00+00:00',
               '2020-11-30 00:00:00+00:00', '2020-12-31 00:00:00+00:00',
               '2021-01-31 00:00:00+00:00', '2021-02-28 00:00:00+00:00',
               '2021-03-31 00:00:00+00:00', '2021-04-30 00:00:00+00:00',
               '2021-05-31 00:00:00+00:00', '2021-06-30 00:00:00+00:00',
               '2021-07-31 00:00:00+00:00', '2021-08-31 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='M')

In [102]:
#UTC 옵션을 통해 표준 시간대로 맞춘다.
ts_utc = ts.tz_localize('UTC')
ts_utc

2011-01-02 00:00:00+00:00   -0.274074
2011-01-05 00:00:00+00:00    0.924186
2011-01-07 00:00:00+00:00    0.167108
2011-01-08 00:00:00+00:00   -1.720296
2011-01-10 00:00:00+00:00   -1.117970
2011-01-12 00:00:00+00:00   -0.946407
dtype: float64

In [103]:
#지역화가 된 후 tx_convert 를 사용해 다른 시간대로 변경 가능하다. 
ts_utc.tz_convert('US/Eastern')

2011-01-01 19:00:00-05:00   -0.274074
2011-01-04 19:00:00-05:00    0.924186
2011-01-06 19:00:00-05:00    0.167108
2011-01-07 19:00:00-05:00   -1.720296
2011-01-09 19:00:00-05:00   -1.117970
2011-01-11 19:00:00-05:00   -0.946407
dtype: float64

# 10.4.2 시간대 고려해 Timestamp 객체
timestamp : 도장처럼 날짜나 시간을 딱 찍어 놓은 객체

In [104]:
t_stamp = pd.Timestamp('2018-1-1 4:00')

In [111]:
t_stamp_utc = t_stamp.tz_localize('UTC')

In [112]:
t_stamp_utc

Timestamp('2018-01-01 04:00:00+0000', tz='UTC')

In [113]:
t_stmp_utc.tz_convert('US/Hawaii')

Timestamp('2017-12-31 18:00:00-1000', tz='US/Hawaii')

In [114]:
#생성할 때부터 해당 지역 시간으로 정해줄 수 있다.
t_stamp_moscow = pd.Timestamp('2018/1/1 16:23', tz='Europe/Moscow')
t_stamp_moscow

Timestamp('2018-01-01 16:23:00+0300', tz='Europe/Moscow')

In [116]:
t_stamp_utc.value #UTC 타임스탬프 값은 UNIX epoch(1970/1/1)부터 현재까지의 나노초로 저장 

1514779200000000000

# 10.4.3 다른 시간대끼리의 연산
timestamp 객체: 내부적으로 UTC로 저장된다. 따라서 추가적인 변환 필요없음

In [117]:
ts.index = pd.date_range('1/1/2018 09:30', periods=6, freq='B')
#freq='B' : workdays 영업일(평일)

In [118]:
ts

2018-01-01 09:30:00   -0.274074
2018-01-02 09:30:00    0.924186
2018-01-03 09:30:00    0.167108
2018-01-04 09:30:00   -1.720296
2018-01-05 09:30:00   -1.117970
2018-01-08 09:30:00   -0.946407
Freq: B, dtype: float64

# 10.5 기간과 기간 연산
- pd.Period()

In [119]:
p = pd.Period(2018, freq = 'A-DEC') #2018/1/1 ~ 2018/12/31
p

Period('2018', 'A-DEC')

In [120]:
p+5

Period('2023', 'A-DEC')

In [121]:
p-2

Period('2016', 'A-DEC')

In [122]:
#두 기간의 빈도(freq)가 같다면 두 기간의 차는 둘 사이의 간격이 된다.
pd.Period('3/4/2028', freq = 'A-DEC') - p

<10 * YearEnds: month=12>

In [123]:
rng = pd.period_range('1/1/2020', '6/3/2020', freq='M')
rng

PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'], dtype='period[M]', freq='M')

In [124]:
pd.Series(np.random.randn(6), index=rng)

2020-01    0.074531
2020-02    0.556535
2020-03    1.972580
2020-04   -0.241066
2020-05    0.363376
2020-06    1.074484
Freq: M, dtype: float64

In [125]:
values = ['200103', '200202', '200301']

In [126]:
index_v = pd.PeriodIndex(values, freq = 'Q-DEC')
index_v

PeriodIndex(['2003Q1', '2002Q1', '2001Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

# 10.5.1 Period의 빈도 변환
asfreq() : 다른 빈도로 변환

In [127]:
p

Period('2018', 'A-DEC')

In [128]:
p.asfreq('M', how='start')

Period('2018-01', 'M')

In [129]:
p.asfreq('M', how='end')

Period('2018-12', 'M')

# 10.5.2 분기 빈도
분기 데이터 : 재정, 금융 및 여러 분야에서 사용

In [131]:
pq = pd.Period('2022-04', freq = 'Q-JAN')
pq

Period('2023Q1', 'Q-JAN')

In [132]:
pq.asfreq('D', 'start')

Period('2022-02-01', 'D')

In [133]:
pd.Period('2018/1/30', freq='Q-AUG')

Period('2018Q2', 'Q-AUG')

# 10.6 리샘플링과 빈도 변환
리샘플링 : 빈도(freq)를 변환하는 과정
- pd.resample
    + 다운 샘플링 : 상위 빈도 -> 하위 빈도/ 많은 빈도 -> 적은 빈도 ex)일->월
    + 업 샘플링 : 하위 빈도 -> 상위 빈도
    + closed = 'right'/ 'left' : 전체에서 연산의 시작점을 앞으로 할지 뒤로 할지
    + label = 'right'/ 'left' : 연산 후 각 그룹에서 사용할 datetime 인덱스 값을 어떤 것으로 할지
    + offset : 반환 결과 값의 datetime 인덱스를 더 조정하고 싶을 때.

In [143]:
rng

PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'], dtype='period[M]', freq='M')

In [144]:
rng_rs = pd.date_range('1/1/2018', periods =100, freq='D')

In [145]:
ts_rs = pd.Series(np.random.randint(100, size = len(rng_rs)), index = rng_rs)

In [146]:
ts_rs.head()

2018-01-01    30
2018-01-02    62
2018-01-03    11
2018-01-04    67
2018-01-05    65
Freq: D, dtype: int32

In [147]:
ts_rs.resample('M', how='mean')

2018-01-31    41.870968
2018-02-28    50.464286
2018-03-31    50.451613
2018-04-30    46.400000
Freq: M, dtype: float64

In [148]:
ts_rs['2018-01'].mean()

41.87096774193548

In [150]:
ts_rs.resample('M', how='mean', kind = 'period')
#kind 옵션을 사용하여 datetime 인덱스를 어떤 식으로 설정할 것인지

2018-01    41.870968
2018-02    50.464286
2018-03    50.451613
2018-04    46.400000
Freq: M, dtype: float64

# 10.6.1 다운샘플링
시계열 조각의 크기를 원하는 빈도로 정의  
각 간격은 한쪽이 열려있게 된다. 양 끝 중 한쪽만 포함한다는 말.

In [151]:
rng_ds = pd.date_range('1/1/2018', periods=12, freq = 'T')

In [153]:
ts_ds = pd.Series(np.arange(12), index=rng_ds)

In [154]:
ts_ds

2018-01-01 00:00:00     0
2018-01-01 00:01:00     1
2018-01-01 00:02:00     2
2018-01-01 00:03:00     3
2018-01-01 00:04:00     4
2018-01-01 00:05:00     5
2018-01-01 00:06:00     6
2018-01-01 00:07:00     7
2018-01-01 00:08:00     8
2018-01-01 00:09:00     9
2018-01-01 00:10:00    10
2018-01-01 00:11:00    11
Freq: T, dtype: int32

In [156]:
ts_ds.resample('5min', how = 'sum')

2018-01-01 00:00:00    10
2018-01-01 00:05:00    35
2018-01-01 00:10:00    21
Freq: 5T, dtype: int32

In [157]:
ts_ds.resample('5min', how='sum', closed='right')

2017-12-31 23:55:00     0
2018-01-01 00:00:00    15
2018-01-01 00:05:00    40
2018-01-01 00:10:00    11
Freq: 5T, dtype: int32

In [159]:
ts_ds.resample('5min', how='mean', closed='left', label='right')

2018-01-01 00:05:00     2.0
2018-01-01 00:10:00     7.0
2018-01-01 00:15:00    10.5
Freq: 5T, dtype: float64

In [160]:
ts_ds.resample('5min', how = 'mean', closed = 'left', label = 'right', loffset = '-30s')

2018-01-01 00:04:30     2.0
2018-01-01 00:09:30     7.0
2018-01-01 00:14:30    10.5
Freq: 5T, dtype: float64

# 10.6.2 업샘플링과 보간
하위 빈도 -> 사위 빈도 eg)월->일

관측되지 않은 데이터가 있기 때문에 집계가 필요하지 않다. 

대신 없는 값을 채워야 하는 '보간'이 필요하다.

보간 수행 옵션
- fill_method ='ffill', 'bfill'
- limit = 정수 : 보간을 수행할 범위 지정

In [161]:
df_us = pd.DataFrame(np.random.randn(2,4),
                    columns = ['Colorado', 'Texas', 'New York', 'Ohio'],
                    index = pd.date_range('1/1/2018', periods=2, freq ='W-WED'))
df_us

Unnamed: 0,Colorado,Texas,New York,Ohio
2018-01-03,0.46085,-0.052698,-2.899218,2.234424
2018-01-10,0.666703,1.338284,0.185105,2.611075


In [162]:
df_daily = df_us.resample('D')
df_daily

<pandas.core.resample.DatetimeIndexResampler object at 0x13041A50>

In [163]:
df_us.resample('D', fill_method='ffill')

Unnamed: 0,Colorado,Texas,New York,Ohio
2018-01-03,0.46085,-0.052698,-2.899218,2.234424
2018-01-04,0.46085,-0.052698,-2.899218,2.234424
2018-01-05,0.46085,-0.052698,-2.899218,2.234424
2018-01-06,0.46085,-0.052698,-2.899218,2.234424
2018-01-07,0.46085,-0.052698,-2.899218,2.234424
2018-01-08,0.46085,-0.052698,-2.899218,2.234424
2018-01-09,0.46085,-0.052698,-2.899218,2.234424
2018-01-10,0.666703,1.338284,0.185105,2.611075


In [164]:
df_us.resample('D', fill_method ='bfill', limit=2)

Unnamed: 0,Colorado,Texas,New York,Ohio
2018-01-03,0.46085,-0.052698,-2.899218,2.234424
2018-01-04,,,,
2018-01-05,,,,
2018-01-06,,,,
2018-01-07,,,,
2018-01-08,0.666703,1.338284,0.185105,2.611075
2018-01-09,0.666703,1.338284,0.185105,2.611075
2018-01-10,0.666703,1.338284,0.185105,2.611075


In [165]:
df_us.resample('W-THU', fill_method = 'ffill')

Unnamed: 0,Colorado,Texas,New York,Ohio
2018-01-04,0.46085,-0.052698,-2.899218,2.234424
2018-01-11,0.666703,1.338284,0.185105,2.611075


# 10.6.3 기간 샘플링
- resample('freq 옵션', how= )

In [171]:
sr = pd.Series(np.random.randint(20, size=100),
              index = pd.date_range('1/1/2018', periods=100, freq='M'))
sr

2018-01-31     7
2018-02-28    15
2018-03-31    13
2018-04-30    17
2018-05-31     3
              ..
2025-12-31     0
2026-01-31     4
2026-02-28    14
2026-03-31     5
2026-04-30    13
Freq: M, Length: 100, dtype: int32

In [173]:
annual_frame = sr.resample('A-DEC', how='sum')
annual_frame

2018-12-31    172
2019-12-31    107
2020-12-31    135
2021-12-31    139
2022-12-31     97
2023-12-31    124
2024-12-31    131
2025-12-31    102
2026-12-31     36
Freq: A-DEC, dtype: int32

In [174]:
annual_frame.resample('Q-DEC', fill_method='ffill')

2018-12-31    172
2019-03-31    172
2019-06-30    172
2019-09-30    172
2019-12-31    107
2020-03-31    107
2020-06-30    107
2020-09-30    107
2020-12-31    135
2021-03-31    135
2021-06-30    135
2021-09-30    135
2021-12-31    139
2022-03-31    139
2022-06-30    139
2022-09-30    139
2022-12-31     97
2023-03-31     97
2023-06-30     97
2023-09-30     97
2023-12-31    124
2024-03-31    124
2024-06-30    124
2024-09-30    124
2024-12-31    131
2025-03-31    131
2025-06-30    131
2025-09-30    131
2025-12-31    102
2026-03-31    102
2026-06-30    102
2026-09-30    102
2026-12-31     36
Freq: Q-DEC, dtype: int32

In [175]:
annual_frame.resample('Q-DEC', fill_method = 'ffill', convention='end')

2018-12-31    172
2019-03-31    172
2019-06-30    172
2019-09-30    172
2019-12-31    107
2020-03-31    107
2020-06-30    107
2020-09-30    107
2020-12-31    135
2021-03-31    135
2021-06-30    135
2021-09-30    135
2021-12-31    139
2022-03-31    139
2022-06-30    139
2022-09-30    139
2022-12-31     97
2023-03-31     97
2023-06-30     97
2023-09-30     97
2023-12-31    124
2024-03-31    124
2024-06-30    124
2024-09-30    124
2024-12-31    131
2025-03-31    131
2025-06-30    131
2025-09-30    131
2025-12-31    102
2026-03-31    102
2026-06-30    102
2026-09-30    102
2026-12-31     36
Freq: Q-DEC, dtype: int32

# 10.7 시계열 그래프
pandas의 시계열 그래프: matplotlib보다 더 개선된 데이터 포맷을 지원한다.