# Pandas Summary
- http://pandas.pydata.org/pandas-docs/stable/10min.html
- Object Creation
- Viewing Data
- Selection
- Setting
- Missing Data
- Operations
- Merge
- Grouping
- Reshaping
- Time Series

In [1]:
import pandas as pd
import numpy as np

## Object Creation

### Create Series

In [2]:
s = pd.Series([1,3,5,np.nan, 6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### Create DataFrame

- data_range : 날짜 데이터에 대한 range 데이터를 설정할 수 있다.
- freq : Y-M-D H:MIN:S (년-월-일  시:분:초)

In [4]:
dates = pd.date_range('20130101', periods = 6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df = pd.DataFrame(np.random.randint(-5,5, size = (6,4)), index = dates, columns = list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0,3,-1,-3
2013-01-02,2,-4,1,-5
2013-01-03,3,-4,4,-4
2013-01-04,-5,4,-2,0
2013-01-05,-3,0,0,-2
2013-01-06,-5,3,-3,-5


In [9]:
df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index = list(range(4)), dtype = 'float32'),
                    # index 0-4의 데이터 1이 float32 type으로 들어감
                    'D' : np.array([3] * 4, dtype = 'int32'), # [3] * 4 = [3,3,3,3]
                    'E' : pd.Categorical(['test', 'train', 'test', 'train']),
                    'F' : 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [10]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

- Describe shows a quick statistic summary (Descrive())
- Transposing

In [11]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-1.333333,0.333333,-0.166667,-3.166667
std,3.50238,3.614784,2.483277,1.94079
min,-5.0,-4.0,-3.0,-5.0
25%,-4.5,-3.0,-1.75,-4.75
50%,-1.5,1.5,-0.5,-3.5
75%,1.5,3.0,0.75,-2.25
max,3.0,4.0,4.0,0.0


In [12]:
df.T

Unnamed: 0,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00,2013-01-06 00:00:00
A,0,2,3,-5,-3,-5
B,3,-4,-4,4,0,3
C,-1,1,4,-2,0,-3
D,-3,-5,-4,0,-2,-5


### Selection

- Getting
- Selection by Label
- Selection by Position
- Boolean Indexing
- isin method

overview
```
- loc : 기본적으로 row를 가져옴
    - [row,col]

- at : value
    - at(row, col) 하나의 value 값(이럴 땐 at이 빠름)
- ibc : index(int)
    - 0:3
    - 0, 1
- iat : index(int)

- [" "] : column
    - ["A"] 있는 컬럼이면 데이터를 가졍고 없으면 추가
```

#### Getting

In [13]:
# A 컬럼 데이터를 출력
df['A']

2013-01-01    0
2013-01-02    2
2013-01-03    3
2013-01-04   -5
2013-01-05   -3
2013-01-06   -5
Freq: D, Name: A, dtype: int64

In [14]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0,3,-1,-3
2013-01-02,2,-4,1,-5
2013-01-03,3,-4,4,-4


#### Selection by Label

In [17]:
df.loc[dates[0]]

A    0
B    3
C   -1
D   -3
Name: 2013-01-01 00:00:00, dtype: int64

In [18]:
# A와 B 컬럼 데이터를 처음부터 끝까지 출력
df.loc[:, ['A','B']]

Unnamed: 0,A,B
2013-01-01,0,3
2013-01-02,2,-4
2013-01-03,3,-4
2013-01-04,-5,4
2013-01-05,-3,0
2013-01-06,-5,3


In [19]:
# 1월 2일 데이터에서 1월 4일 데이터까지 A, B 컬럼데이터를 출력
df.loc['20130102':"20130104",['A','B']]

Unnamed: 0,A,B
2013-01-02,2,-4
2013-01-03,3,-4
2013-01-04,-5,4


In [20]:
# 1월 2일 데이터를 A와 B 컬럼으로 출력
df.loc['20130102', ['A','B']]

A    2
B   -4
Name: 2013-01-02 00:00:00, dtype: int64

- 데이터에 접근할 때 loc로 접근하는 것보다 at을 사용해서 접근하는 것이 속도가 더 빠르다
- at은 하나의 value에 대한 데이터만 가져올 수 있다. (하나의 스칼라형태의 데이터를 가져올 떄는 at으로 가져오는게 좋음)

In [21]:
%%time
# For getting a scalar value
df.loc[dates[0], 'A']

CPU times: user 116 µs, sys: 0 ns, total: 116 µs
Wall time: 121 µs


0

In [22]:
%%time
# For getting fast access to a scalar
df.at[dates[0],'A']

CPU times: user 121 µs, sys: 1 µs, total: 122 µs
Wall time: 127 µs


0

#### Selection by Position

- iloc (integer index location) : index 값으로 loc로 접근을 안해도 정수값으로 순서에 맞는 데이터를 가져옴

In [23]:
df.iloc[3]

A   -5
B    4
C   -2
D    0
Name: 2013-01-04 00:00:00, dtype: int64

In [24]:
# 범위로 데이터 가져오기
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-5,4
2013-01-05,-3,0


In [25]:
# 하나씩 지정해서 데이터 가져오기
df.iloc[[1,2,4], [0,2]]

Unnamed: 0,A,C
2013-01-02,2,1
2013-01-03,3,4
2013-01-05,-3,0


In [26]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,2,-4,1,-5
2013-01-03,3,-4,4,-4


In [27]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,3,-1
2013-01-02,-4,1
2013-01-03,-4,4
2013-01-04,4,-2
2013-01-05,0,0
2013-01-06,3,-3


- iloc로 value에 접근하는 것보다 lat으로 접근하는게 더 빠르다.
- iat은 하나의 value에 대한 데이터만 가져올 수 있다.

In [29]:
%%time
df.iloc[1,1]

CPU times: user 167 µs, sys: 1 µs, total: 168 µs
Wall time: 172 µs


-4

In [30]:
%%time
df.iat[1,1]

CPU times: user 67 µs, sys: 1e+03 ns, total: 68 µs
Wall time: 73 µs


-4

#### Boolean Indexing

In [31]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0,3,-1,-3
2013-01-02,2,-4,1,-5
2013-01-03,3,-4,4,-4
2013-01-04,-5,4,-2,0
2013-01-05,-3,0,0,-2
2013-01-06,-5,3,-3,-5


In [32]:
df.A > 0

2013-01-01    False
2013-01-02     True
2013-01-03     True
2013-01-04    False
2013-01-05    False
2013-01-06    False
Freq: D, Name: A, dtype: bool

In [33]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-02,2,-4,1,-5
2013-01-03,3,-4,4,-4


In [34]:
df > 0

Unnamed: 0,A,B,C,D
2013-01-01,False,True,False,False
2013-01-02,True,False,True,False
2013-01-03,True,False,True,False
2013-01-04,False,True,False,False
2013-01-05,False,False,False,False
2013-01-06,False,True,False,False


In [35]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,3.0,,
2013-01-02,2.0,,1.0,
2013-01-03,3.0,,4.0,
2013-01-04,,4.0,,
2013-01-05,,,,
2013-01-06,,3.0,,


#### isin method

- copy를 이용해서 데이터 프레임의 데이터를 새로운 메모리를 할당하여 복사할 수 있다.

In [37]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,0,3,-1,-3,one
2013-01-02,2,-4,1,-5,one
2013-01-03,3,-4,4,-4,two
2013-01-04,-5,4,-2,0,three
2013-01-05,-3,0,0,-2,four
2013-01-06,-5,3,-3,-5,three


In [38]:
# E 컬럼에 해당되는 데이터에 two와 four값을 가진 데이터를 가져온다
df2['E'].isin(['two', 'four'])

2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

In [39]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,3,-4,4,-4,two
2013-01-05,-3,0,0,-2,four


### Setting

In [40]:
s1 = pd.Series([1,2,3,4,5,6], index = pd.date_range('20130102', periods = 6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df의 컬럼에 series를 넣을 때, index 값이 없는 데이터는 NaN으로 데이터가 없다는 의미의 데이터가 들어간다.

In [42]:
df['F'] = s1
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0,3,-1,-3,
2013-01-02,2,-4,1,-5,1.0
2013-01-03,3,-4,4,-4,2.0
2013-01-04,-5,4,-2,0,3.0
2013-01-05,-3,0,0,-2,4.0
2013-01-06,-5,3,-3,-5,5.0


In [44]:
# Setting values by label
# at을 이용하여 특정 위치의 value 데이터를 바꿀 수 있다.
df.at[dates[0], 'A'] = 1
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1,3,-1,-3,
2013-01-02,2,-4,1,-5,1.0
2013-01-03,3,-4,4,-4,2.0
2013-01-04,-5,4,-2,0,3.0
2013-01-05,-3,0,0,-2,4.0
2013-01-06,-5,3,-3,-5,5.0


In [45]:
# Setting by assigning with a numpy array
# D 컬럼에 [5] 데이터가 df의 데이터 수 만큼 리스트를 만들어서 컬럼데이터를 바꿀 수 있다.
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1,3,-1,5,
2013-01-02,2,-4,1,5,1.0
2013-01-03,3,-4,4,5,2.0
2013-01-04,-5,4,-2,5,3.0
2013-01-05,-3,0,0,5,4.0
2013-01-06,-5,3,-3,5,5.0


In [46]:
# 0보다 큰 value만 -를 붙임
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,-1,-3,-1,-5,
2013-01-02,-2,-4,-1,-5,-1.0
2013-01-03,-3,-4,-4,-5,-2.0
2013-01-04,-5,-4,-2,-5,-3.0
2013-01-05,-3,0,0,-5,-4.0
2013-01-06,-5,-3,-3,-5,-5.0


### Missing Data

In [47]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1,3,-1,5,
2013-01-02,2,-4,1,5,1.0
2013-01-03,3,-4,4,5,2.0
2013-01-04,-5,4,-2,5,3.0
2013-01-05,-3,0,0,5,4.0
2013-01-06,-5,3,-3,5,5.0


In [48]:
# 인덱싱을 다시 한다.
# index 파라미터로 인덱싱을 하는 데이터의 범위를 나타내고
# columns 파라미터로 컬럼 데이터를 나타낸다.
df1 = df.reindex(index = dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = [1,2] # dates 0~1까지 value를 1, 2로 변경
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,1,3,-1,5,,1.0
2013-01-02,2,-4,1,5,1.0,2.0
2013-01-03,3,-4,4,5,2.0,
2013-01-04,-5,4,-2,5,3.0,


In [53]:
# To drop any rows that have missing data
# NaN 데이터가 있는 row를 dropna를 이용하여 제거한다.
df1.dropna(how = 'any')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,2,-4,1,5,1.0,2.0


In [54]:
df1.loc[dates[0]] = np.nan
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,,,,,,
2013-01-02,2.0,-4.0,1.0,5.0,1.0,2.0
2013-01-03,3.0,-4.0,4.0,5.0,2.0,
2013-01-04,-5.0,4.0,-2.0,5.0,3.0,


In [55]:
# how 옵션을 all로 주면 row의 모든 데이터가 NaN인 row를 삭제한다.
df1.dropna(how = 'all')

Unnamed: 0,A,B,C,D,F,E
2013-01-02,2.0,-4.0,1.0,5.0,1.0,2.0
2013-01-03,3.0,-4.0,4.0,5.0,2.0,
2013-01-04,-5.0,4.0,-2.0,5.0,3.0,


In [57]:
df1['F'] = np.nan
df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,,,,,,
2013-01-02,2.0,-4.0,1.0,5.0,,2.0
2013-01-03,3.0,-4.0,4.0,5.0,,
2013-01-04,-5.0,4.0,-2.0,5.0,,


In [58]:
# axis 값으로 row를 삭제할지, column을 삭제할지 설정할 수 있다.
df1.dropna(how= 'all', axis = 1)

Unnamed: 0,A,B,C,D,E
2013-01-01,,,,,
2013-01-02,2.0,-4.0,1.0,5.0,2.0
2013-01-03,3.0,-4.0,4.0,5.0,
2013-01-04,-5.0,4.0,-2.0,5.0,


In [59]:
# Filling missing data
# NaN 데이터에 value 값을 5로 채워준다.
df1.fillna(value = 5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,5.0,5.0,5.0,5.0,5.0,5.0
2013-01-02,2.0,-4.0,1.0,5.0,5.0,2.0
2013-01-03,3.0,-4.0,4.0,5.0,5.0,5.0
2013-01-04,-5.0,4.0,-2.0,5.0,5.0,5.0


### Operations

- Stats
- Histogramming
- String Methods

### Stats

In [60]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1,3,-1,5,
2013-01-02,2,-4,1,5,1.0
2013-01-03,3,-4,4,5,2.0
2013-01-04,-5,4,-2,5,3.0
2013-01-05,-3,0,0,5,4.0
2013-01-06,-5,3,-3,5,5.0


In [61]:
# 각 column별 평균
# NaN 데이터는 평균값을 계산하는데 적용되지 않는다. (F = 1 + 2 + 3 + 4 + 5 / 5 (not 6))
df.mean()

A   -1.166667
B    0.333333
C   -0.166667
D    5.000000
F    3.000000
dtype: float64

In [62]:
# axis - 0: colums mean, 1:rows mean
print(df.mean(1))

2013-01-01    2.0
2013-01-02    1.0
2013-01-03    2.0
2013-01-04    1.0
2013-01-05    1.2
2013-01-06    1.0
Freq: D, dtype: float64


In [64]:
# shift
# value 데이터를 쉬프트하고 쉬프트되서 데이터가 없는 value 는  NaN 데이터가 들어간다.
# -(minus) 값이 들어갈 수 있다.

s = pd.Series([1,3,5,np.nan,6,8], index = dates)
print(s)
print(s.shift(2))
print(s.shift(-1))

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64
2013-01-01    3.0
2013-01-02    5.0
2013-01-03    NaN
2013-01-04    6.0
2013-01-05    8.0
2013-01-06    NaN
Freq: D, dtype: float64


In [65]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,1,3,-1,5,
2013-01-02,2,-4,1,5,1.0
2013-01-03,3,-4,4,5,2.0
2013-01-04,-5,4,-2,5,3.0
2013-01-05,-3,0,0,5,4.0
2013-01-06,-5,3,-3,5,5.0


In [66]:
# df에서 모든 컬럼에 s를 빼줌 (broadcasting 개념으로 df에서 series를 뺴준다.)
# NaN을 뺴주면 NaN이 나옴
print(s)
df.sub(s, axis = 'index')

2013-01-01    1.0
2013-01-02    3.0
2013-01-03    5.0
2013-01-04    NaN
2013-01-05    6.0
2013-01-06    8.0
Freq: D, dtype: float64


Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,2.0,-2.0,4.0,
2013-01-02,-1.0,-7.0,-2.0,2.0,-2.0
2013-01-03,-2.0,-9.0,-1.0,0.0,-3.0
2013-01-04,,,,,
2013-01-05,-9.0,-6.0,-6.0,-1.0,-2.0
2013-01-06,-13.0,-5.0,-11.0,-3.0,-3.0


#### Histogramming

In [68]:
s = pd.Series(np.random.randint(0, 7, size = 10))
s

0    6
1    5
2    4
3    6
4    5
5    4
6    4
7    6
8    3
9    0
dtype: int64

In [69]:
# value 데이터를 같은 데이터끼리 묶어서 갯수를 센 데이터를 보여준다.
s.value_counts()

6    3
4    3
5    2
3    1
0    1
dtype: int64

### String Methods

In [70]:
# 다른 데이터 타입이 들어가면 NaN 데이터가 들어간다.
s = pd.Series(['A','B','C','Aaba','Baca', np.nan, 'CABA', 'dog', 'cat', 1])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
9     NaN
dtype: object

In [71]:
# 문자열 데이터를 대문자로 변경해준다.
_.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
9     NaN
dtype: object

In [73]:
# 문자열 데이터의 길이 데이터로 변경해준다.
_.str.len()

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
5    NaN
6    4.0
7    3.0
8    3.0
9    NaN
dtype: float64

### Merge

- concat

In [74]:
df = pd.DataFrame(np.random.randint(1, 10, size = (10, 4)))
df

Unnamed: 0,0,1,2,3
0,4,5,7,1
1,4,3,1,2
2,3,9,6,8
3,1,6,5,5
4,2,8,3,3
5,9,2,3,6
6,8,4,8,3
7,6,6,1,4
8,5,6,4,8
9,1,7,1,5


In [76]:
# 리스트 데이터로 df 데이터를 나눈다.

pieces = [df[:3], df[3:7], df[7:]]
print(pieces[0], end = '\n\n')
print(pieces[1], end = '\n\n')
print(pieces[2])

   0  1  2  3
0  4  5  7  1
1  4  3  1  2
2  3  9  6  8

   0  1  2  3
3  1  6  5  5
4  2  8  3  3
5  9  2  3  6
6  8  4  8  3

   0  1  2  3
7  6  6  1  4
8  5  6  4  8
9  1  7  1  5


In [77]:
# df가 들어있는 리스트 데이터를 concat하면 데이터가 합쳐진다.
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,4,5,7,1
1,4,3,1,2
2,3,9,6,8
3,1,6,5,5
4,2,8,3,3
5,9,2,3,6
6,8,4,8,3
7,6,6,1,4
8,5,6,4,8
9,1,7,1,5


###  reshaping

- Stack

In [78]:
# zip
list(zip([1,2,3],[4,5,6],[7,8,9]))

[(1, 4, 7), (2, 5, 8), (3, 6, 9)]

In [79]:
tuples = list(zip(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
                  ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [83]:
# MultiIndex : 인덱스데이터를 여러개 가질 수 있다. (frist와 second 두개의 인덱스)
index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [84]:
df = pd.DataFrame(np.random.randint(10, size = (8,2)), index = index, columns = ['A', 'B'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0,8
bar,two,6,0
baz,one,6,8
baz,two,1,6
foo,one,3,5
foo,two,6,3
qux,one,6,8
qux,two,0,6


In [85]:
df2= df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0,8
bar,two,6,0
baz,one,6,8
baz,two,1,6


In [87]:
# columns 데이터를 index에 추가시켜줌
stacked = df2.stack()
stacked

first  second   
bar    one     A    0
               B    8
       two     A    6
               B    0
baz    one     A    6
               B    8
       two     A    1
               B    6
dtype: int64

In [89]:
stacked.index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two'], ['A', 'B']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second', None])

In [91]:
# 가장 마지막에 있는 index 데이터를 column으로 바꿔줌
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0,8
bar,two,6,0
baz,one,6,8
baz,two,1,6


In [92]:
# unstack은 여러번 사용 가능
stacked.unstack().unstack()

Unnamed: 0_level_0,A,A,B,B
second,one,two,one,two
first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
bar,0,6,8,0
baz,6,1,8,6


In [93]:
# 파라미터로 index를 지정해 줄 수 있다.
# second가 column으로 올라감
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,0,6
bar,B,8,0
baz,A,6,1
baz,B,8,6


In [94]:
# first가 column으로 올라감
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.0,6.0
one,B,8.0,8.0
two,A,6.0,1.0
two,B,0.0,6.0


In [95]:
# A,B가 column으로 올라감
stacked.unstack(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0,8
bar,two,6,0
baz,one,6,8
baz,two,1,6


### Time Series

In [96]:
# 분단위로 100개의 시계열 데이터 생성
# freq - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases
rng = pd.date_range('2018-01-01', periods = 100, freq = 'Min')
rng

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:01:00',
               '2018-01-01 00:02:00', '2018-01-01 00:03:00',
               '2018-01-01 00:04:00', '2018-01-01 00:05:00',
               '2018-01-01 00:06:00', '2018-01-01 00:07:00',
               '2018-01-01 00:08:00', '2018-01-01 00:09:00',
               '2018-01-01 00:10:00', '2018-01-01 00:11:00',
               '2018-01-01 00:12:00', '2018-01-01 00:13:00',
               '2018-01-01 00:14:00', '2018-01-01 00:15:00',
               '2018-01-01 00:16:00', '2018-01-01 00:17:00',
               '2018-01-01 00:18:00', '2018-01-01 00:19:00',
               '2018-01-01 00:20:00', '2018-01-01 00:21:00',
               '2018-01-01 00:22:00', '2018-01-01 00:23:00',
               '2018-01-01 00:24:00', '2018-01-01 00:25:00',
               '2018-01-01 00:26:00', '2018-01-01 00:27:00',
               '2018-01-01 00:28:00', '2018-01-01 00:29:00',
               '2018-01-01 00:30:00', '2018-01-01 00:31:00',
               '2018-01-

In [97]:
# 위에 생성한 시계열 데이터를 인덱스로 하는 0 ~ 500 까지의 랜덤한 숫자를 100개 대입한 시리즈 데이터
ts = pd.Series(np.random.randint(0, 500, len(rng)), index = rng)
ts.tail()

2018-01-01 01:35:00     16
2018-01-01 01:36:00    249
2018-01-01 01:37:00    199
2018-01-01 01:38:00    284
2018-01-01 01:39:00    273
Freq: T, dtype: int64

In [99]:
# 10분 간격으로 데이터를 sum을 이용해서 합쳐줌
ts.resample('10Min').sum()

2018-01-01 00:00:00    3104
2018-01-01 00:10:00    2886
2018-01-01 00:20:00    2318
2018-01-01 00:30:00    2666
2018-01-01 00:40:00    2845
2018-01-01 00:50:00    2153
2018-01-01 01:00:00    2904
2018-01-01 01:10:00    2280
2018-01-01 01:20:00    3100
2018-01-01 01:30:00    2171
Freq: 10T, dtype: int64

In [101]:
# Time Zone
rng = pd.date_range('2018-01-01', periods = 5, freq = 'D')
rng

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

In [102]:
ts = pd.Series(np.random.randint(0, 100, len(rng)), rng)
ts

2018-01-01     2
2018-01-02    98
2018-01-03    46
2018-01-04    51
2018-01-05    91
Freq: D, dtype: int64

In [104]:
# 타임존 데이터를 설정함 불러옴 (UTC(기준)  : + 00:00)
ts_utc = ts.tz_localize('UTC')
ts_utc

2018-01-01 00:00:00+00:00     2
2018-01-02 00:00:00+00:00    98
2018-01-03 00:00:00+00:00    46
2018-01-04 00:00:00+00:00    51
2018-01-05 00:00:00+00:00    91
Freq: D, dtype: int64

In [105]:
# 타입존을 변경 (US/Eastern : -05:00)
ts_utc.tz_convert('US/Eastern')

2017-12-31 19:00:00-05:00     2
2018-01-01 19:00:00-05:00    98
2018-01-02 19:00:00-05:00    46
2018-01-03 19:00:00-05:00    51
2018-01-04 19:00:00-05:00    91
Freq: D, dtype: int64

In [107]:
# Time zone list
# 타임존 리스트 확인
from pytz import common_timezones, all_timezones
len(common_timezones), len(all_timezones)

(439, 592)

In [110]:
# 월에 대한 date range의 날짜를 마지막날에서 1일로 바꾸기
# 날짜 데이터를 삭제한 후에 다시 생성
rng = pd.date_range('1/1/2018', periods = 5, freq = 'M')
rng

DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31'],
              dtype='datetime64[ns]', freq='M')

In [112]:
ts = pd.Series(np.random.randint(0,10,len(rng)), index = rng)
ts

2018-01-31    9
2018-02-28    4
2018-03-31    6
2018-04-30    8
2018-05-31    1
Freq: M, dtype: int64

In [114]:
# to_period 날짜 데이터를 제거
ps = ts.to_period()
ps

2018-01    9
2018-02    4
2018-03    6
2018-04    8
2018-05    1
Freq: M, dtype: int64

In [115]:
# to_timestamp 날짜 데이터를 나타냄
ps.to_timestamp()

2018-01-01    9
2018-02-01    4
2018-03-01    6
2018-04-01    8
2018-05-01    1
Freq: MS, dtype: int64