<a href="https://colab.research.google.com/github/ne-choi/textbook/blob/main/10_minutes_to_pandas/Pandas_10%EB%B6%84_%EC%99%84%EC%84%B1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas 10분 완성(10 Minutes to Pandas) 필사**
**출처: 데잇걸즈2 번역 자료**

**목차**
1. Object Creation (객체 생성)  
2. Viewing Data (데이터 확인하기)  
3. Selection (선택)  
4. Missing Data (결측치)  
5. Operation (연산)  
6. Merge (병합)
7. Grouping (그룹화)
8. Reshaping (변형)
9. Time Series (시계열)
10. Categoricals (범주화)
11. Plotting (그래프)
12. Getting Data In / Out (데이터 입 / 출력)
13. Gotchas (잡았다!)


In [None]:
# 패키지 불러오기

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## **1. Object Creation**
[- 참고: 데이터 구조 소개 섹션](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)

Pansdas는 값을 가지고 있는 리스트를 통해 [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)를 만들고, 정수로 만들어진 인덱스를 기본값으로 불러온다.

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

datetime 인덱스와 레이블이 있는 열을 가진 numpy 배열을 전달하여 데이터프레임을 만든다.

In [None]:
dates = pd.date_range('20130101', periods = 6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.131805,0.824105,0.793722,-0.464042
2013-01-02,0.136948,1.036089,-0.199675,0.819016
2013-01-03,1.633209,-0.473571,0.466677,0.478959
2013-01-04,-0.861636,-0.049129,0.350207,-1.452074
2013-01-05,2.248059,-0.758196,0.842324,0.338938
2013-01-06,-0.743977,-0.786985,-1.142624,-0.451164


Series와 같은 것으로 변환될 수 있는 객체들의 dict로 구성된 데이터프레임을 만든다.

In [None]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index = list(range(4)), dtype = 'float32'),
                    'D': np.array([3] * 4, dtype = 'int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [None]:
df2.dtypes

## **2. Viewing Data**
[- 참고: Basic Section](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html)  

데이터프레임의 가장 윗줄과 마지막 줄을 확인하고 싶을 때 사용하는 방법 알아보기.

In [None]:
# 괄호() 안에 숫자를 넣으면 (숫자)줄을 불러오고 넣지 않으면 기본값인 5개를 불러옴

df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-01,-0.131805,0.824105,0.793722,-0.464042
2013-01-02,0.136948,1.036089,-0.199675,0.819016
2013-01-03,1.633209,-0.473571,0.466677,0.478959
2013-01-04,-0.861636,-0.049129,0.350207,-1.452074
2013-01-05,2.248059,-0.758196,0.842324,0.338938


In [None]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.131805,0.824105,0.793722,-0.464042
2013-01-02,0.136948,1.036089,-0.199675,0.819016
2013-01-03,1.633209,-0.473571,0.466677,0.478959
2013-01-04,-0.861636,-0.049129,0.350207,-1.452074
2013-01-05,2.248059,-0.758196,0.842324,0.338938


index, column, numpy 데이터 세부 정보를 알아보자.

In [None]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [None]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [None]:
df.values

array([[-0.13180529,  0.82410509,  0.79372215, -0.4640425 ],
       [ 0.1369476 ,  1.03608902, -0.19967461,  0.81901566],
       [ 1.6332091 , -0.47357082,  0.46667687,  0.47895853],
       [-0.86163576, -0.04912882,  0.35020662, -1.45207412],
       [ 2.24805935, -0.75819576,  0.84232426,  0.33893801],
       [-0.74397667, -0.786985  , -1.14262442, -0.45116445]])

In [None]:
df.describe() # 데이터의 대략적인 통계적 정보 요약을 보여줌

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.380133,-0.034614,0.185105,-0.121728
std,1.279545,0.79578,0.750975,0.831091
min,-0.861636,-0.786985,-1.142624,-1.452074
25%,-0.590934,-0.68704,-0.062204,-0.460823
50%,0.002571,-0.26135,0.408442,-0.056113
75%,1.259144,0.605797,0.711961,0.443953
max,2.248059,1.036089,0.842324,0.819016


In [None]:
df.T # 데이터 전치

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.131805,0.136948,1.633209,-0.861636,2.248059,-0.743977
B,0.824105,1.036089,-0.473571,-0.049129,-0.758196,-0.786985
C,0.793722,-0.199675,0.466677,0.350207,0.842324,-1.142624
D,-0.464042,0.819016,0.478959,-1.452074,0.338938,-0.451164


In [None]:
df.sort_index(axis = 1, ascending = False) # 축별로 정리

Unnamed: 0,D,C,B,A
2013-01-01,-0.464042,0.793722,0.824105,-0.131805
2013-01-02,0.819016,-0.199675,1.036089,0.136948
2013-01-03,0.478959,0.466677,-0.473571,1.633209
2013-01-04,-1.452074,0.350207,-0.049129,-0.861636
2013-01-05,0.338938,0.842324,-0.758196,2.248059
2013-01-06,-0.451164,-1.142624,-0.786985,-0.743977


In [None]:
df.sort_values(by = 'B') # 값별로 정렬

Unnamed: 0,A,B,C,D
2013-01-06,-0.743977,-0.786985,-1.142624,-0.451164
2013-01-05,2.248059,-0.758196,0.842324,0.338938
2013-01-03,1.633209,-0.473571,0.466677,0.478959
2013-01-04,-0.861636,-0.049129,0.350207,-1.452074
2013-01-01,-0.131805,0.824105,0.793722,-0.464042
2013-01-02,0.136948,1.036089,-0.199675,0.819016


## **3. Selection**  
* 주석: Pandas에 최적화된 데이터 접근 방법인 .at, .iat, .loc, .iloc 사용  

[- 참고: 데이터 인덱싱 및 선택](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html), [다중 인덱싱/심화 인덱싱](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

### *** Getting(데이터 얻기)**  
df.A와 동일한 Series를 생성하는 단일 열을 선택한다.

In [None]:
df['A']

2013-01-01   -0.131805
2013-01-02    0.136948
2013-01-03    1.633209
2013-01-04   -0.861636
2013-01-05    2.248059
2013-01-06   -0.743977
Freq: D, Name: A, dtype: float64

In [None]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.131805,0.824105,0.793722,-0.464042
2013-01-02,0.136948,1.036089,-0.199675,0.819016
2013-01-03,1.633209,-0.473571,0.466677,0.478959


In [None]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,0.136948,1.036089,-0.199675,0.819016
2013-01-03,1.633209,-0.473571,0.466677,0.478959
2013-01-04,-0.861636,-0.049129,0.350207,-1.452074


### *** Selection by Label**  
[- 참고: Label을 통한 선택](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

In [None]:
# 라벨을 사용하여 횡단면 얻기
df.loc[dates[0]]

A   -0.131805
B    0.824105
C    0.793722
D   -0.464042
Name: 2013-01-01 00:00:00, dtype: float64

In [None]:
# 라벨을 사용하여 여러 축(의 데이터) 얻기
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,-0.131805,0.824105
2013-01-02,0.136948,1.036089
2013-01-03,1.633209,-0.473571
2013-01-04,-0.861636,-0.049129
2013-01-05,2.248059,-0.758196
2013-01-06,-0.743977,-0.786985


In [None]:
# 양쪽 종단점을 포함한 라벨 슬라이싱 보기
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,0.136948,1.036089
2013-01-03,1.633209,-0.473571
2013-01-04,-0.861636,-0.049129


In [None]:
# 반환되는 객체의 차원 줄이기
df.loc['20130102', ['A', 'B']]

A    0.136948
B    1.036089
Name: 2013-01-02 00:00:00, dtype: float64

In [None]:
# 스킬라 값 얻기
df.loc[dates[0], 'A']

-0.13180529275016625

In [None]:
# cf. 스킬라 값 더 빠르게 구하는 법
df.at[dates[0], 'A']

### *** Selection by Position**  
[- 참고: 위치로 선택하기](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

In [None]:
# 넘겨 받은 정수 위치를 기준으로 선택
df.iloc[3]

In [None]:
# 정수로 표기된 슬라이스를 통해, numpy / python과 유사하게 작동
df.iloc[3:5, 0:2]

In [None]:
# 정수로 표기된 위치값 리스트를 통해, numpy / python 스타일과 유사해짐
df.iloc[[1, 2, 4], [0, 2]]

In [None]:
# 명시적으로 행을 나누고자 하는 경우
df.iloc[1:3, :]