## 전처리(Preprocess)
- 정형 데이터 기준: 결측치 / 이상치 처리 방법

### 결측치 (missing values)
- 표시 방식 : Null, NaN, None, "", NaT
- 확인 방식 : pd.isnull().sum(), pd.describe()
- 처리 방식
    + 제거 : pd.dropna(subset=[], axis=?)
    + 치환 : pd.apply(), 대표값 or 머신러닝 예측값

In [4]:
import pandas as pd
import numpy as np

In [14]:
# 데이터셋 생성
data = {
    "수치형": [1, None, 3, np.nan, 5],
    "범주형": [None, 'B', 'C', np.nan, 'E'],
    "날짜": pd.date_range(start='2021-01-01', periods=4).insert(3, np.nan)
}

df_first = pd.DataFrame(data)
df_first

Unnamed: 0,수치형,범주형,날짜
0,1.0,,2021-01-01
1,,B,2021-01-02
2,3.0,C,2021-01-03
3,,,NaT
4,5.0,E,2021-01-04


#### 결측치 확인

In [15]:
df_first.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   수치형     3 non-null      float64       
 1   범주형     3 non-null      object        
 2   날짜      4 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 252.0+ bytes


In [17]:
df_first.isnull().sum()

수치형    2
범주형    2
날짜     1
dtype: int64

### 결측치 처리

In [19]:
df_first.dropna()

Unnamed: 0,수치형,범주형,날짜
2,3.0,C,2021-01-03
4,5.0,E,2021-01-04


In [20]:
df_first.dropna(subset=['범주형'])


Unnamed: 0,수치형,범주형,날짜
1,,B,2021-01-02
2,3.0,C,2021-01-03
4,5.0,E,2021-01-04


### 결측치 치환

### 대표값 채움
- 평균, 중앙, 최소값, 최대값

In [22]:
df_first.describe()    #연속형만 표시

Unnamed: 0,수치형,날짜
count,3.0,4
mean,3.0,2021-01-02 12:00:00
min,1.0,2021-01-01 00:00:00
25%,2.0,2021-01-01 18:00:00
50%,3.0,2021-01-02 12:00:00
75%,4.0,2021-01-03 06:00:00
max,5.0,2021-01-04 00:00:00
std,2.0,


In [23]:
df_first['수치형'].fillna(3.0)

0    1.0
1    3.0
2    3.0
3    3.0
4    5.0
Name: 수치형, dtype: float64

In [26]:
df_first.describe(include='object')

Unnamed: 0,범주형
count,3
unique,3
top,B
freq,1


In [28]:
df_first['범주형'].count(), df_first['범주형'].unique()

(3, array([None, 'B', 'C', nan, 'E'], dtype=object))

In [29]:
df_first['범주형'].fillna('B')

0    B
1    B
2    C
3    B
4    E
Name: 범주형, dtype: object