##### 데이터 전처리
  - 머신러닝 모델에 훈련 데이터를 입력하기전에 데이터를 가공
  - 각종시각화 도구를 통해 실제 데이터를 파악
  - 머신러닝 기초 수식  y= f(x)  

  - 데이터 품질 문제
    - 각 피처들의 데이터의 범위가 너무 다를때
    - 학습에 영향을 줌
      - 일부 머신러닝은 데이터가 크면 비중을 크게 잡음
      - 스케일을 맞춰줌( 데이터의 최대최소 0~1사이로 변경하거나 또는 표준정규분포 형태로 나타냄)

##### 결측치
  - 실제로 존재하지만 기록되지 않은 데이터
  - 결측치 처리 전략(도메인 지식에 기반)

##### 이상치
  - outlier : 극단적으로 크거나 작은 값
  - 단순히 데이터 분포의 차이와는 다름
  - 데이터 잘못기입 또는 특이현상

##### 데이터 처리 전략
  - 드랍 : 삭제
  - 채우기 : 평균,최빈,중간,이동평균... etc

In [25]:
import pandas as pd
import numpy as np

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
            'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
            'age': [42, np.nan, 36, 24, 73],
            'sex': ['m', np.nan, 'f', 'm', 'f'],
            'preTestScore': [4, np.nan, np.nan, 2, 3],
            'postTestScore': [25, np.nan, np.nan, 62, 70]}

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   first_name     4 non-null      object 
 1   last_name      4 non-null      object 
 2   age            4 non-null      float64
 3   sex            4 non-null      object 
 4   preTestScore   3 non-null      float64
 5   postTestScore  3 non-null      float64
dtypes: float64(3), object(3)
memory usage: 368.0+ bytes


In [27]:
df.isnull().mean()

first_name       0.2
last_name        0.2
age              0.2
sex              0.2
preTestScore     0.4
postTestScore    0.4
dtype: float64

##### Drop

In [28]:
df_no_missing =  df.dropna()
df_no_missing

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [29]:
df_cleaned = df.dropna(how='all')
df_cleaned

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [30]:
df['location'] = np.nan
df.dropna(how='all', axis=1)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [31]:
df
df.dropna(thresh=1) # 데이터가 한개라도 존재하는 행은 남김

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [32]:
df
df.dropna(thresh=5) # 데이터가 한개라도 존재하는 행은 남김

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


##### 채우기
  - 결측치를 대처
  - fillna 사용

In [33]:
df.fillna(0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,0.0
1,0,0,0.0,0,0.0,0.0,0.0
2,Tina,Ali,36.0,f,0.0,0.0,0.0
3,Jake,Milner,24.0,m,2.0,62.0,0.0
4,Amy,Cooze,73.0,f,3.0,70.0,0.0


In [34]:
# 빈 값에 평균값을 채운다: 열단위로 평균값을 계산하고 해당 열에만 값을 채움
df['preTestScore'] = df['preTestScore'].fillna(df['preTestScore'].mean())
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [40]:
df['postTestScore'].median()

62.0

In [36]:
df['postTestScore'].fillna(df['postTestScore'].median())

0    25.0
1    62.0
2    62.0
3    62.0
4    70.0
Name: postTestScore, dtype: float64

In [41]:
df['postTestScore'] = df.groupby('sex')['postTestScore'].transform('mean')

In [42]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,43.5,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,70.0,
3,Jake,Milner,24.0,m,2.0,43.5,
4,Amy,Cooze,73.0,f,3.0,70.0,
