##### 데이터 전처리
  - 머신러닝 모델에 훈련 데이터를 입력하기전에 데이터를 가공
  - 각종시각화 도구를 통해 실제 데이터를 파악
  - 머신러닝 기초 수식  y= f(x)  

  - 데이터 품질 문제
    - 각 피처들의 데이터의 범위가 너무 다를때
    - 학습에 영향을 줌
      - 일부 머신러닝은 데이터가 크면 비중을 크게 잡음
      - 스케일을 맞춰줌( 데이터의 최대최소 0~1사이로 변경하거나 또는 표준정규분포 형태로 나타냄)

##### 결측치
  - 실제로 존재하지만 기록되지 않은 데이터
  - 결측치 처리 전략(도메인 지식에 기반)

##### 이상치
  - outlier : 극단적으로 크거나 작은 값
  - 단순히 데이터 분포의 차이와는 다름
  - 데이터 잘못기입 또는 특이현상

##### 데이터 처리 전략
  - 드랍 : 삭제
  - 채우기 : 평균,최빈,중간,이동평균... etc

In [25]:
import pandas as pd
import numpy as np

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
            'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
            'age': [42, np.nan, 36, 24, 73],
            'sex': ['m', np.nan, 'f', 'm', 'f'],
            'preTestScore': [4, np.nan, np.nan, 2, 3],
            'postTestScore': [25, np.nan, np.nan, 62, 70]}

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   first_name     4 non-null      object 
 1   last_name      4 non-null      object 
 2   age            4 non-null      float64
 3   sex            4 non-null      object 
 4   preTestScore   3 non-null      float64
 5   postTestScore  3 non-null      float64
dtypes: float64(3), object(3)
memory usage: 368.0+ bytes


In [27]:
df.isnull().mean()

first_name       0.2
last_name        0.2
age              0.2
sex              0.2
preTestScore     0.4
postTestScore    0.4
dtype: float64

##### Drop

In [28]:
df_no_missing =  df.dropna()
df_no_missing

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [29]:
df_cleaned = df.dropna(how='all')
df_cleaned

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [30]:
df['location'] = np.nan
df.dropna(how='all', axis=1)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore
0,Jason,Miller,42.0,m,4.0,25.0
1,,,,,,
2,Tina,Ali,36.0,f,,
3,Jake,Milner,24.0,m,2.0,62.0
4,Amy,Cooze,73.0,f,3.0,70.0


In [31]:
df
df.dropna(thresh=1) # 데이터가 한개라도 존재하는 행은 남김

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
2,Tina,Ali,36.0,f,,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [32]:
df
df.dropna(thresh=5) # 데이터가 한개라도 존재하는 행은 남김

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


##### 채우기
  - 결측치를 대처
  - fillna 사용

In [33]:
df.fillna(0)

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,0.0
1,0,0,0.0,0,0.0,0.0,0.0
2,Tina,Ali,36.0,f,0.0,0.0,0.0
3,Jake,Milner,24.0,m,2.0,62.0,0.0
4,Amy,Cooze,73.0,f,3.0,70.0,0.0


In [34]:
# 빈 값에 평균값을 채운다: 열단위로 평균값을 계산하고 해당 열에만 값을 채움
df['preTestScore'] = df['preTestScore'].fillna(df['preTestScore'].mean())
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,25.0,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,,
3,Jake,Milner,24.0,m,2.0,62.0,
4,Amy,Cooze,73.0,f,3.0,70.0,


In [40]:
df['postTestScore'].median()

62.0

In [36]:
df['postTestScore'].fillna(df['postTestScore'].median())

0    25.0
1    62.0
2    62.0
3    62.0
4    70.0
Name: postTestScore, dtype: float64

In [41]:
df['postTestScore'] = df.groupby('sex')['postTestScore'].transform('mean')

In [42]:
df

Unnamed: 0,first_name,last_name,age,sex,preTestScore,postTestScore,location
0,Jason,Miller,42.0,m,4.0,43.5,
1,,,,,3.0,,
2,Tina,Ali,36.0,f,3.0,70.0,
3,Jake,Milner,24.0,m,2.0,43.5,
4,Amy,Cooze,73.0,f,3.0,70.0,


##### 범주형 데이터 전처리
  - 판다스에서 제공하는 get_dummies함수
  - 사잇킷런에서 제공하는 OneHotEncoder, LabelEncoder

In [43]:
edges = pd.DataFrame({'source': [0, 1, 2], 'target': [2, 2, 3],
                      'weight': [3, 4, 5], 'color': ['red', 'blue', 'blue']})
edges

Unnamed: 0,source,target,weight,color
0,0,2,3,red
1,1,2,4,blue
2,2,3,5,blue


In [44]:
edges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   source  3 non-null      int64 
 1   target  3 non-null      int64 
 2   weight  3 non-null      int64 
 3   color   3 non-null      object
dtypes: int64(3), object(1)
memory usage: 224.0+ bytes


In [45]:
pd.get_dummies(edges)

Unnamed: 0,source,target,weight,color_blue,color_red
0,0,2,3,0,1
1,1,2,4,1,0
2,2,3,5,1,0


In [46]:
pd.get_dummies(edges['color'])

Unnamed: 0,blue,red
0,0,1
1,1,0
2,1,0


In [47]:
pd.get_dummies(edges[['color']])

Unnamed: 0,color_blue,color_red
0,0,1
1,1,0
2,1,0


  - 단점
    - 컬럼의 갯수(피처,속성)가 늘어난다.
    - 학습대상이 늘어난다.
    - 리소스 증가하고, 속도가 느려진다.

##### 수치형 데이터를 범주형 데이터로 변경
  - 원핫을 적용한다.

In [49]:
# 3: "M"  4:"L"  5:"XL"
weight_dic = {3 : "M", 4:"L", 5:"XL"}
edges['weight_sign'] = edges['weight'].map(weight_dic)
edges

Unnamed: 0,source,target,weight,color,weight_sign
0,0,2,3,red,M
1,1,2,4,blue,L
2,2,3,5,blue,XL


In [50]:
edges = pd.get_dummies(edges)
edges

Unnamed: 0,source,target,weight,color_blue,color_red,weight_sign_L,weight_sign_M,weight_sign_XL
0,0,2,3,0,1,0,1,0
1,1,2,4,1,0,1,0,0
2,2,3,5,1,0,0,0,1


##### 범주형 데이터로 변환하여 처리하기 : 바인딩
  - 바인딩 : 연속형 데이터를 범주형 데이터로 변환

In [51]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
            'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


In [54]:
df['postTestScore'].min(),df['postTestScore'].max()
# 0 ~ 100사이의 데이터로 25씩 구간을 나눠
# cut 함수 bins, labels
bins = [0,25,50,75,100]
bins_labels = ['Low','Okay','Good','Great']
categories = pd.cut(df['postTestScore'], bins=bins, labels=bins_labels)
categories

0       Low
1     Great
2      Good
3      Good
4      Good
5       Low
6     Great
7      Good
8      Good
9      Good
10     Good
11     Good
Name: postTestScore, dtype: category
Categories (4, object): ['Low' < 'Okay' < 'Good' < 'Great']

In [None]:
# categories를 원핫으로 만들어서 원본에 합치기