<img src=https://scikit-learn.org/stable/_static/ml_map.png>

# 목차
1. Data Load
2. EDA (Exploratory Data Analysis) 탐색적 데이터 분석
    - 통계적 데이터 분포 확인
    - 차트 시각화
3. Feature Engineering : 전처리(preprocessing), 가공(Engineering)
    - 타입변환 (날짜, 카테고리변환(ABC->123)
    - 결측처리 (삭제:drona, 보간(대체):fillna, 모델 활용)
    - 바이닝(범주화 : cut, quct)
    - 인코딩(라벨인코딩, 원핫인코딩 , 더미)
    - 정규화(스케일링 : MinMaxScaler, StandardScaler, RobustScaler, log)
    - 이상치(Outlier)
4. 모델 선정(Model Selection : pycaret)
    - 회귀(Regression)
    - 분류(Classification)
    - 군집(Clustering, PCA)
5. 모델 학습 및 예측(train_test_split & fit & predict)
6. 모델 검증 및 평가 (Validation & Evaluation metrics)
7. 하이퍼파미터 튜닝(Hyper-parameter optimization)
8. 모델 저장 및 배포(Model Save & Deployment)

In [6]:
import re
import numpy as np
import pandas as pd

from datetime import date, datetime, time, timedelta
from dateutil.relativedelta import relativedelta

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.rcParams['font.family']= 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False
# plt.rcParams['figure.figsize'] = [6.4, 4.8]

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.ensemble import AdaBoostRegressor, VotingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

import warnings
warnings.filterwarnings(action='ignore')

# 결측

In [293]:
df = pd.DataFrame({"name":['kim',np.nan,None,"allen","king"],"score": ["A",np.nan,"B",np.nan,"C"]})
df

Unnamed: 0,name,score
0,kim,A
1,,
2,,B
3,allen,
4,king,C


In [277]:
np.nan == np.nan

False

In [292]:
np.nan in [np.nan, np.nan]

True

In [291]:
type(np.nan)

float

In [269]:
None == None

True

In [139]:
None is None

True

In [275]:
np.nan is np.nan

True

In [141]:
None is np.nan

False

## 확인
```
df.isna()
df.isnull()
df.notna()
df.notnull()
```

In [142]:
df.isna()

Unnamed: 0,name,score
0,False,False
1,True,True
2,True,False
3,False,True
4,False,False


In [143]:
df[df.isna().values==True]

Unnamed: 0,name,score
1,,
1,,
2,,B
3,allen,


In [144]:
df.isna().sum()[df.isna().sum()>0].index.values

array(['name', 'score'], dtype=object)

## 삭제
```
DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False)
    how{‘any’, ‘all’}, default ‘any’
```

In [296]:
df = pd.DataFrame({"name":['kim',np.nan,None,"allen","king"],"score": ["A",np.nan,"B",np.nan,"C"]})
df

Unnamed: 0,name,score
0,kim,A
1,,
2,,B
3,allen,
4,king,C


In [298]:
df.dropna(axis=0)

Unnamed: 0,name,score
0,kim,A
4,king,C


In [299]:
df.dropna(axis=1)

0
1
2
3
4


In [294]:
df.dropna(axis=0, how='all')

Unnamed: 0,name,score
0,kim,A
2,,B
3,allen,
4,king,C


## 보간
```
DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)
    method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
```

In [162]:
df.fillna('■●★▲♠♥')

Unnamed: 0,name,score
0,kim,A
1,■●★▲♠♥,■●★▲♠♥
2,■●★▲♠♥,B
3,allen,■●★▲♠♥
4,king,C


In [155]:
df.fillna(method='bfill')

Unnamed: 0,name,score
0,kim,A
1,allen,B
2,allen,B
3,allen,C
4,king,C


In [149]:
df.fillna(method='ffill')

Unnamed: 0,name,score
0,kim,A
1,kim,A
2,kim,B
3,allen,B
4,king,C


In [302]:
df = pd.DataFrame({"name":["smith",np.nan,"jones","allen","king"],"score": ["A","B",'B',"A","A"],"sal": [1000,1000,3000,np.nan,4000]})
# name: aaa
# scor: F
# sal: 999
df

Unnamed: 0,name,score,sal
0,smith,A,1000.0
1,,B,1000.0
2,jones,B,3000.0
3,allen,A,
4,king,A,4000.0


In [303]:
df.fillna({'name':'AAA', 'score':'F', 'sal':9999})

Unnamed: 0,name,score,sal
0,smith,A,1000.0
1,AAA,B,1000.0
2,jones,B,3000.0
3,allen,A,9999.0
4,king,A,4000.0


In [304]:
df['sal'].mean(), df['sal'].median(), df['sal'].mode().values[0]

(2250.0, 2000.0, 1000.0)

In [264]:
# sal의 결측치를 sal의 평균값으로
dfcp = df.copy()
dfcp['sal'] = dfcp['sal'].fillna(dfcp['sal'].mean())
dfcp

Unnamed: 0,name,score,sal
0,smith,A,1000.0
1,,B,1000.0
2,jones,B,3000.0
3,allen,A,2250.0
4,king,A,4000.0


In [267]:
dfcp.groupby('score')['sal'].transform(lambda g: g.fillna( g.mean() ))

0    1000.0
1    1000.0
2    3000.0
3    2500.0
4    4000.0
Name: sal, dtype: float64

In [268]:
# sal의 결측치를 score 그룹바이 평균값
dfcp = df.copy()
mean_sal_by_score = dfcp.groupby('score')['sal'].transform('mean')
dfcp['sal'].fillna(  mean_sal_by_score  )

0    1000.0
1    1000.0
2    3000.0
3    2500.0
4    4000.0
Name: sal, dtype: float64

# 인코딩

# 바이닝

# 정규화