# k-최근접 이웃을 활용해 이상값 찾기
## -> k-최근접 이웃을 사용해 속성이 가장 비정상적인 것으로 나타나는 국가를 식별
---
- 변수간 관계에 대한 어떠한 가정도 없이 이상값을 식별하는 것이 도움이 될 수 있다.
- 비지도 머신러닝 도구는 목표(종속)변수가 없는 데이터에서 다른 관측값과 차이가 있는 것을 식별하는 데 도움이 된다.
    - 이때, K-최근접 이웃을 활용할 수 있다

### 1. 라이브러리 준비
- PyOD(Python Outlier Detection) : 이상값 탐지를 위한 파이썬 패키지
    - 이상값을 탐지하는 여러 가지 지도학습/비지도학습 기법을 모은 도구
    - https://pyod.readthedocs.io/en/latest/
- scikit-learn

In [5]:
pip install pyod

Collecting pyod
  Using cached pyod-1.1.2.tar.gz (160 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: pyod
  Building wheel for pyod (setup.py) ... [?25ldone
[?25h  Created wheel for pyod: filename=pyod-1.1.2-py3-none-any.whl size=190294 sha256=4c40a6fd0c0ceb6babd9abd9bc6bd287b7fb04419ac6b56812f8e28083e98a16
  Stored in directory: /Users/angela/Library/Caches/pip/wheels/74/67/d3/f296e7520af871929a8c60540465a122bd8cfe9c6670827efb
Successfully built pyod
Installing collected packages: pyod
Successfully installed pyod-1.1.2
Note: you may need to restart the kernel to use updated packages.


In [6]:
import pandas as pd
from pyod.models.knn import KNN
from sklearn.preprocessing import StandardScaler
covidtotals = pd.read_csv('./data/covidtotals.csv')
covidtotals.set_index('iso_code', inplace = True)

In [8]:
covidtotals.columns

Index(['lastdate', 'location', 'total_cases', 'total_deaths', 'total_cases_pm',
       'total_deaths_pm', 'population', 'pop_density', 'median_age',
       'gdp_per_capita', 'hosp_beds'],
      dtype='object')

- 표준화 실행
    - 머신러닝 도구 대부분은 제대로 실행되려면 표준화된 데이터가 필요함

In [11]:
standardizer = StandardScaler()
analysisvars = ['location', 'total_cases_pm', 'total_deaths_pm', 'pop_density', 'median_age', 'gdp_per_capita']

covidanalysis = covidtotals.loc[:, analysisvars].dropna()
covidanalysisstand = standardizer.fit_transform(covidanalysis.iloc[:, 1:])

### 2. KNN 모델 실행, 이상점수(anomaly score)를 생성
    - contamination 매개변수를 0.1로 설정하여, 임의의 개수의 이상값을 생성

In [12]:
clf_name = 'KNN'
clf = KNN(contamination = 0.1)
clf.fit(covidanalysisstand)

KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2,
  radius=1.0)

In [13]:
y_pred = clf.labels_

In [14]:
y_scores = clf.decision_scores_

- 모델 예측 정리 
    - 정상값(Outlier = 0), 이상값(outlier = 1)

In [19]:
pred = pd.DataFrame(zip(y_pred, y_scores), columns = ['outlier', 'scores'], index = covidanalysis.index)
pred.head()

Unnamed: 0_level_0,outlier,scores
iso_code,Unnamed: 1_level_1,Unnamed: 2_level_1
AFG,0,0.163522
ALB,0,0.435589
DZA,0,0.263764
AGO,0,0.212935
ATG,0,0.502185


In [20]:
pred.groupby(['outlier'])[['scores']].agg(['min', 'median', 'max'])

Unnamed: 0_level_0,scores,scores,scores
Unnamed: 0_level_1,min,median,max
outlier,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
0,0.09952,0.375671,1.526514
1,1.551284,2.100912,11.976713


- 이상값에 대한 데이터 표시

In [23]:
covidanalysis.join(pred).\
    loc[pred.outlier == 1, ['location', 'total_cases_pm', 'total_deaths_pm', 'scores']].sort_values(['scores'], ascending = False)

Unnamed: 0_level_0,location,total_cases_pm,total_deaths_pm,scores
iso_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SGP,Singapore,5962.727,3.931,11.976713
QAT,Qatar,19753.146,13.19,7.994635
BEL,Belgium,5037.354,816.852,3.545274
BHR,Bahrain,6698.468,11.166,3.28442
LUX,Luxembourg,6418.776,175.726,2.461565
ESP,Spain,5120.952,580.197,2.190121
KWT,Kuwait,6332.42,49.642,2.126949
GBR,United Kingdom,4047.403,566.965,2.112294
ITA,Italy,3853.985,552.663,2.107751
MDV,Maldives,3280.041,9.25,2.094073
