【課程名稱】<font color=#FF0000>機器學習(Machine Learning, ML)</font><br>
【授課講師】[陳祥輝 (mail : HsiangHui.Chen@gmail.com)](mailto:HsiangHui.Chen@gmail.com)<br>
【facebook】[陳祥輝老師的臉書 (歡迎加好友)](https://goo.gl/osivhx)<br>
【相關課程】[東吳推廣數位資訊學苑陳祥輝老師的課程表](https://www.ext.scu.edu.tw/courses_search.php?key=陳祥輝)<br>

In [None]:
# !pip install --upgrade seaborn

In [None]:
# -*- coding: utf-8 -*-
from platform import python_version
import os, time, glob
import pandas as pd
import numpy as np

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

print("【日期時間】{}".format(time.strftime("%Y/%m/%d %H:%M:%S")))
print("【工作目錄】{}".format(os.getcwd()))
print("【Python】{}".format(python_version()))
print("【matplotlib】{}".format(mpl.__version__))
print("【seaborn】{}".format(sns.__version__))
print("【sklearn】{}".format(sklearn.__version__))

# %autosave 120

In [None]:
from matplotlib.font_manager import FontProperties  
winfont01 = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=12) 
winfont02 = FontProperties(fname=r"c:\windows\fonts\kaiu.ttf", size=12) 

# plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei']  # 設定字型為微軟正黑體
# plt.rcParams['axes.unicode_minus'] = False                # 解決負號顯示問題

# macfont = FontProperties(fname="/Library/Fonts/Arial Unicode.ttf", size=10) 

### <font color='blue'>常見的幾種Anomaly Detection / Outlier Detection</font>
- Standard Deviation Method
- Interquartile Range Method
- Automatic Outlier Detection
    - DBSCAN (Density-Based Spatial Clustering of Applications with Noise) 
    - Isolation Forest

【Dataset】
- [House Price Dataset (housing.csv)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv)
- [House Price Dataset Description (housing.names)](https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.names)

### <font color='blue'>Standard Deviation Method</font>
##### 如果覺得你的資料有符合常態分配的時候用他

In [None]:
fname = r'C:/Data/PyMLData/cars.csv'
cars = pd.read_csv(fname, sep=',', encoding='utf-8', engine='python')
print(cars.head(3))

In [None]:
# 常態分配先算出平均數標準差
# 你自己決定要幾背標準差 2倍或是3倍
speed_mean = cars["speed"].mean()
speed_std = cars["speed"].std()

upper = speed_mean + speed_std * 2
lower = speed_mean - speed_std * 2

# 寫法一：# 但這樣應該會比較難寫嗎
# cars_outlier = cars[(cars["speed"] < lower) | (cars["speed"] > upper)] 
# 寫法二： 第二個寫法比要直覺一點點 老師偏好.quary
cars_outlier = cars.query(f"speed < {lower} or speed > {upper}")
cars_outlier

# 列出來之後 你再判斷一下要不要捨棄他 

In [None]:
# normal 也可以看一下
cars_normal = cars.query(f"speed >= {lower} and speed <= {upper}")
print(cars_normal.shape)

### <font color='blue'>Interquartile Range (IQR) Method</font>
##### 如果覺得你的資料沒有符合常態分配的時候用他 但是多半沒有符合呵呵
- $Q3 + 1.5 \times IQR$
- $Q1 - 1.5 \times IQR$

In [None]:
fname = r'C:/Data/PyMLData/cars.csv'
cars = pd.read_csv(fname, sep=',', encoding='utf-8', engine='python')
print(cars.head(3))

In [None]:
Q1, Q3 = cars["speed"].quantile(q=[0.25, 0.75])
IQR = Q3 - Q1

n = 1.5        # 1.5 or 3

upper = Q3 + n * IQR
lower = Q1 - n * IQR

# cars_outlier = cars[(cars["speed"] < lower) | (cars["speed"] > upper)]
cars_outlier = cars.query(f"speed < {lower} or speed > {upper}")
cars_outlier

In [None]:
cars_normal = cars.query(f"speed >= {lower} and speed <= {upper}")
print(cars_normal.shape)

### <font color='blue'>Automatic Outlier Detection</font>
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
    - from sklearn.cluster import DBSCAN
- Isolation Forest
    - from sklearn.ensemble import IsolationForest

【資料集】
- <https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv>
- <https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.names>

### <font color=blue>DBSCAN (Density-Based Spatial Clustering of Applications with Noise)</font>
- <font color=red><b>參考 Unit03【非監督式學習】集群分析(Cluster Analysis)，此處先略過</b></font>

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
fname = r'C:/Data/PyMLData/iris.csv'
iris = pd.read_csv(fname, sep=',', encoding='utf-8', engine='python')
print(iris.head(3))

### <font color=blue>Isolation Forest</font>
- [sklearn.ensemble.IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)

【資料來源】[Isolation forest](https://en.wikipedia.org/wiki/Isolation_forest) <https://en.wikipedia.org/wiki/Isolation_forest>

<table align=left>
    <tr>
        <td><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Isolating_a_Non-Anomalous_Point.png/450px-Isolating_a_Non-Anomalous_Point.png' width=300 align=left></img></td>
        <td><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Isolating_an_Anomalous_Point.png/450px-Isolating_an_Anomalous_Point.png' width=300 align=left></img></td>
    </tr>
    <tr align=center>
        <td align=center>a non-anomalous point</td>
        <td align=center>an anomalous point</td>
    </tr>    
</table>

In [None]:
from sklearn.ensemble import IsolationForest

In [None]:
colnames = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
            'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
            'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
uri = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
data = pd.read_csv(uri, header=None, names=colnames)

X = data.iloc[:, :-1].values
y = data.iloc[:,  -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)