Fetching contributors…
Cannot retrieve contributors at this time
177 lines (111 sloc) 10.9 KB

# 1.Machine Learning Basics

## 1) Introduction

• 监督学习

• 无监督学习

### (2) Data Scaling

feature的规模过大或者过小都会增加数值不稳定的风险并且还使损失函数更加难以优化。

• 基于线性模型权重的特征选择方法会假定输入的feature在同样的规模上。
• 基于梯度下降算法的模型（比如神经网络）的表现和收敛速度会被没有合理scale的数据显著影响。
• 决策树和随机森林类算法对数据规模不太敏感，因为它们使用rule-based标准。

• standard/z-score scaling

Standard/z-score scaling first shift features to their centers(mean) and then divide by their standard deviation. This method is suitable for most continous features of approximately Gaussian distribution.

$$\text{zscore}(x_{ij}^{'}) = \frac{x_{ij} - \mu _{ij}}{\sigma _i}$$

• min-max scaling

Min-max scaling method scales data into range [0, 1]. This method is suitable for data concentrated within a range and preserves zero values for sparse data. Min-max scaling is also sensitive to outliers in the data. Try removing outliers or clip data into a range before scaling.

$$\text{minmax}(x_{ij}^{'}) = \frac{x_{ij} - \text{min}k \mathbf{x}{ik}} {\text{max}k x{ik} - \text{min}k x{ik}}$$

• abs-max scaling.

Max-abs scaling method is similar to min-max scaling, but scales data into range [-1, 1]. It does not shift/center the data and thus preserves signs (positive/negative) of features. Like min-max, max-abs is sensitive to outliers.

$$\text{maxabs}(x_{ij}^{'}) = \frac{x_{ij}}{\text{max}k \vert x{ik} \vert}$$

• robust scaling

Robust scaling method use robust statistics (median, interquartile range) instead of mean and standard deviation. Median and IQR are less sensitive to outliers. For features with large numbers of outliers or largely deviates from normal distribution, robust scaling is recommended.

$$\text{robustscale}(x_{ij}^{'}) = \frac{x_{ij} - \text{median}k x{ik}} {Q_{0.75}(\mathbf{x}i) - Q{0.25}(\mathbf{x}_i)}$$

### (3) 分类问题的评估指标

• Confusion matrix：

Confusion matrix会总结模型正确和错误分类的样本数量，并将预测的样本分成如下四类：

True Positive Negative
Predicted
Positive True Positive (TP) False Positive (FP) $$\textbf{PPV} =\frac{TP}{TP+FP}$$
Negative False Negative (FN) True Negative (TN) $$\textbf{NPV} =\frac{TN}{TN+FN}$$
$$\textbf{Sensiticity} = \frac{TP}{TP+FN}$$ $$\textbf{Specificity} = \frac{TN}{TN+FP}$$
• Accuracy (0 ~ 1)

summarizes both positive and negative predictions, but is biased if the classes are imbalanced:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

• Recall/sensitivity (0 ~ 1)

summarizes how well the model finds out positive samples:

$$\text{Recall/Sensitivity} = \frac{TP}{TP + FN}$$

• Precision/positive predictive value (PPV) (0 ~ 1)

summarizes how well the model finds out negative samples:

$$\text{Precision/Positive Predictive Value} = \frac{TP}{TP + FP}$$

• F1 score (0 ~ 1)

balances between positive predictive value (PPV) and true positive rate (TPR) and is more suitable for imbalanced dataset:

$$\text{F1 score} = 2 \frac{PPV \cdot TPR}{PPV + TPR}$$

• Matthews correlation coefficient (MCC) (-1 ~ 1)

another metric that balances between recall and precision:

$$\text{MCC} = \frac{TP \times TN - FP \times FN} {(TP + FN)(TP + FP)(TN + FP)(TN + FN)}$$

• ROC曲线和Precision-Recall曲线：

ROC曲线下面积（AUROC）或average precision (AP)是一个单值，它总结了不同截止值下的模型平均表现，常常用于报告模型的分类表现。

### (4) 交叉验证

• K-fold Cross-validation

K折交叉验证中，数据集被均匀地划分为$$k$$个部分（folds）。在每轮验证中，在剩余的$$\frac{k-1}{k}$$部分上进行训练，并在剩下的一个fold上进行测试。

K折交叉验证确保训练样本和测试样本之间没有重叠，K轮结束后，每个样本会被设置为测试样品一次。最后，模型平均表现是在 $$k$$轮次中计算指标的平均值得到的。

• Leave-one-out Cross-validation

Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1.

### (5) 机器学习中常见的数据划分模式

• 10-fold cross-validation + an independent validation set
• Create an independent validation dataset first, you can randomly split the dataset into two parts, a training/test set (e.g. 80%) and a validation set (e.g. 20%).
• Cross-validation on the training/test set. You can use 5- or 10-fold cross-validation:
• randomly split the dataset into 10 parts;
• train on 9 and test on 1;
• repeat the training/testing process for all combinations of train-test splits.
• Evaluate the algorithms based on both training/test set and validation set.

## 2) 休息一会

Bayes 贝叶斯

1763年12月23日，托马斯·贝叶斯（ Thomas Bayes，1701？-1761 ）的遗产受赠者R.Price牧师在英国皇家学会宣读了贝叶斯的遗作《论机会学说中一个问题的求解》，其中给出了贝叶斯定理，这一天现在被当作贝叶斯定理的诞生日。虽然贝叶斯定理在今天已成为概率论统计最经典的内容之一，但贝叶斯本人却笼罩在谜团之中。

—— 摘自《机器学习》(周志华)