### sklearn examples of some algorithm

#### 数据标准化（Standardization）

常用的 __数据预处理__ 方法, 它会将每个特征（每一列）的数据转换为 均值为 0、方差为 1 的分布 （也就是标准正态分布）。<br>
这在很多机器学习算法中是很有用的预处理步骤，尤其是那些依赖于距离计算 的模型（如 __KNN、SVM、PCA、线性回归__ 等）。


In [16]:
from sklearn.preprocessing import StandardScaler

X = [[0, 0], 
     [1, 1], 
     [2, 2]]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled) # 可以观察到，每一列的均值为0,方差为1

[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


在机器学习流程中的使用位置,通常放在训练模型之前：
```python
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 加载数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 创建标准化对象
scaler = StandardScaler()

# 对训练集进行拟合和变换（fit + transform）
X_train_scaled = scaler.fit_transform(X_train)

# 对测试集只进行变换（不要 fit！）
X_test_scaled = scaler.transform(X_test)

# 使用标准化后的数据训练模型
model = KNeighborsClassifier()
model.fit(X_train_scaled, y_train)
```

#### Iris 鸢尾花数据集机器学习练习

In [28]:
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载内置的 Iris 鸢尾花数据集
iris = datasets.load_iris()
print("datasets.iris have data size:", iris.data.shape)
print("datasets.iris have target size:", iris.target.shape)

print("Feature Names:")
print(iris.feature_names)
print("Target Names:")
print(iris.target_names)
print("Features:\n", iris.data[:2])     # 显示前2个样本的特征
print("Labels:\n", iris.target[:2])     # 显示前2个样本的标签


# X, y = iris.data[:, :2], iris.target # 选取前两个特征来训练
X, y = iris.data[:, :], iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y) 

scaler = preprocessing.StandardScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

knn = neighbors.KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy_score(y_test, y_pred))

datasets.iris have data size: (150, 4)
datasets.iris have target size: (150,)
Feature Names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Names:
['setosa' 'versicolor' 'virginica']
Features:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]]
Labels:
 [0 0]
Accuracy: 0.9333333333333333
