# sklearn
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
    
https://scikit-learn.org/stable/user_guide.html

https://www.cnblogs.com/lianyingteng/p/7811126.html

传统的机器学习任务从开始到建模的一般流程是：获取数据 -> 数据预处理 -> 训练建模 -> 模型评估 -> 预测，分类。本文我们将依据传统机器学习的流程，看看在每一步流程中都有哪些常用的函数以及它们的用法是怎么样的。希望你看完这篇文章可以最为快速的开始你的学习任务。

# sklearn.datasets
https://www.cnblogs.com/nolonely/p/6980160.html

sklearn 的数据集有好多个种

    自带的小数据集（packaged dataset）：sklearn.datasets.load_<name>
    可在线下载的数据集（Downloaded Dataset）：sklearn.datasets.fetch_<name>
    计算机生成的数据集（Generated Dataset）：sklearn.datasets.make_<name>
    svmlight/libsvm格式的数据集:sklearn.datasets.load_svmlight_file(...)
    从买了data.org在线下载获取的数据集:sklearn.datasets.fetch_mldata(...)


## 导入sklearn数据集

In [1]:
from sklearn import datasets

In [2]:
iris = datasets.load_iris() # 导入数据集
X = iris.data # 获得其特征向量
y = iris.target # 获得样本label

## 创建数据集

# 数据预处理

In [3]:
from sklearn import preprocessing

## 数据归一化
　　为了使得训练数据的标准化规则与测试数据的标准化规则同步，preprocessing中提供了很多Scaler：

In [None]:
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
# 1. 基于mean和std的标准化
scaler = preprocessing.StandardScaler().fit(train_data)
scaler.transform(train_data)
scaler.transform(test_data)

# 2. 将每个特征值归一化到一个固定范围
scaler = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(train_data)
scaler.transform(train_data)
scaler.transform(test_data)
#feature_range: 定义归一化范围，注用（）括起来

## 正则化（normalize）
　　当你想要计算两个样本的相似度时必不可少的一个操作，就是正则化。其思想是：首先求出样本的p-范数，然后该样本的所有元素都要除以该范数，这样最终使得每个样本的范数都为1。

## one-hot编码

In [5]:
data = [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]
encoder = preprocessing.OneHotEncoder().fit(data)
encoder.transform(data).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1., 0., 1., 0., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0.],
       [0., 1., 1., 0., 0., 0., 0., 1., 0.]])

# sklearn.model_selection

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)  # 设置显示数据的最大列数，防止出现省略号…，导致数据显示不全
pd.set_option('expand_frame_repr', False)  # 当列太多时不自动换行

## train_test_split
可以以list，Series， DataFrame作为参数

默认情况会先shuffle，在split

默认情况下不会保证切分后label的比例和切分前相同！但是可以指定stratify来保证切分的比例！

In [None]:
train_test_split(
    X, 
    y, 
    test_size=0.33, 
    random_state=42, 
    shuffle=True, 
    stratify=y
)

In [8]:
from sklearn.model_selection import train_test_split

In [18]:
x1 = list(range(10))
y1 = list(range(10, 20))
y2 = list(range(15, 25))

In [25]:
train_test_split(x1, test_size=0.2, random_state=141)

[[7, 0, 4, 2, 3, 8, 1, 6], [5, 9]]

In [24]:
train_test_split(y1, test_size=0.2, random_state=141)

[[17, 10, 14, 12, 13, 18, 11, 16], [15, 19]]

In [22]:
y2

[15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

In [23]:
train_test_split(y2, test_size=0.2, random_state=141)

[[22, 15, 19, 17, 18, 23, 16, 21], [20, 24]]

In [15]:
X = np.random.randn(1000, 9)
X[:5]

array([[-0.68291117,  0.65708204, -0.07488594, -0.09784512, -0.12735506,
         1.46575136,  0.32055646, -0.61094336, -1.06768396],
       [ 1.48736379,  0.22372249,  0.28589577, -1.09234385,  0.43722026,
         0.41985508,  0.02995451, -1.62778856,  0.96690139],
       [-0.20708027,  0.00756499, -0.82674378, -0.09406711,  0.55807112,
         0.81590917,  0.76628058,  0.75641694, -1.02586569],
       [-2.57354852, -0.73305637, -0.90913825,  0.61868268,  0.32379164,
        -0.9969446 , -0.17100188, -0.58033699,  0.80879013],
       [ 0.78579847,  0.00399596, -0.50592852,  1.41868965, -1.35325206,
         1.01643121, -0.79306194, -3.8466615 , -2.55223184]])

In [16]:
y = [0]*900 + [1]*100

In [4]:
# 默认会打乱
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=14)

In [5]:
type(y_train)

list

In [6]:
y = pd.Series(y)

In [7]:
y.value_counts()

0    900
1    100
dtype: int64

In [9]:
pd.Series(y_train).value_counts()

0    812
1     88
dtype: int64

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=14)
type(y_train)

pandas.core.series.Series

In [19]:
y = pd.DataFrame(y)
y.head()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0


In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=14)
type(y_train)

pandas.core.frame.DataFrame

In [21]:
y_train.iloc[:, 0].value_counts()

0    812
1     88
Name: 0, dtype: int64

In [22]:
y_test.iloc[:, 0].value_counts()

0    88
1    12
Name: 0, dtype: int64

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=14, stratify=y)

In [24]:
y_train.iloc[:, 0].value_counts()

0    810
1     90
Name: 0, dtype: int64

In [23]:
y_test.iloc[:, 0].value_counts()

0    90
1    10
Name: 0, dtype: int64

## KFold, StratifiedKFold'
K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split
dataset into k consecutive folds (without shuffling by default).

Each fold is then used once as a validation while the k - 1 remaining
folds form the training set.

In [2]:
from sklearn.model_selection import KFold, StratifiedKFold

In [None]:
KFold(n_splits=3, shuffle=False, random_state=None)

In [3]:
import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
kf.get_n_splits(X)

2

In [4]:
kf

KFold(n_splits=2, random_state=None, shuffle=False)

In [5]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


## GridSearchCV
Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a "fit" and a "score" method.
It also implements "predict", "predict_proba", "decision_function",
"transform" and "inverse_transform" if they are implemented in the
estimator used.

The parameters of the estimator used to apply these methods are optimized
by cross-validated grid-search over a parameter grid.

### 核心问题是GridSearchCV依据什么来选出最优的模型参数？
Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section [The scoring parameter: defining model evaluation rules](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [6]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

GridSearchCV(
    estimator,
    param_grid,
    scoring=None,
    n_jobs=None,
    iid='warn',
    refit=True,
    cv='warn',
    verbose=0,
    pre_dispatch='2*n_jobs',
    error_score='raise-deprecating',
    return_train_score=False,
)
* scoring : string, callable, list/tuple, dict or None, default: None
    A single string (see :ref:`scoring_parameter`) or a callable
    (see :ref:`scoring`) to evaluate the predictions on the test set.

    For evaluating multiple metrics, either give a list of (unique) strings
    or a dict with names as keys and callables as values.

    NOTE that when using custom scorers, each scorer should return a single
    value. Metric functions returning a list/array of values can be wrapped
    into multiple scorers that return one value each.

    See :ref:`multimetric_grid_search` for an example.

    If None, the estimator's score method is used.
* cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:

    - None, to use the default 3-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.

## RandomizedSearchCV
https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization

While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favourable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:

    A budget can be chosen independent of the number of parameters and possible values.

    Adding parameters that do not influence the performance does not decrease efficiency.


In [8]:
from sklearn.model_selection import RandomizedSearchCV

# 定义模型

## 线性回归

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

## 朴素贝叶斯算法NB

# 保存模型

## 保存为pickle文件

In [None]:
import pickle

# 保存模型
with open('model.pickle', 'wb') as f:
    pickle.dump(model, f)

# 读取模型
with open('model.pickle', 'rb') as f:
    model = pickle.load(f)
model.predict(X_test)

## sklearn自带方法joblib

In [None]:
from sklearn.externals import joblib

# 保存模型
joblib.dump(model, 'model.pickle')

#载入模型
model = joblib.load('model.pickle')

# sklearn.metrics

In [1]:
from sklearn.metrics.pairwise import pairwise_distances, paired_cosine_distances, cosine_distances, cosine_similarity

In [2]:
import numpy as np

In [3]:
data = np.arange(8).reshape([2, 4])
data

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

## 常见的距离

### 欧几里得距离
$$\sqrt{\sum{(x_{1i}-x_{2i})^2}}$$

### 余弦距离

## pairwise_distances(X, Y=None, metric='euclidean', n_jobs=None, **kwds)

In [None]:
Valid values for metric are:

- From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
  'manhattan']. These metrics support sparse matrix inputs.

- From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
  'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis',
  'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
  'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']
  See the documentation for scipy.spatial.distance for details on these
  metrics. These metrics do not support sparse matrix inputs.

In [4]:
pairwise_distances(data,  Y=None, metric='euclidean')

array([[0., 8.],
       [8., 0.]])

In [5]:
np.sqrt(np.sum(np.square(data[1] - data[0])))

8.0

In [6]:
pairwise_distances(data,  Y=None, metric='cosine')

array([[0.       , 0.0952381],
       [0.0952381, 0.       ]])

In [13]:
data[1].dot(data[0])/(np.sqrt(np.sum(np.square(data[1]))) * np.sqrt(np.sum(np.square(data[0]))))

0.9047619047619048

In [17]:
data[1].reshape([1, -1])

array([[4, 5, 6, 7]])

In [18]:
pairwise_distances(data,  Y=data[1].reshape([1, -1]), metric='cosine')

array([[0.0952381],
       [0.       ]])

In [19]:
pairwise_distances(data,  Y=data, metric='cosine')

array([[0.       , 0.0952381],
       [0.0952381, 0.       ]])

## cosine_distances
等同于pairwise_distances(data,  Y=data, metric='cosine')

In [21]:
cosine_distances(data,  Y=data[1].reshape([1, -1]))

array([[0.0952381],
       [0.       ]])

## cosine_similarity
1 - cosine_distances

In [7]:
data[1].reshape([1, -1])

array([[4, 5, 6, 7]])

In [22]:
cosine_similarity(data,  Y=data[1].reshape([1, -1]))

array([[0.9047619],
       [1.       ]])

In [23]:
cosine_similarity(data,  Y=data)

array([[1.       , 0.9047619],
       [0.9047619, 1.       ]])