20190316-2.7 Fancy Indexing（翻译） #36

iMchxx · 2019-03-16T08:45:24Z

花式索引

在前面的部分中，我们看到了如何使用简单索引（例如，arr[0]），切片（例如，arr[:5]）和布尔掩码（例如，arr[arr> 0]）来访问和修改数组的部分。在本节中，我们将介绍另一种数组索引方式，称为花式索引。花式的索引就像我们已经看到的简单索引，但是我们传递索引数组来代替单个标量。这使我们能够非常快速地访问和修改数组值的复杂子集。

探索花式索引

花式索引在概念上很简单：它意味着传递索引数组以同时访问多个数组元素。例如，请考虑以下数组：

In [1]: import numpy as np
   ...: rand = np.random.RandomState(42)
   ...:
   ...: x = rand.randint(100, size=10)
   ...: print(x)
[51 92 14 71 60 20 82 86 74 74]

假设我们想要访问三个不同的元素。我们可以这样做：

In [2]: [x[3], x[7], x[2]]
Out[2]: [71, 86, 14]

或者，我们可以传递单个列表或索引数组以获得相同的结果：

In [3]: ind = [3, 7, 4]
   ...: x[ind]
Out[3]: array([71, 86, 60])

使用花式索引时，结果的形状反映了索引数组的形状，而不是索引的数组的形状：

In [4]: ind = np.array([[3, 7],
   ...:                 [4, 5]])
   ...: x[ind]
Out[4]:
array([[71, 86],
       [60, 20]])

花式索引也可以在多个维度上工作。考虑以下数组：

In [5]: X = np.arange(12).reshape((3, 4))
   ...: X
Out[5]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

与标准索引一样，第一个索引引用行，第二个索引引用列：

In [6]: row = np.array([0, 1, 2])
   ...: col = np.array([2, 1, 3])
   ...: X[row, col]
Out[6]: array([ 2,  5, 11])

请注意，结果中的第一个值是X[0,2]，第二个值是X[1,1]，第三个值是X[2,3]。花式索引中的索引配对遵循数组计算：广播中提到的所有广播规则。因此，例如，如果我们在索引中组合列向量和行向量，我们得到一个二维结果：

In [7]: X[row[:, np.newaxis], col]
Out[7]:
array([[ 2,  1,  3],
       [ 6,  5,  7],
       [10,  9, 11]])

这里，每个行值与每个列向量匹配，正如我们在算术运算的广播中看到的那样。例如：

In [8]: row[:, np.newaxis] * col
Out[8]:
array([[0, 0, 0],
       [2, 1, 3],
       [4, 2, 6]])

重要的是要记住花式索引，返回值反映了索引广播后的形状，而不是被索引的数组的形状。

组合索引

对于更强大的操作，花哨的索引可以与我们看到的其他索引模式结合使用：

In [9]: print(X)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

我们可以结合花式和简单的索引：

In [10]: X[2, [2, 0, 1]]
Out[10]: array([10,  8,  9])

我们还可以将花式索引与切片结合起来：

In [11]: X[1:, [2, 0, 1]]
Out[11]:
array([[ 6,  4,  5],
       [10,  8,  9]])

我们可以将花式索引与掩码(masking)结合起来：

In [12]: mask = np.array([1, 0, 1, 0], dtype=bool)
    ...: X[row[:, np.newaxis], mask]
Out[12]:
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

所有这些索引选项组合在一起形成一组非常灵活的操作，用于访问和修改数组值。

示例：选择随机点

花式索引的一个常见用途是从矩阵中选择行的子集。例如，我们可能有一个N*D矩阵表示D维中的N个点，例如从二维正态分布绘制的以下点：

In [13]: mean = [0, 0]
    ...: cov = [[1, 2],
    ...:        [2, 5]]
    ...: X = rand.multivariate_normal(mean, cov, 100)
    ...: X.shape
Out[13]: (100, 2)

使用我们将在[Matplotlib简介](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)中讨论的绘图工具，我们可以将这些点可视化为散点图：

In [14]: %matplotlib inline
    ...: import matplotlib.pyplot as plt
    ...: import seaborn; seaborn.set()  # for plot styling
    ...:
    ...: plt.scatter(X[:, 0], X[:, 1]);

让我们使用花式索引来选择20个随机点。我们首先选择20个没有重复的随机索引，然后使用这些索引选择原始数组的一部分：

In [16]: indices = np.random.choice(X.shape[0], 20, replace=False)
    ...: indices
Out[16]:
array([93, 87,  8, 47, 88, 84, 44, 95, 23,  3, 32, 46, 97,  0, 22,  6, 35,
       28, 25, 42])
       
In [17]: selection = X[indices]  # fancy indexing here
    ...: selection.shape
Out[17]: (20, 2)

现在，要查看选择了哪些点，让我们在所选点的位置上绘制大圆圈：

In [18]: plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
    ...: plt.scatter(selection[:, 0], selection[:, 1],
    ...:             facecolor='none', s=200);

In [19]: plt.show()

这种策略通常用于快速分区数据集，这在训练/测试拆分中经常需要用于验证统计模型（参见超参数和模型验证），以及采样方法来回答统计问题。

使用花式索引修改值

正如可以使用花哨的索引来访问部分数组，它也可以用于修改部分数组。例如，假设我们有一个索引数组，我们想将数组中的相应项设置为某个值：

In [20]: x = np.arange(10)
    ...: i = np.array([2, 1, 8, 4])
    ...: x[i] = 99
    ...: print(x)
[ 0 99 99  3 99  5  6  7 99  9]

我们可以使用任何赋值类型的运算符。例如：

In [21]: x[i] -= 10
    ...: print(x)
[ 0 89 89  3 89  5  6  7 89  9]

但请注意，使用这些操作重复索引可能会导致一些潜在的意外结果。考虑以下：

In [22]: x = np.zeros(10)
    ...: x[[0, 0]] = [4, 6]
    ...: print(x)
[6. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

4去了哪里？此操作的结果是首先分配x[0] = 4，然后是x[0] = 6.结果当然是x[0]包含值6。

很公平，但考虑这个操作：

In [23]: i = [2, 3, 3, 4, 4, 4]
    ...: x[i] += 1
    ...: x
Out[23]: array([6., 0., 1., 1., 1., 0., 0., 0., 0., 0.])

您可能希望x[3]包含值2，而x[4]将包含值3，因为这是每个索引重复的次数。为什么不是这样？从概念上讲，这是因为x[i] += 1意味着x[i] = x[i] + 1的简写。评估x[i] + 1，然后将结果分配给x中的索引。考虑到这一点，不是多次发生的增强，而是分配，这导致相当不直观的结果。

那么如果你想要重复操作的其他行为呢？为此，您可以使用ufuncs的at()方法（自NumPy 1.8起可用），并执行以下操作：

In [24]: x = np.zeros(10)
    ...: np.add.at(x, i, 1)
    ...: print(x)
[0. 0. 1. 2. 3. 0. 0. 0. 0. 0.]

at()方法使用指定的值（此处为1）在指定的索引（此处为i）处执行给定运算符的就地应用。另一种内在的类似方法是ufuncs的reduceat()方法，您可以在NumPy文档中阅读。

示例：分箱数据

您可以使用这些想法有效地分割数据以手动创建直方图。例如，假设我们有1,000个值，并希望快速找到它们落入一系列区域的位置。我们可以使用ufunc.at来计算它，如下所示：

In [25]: np.random.seed(42)
    ...: x = np.random.randn(100)
    ...:
    ...: # compute a histogram by hand
    ...: bins = np.linspace(-5, 5, 20)
    ...: counts = np.zeros_like(bins)
    ...:
    ...: # find the appropriate bin for each x
    ...: i = np.searchsorted(bins, x)
    ...:
    ...: # add 1 to each of these bins
    ...: np.add.at(counts, i, 1)

计数现在反映每个箱中的点数 - 换句话说，直方图：

# plot the results
In [26]: plt.plot(bins, counts, linestyle='steps');

当然，每次想要绘制直方图时都必须这样做是很愚蠢的。这就是Matplotlib提供plt.hist()例程的原因，它在一行中完成相同的操作：

In [28]: plt.hist(x, bins, histtype='step');

此功能将创建一个与此处看到的几乎相同的绘图。为了计算分箱，matplotlib使用np.histogram函数，它与我们之前做的计算非常相似。我们在这里比较两个：

In [30]: print("NumPy routine:")
    ...: %timeit counts, edges = np.histogram(x, bins)
    ...:
    ...: print("Custom routine:")
    ...: %timeit np.add.at(counts, np.searchsorted(bins, x), 1)
NumPy routine:
23.5 µs ± 485 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Custom routine:
14.4 µs ± 761 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我们自己的单行算法比NumPy中的优化算法快几倍！怎么会这样？如果你深入研究np.histogram源代码（你可以通过输入np.histogram来在IPython中做到这一点），你会发现它比我们已经完成的简单搜索和计数更加复杂 ;这是因为NumPy的算法更灵活，特别是当数据点数量变大时，它可以提供更好的性能：

In [31]: x = np.random.randn(1000000)
    ...: print("NumPy routine:")
    ...: %timeit counts, edges = np.histogram(x, bins)
    ...:
    ...: print("Custom routine:")
    ...: %timeit np.add.at(counts, np.searchsorted(bins, x), 1)
NumPy routine:
71 ms ± 613 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Custom routine:
115 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这种比较表明，算法效率几乎从来不是一个简单的问题。对大型数据集有效的算法并不总是小数据集的最佳选择，反之亦然（参见[Big-O表示法](https://jakevdp.github.io/PythonDataScienceHandbook/02.08-sorting.html#Aside:-Big-O-Notation)）。但是自己编码这个算法的好处是，通过理解这些基本方法，你可以使用这些构建块来扩展它来做一些非常有趣的自定义行为。在数据密集型应用程序中有效使用`Python`的关键是了解一般的便利例程，如`np.histogram`，当它们合适时，但是当你需要更尖锐的行为时，也知道如何使用低级功能。

via

Fancy Indexing

The text was updated successfully, but these errors were encountered:

iMchxx changed the title ~~20190316-2.7 Fancy Indexing~~ 20190316-2.7 Fancy Indexing（翻译） Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20190316-2.7 Fancy Indexing（翻译） #36

20190316-2.7 Fancy Indexing（翻译） #36

iMchxx commented Mar 16, 2019

20190316-2.7 Fancy Indexing（翻译） #36

20190316-2.7 Fancy Indexing（翻译） #36

Comments

iMchxx commented Mar 16, 2019

花式索引

探索花式索引

组合索引

示例：选择随机点

使用花式索引修改值

示例：分箱数据

via