## 3-3 Numpy数据基础

In [1]:
import numpy

numpy.__version__

'1.13.3'

In [3]:
import numpy as np

np.__version__

'1.13.3'

## Python List

+ Python 的列表允许其中的元素有不同的类型
+ Python 还内置了一个 array 类型

In [4]:
L = [i for i in range(10)]
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]:
# 读取列表的第 6 个元素
L[5]

5

In [8]:
# 一个原来是 int 的列表可以修改其中一个值为字符串类型
L[5] = 'Machine Learning'
L

[0, 1, 2, 3, 4, 'Machine Learning', 6, 7, 8, 9]

## Python 内置的 array 类型的使用

In [11]:
import array
# 第 1 个参数写 'i' 表示整数型
arr = array.array('i', [i for i in range(10)])
arr

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:
arr[5]

5

In [15]:
arr[5] = 100

In [16]:
arr

array('i', [0, 1, 2, 3, 4, 100, 6, 7, 8, 9])

In [17]:
# 此时，如果我们要修改其中一个值为字符串类型，解释器就会抛出异常
arr[5] = 'Machine Learning'

TypeError: an integer is required (got type str)

## NumPy 的 array

多维数组

+ 在维度上是 1 的时候，数学上叫做向量
+ 在维度上是 2 的时候，数学上叫做矩阵

In [19]:
nparr = np.array([i for i in range(10)])

nparr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [21]:
nparr[5]

5

In [22]:
nparr[5] = 100
nparr

array([  0,   1,   2,   3,   4, 100,   6,   7,   8,   9])

In [23]:
arr[5] = 'Machine Learning'

TypeError: an integer is required (got type str)

In [24]:
nparr.dtype

dtype('int64')

**如果我们将整形的 NumPy 数组的其中一个值修改成浮点型，NumPy 会自动截断。**

In [25]:
nparr[5] = 3.14

In [26]:
# 下标是 5 的数变成了 3 而不是 3.14，因为 NumPy 中的 array 的类型必须一致
nparr

array([0, 1, 2, 3, 4, 3, 6, 7, 8, 9])

**如果创建 NumPy 的时候，数据类型就不一致。**

In [27]:
nparr2 = np.array([1, 2, 3.14])

In [28]:
nparr2

array([ 1.  ,  2.  ,  3.14])

In [29]:
nparr2.dtype

dtype('float64')

## 3-4 创建Numpy数组(和矩阵)

创建全是 0 的 NumPy 数组。

In [30]:
np.zeros(5)

array([ 0.,  0.,  0.,  0.,  0.])

In [32]:
# 创建的时候，如果不指定，默认的类型是浮点型
np.zeros(5).dtype

dtype('float64')

In [33]:
np.zeros((3,5),dtype=int)

array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

创建全是 1 的 NumPy 数组。

In [34]:
np.ones(10)

array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [37]:
np.ones((3,5))

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

创建的矩阵的元素都等于同一个元素。

In [38]:
np.full((3,5),666)

array([[666, 666, 666, 666, 666],
       [666, 666, 666, 666, 666],
       [666, 666, 666, 666, 666]])

In [40]:
help(np.full)

Help on function full in module numpy.core.numeric:

full(shape, fill_value, dtype=None, order='C')
    Return a new array of given shape and type, filled with `fill_value`.
    
    Parameters
    ----------
    shape : int or sequence of ints
        Shape of the new array, e.g., ``(2, 3)`` or ``2``.
    fill_value : scalar
        Fill value.
    dtype : data-type, optional
        The desired data-type for the array  The default, `None`, means
         `np.array(fill_value).dtype`.
    order : {'C', 'F'}, optional
        Whether to store multidimensional data in C- or Fortran-contiguous
        (row- or column-wise) order in memory.
    
    Returns
    -------
    out : ndarray
        Array of `fill_value` with the given shape, dtype, and order.
    
    See Also
    --------
    zeros_like : Return an array of zeros with shape and type of input.
    ones_like : Return an array of ones with shape and type of input.
    empty_like : Return an empty array with shape and type of in

In [41]:
np.full(shape=(3,5),fill_value=666.0)

array([[ 666.,  666.,  666.,  666.,  666.],
       [ 666.,  666.,  666.,  666.,  666.],
       [ 666.,  666.,  666.,  666.,  666.]])

## NumPy 的 arange

NumPy 的 arange 的步长可以指定成小数。

In [42]:
# 得到的 Python 原生的数组不包含终止点
[i for i in range(0,20,2)]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [44]:
# 步长不能指定成小数
[i for i in range(0,1,0.2)]

TypeError: 'float' object cannot be interpreted as an integer

In [45]:
np.arange(0, 1, 0.2)

array([ 0. ,  0.2,  0.4,  0.6,  0.8])

### linspace

+ 生成了等差数列，第 3 个参数是这个数列的元素个数
+ 这里的 lin 是线性 linear 的意思。
+ 生成的数据包含了起始值和终止值

In [46]:

np.linspace(0,20,10)

array([  0.        ,   2.22222222,   4.44444444,   6.66666667,
         8.88888889,  11.11111111,  13.33333333,  15.55555556,
        17.77777778,  20.        ])

In [47]:
np.linspace(0,20,11)

array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.,  20.])

## random 模块

In [49]:
# 随机生成一个数，这个数的取值范围在 0，10 之间，10 取不到
np.random.randint(0,10)

3

In [51]:
np.random.randint(0,1)

0

In [52]:
help(np.random.randint)

Help on built-in function randint:

randint(...) method of mtrand.RandomState instance
    randint(low, high=None, size=None, dtype='l')
    
    Return random integers from `low` (inclusive) to `high` (exclusive).
    
    Return random integers from the "discrete uniform" distribution of
    the specified dtype in the "half-open" interval [`low`, `high`). If
    `high` is None (the default), then results are from [0, `low`).
    
    Parameters
    ----------
    low : int
        Lowest (signed) integer to be drawn from the distribution (unless
        ``high=None``, in which case this parameter is one above the
        *highest* such integer).
    high : int, optional
        If provided, one above the largest (signed) integer to be drawn
        from the distribution (see above for behavior if ``high=None``).
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, i

In [53]:
# 第 3 个参数是数据的形状
np.random.randint(0,10,10)

array([7, 3, 9, 0, 7, 1, 7, 6, 3, 2])

**下面的代码说明了区间是前闭后开。**

In [54]:
np.random.randint(0,1,10)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

我们还可以为随机数的生成设置“随机数种子”。

In [55]:
np.random.randint(4,8,size=(3,5))

array([[5, 5, 4, 6, 7],
       [5, 5, 7, 4, 6],
       [5, 6, 5, 4, 6]])

In [56]:
np.random.randint(4,8,size=(3,5))

array([[6, 6, 6, 4, 7],
       [6, 7, 4, 7, 7],
       [6, 7, 5, 4, 6]])

In [57]:
np.random.seed(666)
np.random.randint(4,8,size=(3,5))

array([[4, 6, 5, 6, 6],
       [6, 5, 6, 4, 5],
       [7, 6, 7, 4, 7]])

### 生成随机的浮点数

In [61]:
# 随机生成 0 到 1 之间的一个浮点数
np.random.random()

0.9532313721123061

In [58]:
np.random.random(10)

array([ 0.28116849,  0.46284169,  0.23340091,  0.76706421,  0.81995656,
        0.39747625,  0.31644109,  0.15551206,  0.73460987,  0.73159555])

In [62]:
np.random.random((3,5))

array([[ 0.29097383,  0.84778197,  0.3497619 ,  0.92389692,  0.29489453],
       [ 0.52438061,  0.94253896,  0.07473949,  0.27646251,  0.4675855 ],
       [ 0.31581532,  0.39016259,  0.26832981,  0.75366384,  0.66673747]])

## 生成符合正态分布的数 np.random.normal

In [63]:
np.random.normal()

0.06102403954600113

In [67]:
# 这里的 10 是均值，100 是标准差
np.random.normal(10,100)

21.21217025535024

In [72]:
help(np.random.normal)

Help on built-in function normal:

normal(...) method of mtrand.RandomState instance
    normal(loc=0.0, scale=1.0, size=None)
    
    Draw random samples from a normal (Gaussian) distribution.
    
    The probability density function of the normal distribution, first
    derived by De Moivre and 200 years later by both Gauss and Laplace
    independently [2]_, is often called the bell curve because of
    its characteristic shape (see the example below).
    
    The normal distributions occurs often in nature.  For example, it
    describes the commonly occurring distribution of samples influenced
    by a large number of tiny, random disturbances, each with its own
    unique distribution [2]_.
    
    Parameters
    ----------
    loc : float or array_like of floats
        Mean ("centre") of the distribution.
    scale : float or array_like of floats
        Standard deviation (spread or "width") of the distribution.
    size : int or tuple of ints, optional
        Output shap

In [74]:
np.random.normal(0,1,(3,5))

array([[-1.68290077,  0.22918525, -1.75662522,  0.84463262,  0.27721986],
       [ 0.85290153,  0.1945996 ,  1.31063772,  1.5438436 , -0.52904802],
       [-0.6564723 , -0.2015057 , -0.70061583,  0.68713795, -0.02607576]])

**通过查看文档的方式学习**

```
np.random.normal?
```

或者 

```
help(np.random.normal?)
```

## 3-5 Numpy数组(和矩阵)的基本操作

In [81]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [79]:
X = np.arange(15).reshape(3, 5)
X

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

### 基本属性

In [82]:
x.ndim

1

In [83]:
X.ndim

2

In [84]:
x.shape

(10,)

In [85]:
X.shape

(3, 5)

In [86]:
x.size

10

In [87]:
X.size

15

### numpy.array 的数据访问

In [88]:
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [89]:
x[0]

0

In [90]:
x[-1]

9

In [91]:
X

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [92]:
X[0][0]

0

In [93]:
X[0,1]

1

In [94]:
X[2,2]

12

In [95]:
x[0:5]

array([0, 1, 2, 3, 4])

In [96]:
x[:5]

array([0, 1, 2, 3, 4])

In [97]:
# 以间距为 2 得到一个新的数组
x[::2]

array([0, 2, 4, 6, 8])

In [98]:
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [99]:
X

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

我们期望获得矩阵 X 的前两行，前三列。

In [101]:
X[:2,:3]

array([[0, 1, 2],
       [5, 6, 7]])

**两个 [][] 来表达 X[:2,:3] 不能够表示 “获得矩阵 X 的前两行，前三列。” 这个语义。**

In [100]:
X[:2][:3]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [None]:
这是因为，X[:2][:3] 是按照下面的步骤执行的。

In [103]:
X_step1 = X[:2]
X_step1

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [104]:
X_step1[:3]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

## 3-6 Numpy数组(和矩阵)的合并与分割

In [107]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])

In [108]:
x

array([1, 2, 3])

In [109]:
y

array([3, 2, 1])

concatenate  
vt.连结; 使连锁

向量的堆叠，加在后面。

In [110]:
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

1 维的数据不能够设置 axis 的值为 1，因为总共就 1 个维度， axis = 0 或者 1 表示有 2 个维度了。

In [121]:
np.concatenate([x, y], axis = 1)

AxisError: axis 1 is out of bounds for array of dimension 1

还可以连接 3 个 array。

In [112]:
z = np.array([666, 666, 666])

In [113]:
np.concatenate([x, y, z])

array([  1,   2,   3,   3,   2,   1, 666, 666, 666])

二维矩阵的堆叠就有方向的问题了。

创建一个二维数组。

In [114]:
A = np.array([[1, 2, 3],
              [4, 5, 6]])

In [115]:
np.concatenate([A, A])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [116]:
help(np.concatenate)

Help on built-in function concatenate in module numpy.core.multiarray:

concatenate(...)
    concatenate((a1, a2, ...), axis=0)
    
    Join a sequence of arrays along an existing axis.
    
    Parameters
    ----------
    a1, a2, ... : sequence of array_like
        The arrays must have the same shape, except in the dimension
        corresponding to `axis` (the first, by default).
    axis : int, optional
        The axis along which the arrays will be joined.  Default is 0.
    
    Returns
    -------
    res : ndarray
        The concatenated array.
    
    See Also
    --------
    ma.concatenate : Concatenate function that preserves input masks.
    array_split : Split an array into multiple sub-arrays of equal or
                  near-equal size.
    split : Split array into a list of multiple sub-arrays of equal size.
    hsplit : Split array into multiple sub-arrays horizontally (column wise)
    vsplit : Split array into multiple sub-arrays vertically (row wise)
    dsp

参数 axis = 0 表示按照行的方式堆叠（一行一行堆叠）， axis = 0 是默认值。

In [117]:
np.concatenate([A, A], axis = 0)

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

参数 axis = 1 表示按照列的方式堆叠（一列一列堆叠）。

In [118]:
np.concatenate([A, A], axis = 1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

当我们硬要把一个矩阵和一个向量堆叠的时候，我们可以使用 reshape 操作，使得向量成为矩阵。

In [123]:
np.concatenate([A, z.reshape(1,-1)])

array([[  1,   2,   3],
       [  4,   5,   6],
       [666, 666, 666]])

In [124]:
z.reshape(1,-1)

array([[666, 666, 666]])

In [125]:
A

array([[1, 2, 3],
       [4, 5, 6]])

In [126]:
A2 = np.concatenate([A, z.reshape(1,-1)])

In [127]:
A2

array([[  1,   2,   3],
       [  4,   5,   6],
       [666, 666, 666]])

horizontal 水平  
vertical 垂直

下面介绍 concatenate 的一种快捷方式。

In [128]:
# 此时，维度不同的两个 array 对象都可以堆叠
np.vstack([A, z])

array([[  1,   2,   3],
       [  4,   5,   6],
       [666, 666, 666]])

In [130]:
B = np.full((2, 2), 100)

In [131]:
np.hstack([A, B])

array([[  1,   2,   3, 100, 100],
       [  4,   5,   6, 100, 100]])

### 分割的操作

+ 一维向量的分割
+ 二维矩阵的分割

In [132]:
x = np.arange(10)

In [133]:
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [134]:
# 在索引为 3 和索引为 7 之前分割
np.split(x, [3, 7])

[array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])]

In [135]:
x1, x2, x3 = np.split(x, [3, 7])

In [136]:
x1

array([0, 1, 2])

In [137]:
x2

array([3, 4, 5, 6])

In [138]:
x3

array([7, 8, 9])

In [139]:
# 只设立一个分割点
x1, x2 = np.split(x, [5])

In [140]:
x1

array([0, 1, 2, 3, 4])

In [141]:
x2

array([5, 6, 7, 8, 9])

#### 二维数组（矩阵）的分割

In [142]:
A = np.arange(16).reshape(4,4)

In [143]:
A

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [144]:
A1, A2 = np.split(A, [2], axis = 0)

In [145]:
A1

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [146]:
A2

array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [147]:
A1, A2 = np.split(A, [2], axis = 1)

In [148]:
A1

array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13]])

In [149]:
A2

array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15]])

**记忆要点：**总是在指定的索引之前进行分割。

快捷方式：上下拆分和左右拆分。

In [150]:
upper, lower = np.vsplit(A, [1])

In [151]:
upper

array([[0, 1, 2, 3]])

In [152]:
lower

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [153]:
left, right = np.hsplit(A, [3])

In [154]:
left

array([[ 0,  1,  2],
       [ 4,  5,  6],
       [ 8,  9, 10],
       [12, 13, 14]])

In [155]:
right

array([[ 3],
       [ 7],
       [11],
       [15]])

**应用**：我们经常要对一个矩阵，将特征与标签拆分，那么垂直拆分就是很常见的一种操作。
索引也可以使用负号。

In [157]:
data = np.arange(16).reshape((4,4))

In [158]:
X, y = np.hsplit(data, [-1])

In [159]:
X

array([[ 0,  1,  2],
       [ 4,  5,  6],
       [ 8,  9, 10],
       [12, 13, 14]])

In [160]:
y

array([[ 3],
       [ 7],
       [11],
       [15]])

3-7 Numpy中的矩阵运算

3-8 Numpy中的arg运算

3-9 Numpy中的比较和Fancy Indexing

3-10 Matplotlib数据可视化基础

3-11 数据加载和简单的数据探索