Numpy, Pandas, Matplotlib快速入门

Numpy是Python生态中最关键的数值计算库，大量涉及数据分析的第三方库都可能依赖于它，`ndarray`是它的核心数据结构。Pandas建构在Dataframe(数据框，类似于R语言中的Dataframe)和Series（Dataframe中的列）数据结构，并支持类似于SQL的数据查询方法。Matplotlib是一个绘图工具箱，使用类似于Matlab的语法。
### 建议阅读材料

* 利用Python进行数据分析（原书第2版）https://github.com/apachecn/pyda-2e-zh.git

![](https://upload-images.jianshu.io/upload_images/7178691-0d965cf51eb5af9e.png?imageMogr2/auto-orient/strip|imageView2/2/w/516)
### 目录

* [Numpy](#numpy)
* [Pandas](#pandas)
* [Matplotlib](#matplotlib)

<a id="numpy"></a>

### Numpy

我们将运用Numpy和Pandas所提供的数据结构，包括`ndarray`、`Series`和`DataFrame`，它们与Python内置的`list`有着类似的特点，但更加强大。


Python中用到数值计算的地方几乎都有`numpy`的身影。它为Python提供了高性能的向量、矩阵和其他高维数据结构。它在底层是由C语言或Fortran编写的，当用它进行向量化的运算时，性能非常不错。


要使用`numpy`，需要安装并导入它：

In [2]:
import numpy as np

#### Creating Numpy arrays

There are a number of ways to initialize new Numpy arrays, for example from

* converting from Python lists or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

##### From lists

For example, to create new vector and matrix arrays from Python lists we can use the `numpy.array` function

In [121]:
# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])
v

array([1, 2, 3, 4])

In [4]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [5]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

In [6]:
v.shape

(4,)

In [7]:
M.shape

(2, 2)

The number of elements in the array is available through the `ndarray.size` property:

In [8]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`:

In [9]:
np.shape(M)

(2, 2)

In [10]:
np.size(M)

4

So far the `numpy.ndarray` looks a lot like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
* Numpy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when array is created.
* Numpy arrays are memory efficient.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of `numpy` arrays can be implemented in a compiled language (C and Fortran is used).

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [11]:
M.dtype

dtype('int32')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [12]:
#M[0,0] = 'hello'

In [13]:
M[0,0] = 5

In [14]:
M

array([[5, 2],
       [3, 4]])

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [15]:
N = np.array([[1, 2], [3, 4]], dtype=complex)
N

array([[1.+0.j, 2.+0.j],
       [3.+0.j, 4.+0.j]])

Common types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, and `object` (string).

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`.

#### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in `numpy` that generates arrays of different forms. Some of the more common are:

In [16]:
# create a range (the end value is not included)
x = np.arange(-1, 1, 0.1) # arguments: start, stop, step
x

array([-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01, -2.22044605e-16,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01])

In [17]:
# dtype is determined automatically unless specified
x.dtype

dtype('float64')

In [18]:
# range of integers
y = np.arange(0, 10, 1) # arguments: start, stop, step
y

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [19]:
y.dtype

dtype('int32')

In [20]:
# specifying dtype as float
z = np.arange(0, 10, 1, dtype=float) # arguments: start, stop, step
z

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [21]:
z.dtype

dtype('float64')

In [22]:
# using linspace, both end points ARE included
np.linspace(0, 10, 11) # arguments: start, stop, N

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [23]:
# similar to meshgrid in MATLAB
x, y = np.mgrid[0:5, 0:5] 

In [24]:
x

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

In [25]:
y

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

In [26]:
# uniform random numbers in interval [0,1]
np.random.rand(5,5)

array([[0.00583808, 0.1704263 , 0.87586323, 0.37684269, 0.48814357],
       [0.4720978 , 0.40856105, 0.67434085, 0.15701151, 0.02114673],
       [0.9492753 , 0.12680123, 0.71332553, 0.15193932, 0.16940657],
       [0.85610915, 0.8850134 , 0.68407961, 0.7921087 , 0.80875792],
       [0.75900125, 0.54961475, 0.36957449, 0.20550467, 0.73218426]])

In [27]:
# standard normal distributed random numbers
np.random.randn(5,5)

array([[-1.53891582,  1.25111509,  0.47199981, -1.71509568,  1.38216143],
       [-0.22338497, -2.18017841,  0.51780571, -0.87987174,  0.56963239],
       [ 0.69994533,  0.43528367,  1.08757965,  0.90929234,  0.39745858],
       [ 0.60235582,  0.77096185,  0.14220938,  1.1609974 , -1.12042253],
       [ 0.24495994, -2.13287238,  0.53191867, -0.14585603, -0.18011253]])

In [28]:
# diagonal matrix
np.diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [29]:
# zeros
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [30]:
# ones
np.ones((3,3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [31]:
# ones as int
np.ones((3,3), dtype=int)

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

In [32]:
# three-dimensional
np.ones((3,3,3))

array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])

In [33]:
# four-dimensional
np.ones((3,3,3,3))

array([[[[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]]],


       [[[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]]],


       [[[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]]]])

#### Indexing

We can index elements in an array using the square bracket and indices:

In [34]:
# v is a vector, and has only one dimension, taking one index
v[0]

1

In [35]:
# M is a matrix, or a 2 dimensional array, taking two indices 
M[1,1]

4

In [36]:
# If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)
M[1]

array([3, 4])

The same thing can be achieved with using `:` instead of an index: 

In [37]:
M[1,:] # row 1

array([3, 4])

In [38]:
M[:,1] # column 1

array([2, 4])

We can assign new values to elements in an array using indexing:

In [39]:
M[0,0] = -1
M

array([[-1,  2],
       [ 3,  4]])

In [40]:
# also works for rows and columns
M[0,:] = 0
M[:,1] = -2

In [41]:
M

array([[ 0, -2],
       [ 3, -2]])

#### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array:

In [42]:
A = np.array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [43]:
A[1:3]

array([2, 3])

Array slices are *mutable*: if they are assigned a new value the original array from which the slice was extracted is modified:

In [44]:
A[1:3] = [-2,-3]
A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [45]:
A[::] # lower, upper, step all take the default values

array([ 1, -2, -3,  4,  5])

In [46]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

array([ 1, -3,  5])

In [47]:
A[:3] # first three elements

array([ 1, -2, -3])

In [48]:
A[3:] # elements from index 3

array([4, 5])

Negative indices counts from the end of the array (positive index from the begining):

In [49]:
A = np.array([1,2,3,4,5])

In [50]:
A[-1] # the last element in the array

5

In [51]:
A[-3:] # the last three elements

array([3, 4, 5])

Index slicing works exactly the same way for multidimensional arrays:

In [52]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [53]:
# a block from the original array
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

In [54]:
# strides
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

#### Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index: 

In [55]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [56]:
col_indices = [1, 2, 3]
A[row_indices, col_indices]

array([11, 22, 33])

In [57]:
# equivalent to
A[1,1], A[2,2], A[3,3]

(11, 22, 33)

We can also index *masks*: If the index mask is an Numpy array of with data type `bool`, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element: 

In [58]:
B = np.array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [59]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

array([0, 2])

In [60]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]

array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [61]:
x = np.arange(0, 10, 0.5)
x

array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
       6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

In [62]:
# want values of x that are at least 5 and have no decimal component
mask = (x >= 5) & (x % 1 == 0)
mask

array([False, False, False, False, False, False, False, False, False,
       False,  True, False,  True, False,  True, False,  True, False,
        True, False])

In [63]:
x[mask]

array([5., 6., 7., 8., 9.])

In [64]:
x[x > 5]

array([5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

#### Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

In [65]:
v1 = np.arange(0, 5)
v1

array([0, 1, 2, 3, 4])

In [66]:
v1 * 2

array([0, 2, 4, 6, 8])

In [67]:
v1 + 2

array([2, 3, 4, 5, 6])

In [68]:
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [69]:
np.dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [70]:
np.dot(A, v1)

array([ 30, 130, 230, 330, 430])

In [71]:
np.dot(v1, v1)

30

Alternatively, we can cast the array objects to the type `matrix`. This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra. There is a ton of matrix math that we won't cover here.

In [72]:
M = np.matrix(A)
v = np.matrix(v1).T # make it a column vectorm

In [73]:
M

matrix([[ 0,  1,  2,  3,  4],
        [10, 11, 12, 13, 14],
        [20, 21, 22, 23, 24],
        [30, 31, 32, 33, 34],
        [40, 41, 42, 43, 44]])

In [74]:
v

matrix([[0],
        [1],
        [2],
        [3],
        [4]])

In [75]:
M*M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [76]:
M*v

matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

#### Data computations

In [77]:
np.mean(v1)

2.0

In [78]:
np.std(v1), np.var(v1)

(1.4142135623730951, 2.0)

In [79]:
v1.min()

0

In [80]:
v1.max()

4

In [81]:
sum(v1)

10

#### Iterating over array elements

In [82]:
for element in v1:
    print(element)

0
1
2
3
4


In [83]:
M = np.array([[1,2], [3,4]])
M

array([[1, 2],
       [3, 4]])

In [84]:
for row in M:
    print("row", row)    
    for element in row:
        print(element)

row [1 2]
1
2
row [3 4]
3
4


In [85]:
for row_idx, row in enumerate(M):
    print("row_idx", row_idx, "row", row)    
    for col_idx, element in enumerate(row):
        print("col_idx", col_idx, "element", element) 
        # modify the matrix M: square each element
        M[row_idx, col_idx] = element ** 2

row_idx 0 row [1 2]
col_idx 0 element 1
col_idx 1 element 2
row_idx 1 row [3 4]
col_idx 0 element 3
col_idx 1 element 4


In [86]:
# each element in M are now squared
M

array([[ 1,  4],
       [ 9, 16]])

In [87]:
# another way to square a matrix
M ** 2

array([[  1,  16],
       [ 81, 256]], dtype=int32)

<a id="pandas"></a>

### Pandas

#### What is Pandas?

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. 

#### Library features

* DataFrame object for data manipulation with integrated indexing
* Tools for reading and writing data between in-memory data structures and different file formats
* Data alignment and integrated handling of missing data
* Reshaping and pivoting of data sets
* Label-based slicing, fancy indexing, and subsetting of large data sets
* Data structure column insertion and deletion
* Group-by engine allowing split-apply-combine operations on data sets
* Data set merging and joining
* Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging

The library is highly optimized for performance, with critical code paths written in Cython or C.

#### Download data

I copied data from http://www.sccoos.org/data/autoshorestations/autoshorestations.php?study=Scripps%20Pier and pasted it into Excel, then saved it as a CSV file. Download [scripps_pier_20151110.csv](https://raw.githubusercontent.com/cuttlefishh/python-for-data-analysis/master/data/scripps_pier_20151110.csv) from GitHub and save it to a directory called `data` at the same level as `lessons`.

#### Install packages

Install pandas and matplotlib using if you haven't already. If you're not sure, you can type `conda list` at a terminal prompt.

```
conda install pandas
conda install matplotlib
```

#### Import modules

In [88]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read data from CSV

In [89]:
data1 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=None, header=0)

FileNotFoundError: File b'../data/scripps_pier_20151110.csv' does not exist

In [0]:
data1.head()

In [0]:
data2 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=0, header=0)

In [0]:
data2.head()

In [0]:
data2.describe()

#### Indexing in pandas

There are two ways to index a Pandas DataFrame.

* `loc` works on labels in the index.
* `iloc` works on the positions in the index (so it only takes integers).

#### With Date as the index column (data2)

In [0]:
data2.iloc[0]

In [0]:
data2['temp (C)'].head(10)

#### With no index column (data1)

In [0]:
data1.iloc[0]

In [0]:
data1.loc[0]

In [0]:
data2.loc['11/10/15 1:42']

In [0]:
data1['Date'].head()

In [0]:
data1.Date.head()

In [0]:
data1.iloc[:,0].head()

#### Convert date/time to timestamp object

In [0]:
time = pd.to_datetime(data1.iloc[:,0])
time.head()

In [0]:
type(time)

In [0]:
time.dtype

<a id="matplotlib"></a>

### Matplotlib

#### Plot a single variable vs. time

In [0]:
fig = plt.figure()
plt.plot(time, data1['chl (ug/L)'])
plt.xlabel('Time')
plt.ylabel('Chlorophyll')
fig.savefig('scripps_pier_Chlorophyll.pdf')

#### Plot each response variable in a loop

In [0]:
# rename the columns so they can be inserted as file names
data1.rename(columns={'chl (ug/L)': 'chlorophyll', 'pres (dbar)': 'pressure', 
                      'sal (PSU)': 'salinity', 'temp (C)': 'temperature'}, inplace=True)

In [0]:
# index is numerical starting from 0
data1.head()

In [0]:
data1.columns

In [0]:
for col in data1.columns:
    if col != 'Date':
        fig = plt.figure()
        plt.plot(time, data1[col])
        plt.xlabel('time')
        plt.ylabel(col)
        fig.savefig('scripps_pier_%s.pdf' % col)

#### Plot all response variables together

In [0]:
# index is the date/time as a string (object)
data2.head()

In [0]:
# convert the index to a datetime index
data2.index = pd.to_datetime(data2.index)

In [0]:
# now we see the index has changed to a standard datetime format
data2.head()

In [0]:
# closer look at the first item in the index
data2.index[0]

In [0]:
# timestamp has a method date() to get the date
data2.index[0].date()

In [0]:
# get all the rows with date 2015-11-10
data2.loc['2015-11-10']

In [0]:
# get all the rows with hour 22 (10pm-11pm)
data2[data2.index.hour == 22]

In [0]:
# use the built-in plot() method of a pandas dataframe
plt.figure()
data2.plot()
plt.legend(loc='best')

### P.S. About that name...

The name "Pandas" actually has nothing to do with the animal. It is derived from the term "panel data", an econometrics term for multidimensional structured data sets.

![pandas](http://wdy.h-cdn.co/assets/16/05/980x490/landscape-1454612525-baby-pandas.jpg)