# Tutorial_03: numpy 和他的伙伴们

## 1. what is numpy


- numpy 提供类似 matlab 的矩阵操作体验 （**matrix** operation on-the-fly）,例如：

```
x[:,1] #fancy indexing, different form python list 
np.linspace(0,10,21)
np.arange(0,10, 0.5)
```
> why matrix?
>
>rows: instances,columns: features;
> matrix 并不天然是数据，数据天然是matrix

- fast: built around c array with pointers to a continuous data buffer (区别于python list)

In [17]:
# import  numpy
# dir(numpy)

## 2. data in numpy: ndarray

### 2.1 ndarray属性：dtype, shape

In [6]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)

# type(x)

# x.dtype
# int8 uint8 .... int64 float16 ... float64 (float128, bool, string)
# x.astype

In [24]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)

# x.shape
# x.size # 元素数目

# x.reshape((3,2))
# x.reshape((3,-1))
# x.flatten()

# np.concatenate([x, x], axis=1)


### 2.2 怎样初始化：initializers, loaders

In [41]:
# initializers

import numpy as np

# 从 list 等其他sequence对象构造
# x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)
# list(x) # convert x back to list
# y = np.array(x)

# constant 构造器
# x=np.linspace(0,10,21)
# x=np.arange(0,10, 0.5)
# x=np.zeros((5,5))
x=np.ones((5,5))
# x=np.eye(5)

# random 构造器
# x=np.random.randn(3,5)
# X = np.random.uniform(low=0., high=1., size=(5,5))
# np.random.randint
# np.random.shuffle
# np.random.normal

In [1]:
# loaders

import numpy as np

# 从 txt 文件加载
x=np.loadtxt("../data/winequality-red.csv", delimiter=";", skiprows=1)
np.savetxt("../data/test.csv", x, fmt='%.4f', delimiter=',', newline='\n')


# pickle 序列化/反序列化
# import pickle
# with open("../data/test.pkl", "wb") as f:
#     pickle.dump(x, f)

# with open("../data/test.pkl", "rb") as f:
#     x=pickle.load(f)

# 序列化/反序列化
# x.dump("../data/test.npy")
# x = np.load("../data/test.npy")


# 从 matlab 文件加载
# scipy.io.loadmat 
# scipy.io.savemat

# 从图像文件加载
# scipy.misc
# PIL

# 从音频文件加载
# scipy.io.wavfile

# print(x)

# 其他文件加载 

### 2.3 怎样操作

- ** indexing and slicing**
```
x[idx]
x[min:max:step]
```

In [16]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6],[7.7,8.8,9.9]], dtype=np.float64)
# x
# x[1]

# x[:, 1]

# x[1:,:]
# x[:,::2]
# x[1,-2:]

# x=np.array([1.,2.,6.,8.])
# (x[1:] + x[:-1])/2.0



array([ 5.5,  6.6])

- ** mask (boolean indexing) **

In [22]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)
# x<3
# print(x<3)
# print(np.where(x<3))
# print(x[np.where(x<3)])

# x[x<3]=0
print(x)

[[ 0.   0.   3.3]
 [ 4.4  5.5  6.6]]


- ** matrix operation **

In [13]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)

y=x.T
z=x.dot(y)
z=np.matmul(x,y)
# print(z)

- ** ufunc **

In [15]:
import numpy as np
x=np.array([[1.1,2.2,3.3],[4.4,5.5,6.6]], dtype=np.float64)

# y=np.exp(x)
# y=np.log(x)
# y=x*x

# np.ceil(x)
# np.floor(x)
# np.max(x, axis=1)
# np.argmax(x, axis=1)
# np.mean(x)
# np.sum(x)
# np.cumsum(x)
# np.std(x) # np.sqrt((x-np.mean(x))**2/float(len(x)))

# np.sort(x)

- ** matrix operation **

In [44]:
import numpy as np
v=np.array([1.,3., 5.])
np.linalg.norm(v)
# np.linalg.eig(m)
# np.linalg.det(m)

5.9160797830996161

## 3. numpy 相关的其他工具

![anaconda.jpg](images/anaconda.jpg)


### 3.1 scipy 简介

numpy （基本的数据结构） > scipy （科学计算） > scikit-learn (机器学习)

依赖numpy；主要提供了一些科学计算中常用的数值方法和工具函数

![numpy](images/numpy.png)

### 3.2 scikit-learn 简介

numpy （基本的数据结构） > scipy （科学计算） > scikit-learn (机器学习)

依赖scipy；主要提供机器学习相关的工具

a uniform interface for all estimators:

- transformer
- classifier/regressor
- pipeline = make_pipeline(transformer1,...transformerN, Classifier) 

常用方法：
```
estimator.fit(X_train, y_train) # supervised learning
estimator.fit(X_train) # unsupervised learning

estimator.score(X_test, y_test) # supervised learning
estimator.predict(X_new) #supervised learning

estimator.transform(X) #unsupervised learning
```

<table>
<tr style="border:None; font-size:20px; padding:10px;"><th>``model.predict``</th><th>``model.transform``</th></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Classification</td><td>Preprocessing</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Regression</td><td>Dimensionality Reduction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Clustering</td><td>Feature Extraction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>&nbsp;</td><td>Feature Selection</td></tr>

</table>


### 3.3 pandas 简介

numpy（基本的数据结构）> pandas （擅长时间序列、表格数据的分析、清理、增删改查）

依赖numpy；pandas 擅长数据规整Wrangling的各种操作（清理，格式转换，数据合并），类SQL的manpulation（增删改查）

In [2]:
import pandas as pd
from pandas import Series, DataFrame

# Series与numpy 1d array 类似，甚至可直接操作 np.exp(obj), 但可自定义name，有dict特性如 obj['a']
series = Series({'a': 7, 'b':-5, 'c':3}) # 通过dict 初始化

data = DataFrame({'Name': ['A','B','C','D','E'], 'Price':[121,40,100,130,11]})
# df = DataFrame({"col1":{'a': 7, 'b':-5, 'c':3},"col2":{'a':5,'b':9,'c':1}}, columns=['col2','col1'], index=['a','b','c'])

# data

In [3]:
# a quick statistics
# data.describe()

In [4]:
# print(data["Name"])
# print(data.iloc[1])

In [6]:
# data.values