# 第八章 处理时间序列

本章依赖的类库有：

- numpy 快速操作结构数组的工具
- scipy 科学计算库

python类库安装教程见实体书附录A。
____

## 8.1 Embedding

### 简单的文本识别


In [1]:
import re
import numpy as np

def process_text(text):
    """将标点符号替换成空格"""
    dot_word = r'。|，|\.|\?|,|!|，|\(|\)|\(|\)| '
    return re.sub(dot_word, ' ', text)

def text2vec(text):
    """将文本转换成向量"""
    cleaned_text = process_text(text) # 输入文本预处理
    text_vec = cleaned_text.split()
    vocab_list = list(set(text_vec)) # 词汇集
    numer_vec = [0.]*len(vocab_list) # 数字向量
    for word in text_vec:
        # 每遇到一个单词，对应的数字向量位置+1
        numer_vec[vocab_list.index(word)] += 1
    return numer_vec / np.sum(numer_vec) # 归一化，让向量中所有数值加起来等于1

In [2]:
text = 'Without a doubt, a loving and friendly puppy or dog can put an instant smile on your face! When you adopt a dog from Atlanta Humane Society, you gain a wonderful canine companion. But most of all, when you adopt a rescue dog, you have the ability to bond with one of Atlanta’s forgotten and neglected animals.'

text2vec(text)

array([ 0.03508772,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386,
        0.01754386,  0.07017544,  0.0877193 ,  0.01754386,  0.01754386,
        0.03508772,  0.03508772,  0.05263158,  0.01754386,  0.01754386,
        0.01754386,  0.01754386,  0.01754386,  0.01754386,  0.01754386])

### 深度学习从读懂词义开始

欧氏距离和余弦向量：

In [3]:
import numpy as np
from scipy import spatial

def euc_distance(v1, v2):
    """用欧氏距离判断相似距离"""
    return np.linalg.norm(np.array(v1) - np.array(v2))

def cos_similar(v1, v2):
    """用余弦向量判断相似程度"""
    return 1 - spatial.distance.cosine(np.array(v1), np.array(v2))

计算One-Hot编码的单词向量相似度

In [4]:
puppy_vec = [1.0, 0.0] + [0.0] * 998
dog_vec = [0.0, 1.0] + [0.0] * 998
some_word_vec = [0.0] * 499 + [1.0] + [0.0] * 500

print(euc_distance(puppy_vec, dog_vec))
print(euc_distance(puppy_vec, some_word_vec))
print(cos_similar(puppy_vec, dog_vec))
print(cos_similar(puppy_vec, some_word_vec))

1.41421356237
1.41421356237
0.0
0.0


## 8.2 适合序列的模型

### LSTM（Long Shore-Term Memory长短期记忆）

在Keras中，你可以按照下面的方式给模型增加一个LSTM层：

In [None]:
mode.add(LSTM(32)) # 输出向量长度32

在Caffe的prototxt文件中，以下面的方式配置一个LSTM层：

    layer {
      name: "lstm" # lstm层
      type: "Lstm"
      bottom: "data"
      bottom: "xxx"
      top: "lstm"

      lstm_param {
        num_output: 32 # lstm输出向量维度
        clipping_threshold: 0.1 # 为了解决梯度爆炸，设置的梯度最大阈值
        weight_filler {
          type: "gaussian" # 初始化weight的方式
          std: 0.1
        }
        bias_filler {
          type: "constant" # 初始化bias的方式
        }
      }
    }