# 特征提取
- DictVectorizer
- feature hashing
- text feature extraction: bag of words
- image feature extraction: patches, connectivity graph

## 1.  DictVectorizer
每一个dict是一个样本，
- 优点：convinient, sparse  -
- 缺点：not particularly fast

In [3]:
from sklearn.feature_extraction import DictVectorizer

In [12]:
_X = [{'feature1': 'a'}, {'feature2': 2}]
vec = DictVectorizer()
X = vec.fit_transform(_X)
X.toarray()

array([[1., 0.],
       [0., 2.]])

In [13]:
vec.get_feature_names()

['feature1=a', 'feature2']

In [17]:
window = [{'word-2': 'I', 'word-1':'am', 'word+1': 'you'},
          {'word-2': 'You', 'word-1':'are', 'word+1': 'beach'},]
print(vec.fit_transform(window).toarray())
vec.get_feature_names()

[[0. 1. 1. 0. 1. 0.]
 [1. 0. 0. 1. 0. 1.]]


['word+1=beach',
 'word+1=you',
 'word-1=am',
 'word-1=are',
 'word-2=I',
 'word-2=You']

## 2. feature hashing
把特征数目不定的样本映射到固定数目的特征  
hash: x--> y  
- y不同则x肯定不同
- y相同x可能不同

## 3. text feature extraction

#### 3.1 the bag of words representation of document
bag of words = bag of n-grams

- tokenizing
- counting
- normalizing  

doc = token frequencies，忽略wrods的出现位置信息

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
#doc是一个string
#corpus是一个list of doc
#e.g., ['I love you, and I hate you',
#        'Hello, my beautiful']

In [52]:
X = ['I love you, and I hate you',
     'Hello, my beautiful you']
vec = CountVectorizer()
vec.fit_transform(X).toarray()

array([[1, 0, 1, 0, 1, 0, 2],
       [0, 1, 0, 1, 0, 1, 1]], dtype=int64)

In [25]:
vec.get_stop_words()

In [26]:
vec.get_feature_names()

['and', 'beautiful', 'hate', 'hello', 'love', 'my', 'you']

In [37]:
vec.vocabulary_['haha']

KeyError: 'haha'

In [38]:
print(vec.vocabulary_.get('haha'))
#dict 用get可以避免KeyError

None


In [40]:
vec.transform(['as']).toarray()

array([[0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [42]:
vec.set_params(ngram_range=(1,2))

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [46]:
vec.fit_transform(X)
vec.get_feature_names()

['and',
 'and hate',
 'beautiful',
 'hate',
 'hate you',
 'hello',
 'hello my',
 'love',
 'love you',
 'my',
 'my beautiful',
 'you',
 'you and']

In [47]:
#ngram 有局部的位置信息

#### Tf-idf term weighting
- 有些词特征在大多数文档都出现，相当于一个c常数，方差很小，可以去除。
- 另一个解释是这些词和目标之间的交叉熵很大， 互信息小

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
vec.fit_transform(X)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


<2x7 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [54]:
vec.idf_

array([1.40546511, 1.40546511, 1.40546511, 1.40546511, 1.40546511,
       1.40546511, 1.        ])

In [56]:
vec.get_feature_names()

['and', 'beautiful', 'hate', 'hello', 'love', 'my', 'you']

In [61]:
#todo: Kmeans, 对样本降维

In [62]:
# topic主题模型， 对特征降维
# 一个topic是一个化合物， 是一个PCA主成分
# NMF, LDA

###  bag of words(uni-grams) 的缺点
1. 短语，多个词
2. 没有词的顺序信息

可能解决方法：n-grams, character 2-gram

In [72]:
vec = CountVectorizer(analyzer='char_wb', ngram_range=(2,3))
vec.fit_transform(X)

<2x72 sparse matrix of type '<class 'numpy.int64'>'
	with 82 stored elements in Compressed Sparse Row format>

### HashVectorizer

In [77]:
# hashvectorizer
from sklearn.feature_extraction.text import HashingVectorizer

In [82]:
vec = HashingVectorizer(ngram_range=(2,3), analyzer='char_wb')
vec.fit_transform(X)

<2x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 82 stored elements in Compressed Sparse Row format>

In [84]:
# NLTK

## Image feature extraction

- patch extraction
- connectivity graph