### Intro

[API]() |
[]() |
[demo]()

### [Loading Features from dicts](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer)

* converts feature arrays (lists of dicts) to NumPy/SciPy representation, used by SciKit

In [1]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Fransisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']

In [2]:
# also good for training sequence classifiers for NLP.
# by extracting feature windows around a word of interest

pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    },
    # in a real application one would extract many such dictionaries
]

# vectorize into sparse 2D array
vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized)               

pos_vectorized.toarray()
print(vec.get_feature_names())

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']


### [Feature Hashing](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher) |
[Wikipedia](https://en.wikipedia.org/wiki/Feature_hashing) |
[demo](hashing_vs_dict_vectorizer.ipynb)

* instead of building a hash table of training features,
* apply hash funct to features = find their column index in sample matrices directly?
* advantage: better speed, better memory usage
* disadvtge: inspectability
* signed 32b hash used to resolve collisions between unrelated features
* so max #features supported = 2^31-1

In [6]:
# example: word-level NLP task
# features extracted from (token, part_of_speech) pairs
def token_features(token, part_of_speech):
    if token.isdigit():
        yield "numeric"
    else:
        yield "token={}".format(token.lower())
        yield "token,pos={},{}".format(token, part_of_speech)
    if token[0].isupper():
        yield "uppercase_initial"
    if token.isupper():
        yield "all_uppercase"
    yield "pos={}".format(part_of_speech)

#raw_X = (token_features(tok, pos_tagger(tok)) for tok in corpus)
#hasher = FeatureHasher(input_type='string')
#X = hasher.transform(raw_X)
#X

### [Feature Extraction (Text)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) |
[demo](topics_extraction_with_nmf_lda.ipynb)

* capability: tokenizing (strings => id), counting #tokens, normalizing
* resulting matrix will be sparse, usually >99%


In [9]:
# vectorizer usage
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'Hello Hello Hello Hello Hello',
]
X = vectorizer.fit_transform(corpus) # tokenize the corpus
X   

<5x10 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [10]:
# default config: tokenize any words of 2 letters or larger
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.") == (
    ['this', 'is', 'text', 'document', 'to', 'analyze'])

vectorizer.get_feature_names() == (
    ['and', 'document', 'first', 'is', 'one',
     'second', 'the', 'third', 'this'])

X.toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 0, 1, 0, 0, 1, 0, 1],
       [0, 0, 0, 5, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [11]:
# converse mapping from feature name to column index
vectorizer.vocabulary_.get('document')

1

In [12]:
# words not seen in corpus will be ignored in future calls to transformer:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [14]:
# preserving local ordering info
bigram_vectorizer = CountVectorizer(
    ngram_range=(1, 2),
    token_pattern=r'\b\w+\b', 
    min_df=1)

analyze = bigram_vectorizer.build_analyzer()

analyze('Bi-grams are cool!') == (
    ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])

True

In [15]:
# vocab extracted by vectorizer
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2

array([[0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [16]:
# is "is this" in the corpus?
feature_index = bigram_vectorizer.vocabulary_.get('is this')
X_2[:, feature_index]

array([0, 0, 0, 1, 0], dtype=int64)

### [API (TFIDF: transformer)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

### [API (TFIDF: vectorizer)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

* used to re-weight feature counts to floating point - for usage by a classifier

In [17]:
# normalization done by a Transformer
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf=False)
transformer

TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False,
         use_idf=True)

In [18]:
# example
counts = [[3, 0, 1],
          [2, 0, 0],
          [3, 0, 0],
          [4, 0, 0],
          [3, 2, 0],
          [3, 0, 2]]

tfidf = transformer.fit_transform(counts)
tfidf, tfidf.toarray()

(<6x3 sparse matrix of type '<class 'numpy.float64'>'
 	with 9 stored elements in Compressed Sparse Row format>,
 array([[ 0.81940995,  0.        ,  0.57320793],
        [ 1.        ,  0.        ,  0.        ],
        [ 1.        ,  0.        ,  0.        ],
        [ 1.        ,  0.        ,  0.        ],
        [ 0.47330339,  0.88089948,  0.        ],
        [ 0.58149261,  0.        ,  0.81355169]]))

In [19]:
# combines CountVectorizer+TfidfTransformer in single model
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit_transform(corpus)

<5x10 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in Compressed Sparse Row format>

### Decoding Text Files

* text = characters; files = bytes = characters+encoding
* Python rqmnt: bytes must be decoded to Unicode charset
* Common encodings: ASCII, Latin-1, KOI8-R, UTF-8, UTF-16

In [20]:
#chardet learns decodings, vectorizes texts, prints learned vocab.
import chardet    
text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded = [x.decode(chardet.detect(x)['encoding'])
           for x in (text1, text2, text3)]        
v = CountVectorizer().fit(decoded).vocabulary_    
for term in v: print(v)            

{'flügeln': 4, 'gegrăźă': 6, 'mein': 12, 'sind': 16, 'herzliebchen': 9, 'dich': 3, 'fort': 5, 'sei': 15, 'holdselig': 10, 'gesanges': 8, 'deine': 1, 'sauerkraut': 14, 'ich': 11, 'auf': 0, 'mir': 13, 'des': 2, 'gerüche': 7, 'trag': 17}
{'flügeln': 4, 'gegrăźă': 6, 'mein': 12, 'sind': 16, 'herzliebchen': 9, 'dich': 3, 'fort': 5, 'sei': 15, 'holdselig': 10, 'gesanges': 8, 'deine': 1, 'sauerkraut': 14, 'ich': 11, 'auf': 0, 'mir': 13, 'des': 2, 'gerüche': 7, 'trag': 17}
{'flügeln': 4, 'gegrăźă': 6, 'mein': 12, 'sind': 16, 'herzliebchen': 9, 'dich': 3, 'fort': 5, 'sei': 15, 'holdselig': 10, 'gesanges': 8, 'deine': 1, 'sauerkraut': 14, 'ich': 11, 'auf': 0, 'mir': 13, 'des': 2, 'gerüche': 7, 'trag': 17}
{'flügeln': 4, 'gegrăźă': 6, 'mein': 12, 'sind': 16, 'herzliebchen': 9, 'dich': 3, 'fort': 5, 'sei': 15, 'holdselig': 10, 'gesanges': 8, 'deine': 1, 'sauerkraut': 14, 'ich': 11, 'auf': 0, 'mir': 13, 'des': 2, 'gerüche': 7, 'trag': 17}
{'flügeln': 4, 'gegrăźă': 6, 'mein': 12, 'sind': 16, 'herzli

### Bag of Words Limitations

* Bag-of-Words == collection of unigrams. Can't capture phrases or multi-word expressions.
* N-grams can help. Ex: n=2 grams == pairs of consecutive words.

In [18]:
# using 'char_wb analyzer
ngram_vectorizer = CountVectorizer(
    analyzer='char_wb', 
    ngram_range=(2, 2), 
    min_df=1)
counts = ngram_vectorizer.fit_transform(
    ['words', 'wprds'])
ngram_vectorizer.get_feature_names() == (
    [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])

counts.toarray().astype(int)

array([[1, 1, 1, 0, 1, 1, 1, 0],
       [1, 1, 0, 1, 1, 1, 0, 1]])

In [20]:
# using 'char' analyzer
ngram_vectorizer1 = CountVectorizer(
    analyzer='char_wb', 
    ngram_range=(5, 5), 
    min_df=1)

ngram_vectorizer1.fit_transform(
    ['jumpy fox'])

ngram_vectorizer1.get_feature_names() == (
    [' fox ', ' jump', 'jumpy', 'umpy '])


ngram_vectorizer2 = CountVectorizer(
    analyzer='char', 
    ngram_range=(5, 5), 
    min_df=1)

ngram_vectorizer2.fit_transform(
    ['jumpy fox'])
                      
ngram_vectorizer2.get_feature_names() == (
    ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])

True

### Vectorizing Large Text Corpus with Hashing

### Custom Vectorizing

In [21]:
# passing a func to the constructor
def my_tokenizer(s):
    return s.split()

vectorizer = CountVectorizer(tokenizer=my_tokenizer)
vectorizer.build_analyzer()(u"Some... punctuation!") == (
    ['some...', 'punctuation!'])

True

### Advanced Token Analysis using NLTK

[NLTK](http://www.nltk.org/)

In [23]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

vect = CountVectorizer(tokenizer=LemmaTokenizer()) 
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<__main__.LemmaTokenizer object at 0x7fbb71eea588>,
        vocabulary=None)

### Image Extraction

[extract 2D image patches](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.image.extract_patches_2d.html#sklearn.feature_extraction.image.extract_patches_2d) |
[reconstruct_from_patches_2d](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.image.reconstruct_from_patches_2d.html#sklearn.feature_extraction.image.reconstruct_from_patches_2d)

In [21]:
# generate 4x4 pixel image, RGB format
import numpy as np
from sklearn.feature_extraction import image

one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
one_image[:, :, 0]  # R channel of a fake RGB picture

patches = image.extract_patches_2d(
    one_image, 
    (2, 2), 
    max_patches=2,
    random_state=0)

print(patches.shape)

patches[:, :, :, 0]

patches = image.extract_patches_2d(
    one_image, (2, 2))

print(patches.shape)
print(patches[4, :, :, 0])

(2, 2, 2, 3)
(9, 2, 2, 3)
[[15 18]
 [27 30]]


In [27]:
# attempt image reconstruction from patches
# by averaging on overlaps
reconstructed = image.reconstruct_from_patches_2d(
    patches, 
    (4, 4, 3))

np.testing.assert_array_equal(
    one_image, reconstructed)

five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
patches = image.PatchExtractor((2, 2)).transform(five_images)
patches.shape

(45, 2, 2, 3)

In [None]:
# image connectivity graph
