In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

In [2]:
texts = ["I love love Programming.", " I love Math.", " I tolerate Biology."]

### Create Bag of Words

#### Count Vectors

Count Vectorizer documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer

class __sklearn.feature_extraction.text.CountVectorizer__(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)[source]¶

CountVectorizer has buildin lowercasing, punctuation & stop words removal 

In [3]:
cvect = CountVectorizer()
text_cvect = cvect.fit_transform(texts)
cvect.vocabulary_

{'love': 1, 'programming': 3, 'math': 2, 'tolerate': 4, 'biology': 0}

Results are stored in CSR format, memory effective

In [4]:
text_cvect

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

Dense representation:

In [5]:
text_cvect.todense()

matrix([[0, 2, 0, 1, 0],
        [0, 1, 1, 0, 0],
        [1, 0, 0, 0, 1]], dtype=int64)

Lets try "binary=True". E.g. create one-hot encoding representation. This is useful for discrete probabilistic models that model binary events rather than integer counts. May be good option for sentiment analysis. 

In [6]:
cvect = CountVectorizer(binary=True)
text_cvect = cvect.fit_transform(texts)
text_cvect.todense()

matrix([[0, 1, 0, 1, 0],
        [0, 1, 1, 0, 0],
        [1, 0, 0, 0, 1]], dtype=int64)

Now lets have unigrams &  bigrams

In [7]:
cvect = CountVectorizer(ngram_range=(1, 2))
text_cvect = cvect.fit_transform(texts)
cvect.vocabulary_

{'love': 1,
 'programming': 6,
 'love love': 2,
 'love programming': 4,
 'math': 5,
 'love math': 3,
 'tolerate': 7,
 'biology': 0,
 'tolerate biology': 8}

In [8]:
text_cvect.todense()

matrix([[0, 2, 1, 0, 1, 0, 1, 0, 0],
        [0, 1, 0, 1, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=int64)

Sometimes models can be created based on chars. It allows to capture the most common words parts. No need for stemming/lemmatizaion. 
We can setup max & min number of tokens occurencies with max_df, min_df, max_features

In [10]:
cvect = CountVectorizer(ngram_range=(3, 4), analyzer="char", max_features=10)
#cvect = CountVectorizer(ngram_range=(3, 4), analyzer="char")
text_cvect = cvect.fit(texts)
cvect.vocabulary_

{'i l': 3,
 ' lo': 1,
 'lov': 5,
 'ove': 7,
 've ': 9,
 'i lo': 4,
 ' lov': 2,
 'love': 6,
 'ove ': 8,
 ' i ': 0}

#### TFIDF vectors

TfidfVectorizer documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer
class __sklearn.feature_extraction.text.TfidfVectorizer__(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)[source]

In [11]:
tfidf = TfidfVectorizer(smooth_idf=False, use_idf=False, norm='l1')
text_tfidf = tfidf.fit_transform(texts)
tfidf.vocabulary_

{'love': 1, 'programming': 3, 'math': 2, 'tolerate': 4, 'biology': 0}

This is a result of term frequency calculation. No idf applied

In [12]:
text_tfidf.todense()

matrix([[0.        , 0.66666667, 0.        , 0.33333333, 0.        ],
        [0.        , 0.5       , 0.5       , 0.        , 0.        ],
        [0.5       , 0.        , 0.        , 0.        , 0.5       ]])

In the word "love" has lower weight

In [13]:
tfidf = TfidfVectorizer(smooth_idf=False, use_idf=True, norm='l1')
text_tfidf = tfidf.fit_transform(texts)
text_tfidf.todense()

matrix([[0.        , 0.57254423, 0.        , 0.42745577, 0.        ],
        [0.        , 0.4010942 , 0.5989058 , 0.        , 0.        ],
        [0.5       , 0.        , 0.        , 0.        , 0.5       ]])

In [14]:
tfidf = TfidfVectorizer(ngram_range=(1, 2))
text_tfidf = tfidf.fit_transform(texts)
text_tfidf.todense()

matrix([[0.        , 0.65985664, 0.43381609, 0.        , 0.43381609,
         0.        , 0.43381609, 0.        , 0.        ],
        [0.        , 0.4736296 , 0.        , 0.62276601, 0.        ,
         0.62276601, 0.        , 0.        , 0.        ],
        [0.57735027, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.57735027, 0.57735027]])

#### Hashing trick

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer

__sklearn.feature_extraction.text.HashingVectorizer__(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, dtype=<class ‘numpy.float64’>)

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

 * it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
 * it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
 * it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
 
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

 * there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when  trying to introspect which features are most important to a model.
 * there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
 * no IDF weighting as this would render the transformer stateful.

In [44]:
texts = ["I love love Programming.", " I love Math.", " I tolerate Biology."]

In [45]:
hashv = HashingVectorizer(n_features=10, norm="l1")
text_hashv = hashv.fit_transform(texts)
text_hashv.todense()

matrix([[ 0.        , -0.66666667,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.33333333,  0.        ,  0.        ],
        [ 0.        , -0.5       ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.5       ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.5       ,  0.5       ]])

## Co-occurencies matrix

In [46]:
from collections import defaultdict

In [47]:
# look at the right implementation
def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        text = [w.replace(".", "") for w in text]
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1
        # formulate the dictionary into dataframe
    vocab = sorted(vocab) 
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

In [48]:
co_occurrence(texts, 1)

Unnamed: 0,biology,i,love,math,programming,tolerate
biology,0,0,0,0,0,1
i,0,0,2,0,0,1
love,0,2,1,1,1,0
math,0,0,1,0,0,0
programming,0,0,1,0,0,0
tolerate,1,1,0,0,0,0
