# Feature extraction from text using scikit-learn

This is our toy document collection

In [38]:
documents = [
    "This is the first document",
    "This is the second second document",
    "Document three is short",
    "Document four is boring",
    "Document five five five five five is where we stop"
]

### Using raw term counts

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [90]:
count_vect = CountVectorizer()
counts = count_vect.fit_transform(documents)

The vocabulary (may also be viewed as a list of features using `count_vect.get_feature_names()`)

In [94]:
count_vect.vocabulary_

{'boring': 0,
 'document': 1,
 'first': 2,
 'five': 3,
 'four': 4,
 'is': 5,
 'second': 6,
 'short': 7,
 'stop': 8,
 'the': 9,
 'this': 10,
 'three': 11,
 'we': 12,
 'where': 13}

Document-term matrix

In [92]:
print(counts.toarray())

[[0 1 1 0 0 1 0 0 0 1 1 0 0 0]
 [0 1 0 0 0 1 2 0 0 1 1 0 0 0]
 [0 1 0 0 0 1 0 1 0 0 0 1 0 0]
 [1 1 0 0 1 1 0 0 0 0 0 0 0 0]
 [0 1 0 5 0 1 0 0 1 0 0 0 1 1]]


### Using TF weighting

In [43]:
from sklearn.feature_extraction.text import TfidfTransformer

Here, we use l1 normalization, i.e., the sum of vector elements is 1. The default would be l2 normalization, i.e., the sum of squares of vector elements is 1. See the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

In [44]:
tf_transformer = TfidfTransformer(norm='l1', use_idf=False)
counts_tf = tf_transformer.fit_transform(counts)

Document-term matrix

In [45]:
print(counts_tf.toarray())

[[ 0.          0.2         0.2         0.          0.          0.2         0.
   0.          0.          0.2         0.2         0.          0.          0.        ]
 [ 0.          0.16666667  0.          0.          0.          0.16666667
   0.33333333  0.          0.          0.16666667  0.16666667  0.          0.
   0.        ]
 [ 0.          0.25        0.          0.          0.          0.25        0.
   0.25        0.          0.          0.          0.25        0.          0.        ]
 [ 0.25        0.25        0.          0.          0.25        0.25        0.
   0.          0.          0.          0.          0.          0.          0.        ]
 [ 0.          0.1         0.          0.5         0.          0.1         0.
   0.          0.1         0.          0.          0.          0.1         0.1       ]]


### Using TF-IDF weighting

In [97]:
tfidf_transformer = TfidfTransformer(norm='l1', use_idf=True)
counts_tfidf = tfidf_transformer.fit_transform(counts)

IDF values

In [98]:
print(tfidf_transformer.idf_)

[ 2.09861229  1.          2.09861229  2.09861229  2.09861229  1.
  2.09861229  2.09861229  2.09861229  1.69314718  1.69314718  2.09861229
  2.09861229  2.09861229]


Document-term matrix

In [47]:
print(counts_tfidf.toarray())

[[ 0.          0.1336022   0.28037922  0.          0.          0.1336022
   0.          0.          0.          0.22620819  0.22620819  0.          0.
   0.        ]
 [ 0.          0.10434581  0.          0.          0.          0.10434581
   0.43796278  0.          0.          0.17667281  0.17667281  0.          0.
   0.        ]
 [ 0.          0.16136256  0.          0.          0.          0.16136256
   0.          0.33863744  0.          0.          0.          0.33863744
   0.          0.        ]
 [ 0.33863744  0.16136256  0.          0.          0.33863744  0.16136256
   0.          0.          0.          0.          0.          0.          0.
   0.        ]
 [ 0.          0.05322292  0.          0.55847135  0.          0.05322292
   0.          0.          0.11169427  0.          0.          0.
   0.11169427  0.11169427]]


**IMPORTANT** To be able reuse the vocabulary later, e.g., when applying the learned model on the text set, the vocabulary needs to be saved (i.e., dumped to a file). In case of TF-IDF weighting, the IDF values also need to be saved. While saving IDF values is possible, there is currently no support for loading these back from file. Therefore, the workaround is to dump the entire `TfidfTransformer` model to file.

Create a `data` folder before running the code below. We use `joblib` (scikit learn's replacement of `pickle`).

In [48]:
from sklearn.externals import joblib

Dumping vocabulary

In [99]:
joblib.dump(count_vect.vocabulary_, "data/vocabulary.pkl") 

['data/vocabulary.pkl']

Dumping `TfidfTransformer`

In [77]:
joblib.dump(tfidf_transformer, "data/tfidf_transformer.pkl") 

['data/tfidf_transformer.pkl']

Testing new document by loading the saved vocabulary

In [78]:
new_docs = ["document second five the unseen"]

In [102]:
vocab = joblib.load("data/vocabulary.pkl")
count_vect2 = CountVectorizer(vocabulary=vocab)
counts2 = count_vect2.fit_transform(new_docs)

Notice that "ten" is not in the vocabulary

In [103]:
print(count_vect2.get_feature_names())
print(counts2.toarray())

['boring', 'document', 'first', 'five', 'four', 'is', 'second', 'short', 'stop', 'the', 'this', 'three', 'we', 'where']
[[0 1 0 1 0 0 1 0 0 1 0 0 0 0]]


Trying to get TFIDF weights; this is not possible by simply loading the vocabulary

In [104]:
tfidf_transformer2 = TfidfTransformer(norm='l1', use_idf=True)
counts_tfidf2 = tfidf_transformer2.fit_transform(counts2)

In [105]:
print(counts_tfidf2.toarray())

[[ 0.    0.25  0.    0.25  0.    0.    0.25  0.    0.    0.25  0.    0.    0.
   0.  ]]


Let's now try with the saved `TfidfTransformer`

In [106]:
tfidf_transformer3 = joblib.load("data/tfidf_transformer.pkl")
counts_tfidf3 = tfidf_transformer3.transform(counts2)

In [107]:
print(counts_tfidf3.toarray())

[[ 0.          0.14513005  0.          0.30457171  0.          0.
   0.30457171  0.          0.          0.24572654  0.          0.          0.
   0.        ]]
