<a href="https://colab.research.google.com/github/paul-williams-ch-ml/ML/blob/master/text/Text_Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Feature Extraction.

Converting text into numeric values for processing.

In [0]:
 from sklearn.feature_extraction.text import CountVectorizer

We start with a corpus of 4 documents. Each document is a simple statement.

In [0]:
corpus = ['This is the first document.',
          'This is the second document.',
          'Third document. Document number three',
          'Number four. To repeat, number four']

In [0]:
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(corpus)

We see that our 'bag_of_words' is helds as a Sparse matrix.

In [28]:
bag_of_words

<4x12 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

Each document is represented by the first column.
Each word by the second column.
Every word has been assigned a unique index and the number of occurances of the word in the document is displayed in the third column..


In [45]:
print(bag_of_words)

  (0, 9)	1
  (0, 3)	1
  (0, 7)	1
  (0, 1)	1
  (0, 0)	1
  (1, 9)	1
  (1, 3)	1
  (1, 7)	1
  (1, 0)	1
  (1, 6)	1
  (2, 0)	2
  (2, 8)	1
  (2, 4)	1
  (2, 10)	1
  (3, 4)	2
  (3, 2)	2
  (3, 11)	1
  (3, 5)	1


We can identify which index has been assigned to a particular word.

In [30]:
vectorizer.vocabulary_.get('document')

0

In [22]:
vectorizer.vocabulary_

{'document': 0,
 'first': 1,
 'four': 2,
 'is': 3,
 'number': 4,
 'repeat': 5,
 'second': 6,
 'the': 7,
 'third': 8,
 'this': 9,
 'three': 10,
 'to': 11}

Placing our data into a Pandas DataFrame.

In [31]:
import pandas as pd
print(pd.__version__)

0.24.2


In [33]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,1,1,0,1,0,0,0,1,0,1,0,0
1,1,0,0,1,0,0,1,1,0,1,0,0
2,2,0,0,0,1,0,0,0,1,0,1,0
3,0,0,2,0,2,1,0,0,0,0,0,1


Now we can assign a "Meaningfullness" value to each word by using the TfidfVectorizer.

The words value is increased the number of uses within a document. Then it is multiplied inversely by it's usage within the entire corpus.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
bag_of_words = vectorizer.fit_transform(corpus)

In [39]:
print(bag_of_words)

  (0, 0)	0.3528554929793508
  (0, 1)	0.5528163151092931
  (0, 7)	0.43584673254990375
  (0, 3)	0.43584673254990375
  (0, 9)	0.43584673254990375
  (1, 6)	0.5528163151092931
  (1, 0)	0.3528554929793508
  (1, 7)	0.43584673254990375
  (1, 3)	0.43584673254990375
  (1, 9)	0.43584673254990375
  (2, 10)	0.4850008395708102
  (2, 4)	0.3823802326982809
  (2, 8)	0.4850008395708102
  (2, 0)	0.6191395067937654
  (3, 5)	0.3432724906138499
  (3, 11)	0.3432724906138499
  (3, 2)	0.6865449812276998
  (3, 4)	0.5412799489419371


In [41]:
vectorizer.vocabulary_.get('document')

0

In [42]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,0.352855,0.552816,0.0,0.435847,0.0,0.0,0.0,0.435847,0.0,0.435847,0.0,0.0
1,0.352855,0.0,0.0,0.435847,0.0,0.0,0.552816,0.435847,0.0,0.435847,0.0,0.0
2,0.61914,0.0,0.0,0.0,0.38238,0.0,0.0,0.0,0.485001,0.0,0.485001,0.0
3,0.0,0.0,0.686545,0.0,0.54128,0.343272,0.0,0.0,0.0,0.0,0.0,0.343272


If you have a large corpus. Then the use of a HashingVectorizer is preferred.

**Note:** n_features is the number of Hash Buckets to be created. If the number of Hash Buckets is less than the number of unique word (as is the purpose of this method) then multiple words ar assigned to each hash. In our example we map 12 unique words to 8 Hash Buckets.

In [0]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=8)
feature_vector = vectorizer.fit_transform(corpus)

In [48]:
print(feature_vector)

  (0, 0)	-0.8944271909999159
  (0, 5)	0.4472135954999579
  (0, 6)	0.0
  (1, 0)	-0.5773502691896258
  (1, 3)	0.5773502691896258
  (1, 5)	0.5773502691896258
  (1, 6)	0.0
  (2, 0)	-0.7559289460184544
  (2, 3)	0.3779644730092272
  (2, 5)	0.3779644730092272
  (2, 7)	0.3779644730092272
  (3, 0)	0.31622776601683794
  (3, 3)	0.31622776601683794
  (3, 5)	0.6324555320336759
  (3, 7)	0.6324555320336759


**Note:** A disadvantage of Hashing is that there is no way to get back to the words original value.