# Extracting Features from Text
##### Using Bag of Words, TF-IDF Transformation

- First we Tokenise the document into individual words
- Then represent each word by a number
- Now the document can be expressed in form of a tensor/matrix
- How do we assign these numbers to each word ? is there a method to do it ?
- The numbers used to represent each word are called **Word Embeddings** & there are three ways in which embeddings can be assigned
  - **One-hot Encoded Embeddings**
  - **Frequency-based Embeddings**
  - **Prediction-based Embeddings** - deep learning frameworks use this technique where the numerical representation of text capture meanings and semantic relationships

**One-hot Encoding Embeddings**
- **Disadvantages:**
  - If we have large vocabulary then our feature vectors will be enormous (for ex if we have 10000 words in the corpus then every feature vector will be 10000 elements long)
  - Does not capture the order of the occurence of the words hence the contextual information is lost 
  - Loss of frequency information i.e. we don't capture how many times the word appeared in the document
  - One-hot encoding does NOT capture any semantic information or relationship between words

**Frequency-based Embeddings**
 - Preserves the frequency of the words
 - Are of two types 
  - **Count Vectors:** 
        - counts how many times a word appeared in the particular document, thus it captures the frequency each word
        - **Disadvantages:**
        - if we have large vocabulary in the corpus we get enormous feature vectors for each document, to mitigate this we can
        - Choose only the top N words based on frequency to counter this 
        - Hash words into buckets to have a fixed vocab size
      - Does not capture the order of the occurence of the words hence the contextual information is lost 
      - Does not capture any semantic information or relationship between words
  - **TF-IDF (Term frequency-Inverse document frequency):** 
    - It is a scoring algorithm which tracks how important the word is to the document & to the corpus as a whole
    - It captures the word frequency in a document as well as in the entire corpus
    - Every word is give a score by calculation xi = tf(wi) * idf(wi) (where tf is the word frequency in the word & idf is the inverse term frequency of that word in the whole document)
    - tf will up weight the word which have higher frequency in the document idf will down weight the words which have high frequency in the corpus i.e. the common words (stop words)
    - Thus tf-idf score for a word will be high if the word has high frequency in the document & less frequency in the corpus
    - **Advantages:**
      - Each feature vector is not as big as the whole corpus
      - Frequency & relevance of each word is captured
    - **Disadvantages:**
      - Context is still not captured

Here we will look at these 3 vectorizers in sklearn
- **CountVectorizer** - creates basic frequency based representation of words
- **TfidfVectorizer**
- **HashingVectorizer**

Each vectorizer has a fit method to which we pass a corpus of documents, this method will generate unique IDs for all words in the corpus

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

#### Define a corpus of 4 documents with some repeated values

In [5]:
corpus = ['This is the first document.',
          'This is the second document.', 
          'Third document. Document number three', 
          'Number four. To repeat, number four']

#### Use CountVectorizer to convert a collection of text documents to a "bag of words"

In [6]:
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(corpus)
bag_of_words

<4x12 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

#### View what the "bag" looks like

In [7]:
print(bag_of_words) # the below format is (documentID, wordID) frequency

  (0, 0)	1
  (0, 1)	1
  (0, 7)	1
  (0, 3)	1
  (0, 9)	1
  (1, 6)	1
  (1, 0)	1
  (1, 7)	1
  (1, 3)	1
  (1, 9)	1
  (2, 10)	1
  (2, 4)	1
  (2, 8)	1
  (2, 0)	2
  (3, 5)	1
  (3, 11)	1
  (3, 2)	2
  (3, 4)	2


#### Get the value to which a word is mapped

In [8]:
vectorizer.vocabulary_.get('document') # will give the ID of the word

0

In [9]:
vectorizer.vocabulary_ # typically most frequent words are assigned lower value ID

{'this': 9,
 'is': 3,
 'the': 7,
 'first': 1,
 'document': 0,
 'second': 6,
 'third': 8,
 'number': 4,
 'three': 10,
 'four': 2,
 'to': 11,
 'repeat': 5}

In [10]:
import pandas as pd
print(pd.__version__)

0.23.4


In [11]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())
# below the rows are the individual documents & the column value are the corresponding word frequency

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,1,1,0,1,0,0,0,1,0,1,0,0
1,1,0,0,1,0,0,1,1,0,1,0,0
2,2,0,0,0,1,0,0,0,1,0,1,0
3,0,0,2,0,2,1,0,0,0,0,0,1


#### Extend bag of words with TF-IDF weights

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
bag_of_words = vectorizer.fit_transform(corpus)

print(bag_of_words) # the below format is (documentID, wordID) tf-idf score

  (0, 9)	0.43584673255
  (0, 3)	0.43584673255
  (0, 7)	0.43584673255
  (0, 1)	0.552816315109
  (0, 0)	0.352855492979
  (1, 9)	0.43584673255
  (1, 3)	0.43584673255
  (1, 7)	0.43584673255
  (1, 0)	0.352855492979
  (1, 6)	0.552816315109
  (2, 0)	0.619139506794
  (2, 8)	0.485000839571
  (2, 4)	0.382380232698
  (2, 10)	0.485000839571
  (3, 4)	0.541279948942
  (3, 2)	0.686544981228
  (3, 11)	0.343272490614
  (3, 5)	0.343272490614


In [10]:
vectorizer.vocabulary_.get('document')

0

In [11]:
pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names())
# below the rows are the individual documents & the column value are the corresponding tf-idf scores

Unnamed: 0,document,first,four,is,number,repeat,second,the,third,this,three,to
0,0.352855,0.552816,0.0,0.435847,0.0,0.0,0.0,0.435847,0.0,0.435847,0.0,0.0
1,0.352855,0.0,0.0,0.435847,0.0,0.0,0.552816,0.435847,0.0,0.435847,0.0,0.0
2,0.61914,0.0,0.0,0.0,0.38238,0.0,0.0,0.0,0.485001,0.0,0.485001,0.0
3,0.0,0.0,0.686545,0.0,0.54128,0.343272,0.0,0.0,0.0,0.0,0.0,0.343272


#### View all the words and their corresponding values

In [12]:
vectorizer.vocabulary_

{'document': 0,
 'first': 1,
 'four': 2,
 'is': 3,
 'number': 4,
 'repeat': 5,
 'second': 6,
 'the': 7,
 'third': 8,
 'this': 9,
 'three': 10,
 'to': 11}

### Hashing Vectorizer
* One issue with CountVectorizer and TF-IDF Vectorizer is that the number of features can get very large if the vocabulary is very large
* The whole vocabulary will be stored in memory, and this may end up taking a lot of space
* With Hashing Vectorizer, one can limit the number of features, let's say to a number <b>n</b>
* Each word will be hashed to one of the n values
* There will collisions where different words will be hashed to the same value
* In many instances, peformance does not really suffer in spite of the collisions
* Thus we use HashingVectorizer when the feature set is very large

In [13]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=8) # n_features represents the number of hash buckets to be used
feature_vector = vectorizer.fit_transform(corpus)
print(feature_vector)

  (0, 0)	-0.894427191
  (0, 5)	0.4472135955
  (0, 6)	0.0
  (1, 0)	-0.57735026919
  (1, 3)	0.57735026919
  (1, 5)	0.57735026919
  (1, 6)	0.0
  (2, 0)	-0.755928946018
  (2, 3)	0.377964473009
  (2, 5)	0.377964473009
  (2, 7)	0.377964473009
  (3, 0)	0.316227766017
  (3, 3)	0.316227766017
  (3, 5)	0.632455532034
  (3, 7)	0.632455532034


#### One disadvantage of HashingVectorizer is that there is no way to get back the words from the hashed value