# Feature Extraction from Text data

Often times, data is not structured (in the form of rows and columns). Text data is one of the most common types of unstructured data. Therefore, features need to extracted if feature-based models are to be used.


In this tutorial, we will illustrate the feature extraction from text for classification purposes.

We will first import the `SMS spam` dataset that contains Phone messages which are spam and some which are not.

References

- More details can be found in Chapters 2 and 3 of `Natural Language Processing in Action` by `Hobson Lane, Cole Howard, Hannes Hapke`

* Import the csv file that contains the SMS spam data set.
* There are two class labels `ham` (not spam) and `spam`

In [1]:
import pandas as pd

In [2]:
sms_msgs = pd.read_csv('sms_spam.csv',names= ['class', 'sms'],header=1)
sms_msgs

Unnamed: 0,class,sms
0,ham,K..give back my thanks.
1,ham,Am also doing in cbe only. But have to pay.
2,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
3,spam,okmail: Dear Dave this is your final notice to...
4,ham,Aiya we discuss later lar... Pick u up at 4 is...
...,...,...
5553,ham,You are a great role model. You are giving so ...
5554,ham,"Awesome, I remember the last time we got someb..."
5555,spam,"If you don't, your prize will go to another cu..."
5556,spam,"SMS. ac JSco: Energy is high, but u may not kn..."


How many messages are spam and not-spam. 

* You can see here that the number of messages belonging to different classes are not equal. 
* There is a class imbalance here. This should be corrected ideally. But for now lets not worry about it.

In [3]:
sms_msgs['class'].value_counts()

ham     4811
spam     747
Name: class, dtype: int64

## Ways to extract features from Text data. 

Two main ways to extract features from text data

#### `Bags of words` or `Term Frequency` or `TF` : 
Vectors of word counts or frequencies. `term frequency or TF or bag of words` for a word is the number of times the word occurs in a text. `TF vector` is another way of calling `bag of words`.

Below are three texts:

* "about the bird the bird bird bird bird"
* "you heard about the bird"
* "the bird is the word"

The `Term Freqeuncy matrix` of the three texts is:

| about  | bird | heard | is | the | word | you |
| ------ | ---- | ----- | -- | --- | ---- | --- |
| 1      | 5    |  0    |  0 |  2  |  0   |   0 |
| 1      | 1    |  1    |  0 |  1  |  0   |   1 |
| 0      | 1    |  0    |  1 |  2  |  1   |   0 |

The counts of each of seven words are the features. 

#### `TF-IDF vectors` or `term frequency times inverse document frequency`: 

This is a normalized version of `TF`. This one is used rather than Bag of Words because All Texts don't have same length and hence the `TF` should be normalized. More details on this later on.

## Preprocessing Steps

Text contains not just words but punctuations, abbreviations, lower case/ upper case etc. Therefore, Before feature extraction, we need to perform text preprocessing to clean the text data.

There are three main steps of text preprocessing:

#### Tokenization

`tokenization` is a special case of sentence segmentation. Segmentation breaks up text into smaller chunks or segments. Each segment has some meaning.

Tokenization focuses on segmenting text into tokens. Each token can be words or punctuation marks etc. depending on the problem context.

- Lower casing all the words in the text.
- Depending on the problem, Removing punctuations like . , ! " or keep them.





#### Filtering 
- Removing Stop words
- Removing anything unwanted words

Stop words are common words in any language that occur in high frequency but carry much less meaning/information.

- a, an, the, this, and, or, of, on

You have a choice of removing these stop words after tokenization.

#### Normalization


- `case folding` : converting all letters/words in one specific case (lower or upper). Can also be a part of tokenization

- `stemming` or `lemmatization`:  

Lemmatization/Stemming is the process of converting a word to its base form. 

`flies` -> Lemmatization -> `fly`

`flying` -> Lemmatization -> `fly`

Without `Lemmatization/Stemming`, you will end up counting `flies` and `flying` as two different tokens/features.

#### In our tutorial, we will skip `Normalization` because of lack of prerequisites from Natural Language processing.

## Extracting `Bag of Words or TF ` using  `CountVectorizer` from `scikitlearn`

More documentation here https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Lets illustrate first by using two texts.

In [None]:
nltk package has inbuilt tokenizer

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

'''
This function allows to convert text documents to a matrix of token counts.
'''
vectorizer1 = CountVectorizer(lowercase=True, # Convert all characters to lowercase.
                             stop_words = 'english', # a built-in stop word list for English is used. all words in the stop_words are removed
                             min_df=1) # ignore words/tokens that have a document frequency strictly lower than the given threshold. 

'''
vectorizer works on a list of documents/Texts. Therefore we need to create a list that contains the texts
'''
sentence1 = """The Faster Harry got to the store, the faster Harry, the faster, would get home."""
sentence2 = """Harry went to the store fast and brought some fruit home"""

docs=[]
docs.append(sentence1)
docs.append(sentence2)
tf_vector = vectorizer1.fit_transform(docs)

'''
gives a list of the unique tokens (the vocabulary)
'''
print("vocabulary \n=============================================")
vocab = vectorizer1.get_feature_names()
print(vocab)

'''
This is the TF vector of the document
'''
print("\nThis is the TF vector of the document\n=============================================")
print(tf_vector.todense()) 

vocabulary 
['brought', 'fast', 'faster', 'fruit', 'got', 'harry', 'home', 'store', 'went']

This is the TF vector of the document
[[0 0 3 0 1 2 1 1 0]
 [1 1 0 1 0 1 1 1 1]]




In the above code, `tf_vector` is the Term Frequency vector of the text. It is a sparse matrix and we can convert to dense matrix using `todense()`

We can represent the Term Frequency vector using Pandas DataFrame (in the form a table) to be more readable. There is one row (number of documents) and four columns (number of tokens).

In [23]:
import pandas as pd
df = pd.DataFrame(tf_vector.todense(),columns=vocab,index=['sentence1','sentence2'])
df

Unnamed: 0,brought,fast,faster,fruit,got,harry,home,store,went
sentence1,0,0,3,0,1,2,1,1,0
sentence2,1,1,0,1,0,1,1,1,1


## `TFIDF` or `Term Frequency Times Inverse Document Frequency`

The idea is that importance of a token in a specific text relative to other texts should depend on 

- the normalized frequency (`raw count / total no. of words`) of the token in the specific document
- number of texts  containing the token

`TFIDF` or `term frequency times inverse document frequency` of a word/term in a text quantifies the importance of that word in the document relative to the rest of the texts.  

- `IDF (Inverse Document Frequency)` of a word = ratio of total number of documents to the number of documents containing the word. Usually `logarithm` of the ratio is used. 

- `TFIDF` or `term frequency times inverse document frequency` of a term/word in a document is simply multiplication of normalized frequency of the term in the document to `IDF` of the word.

For a given word/term `t` in a given document, `d`, in a lists of texts, `D`
- `normalized TF(t,d) or bag of words (t,d) = count(t)/count(d)`
- `IDF(t,D) = log(no. of docs/no. of docs containing t) + 1`
- `TFIDF(t,d) = normalized TF(t,d) * IDF(t,D)`

The effect of adding `1` to the IDF in the equation above is that terms with zero `IDF`, i.e., terms that occur in all documents in a training set, will not be entirely ignored. 



## Extracting `TFIDF ` using  `TfidVectorizer` from `scikitlearn`

`TfidVectorizer` calculates the TFIDF matrix in a slightly different manner.

- Instead of using normalized TF(t,d), it uses TF(t,d)
- Default: if option `smooth_idf=True`: IDF(t,D) = log( (1+ no. of docs)/(1+ no. of docs containing t)) + 1 
- If option `smooth_idf=False`: IDF(t,D) = log(no. of docs/no. of docs containing t) + 1 
- TFIDF(t,d) = TF(t,d) * IDF(t,D) 
- Scikit learn reports normalized TFIDF(t,d)  as  TFIDF(t)/sqrt($\sum_{w\in d} TFIDF(w,d)^2$)


Lets illustrate this using three simple texts

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

print("\n---------------------------------------------------------------------------------\n")
'''
Three example texts
'''
test_docs =["apple boy cat to dog","apple egg","boy boy"]
vectorizer2 = TfidfVectorizer(lowercase=True, # Convert all characters to lowercase.
                             stop_words = 'english', # a built-in stop word list for English is used. all words in the stop_words are removed
                             min_df=1) # ignore words/tokens that have a document frequency strictly lower than the given threshold. 
model = vectorizer2.fit_transform(test_docs)
tokens = vectorizer2.get_feature_names()
print("Vocabulary of the documents\n")
print(tokens)


---------------------------------------------------------------------------------

Vocabulary of the documents

['apple', 'boy', 'cat', 'dog', 'egg']




In [29]:
print("Term Frequency times inverse document frequency as a Pandas DataFrame\n")
X = model.todense().round(2)
X = pd.DataFrame(X,columns=tokens,index=['doc1','doc2','doc3'])
X

Term Frequency times inverse document frequency as a Pandas DataFrame



Unnamed: 0,apple,boy,cat,dog,egg
doc1,0.43,0.43,0.56,0.56,0.0
doc2,0.61,0.0,0.0,0.0,0.8
doc3,0.0,1.0,0.0,0.0,0.0


In [None]:
test_docs =["apple boy cat to dog","apple egg","boy boy"]

The term frequency matrix for these documents is

In [28]:
Y=pd.DataFrame(np.array([[1,1,1,1,0],
                      [1,0,0,0,1],
                      [0,2,0,0,0]]),columns=tokens,index=['doc1','doc2','doc3'])
Y

Unnamed: 0,apple,boy,cat,dog,egg
doc1,1,1,1,1,0
doc2,1,0,0,0,1
doc3,0,2,0,0,0


- The feature `apple` has count 1 in doc1 and doc2. When normalized using `TFIDF`, `apple` has more importance in doc2 compared to doc1.
- The feature `apple`, `boy`, `cat`, and `dog` have count = 1 in doc1. However, `apple` and `boy` are present in other texts: `doc2` & `doc3`. But `cat` and `dog` are only present in doc1. Therefore `cat` and `dog` are more important in doc1 when TFIDF is used.

The scikit-learn `TFIDF` matrix (not normalized is) for the first `doc` is

In [88]:
import math
tmp1= np.array([1* (math.log((1+3)/(2+1)) + 1 ),
 1* (math.log((1+3)/(2+1)) + 1),
 1* (math.log((1+3)/(1+1)) + 1),
 1* (math.log((1+3)/(1+1)) + 1),
 0])
tmp1

array([1.28768207, 1.28768207, 1.69314718, 1.69314718, 0.        ])

The scikit-learn TFIDF matrix (normalized) for the first `doc` is

In [90]:
from numpy.linalg import norm
tmp1/norm(tmp1)

array([0.42804604, 0.42804604, 0.5628291 , 0.5628291 , 0.        ])

### Other examples of using `TFIDFvectorizer` to extract features

#### Using custom `stop_words`

In [51]:
vectorizer3 = TfidfVectorizer(lowercase=True, # Convert all characters to lowercase.
                             stop_words = ["all","in","the","is","and"], # custom words to be ignored.
                             min_df=2) # ignore terms that appeared in less than 2 texts  

#### Removing words that occur very frequently

If there is a word that is contained in texts that belong to both the class labels, then the word may not contribute to differentiation between the two class labels.

In [54]:
vectorizer4 = TfidfVectorizer(lowercase=True, # Convert all characters to lowercase.
                             stop_words = ["all","in","the","is","and"], # custom words to be ignored.
                             max_df=0.85) # ignore terms that appeared in 85% of the texts

#### Custom tokenizer

Below is a function that attempts to keep all punctuation, 
and special characters and separates words separared by tokens.

For this, you should have some knowledge of regular expressions.


In [57]:
import re
def my_tokenizer(text):
    # create a space between special characters 
    text=re.sub("(\\W)"," \\1 ",text)
    # split based on whitespace
    words = re.split("\s+",text)
    words = [ w for w in words if w !=""]
    return words



In [62]:
sentence1 = """The Faster Harry got to the store, the faster Harry, the faster, would get home."""
sentence2 = """Harry went to the store fast and brought some fruit home"""



docs=[]
docs.append(sentence1)
docs.append(sentence2)

vectorizer5 = TfidfVectorizer(lowercase=True,
                     tokenizer=my_tokenizer1, # use the tokenizer function defined
                     stop_words='english',
                     min_df = 1)
tf_vector5 = vectorizer5.fit_transform(docs)
df = pd.DataFrame(tf_vector5.todense(),columns=vectorizer5.get_feature_names(),index=['sentence1','sentence2'])
df



Unnamed: 0,",",.,brought,fast,faster,fruit,got,harry,home,store,went
sentence1,0.625034,0.208345,0.0,0.0,0.625034,0.0,0.208345,0.296478,0.148239,0.148239,0.0
sentence2,0.0,0.0,0.425677,0.425677,0.0,0.425677,0.0,0.302873,0.302873,0.302873,0.425677


#### Limiting the size of the features

In [53]:
vectorizer6 = TfidfVectorizer(lowercase=True, # Convert all characters to lowercase.
                              max_features = 10) # consider the top max_features ordered by term frequency across the texts

#### Word level – N-grams (unigrams and bigrams)

`N-grams`: a feature is a sequence of N consecutive words.

Sometimes `bi-grams` and `tri-grams` may capture contextual information compared to just `unigrams`. 

In [63]:
vectorizer7 = TfidfVectorizer(lowercase=True, 
                              ngram_range = (1,2))

'''
ngram_range of (1, 1) means only unigrams, 
(1, 2) means unigrams and bigrams, and 
(2, 2) means only bigrams. 
'''

'\nngram_range of (1, 1) means only unigrams, \n(1, 2) means unigrams and bigrams, and \n(2, 2) means only bigrams. \n'

In [67]:
sentence1 = """The Faster Harry got to the store, the faster Harry, the faster, would get home."""
sentence2 = """Harry went to the store fast and brought some fruit home"""



docs=[]
docs.append(sentence1)
docs.append(sentence2)


tf_vector7 = vectorizer7.fit_transform(docs)
print("features are \n")
print(vectorizer7.get_feature_names())
df = pd.DataFrame(tf_vector7.todense(),columns=vectorizer7.get_feature_names(),index=['sentence1','sentence2'])
df


features are 

['and', 'and brought', 'brought', 'brought some', 'fast', 'fast and', 'faster', 'faster harry', 'faster would', 'fruit', 'fruit home', 'get', 'get home', 'got', 'got to', 'harry', 'harry got', 'harry the', 'harry went', 'home', 'some', 'some fruit', 'store', 'store fast', 'store the', 'the', 'the faster', 'the store', 'to', 'to the', 'went', 'went to', 'would', 'would get']




Unnamed: 0,and,and brought,brought,brought some,fast,fast and,faster,faster harry,faster would,fruit,...,store the,the,the faster,the store,to,to the,went,went to,would,would get
sentence1,0.0,0.0,0.0,0.0,0.0,0.0,0.448932,0.299288,0.149644,0.0,...,0.149644,0.425892,0.448932,0.106473,0.106473,0.106473,0.0,0.0,0.149644,0.149644
sentence2,0.238748,0.238748,0.238748,0.238748,0.238748,0.238748,0.0,0.0,0.0,0.238748,...,0.0,0.169871,0.0,0.169871,0.169871,0.169871,0.238748,0.238748,0.0,0.0


## Extracting features fom  SMS spam data

In [91]:
'''
Separate the texts & class labels
'''
sms_label = sms_msgs['class']
sms_text  = sms_msgs['sms']

'''
Split train and test samples
'''

from sklearn.model_selection import train_test_split
msg_train, msg_test, label_train, label_test = train_test_split(sms_text, sms_label, test_size=0.2)

print("Number of Training and Testing samples\n")
print(msg_train.shape)
print(msg_test.shape)
print(label_train.shape)
print(label_test.shape)



'''
First fit the tfidfvectorizer on the training set to extract features
'''
from sklearn.feature_extraction.text import TfidfVectorizer
train_texts=msg_train.to_list() # list that contain the texts from training set
vectorizer = TfidfVectorizer(lowercase=True, # Convert all characters to lowercase.
                             stop_words = 'english', # a built-in stop word list for English is used. all words in the stop_words are removed
                             min_df=1) 
sms_train_tfidf = vectorizer.fit_transform(train_texts)
sms_train_tfidf = sms_train_tfidf.todense()
print("Shape of Training data")
print(sms_train_tfidf.shape)


'''
Use the same  tfidfvectorizer to transform the testing set
'''

test_texts=msg_test.to_list()
sms_test_tfidf = vectorizer.transform(test_texts)
sms_test_tfidf = sms_test_tfidf.todense()
print("Shape of Testing data")
print(sms_test_tfidf.shape)

Number of Training and Testing samples

(4446,)
(1112,)
(4446,)
(1112,)
Shape of Training data
(4446, 7431)
Shape of Testing data
(1112, 7431)


In [79]:
print("Extracted features are")
print(vectorizer.get_feature_names())

Extracted features are




In [80]:
print("Training features Data Frame")
pd.DataFrame(sms_train_tfidf,columns=vectorizer.get_feature_names())

Training features Data Frame


Unnamed: 0,00,000,000pes,008704050406,0089,01223585236,0125698789,02,0207,02072069400,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,ãº1,éˆ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4441,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4442,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4443,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Now the data is in the form of `samples x features`. Any predictive pipeline such as feature filtering can be done.