# Feature Extraction Techniques

Some of the most popular methods of feature extraction are :

- Bag-of-Words
- TF-IDF
- Count Vectorization

## Bag Of Word Analysis

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

- A vocabulary of known words.
- A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

**Limitations of Bag of words analysis**

- Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
- Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
- Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

## Importing libraries

In [3]:
import pandas as pd
import numpy as np 

from tensorflow.keras.preprocessing.text import Tokenizer

  return f(*args, **kwds)
  return f(*args, **kwds)


## Reading data

In [4]:
tweets = pd.read_csv('C:\\Users\\nehal\\Music\\12.NLP\\Practise\\Datasets\\narendramodi_tweets.csv')
print(tweets.shape)
tweets.head()

(3220, 14)


Unnamed: 0,id,retweets_count,favorite_count,created_at,text,lang,retweeted,followers_count,friends_count,hashtags_count,description,location,background_image_url,source
0,8.263846e+17,1406.0,4903.0,2017-01-31 11:00:07,The President's address wonderfully encapsulat...,en,False,26809964.0,1641.0,1.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
1,8.263843e+17,907.0,2877.0,2017-01-31 10:59:12,Rashtrapati Ji's address to both Houses of Par...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
2,8.263827e+17,694.0,0.0,2017-01-31 10:52:33,RT @PMOIndia: Empowering the marginalised. htt...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
3,8.263826e+17,666.0,0.0,2017-01-31 10:52:22,RT @PMOIndia: Commitment to welfare of farmers...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
4,8.263826e+17,716.0,0.0,2017-01-31 10:52:16,RT @PMOIndia: Improving the quality of life fo...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client


## Text Preprocessing

In [5]:
# converting to lower case and extracting only alphabets, spaces and fullstops
docs=tweets.text.str.lower().str.replace('[^a-z\s.]','')
docs[:5]

0    the presidents address wonderfully encapsulate...
1    rashtrapati jis address to both houses of parl...
2    rt pmoindia empowering the marginalised. https...
3    rt pmoindia commitment to welfare of farmers. ...
4    rt pmoindia improving the quality of life for ...
Name: text, dtype: object

## Tokenization

In [6]:
#Spliting each review into words
docs_tokens=docs.str.split(' ')
docs_tokens[:5]

0    [the, presidents, address, wonderfully, encaps...
1    [rashtrapati, jis, address, to, both, houses, ...
2    [rt, pmoindia, empowering, the, marginalised.,...
3    [rt, pmoindia, commitment, to, welfare, of, fa...
4    [rt, pmoindia, improving, the, quality, of, li...
Name: text, dtype: object

In [7]:
#Putting all tokens into a list 
tokens_all=[]

for x in docs_tokens:
    tokens_all.extend(x)
print('No. of tokens in entire corpus:',len(tokens_all))

No. of tokens in entire corpus: 56862


### Bag of Word Analysis

In [8]:
bow=pd.Series(tokens_all).value_counts()
bow

                      4690
the                   2184
to                    1516
of                    1508
amp                   1480
                      ... 
ishafoundation           1
httpst.cosdjlboist       1
re                       1
flagd                    1
gowda.                   1
Length: 10026, dtype: int64

### TF- IDF

TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:

- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across documents.

The scores are a weighting where not all words are equally as important or interesting.

The scores have the effect of highlighting words that are distinct (contain useful information) in a given document.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
vectorizer=TfidfVectorizer()
vectorizer.fit_transform(docs)

<3220x8799 sparse matrix of type '<class 'numpy.float64'>'
	with 49746 stored elements in Compressed Sparse Row format>

In [13]:
vocab=vectorizer.get_feature_names()
vocab[:5]

['aabhar', 'aadhaar', 'aadhar', 'aajtak', 'aamirkhan']

In [16]:
pd.DataFrame(vectorizer.fit_transform(docs).toarray(),columns=vocab)

Unnamed: 0,aabhar,aadhaar,aadhar,aajtak,aamirkhan,aanandmayi,aap,aawas,aazadisaal,abdel,...,zaidi,zayed,zeal,zero,zhejiang,ziara,zimbabwe,zone,zones,zuma
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Count Vectorizer

CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. The value of each cell is nothing but the count of the word in that particular text sample

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer()
vectorizer.fit_transform(docs)

vocab=vectorizer.get_feature_names()
vocab[:5]

pd.DataFrame(vectorizer.fit_transform(docs).toarray(),columns=vocab)

Unnamed: 0,aabhar,aadhaar,aadhar,aajtak,aamirkhan,aanandmayi,aap,aawas,aazadisaal,abdel,...,zaidi,zayed,zeal,zero,zhejiang,ziara,zimbabwe,zone,zones,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3216,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3217,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3218,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
3