## Features Extraction

The **sklearn.feature_extraction module** can be used to extract features from text or images in a format supported by machine learning algorithms.

### Extracting features from categorical variables

- Many machine learning problems have categorical features.

- These categorical variables are commonly encoded using one-hot encoding, in which the
  explanatory variable(or features) is encoded using one binary feature for each of the 
  variables

<img src="images/Dummy_Variables1.PNG"/>&nbsp;&nbsp;<img src="images/onek.PNG" />

### Extracting features from text

- Many machine learning problems use text as an explanatory variable. 

<img src="images/ex2.PNG" />

the most common representation of text that is used in machine learning:
    
 **The bag-of-words model**

- The representation of text in a format of a matrix where each row is an observation and
  each column is a unique word. 
  

- The value of each element of in a matrix is either a binary value that indicate the presence 
  of each word or an integer that indicate how many times that word appears.

A bag-of-words is a representation of text that describes the occurrence of words with in a document.

### CountVectorizer

provides a simple way that  can produce a bag-of-words representation from a collection of text documents.

- A collection of documents is called a corpus.


- Tokenization is the process of splitting  documents or a string  into tokens(words).


- The CountVectorizer class tokenizes using a regular expression that splits strings on 
  whitespace and extracts sequences of characters that are two or more in length.

### CountVectorizer step by step

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [2]:
vect = CountVectorizer()

In [3]:
messages = ["Hey hey hey lets go get lunch today..!",
            "Did you go home?..",
            "Hey!!! I need a favor"]

using fit method ,CountVectorizer() will learn what tokens are being used in our messages

In [5]:
vect.fit(messages)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

- By using the get_feature_name() method ,we can see what features
  have been created from our messages.(or what tokens have been learned by CountVectorizer)

In [6]:
print(vect.get_feature_names())

[u'did', u'favor', u'get', u'go', u'hey', u'home', u'lets', u'lunch', u'need', u'today', u'you']


In [None]:
#list(vect.get_stop_words())

In [7]:
for i in vect.get_feature_names():
    print (i,end=' ')

did favor get go hey home lets lunch need today you


#### there are few things to consider

 1:Everything in lowercase
 
 2:Words less than two letters have not been included(ex..'a')
 
 3:Punctuation has been removed
 
 4:there are no duplicates
 
 5:alphabatic order

Now transform() will create matrix to represent our messages.

In [None]:
#list(vect.get_stop_words())
#

In [8]:
dtm = vect.transform(messages)

In [10]:
#print dtm
print (dtm.toarray())

[[0 0 1 1 3 0 1 1 0 1 0]
 [1 0 0 1 0 1 0 0 0 0 1]
 [0 1 0 0 1 0 0 0 1 0 0]]


In [11]:
df = pd.DataFrame(dtm.toarray(),columns = vect.get_feature_names())
print df

   did  favor  get  go  hey  home  lets  lunch  need  today  you
0    0      0    1   1    3     0     1      1     0      1    0
1    1      0    0   1    0     1     0      0     0      0    1
2    0      1    0   0    1     0     0      0     1      0    0


In [None]:
test = 'hello john..! how are you.@#.?'
#hello john how are you

In [None]:
import string
string.punctuation

In [None]:
d=[]
for i in test:
    if i not in string.punctuation:
        d.append(i)
        
print (''.join(d))

In [None]:
new_msg = ['Hey lets go get a drink tonight']
new_dtm = vect.transform(new_msg)
print new_dtm.toarray()

### TF-IDF
Term Frequency, Inverse Document Frequency

- It's a way to score the importance of words (or "terms") in a document based on how
  frequently they appear across multiple documents.
  

- If a word appears frequently in a document, it's important. that word will get high score.


- But if a word appears in many documents, it's not a unique identifier. that word will get low 
  score.

tfidf score of a word(w):
    
       tf(w) * idf(w)
        
        
- where 
  tf(w) = (Number of times the word appears in a document)/(Total no. of words in the document) 
  
  
- idf(w) = log(No. of documents/No. of documents that contain word w)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [10]:
messages = ["Hey  lets go get lunch today..!",
            "Did you go home?..",
            "Hey!!! hey heyI favor need a favor",
            "I want a favor"]

In [11]:
vect = TfidfVectorizer()
dtm = vect.fit_transform(messages)

In [12]:
df = pd.DataFrame(dtm.toarray(),columns=vect.get_feature_names())

In [13]:
df

Unnamed: 0,did,favor,get,go,hey,heyi,home,lets,lunch,need,today,want,you
0,0.0,0.0,0.436719,0.344315,0.344315,0.0,0.0,0.436719,0.436719,0.0,0.436719,0.0,0.0
1,0.525473,0.0,0.0,0.414289,0.0,0.0,0.525473,0.0,0.0,0.0,0.0,0.0,0.525473
2,0.0,0.597147,0.0,0.0,0.597147,0.378703,0.0,0.0,0.0,0.378703,0.0,0.0,0.0
3,0.0,0.61913,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.785288,0.0


### Stop word filtering

A strategy is to remove words that are common to most of the documents in the corpus. 

In [14]:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vect = CountVectorizer(stop_words='english')

messages = ["Hey hey hey lets go get lunch today..!",
            "Did you go home?..",
            "Hey!!! I need a favor"]

dtm=vect.fit_transform(messages)

df = pd.DataFrame(dtm.toarray(),columns = vect.get_feature_names())
print df

   did  favor  hey  home  lets  lunch  need  today
0    0      0    3     0     1      1     0      1
1    1      0    0     1     0      0     0      0
2    0      1    1     0     0      0     1      0


In [15]:
list(vect.get_stop_words())

['all',
 'six',
 'less',
 'being',
 'indeed',
 'over',
 'move',
 'anyway',
 'fifty',
 'four',
 'not',
 'own',
 'through',
 'yourselves',
 'go',
 'where',
 'mill',
 'only',
 'find',
 'before',
 'one',
 'whose',
 'system',
 'how',
 'somewhere',
 'with',
 'thick',
 'show',
 'had',
 'enough',
 'should',
 'to',
 'must',
 'whom',
 'seeming',
 'under',
 'ours',
 'has',
 'might',
 'thereafter',
 'latterly',
 'do',
 'them',
 'his',
 'around',
 'than',
 'get',
 'very',
 'de',
 'none',
 'cannot',
 'every',
 'whether',
 'they',
 'front',
 'during',
 'thus',
 'now',
 'him',
 'nor',
 'name',
 'several',
 'hereafter',
 'always',
 'who',
 'cry',
 'whither',
 'this',
 'someone',
 'either',
 'each',
 'become',
 'thereupon',
 'sometime',
 'side',
 'two',
 'therein',
 'twelve',
 'because',
 'often',
 'ten',
 'our',
 'eg',
 'some',
 'back',
 'up',
 'namely',
 'towards',
 'are',
 'further',
 'beyond',
 'ourselves',
 'yet',
 'out',
 'even',
 'will',
 'what',
 'still',
 'for',
 'bottom',
 'mine',
 'since',
 '