# Text-Representation-BagOfWords-TFIDF-NGrams


## Feature Exraction - Text to Numbers in NLP
- Also called Text Represntation or Text Vectorization
- Vectorization should be in such a way thay it tells  the hidden meaning - semantic meaning

## Techniques for Vectorization
1. One Hot Encoding - Not used - Disadvantage - Sparsity & overfitting, No Fixed Size, OOV (out of vocabulary problem), no capturing sematic meaning
2. Bag of Words - Text classification perform good - Based on Frequency of words - DisAdv - Sparsity, OOV, Ordering of word (meaning changes), Not able to capture small changes like (i am going & I am not going) both are very different meaning.
3. ngrams - Able to capture semantic better if we take value 2 or 3. DisAdv - Slow, more sparsity, 
4. TFIDF - Term frequency & Inverse Doc frequency. Get the wt & multiplied together
5. Custom Features
6. Word2Vec - Embedding - Deep learning (Best)

# Say that we have dataset of 5000 reviews
- Corpus - All the text - reviews combined together or dataset joined together
- Vocabulary - All unique words that  you have in a corpus
- Document - All the individual reviews are individual documents
- Word - simple individual words

# Bag of Words

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

In [6]:
text = ["I like reading book", "Which book are you reading, is this good book" ]

In [7]:
cv = CountVectorizer()
count = cv.fit_transform(text)

In [8]:
count.toarray()

array([[0, 1, 0, 0, 1, 1, 0, 0, 0],
       [1, 2, 1, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [12]:
cv.get_feature_names_out()

array(['are', 'book', 'good', 'is', 'like', 'reading', 'this', 'which',
       'you'], dtype=object)

In [14]:
df = pd.DataFrame(count.toarray(), columns=cv.get_feature_names_out())

In [15]:
df

Unnamed: 0,are,book,good,is,like,reading,this,which,you
0,0,1,0,0,1,1,0,0,0
1,1,2,1,1,0,1,1,1,1


In [26]:
val = cv.transform(["Reading a good book is good for the brain"]).toarray()
val
# The word that is not the part of vocabulary will not be considered, 

array([[0, 1, 2, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [27]:
#df.loc[len(df.index)] = ['Amy', 89, 93] 
df1 = pd.DataFrame(val,columns=cv.get_feature_names_out() )

In [28]:
df1

Unnamed: 0,are,book,good,is,like,reading,this,which,you
0,0,1,2,1,0,1,0,0,0


In [29]:
pd.concat([df, df1])

Unnamed: 0,are,book,good,is,like,reading,this,which,you
0,0,1,0,0,1,1,0,0,0
1,1,2,1,1,0,1,1,1,1
0,0,1,2,1,0,1,0,0,0


- Remember Bag of words is the N grams with paramter ngram_range=(1, 1)

# N Grams

 - ngram_range = (2,2) - two words will be combined for the vocab
- ngram_range = (1,2) - bith one word & two wprds will be combined 

In [40]:
# for the same example
cv = CountVectorizer(ngram_range=(2,2))
count = cv.fit_transform(text)

In [41]:
count.toarray()

array([[0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 0, 1, 1, 1, 1]], dtype=int64)

In [42]:
cv.get_feature_names_out()

array(['are you', 'book are', 'good book', 'is this', 'like reading',
       'reading book', 'reading is', 'this good', 'which book',
       'you reading'], dtype=object)

In [43]:
df = pd.DataFrame(count.toarray(), columns=cv.get_feature_names_out())

In [44]:
df

Unnamed: 0,are you,book are,good book,is this,like reading,reading book,reading is,this good,which book,you reading
0,0,0,0,0,1,1,0,0,0,0
1,1,1,1,1,0,0,1,1,1,1


# Tf-TDF
Term Frequency(t,d) = freq. of term t in document d **/ total number of terms in document d**
- TF - acts like a probability

Inverse Document Frequency = **log**(Total no. of documents in the corpus **/ No. of documents with term t in them**)
- IDF - Gives less wt to the common words
- We take log so that value of idf is comparable to value of tf. Reducing hte value of idf by taking the log
- Used in search engine to get the key word retrival
- Sparsity, OOV, Dimension, No semantic, not capturing the meaning

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [50]:
text

['I like reading book', 'Which book are you reading, is this good book']

In [48]:
tfidf = TfidfVectorizer()
tfidf.fit_transform(text).toarray()

array([[0.        , 0.50154891, 0.        , 0.        , 0.70490949,
        0.50154891, 0.        , 0.        , 0.        ],
       [0.342369  , 0.48719673, 0.342369  , 0.342369  , 0.        ,
        0.24359836, 0.342369  , 0.342369  , 0.342369  ]])

In [49]:
tfidf.fit_transform(text).toarray().shape

(2, 9)

In [53]:
tfidf.idf_

array([1.40546511, 1.        , 1.40546511, 1.40546511, 1.40546511,
       1.        , 1.40546511, 1.40546511, 1.40546511])

In [54]:
tfidf.get_feature_names_out()

array(['are', 'book', 'good', 'is', 'like', 'reading', 'this', 'which',
       'you'], dtype=object)

In [57]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.40546511 1.         1.40546511 1.40546511 1.40546511 1.
 1.40546511 1.40546511 1.40546511]
['are' 'book' 'good' 'is' 'like' 'reading' 'this' 'which' 'you']


# Custom Features -
- **Hybrid Features** Generally we use both custom features & feature technique above 
- You create your features based on your requirement or domain knowledge such as number of positive words in the document or no. of negative words or ratio or positive & negative word or word count (postive review are lengthy compared to negative)