1. What is Feature Extraction from text ?
    - converting text to numbers
    - This is also called as `Text Representation` or `Text Vectorization`

2. Why do we need it ?
    - select good features from the document
    - garbage `in` garbage `out`

3. why is this difficult to do ?
    - image data we can convert into vector easily (text detection)
    - same for the speech data (sampling rate -->> array)
    - but for the textual data can't directly create a features from it

4. What is the core idea ?
    - text should explain semantic meaning i.e the study of words in context

5. What are the core techniques ?
    - OHE (One Hot Encoding)
    - BOW (Bag of Word)
    - N-grams
    - TF-IDF (Term Frequency-Inverse Document Frequency)
    - Custom Features (like repatation of word, avg. size of word, count of vowels and consonants)
    - Deep Learning based `Word2Vec` also know as **Embeddings**

### Common Terms :

1. **Corpus, or Corpora(Plural):** It is a collection of text of similar type, for example, movie review, social media posts, etc.

2. **Vocabulary:** Uniqueu group of terms used in a text or speech.

3. **Documents:** the body of text and collectively form a corpus.

4. **Word:** Individual word in documents

### One Hot Encoding :

| Column 1 | Column 2 |
|---|---|
| D1 | people watch campusx |
| D2 | campusx watch campusx |
| D3 | people write campusx |
| D4 | campusx write comment |

---

- Corpus :
    - people watch campusx campusx watch campusx people write campusx campusx write write comment

- Vocabulary :
    - people watch campusx write comment
    - here value of `Vocabulary` is `5`

---

| people | watch | campusx | write | comment |
| ---| --- | --- | --- | --- |



- D1 = “people watch campusx”
    - D1 = [[1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0]]
    - Shape of D1 is (3, 5)
- D2 = “campusx watch campusx”
    - D2 = [[0,0,1,0,0], [0,1,0,0,0], [0,0,1,0,0]]
    - Shape of D2 is (3, 5)
- D3 = “people write campusx”
    - D3 = [[1,0,0,0,0], [0,0,0,1,0], [0,0,1,0,0]]
    - Shape of D3 is (3, 5)
- D4 = “campusx write comment”
    - D4 = [[0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1]]
    - Shape of D4 is (3, 5)


#### Pros :
- Intuitive : having the ability to know or understand things without any proof or evidence
- easy to implement

#### Flaws :
- Vocabulary value will increase and due to same array for the document will get complex in nature.
- sometimes overfitting also may comes into the picture
- if D3 is like “people write campusx watch” instead of the original one available then Shape of the document will get change; now here value is (4,5) so ML algorithm will not train
- consider all the factors model trained but if any new word comes out of vocabulary then model will not give good prediction
- no capturing of sematic meaning

| walk | run | bottle |

here if can see `walk` and `run` might be related but `bottle` not giving that much relation but in terms of vector they are related

### Bag of Words :
- counting particular word in document
- mostly used for text classification

| | text | output |
|---|---|---|
| D1 | people watch campusx | 1 |
| D2 | campusx watch campusx | 1 |
| D3 | people write campusx | 0 |
| D4 | campusx write comment | 0 |

| | people | watch | campusx | write | comment |
|--- | ---| --- | --- | --- | --- |
| D1 | 1 | 1 | 1 | 0 | 0 |
| D2 | 0 | 1 | 2 | 0 | 0 |
| D3 | 1 | 0 | 0 | 1 | 1 |
| D4 | 0 | 0 | 1 | 1 | 1 |


- core intution :
    - order of the words not matter
    - context doesn't matter
    - semantic relation is considered here as whole document will get converted into vector and angle/distance will be caluclated to find the semantic relationship

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0,]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


`from sklearn.feature_extraction.text import CountVectorizer` :

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#countvectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bow = cv.fit_transform(df['text'])

In [None]:
#print vocabulary
cv.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

this means indexing is like : campusx comment people watch write

In [None]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]


people watch campusx

[[1 0 1 1 0]]

---

campusx watch campusx

[[2 0 0 1 0]]


In [None]:
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 1]])

out of vocabulary words are not shown here

- Advantages :
    - simple and intutive
    - Out of Vocabulary will be handled due to which `n-dim shape will get handle`
    - little bit semantic relationship will get capture here


- Dis-advantage :
    - Sparsity problem occurs when the numbers of non-zero values are very less compare to zero values in datasets.
    - over fitting
    - no solution for out of Vocabulary
    - not considering order

### n-grams :
- ordering will ge consider as we are providing ngram_rangetuple (min_n, max_n), default=(1, 1)

here for example lets take `n=2`

Vocabulary is :-
`people watch`,`watch campusx`,`campusx watch`,`people write`, `write comment`,`campusx write`


D1 = 1,1,0,0,0,0
D2 = 0,1,1,0,0,0
D3 = 0,0,0,1,1,0
D4 = 0,0,0,0,1,1

same we can do with n-gram

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0,]})

from sklearn.feature_extraction.text import CountVectorizer
# cv = CountVectorizer(ngram_range=(2, 2))
# cv = CountVectorizer(ngram_range=(1, 2))
cv = CountVectorizer(ngram_range=(2, 3))
n_gram = cv.fit_transform(df['text'])

#print vocabulary
print(cv.vocabulary_)

{'people watch': 4, 'watch campusx': 8, 'people watch campusx': 5, 'campusx watch': 0, 'campusx watch campusx': 1, 'people write': 6, 'write comment': 9, 'people write comment': 7, 'campusx write': 2, 'campusx write comment': 3}


- Advantages :
    - able to capture better semantic meaning
    - intutive to understand and easy to implement

- Disadvantages :
    - dimensions of vocabulary will get increase when value of `n` will get increase
    - so that time complexity also will get increase and model will take more time to predict an output
    - no solution for out of vocabulary

### Tf-Idf (Term Frequency - Inverse Document Frequency):
- here for each word weightage will be assigned as per the frequency of that word occur document and corpora
- $$
\text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of terms in document } d}
$$

- $$
\text{IDF}(t) = \log_e \left( \frac{\text{Total number of documents in the corpus}}{\text{Number of documents with term } t \text{ in them}} \right)
$$

- Tf-Idf will be calculated as : ` TF * IDF `



In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'],'output':[1,1,0,0,]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [3]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


here internally sklearn use `TfidfTransformer` to calculate idf and `TfidfTransformer` is explicitly adding `+1` in original formula `smooth_idf=True (the default)` and the reason is ```The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.```

refer :- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#tfidftransformer



Why we are taking log while caculating `idf` ?
- tf : showing imoprtance of that term for particular document
- idf : showing importance of that term for corpus
- both tf & idf are importance while caculating `tf * idf`
- tf will vary from `0 - 1` but if we did not take log on idf then depending on the word occurance value will get vary form minimun to very large number **then `idf` will dominate `tf`**

- Advantages :
    1. primarily used in `Information retrieval` like search engine


- Dis-advantages :
    1. Sparcity( if document size if large then `0` will increase and turn into over fitting)
    2. out of vocabulary still not handled here
    3. if vocabulary is large then array dimention will get increase
    4. semantic relation is not captured here

### Custom Features :
- search number of +ve words
- search number of -ve words
- ratio `+ve words` / `-ve words`
- word count based on domain knowledge