### What is Feature Extraction from text .?
Feature extraction from text means turning text into numbers or patterns that a computer can understand. It helps models learn from words by using techniques like:

- **Bag of Words**: Counts how many times each word appears.
- **TF-IDF**: Finds important words by checking how common they are in a document but rare in others.
- **Word Embeddings**: Turns words into numbers where similar words have similar values.
- **N-grams**: Looks at word pairs or groups to understand context.

This process helps computers work with text data for tasks like prediction or classification.

### Why do we need it .?
We need feature extraction from text because computers can't understand words directly—they only process numbers. By converting text into numbers or patterns (features), we help the computer recognize important information, like word meanings or relationships. This makes it possible to use text for tasks like sentiment analysis, spam detection, or any machine learning model that requires structured data. Without feature extraction, the text would just be random characters to the machine.

### Why is it difficult .?
Feature extraction from text is difficult because:

1. **Text is Unstructured**: Unlike numbers, text doesn't follow a fixed pattern, making it hard for computers to interpret.
   
2. **Complex Meaning**: Words can have multiple meanings, and context is important to understand them properly.

3. **Word Relationships**: Words don’t always appear in the same order, and their connections (like synonyms or phrases) are tricky for machines to grasp.

4. **High Dimensionality**: Text data can have thousands of unique words, creating very large feature sets, which are hard to manage.

These challenges make it complex to turn text into useful features for machine learning.

### What is the core idea .?
The core idea of feature extraction from text is to transform raw text into a numerical format that a computer can understand and analyze. This process captures the essential information and relationships within the text while reducing its complexity, allowing machine learning models to effectively learn patterns and make predictions based on the text data. In simple terms, it’s about simplifying and organizing text so that computers can work with it.

### What are the techniques .?
Here are the techniques for feature extraction from text

1. **Bag of Words (BoW)**: Counts the frequency of each word in a document without considering the order.

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**: Weighs the importance of words based on how often they appear in a document compared to all documents, highlighting unique terms.

3. **Word Embeddings**:
   - **Word2Vec**: Represents words as vectors in a continuous space, capturing semantic meanings and relationships.
   - **GloVe (Global Vectors for Word Representation)**: Similar to Word2Vec but focuses on global word-word co-occurrence statistics.

4. **N-grams**: Captures sequences of N words (e.g., bigrams for pairs of words) to preserve some context in the text.

5. **Count Vectorization**: Similar to BoW, it counts occurrences of words or n-grams to create feature vectors.

6. **One-Hot Encoding**: Converts each word into a binary vector where only one bit is "1" (indicating the presence of that word) and all other bits are "0." This technique creates a sparse representation of words.

7. **Part of Speech (POS) Tagging**: Identifies grammatical elements (like nouns, verbs, and adjectives) in the text to provide additional features.

8. **Text Rank / RAKE**: Algorithms for extracting keywords from the text based on their importance and relevance.

9. **Latent Semantic Analysis (LSA)**: Uses singular value decomposition to reduce dimensionality and uncover hidden relationships between terms.

10. **Topic Modeling**: Techniques like Latent Dirichlet Allocation (LDA) identify topics within a collection of documents, providing a feature representation based on those topics.

These techniques help convert text into meaningful features that can be used for various natural language processing tasks.



# Common Terms
1. Corpus
2. Vocabulary
3. Document
4. Word
<hr>

## Approches

### Bag of Words

In [3]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

In [15]:
bow = cv.fit_transform(df['text'])

In [16]:
# vocab
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [17]:
print(bow[0].toarray())
print(bow[0].toarray())

[[1 0 1 1 0]]
[[1 0 1 1 0]]


In [18]:
cv.transform(['campusx watch and write comment']).toarray()

array([[1, 1, 0, 1, 1]])

| **Advantages**                          | **Disadvantages**                                 |
|------------------------------------------|---------------------------------------------------|
| **Simplicity**: Easy to implement and understand. | **Ignores Context**: Loses word order and meaning relationships between words. |
| **Works well with small datasets**: Effective for basic tasks like text classification when the dataset is not too large. | **High Dimensionality**: For large vocabularies, it can create very large and sparse vectors, making computation and storage difficult. |
| **Useful for simple text representation**: Captures word frequencies effectively. | **Fails to handle synonyms**: Treats similar words (e.g., "happy" and "joyful") as different, missing the semantic meaning. |
| **Fast**: Computationally efficient compared to more complex models like embeddings. | **No semantics**: Ignores the meaning or context of the words, focusing only on their frequency. |
| **Flexible with basic preprocessing**: Can be combined with techniques like stopword removal or stemming to improve performance. | **Sensitive to irrelevant words**: Common or unimportant words can dominate the feature set if not removed properly. |
| **Works with most machine learning models**: Compatible with simple and advanced models like Naive Bayes, SVM, etc. | **Sparse Representation**: Leads to sparse matrices, which are inefficient for storing data and can slow down algorithms. |


### N-grams

In [19]:
df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [39]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))

In [40]:
bow = cv.fit_transform(df['text'])

# vocab
print(cv.vocabulary_)

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'write': 9, 'comment': 3, 'people write': 6, 'write comment': 10, 'campusx write': 2}


In [41]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 0 0 0 1 1 0 1 1 0 0]]
[[2 1 0 0 0 0 0 1 1 0 0]]


## Tf-Idf

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a technique used to determine the importance of a word in a document relative to a whole collection of documents (called the corpus). It helps highlight important words while reducing the impact of common words like "the" or "is."

### Formula:

1. **Term Frequency (TF)**: Measures how frequently a word appears in a document.

    TF = (Number of times the word appears in a document / number of words in the document)


2. **Inverse Document Frequency (IDF)**: Measures how important a word is across the corpus. Rare words get a higher score.


   TF = Log(Total number of documents / Number of documents containing the word)

3. **TF-IDF**: The final score combines both, giving importance to words that are frequent in a document but rare in the corpus.
 
   TF-IDF=TF×IDF

### Advantages:
- **Highlights important words**: Gives higher weight to words that are significant in a document but not common in the entire dataset.
- **Filters out common words**: Reduces the importance of frequently used words like "is," "the," etc.
- **Effective for text classification**: Works well for tasks like spam detection, document classification, and information retrieval.
  
### Disadvantages:
- **Ignores word context**: Only considers word frequency and doesn’t capture the meaning or position of the words.
- **Sparse vectors**: The result can be a large matrix with lots of zeros, which can be inefficient for large datasets.
- **Doesn’t handle synonyms**: Treats words with similar meanings (e.g., "good" and "great") as completely different.
- **No deep understanding**: Doesn't capture relationships between words, unlike more advanced techniques like word embeddings.

In short, TF-IDF helps identify important words in text but doesn't capture word meaning or relationships. It’s simple and useful but limited for more complex language tasks.

In [43]:
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [46]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']
