# Information Gain
**Information Gain (IG)** is a way to measure how useful a feature (like a word) is for sorting data. It is commonly used in tasks such as selecting features, building decision trees, and creating text models.<br>
It shows how much uncertainty (entropy) about the category of an example is reduced if we know the value of a feature (e.g., whether or not a word appears in a tweet).

In [1]:
# 📌 This notebook assumes that corpus processing, tokenization and BoW construction was already performed on the notebook:
# 👉 'feature-extraction/bag_of_words.ipynb'

#The variables used here (such as `BoW_tr`, `tr_txt`, `V1`, `dict_indices1`) were built there.
#If you want to re-run the pipeline from scratch, check that file first.

> 🔗 **Note:** The corpus loading, tokenization and construction of the Bag of Words is at
> [`bag_of_words.ipynb`](./feature-extraction/bag_of_words.ipynb)

In [2]:
from collections import Counter
def compute_gain(BoW, labels, vocab):
    label_counts = Counter(labels)
    HC = -sum((count / len(labels)) * np.log2(count / len(labels)) for count in label_counts.values()) #Entropia
    IGs = {}

    for term_idx, word in enumerate(vocab):
        docs = BoW[:, term_idx] > 0  # Documentos donde aparece
        n_docs = ~docs  # Documentos donde no aparece
        docs_labels = [labels[i] for i in range(BoW.shape[0]) if docs[i]]
        n_docs_labels = [labels[i] for i in range(BoW.shape[0]) if n_docs[i]]
        def entropy_t(sub_labels):
            count = Counter(sub_labels)
            if len(sub_labels) == 0:
                return 0
            return -sum((c / len(sub_labels)) * np.log2(c / len(sub_labels)) for c in count.values())

        HC_t = (len(docs_labels) / BoW.shape[0]) * entropy_t(docs_labels) + \
                      (len(n_docs_labels) / BoW.shape[0]) * entropy_t(n_docs_labels)
        IG = HC - HC_t
        IGs[word] = IG

    return IGs

In [3]:
IG_scores = compute_gain(BoW_tr, tr_y, V1)
K=50
top_words = dict(sorted(IG_scores.items(), key=lambda x: x[1], reverse=True)[:50])