# Constraining label counts in IMBD News Groups using LogisticRegression

<a target="_blank" href="https://colab.research.google.com/github/justinj-evans/predlp/blob/master/examples/multilabel_classification_imdb.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

The example demonstrates how to impose a label constraint using the python package *predlp*. 

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
from collections import Counter, OrderedDict
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## Source Data

The dataset used in this tutorial is from scikit-learn's multilabel classification dataset '20newsgroups'.

In [2]:
# load open-source dataset
newsgroups = fetch_20newsgroups(subset='test')

# create dataframe
data = pd.DataFrame({'text': newsgroups['data'], 'target': newsgroups['target']})

### Impose class imbalance

**What is a class imbalance?**
Class imbalance refers to a situation in a dataset where the distribution of target classes is not uniform or is significantly skewed. Class imbalance is challenging for models to classify because it skews the underlying data distribution, which significantly impacts sampling and statistical inference. 

**Why impose a class imbalance?**
Here, we're trying to generate a realistic scenario in a multilabel classification problem and determine if the package *predlp* can improve class representation by imposing label constraints.


In [3]:
# Create class imbalance
# Count the occurrences of each target class
class_counts = Counter(data['target'])

# Find the top two classes with the largest counts and the bottom class with the least count
top_two_classes = [item[0] for item in class_counts.most_common(2)]
bottom_class = min(class_counts, key=class_counts.get)

# Filter the dataframe to include only the selected classes
selected_classes = top_two_classes + [bottom_class]
data_class_imbalanced = data[data['target'].isin(selected_classes)]

# Keep class names for visualizing results
selected_class_names = [f"{[class_idx]}-{newsgroups['target_names'][class_idx]}" for class_idx in selected_classes]

## Generate Train/Test splits

**What is stratified sampling?**
Stratified sampling ensures that train and test splits preserve the proportion of key variables (e.g., target classes), reducing sampling variability and improving representativeness. It assumes the stratification variable is relevant, strata are well-defined and sufficiently large, and the data distribution reflects what the model will encounter during deployment.

**Why use stratified sampling?**
In an ideal modeling scenario, if your train/test split is 50:50 and you stratify by target classes, the model should predict class counts proportional to those it was trained on. For instance, if the training dataset contains 100 examples of "dogs" and 10 examples of "cats," the model, when tested on a dataset with an identical class makeup, should ideally predict the same proportions.
    
However, we never build perfect models and class imbalance will favor majority classes if two classes are semantically similar.

In [4]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(data_class_imbalanced['text'], data_class_imbalanced['target'], test_size=0.5, random_state=42, stratify=data_class_imbalanced['target'])

In [5]:
# Demonstrate that train/test datasets have the same class counts
print(f"Train Class Counts: {dict(OrderedDict(sorted(Counter(y_train).items())))}")
print(f"Test Class Counts: {dict(OrderedDict(sorted(Counter(y_test).items())))}")

Train Class Counts: {10: 199, 15: 199, 19: 126}
Test Class Counts: {10: 200, 15: 199, 19: 125}


## Modeling

In [6]:
# Vectorize the text
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Model: Logistic Regression with OvR Strategy
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

model = OneVsRestClassifier(LogisticRegression(max_iter=1000))
model.fit(X_train_vec, y_train)

# Predictions with Probabilities
y_pred_proba = model.predict_proba(X_test_vec)
y_pred = model.predict(X_test_vec)

## Predictions

Here we see that class [19]-talk.religion.misc goes from an expected class count of 125 (23% of records) in the test dataset to a predicted class count of 67 (12% of records).

In [7]:
test_label_counts = dict(Counter(y_test))
pred_label_counts = dict(OrderedDict(sorted(Counter(y_pred).items())))

print(f"Test Class Counts: {test_label_counts}")
print(f"Predicted Test Class Counts: {pred_label_counts}")

Test Class Counts: {10: 200, 15: 199, 19: 125}
Predicted Test Class Counts: {10: 223, 15: 234, 19: 67}


In [8]:
print(classification_report(list(y_test), list(y_pred), target_names=selected_class_names))

                             precision    recall  f1-score   support

      [10]-rec.sport.hockey       0.88      0.98      0.93       200
[15]-soc.religion.christian       0.82      0.96      0.89       199
    [19]-talk.religion.misc       0.96      0.51      0.67       125

                   accuracy                           0.86       524
                  macro avg       0.89      0.82      0.83       524
               weighted avg       0.88      0.86      0.85       524



Lastly, while our classifier achieves high precision for class [19]-talk.religion.misc, it misses a significant portion of the expected data (recall at 0.51).

## Label constraints using the package *predlp*

Here, we're using *predlp* to enforce predefined label constraints in our model's predictions.  We feed in the model's predicted probabilities for each label and example (pred_probs) along with our label constraint. The package uses these linear programming constraints to maximizes the model's cumulative confidence score.

**Assumptions**  
Is it likely to know the exact distribution of new data (e.g., the test dataset) in advance? No. However, practitioners may estimate a plausible range of distributions based on prior experience or through statistical inference. These ranges can serve as constraints to guide model predictions.

In [9]:
from predlp.solver import pred_prob_lp
pred_after_lp = pred_prob_lp(class_names=selected_classes, label_counts= test_label_counts, pred_probs=y_pred_proba)
pred_after_lp_label_counts = Counter(pred_after_lp)

Optimal Score: 318.29752711365853


In [10]:
# demonstrate constraint successful 
print(f"Test Class Counts: {dict(test_label_counts)}")
print(f"Predicted Test Class Counts: {dict(pred_after_lp_label_counts)}")

Test Class Counts: {10: 200, 15: 199, 19: 125}
Predicted Test Class Counts: {10: 200, 15: 199, 19: 125}


In [11]:
print(classification_report(list(y_test), list(pred_after_lp), target_names=selected_class_names))

                             precision    recall  f1-score   support

      [10]-rec.sport.hockey       0.98      0.98      0.98       200
[15]-soc.religion.christian       0.91      0.91      0.91       199
    [19]-talk.religion.misc       0.85      0.85      0.85       125

                   accuracy                           0.92       524
                  macro avg       0.91      0.91      0.91       524
               weighted avg       0.92      0.92      0.92       524



Here, we see *predlp* improved recall for our minitory class and resulted in a higher f1-score for all classes.