# Detecting Issues in a Text Dataset with Cleanlab

In this example, we will use Cleanlab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank, which is a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1000 customer service requests which are classified into 10 categories based on their intent.

[`Cleanlab`](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabelled data, out-of-scope examples (outliers), or otherwise ambiguous examples.

In this example, we will
* use the pretrained transformer model to extract the text embeddings from the customer service requests
* train a simple logistic regression model on the text embeddings to compute out-of-sample predicted probabilities
* run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset

## Setups

In [1]:
!pip install -qU "cleanlab[datalab]" scikit-learn transformers sentence-transformers datasets

In [2]:
import re
import string
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

from cleanlab import Datalab

import random


pd.set_option('display.max_colwidth', None)
SEED = 111
np.random.seed(SEED)
random.seed(SEED)

## Load and format the text dataset

In [4]:
from datasets import load_dataset

dataset = load_dataset('PolyAI/banking77', split='train')
data = pd.DataFrame(dataset[:1000])

The repository for PolyAI/banking77 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/banking77.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/839k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/240k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3080 [00:00<?, ? examples/s]

In [5]:
data.head(3)

Unnamed: 0,text,label
0,I am still waiting on my card?,11
1,What can I do if my card still hasn't arrived after 2 weeks?,11
2,I have been waiting over a week. Is the card still coming?,11


In [6]:
raw_texts, labels = data['text'].values, data['label'].values
num_classes = len(set(labels))

print(f"Number of classes: {num_classes}")
print(f"Classes: {set(labels)}")

Number of classes: 7
Classes: {32, 34, 36, 11, 13, 46, 17}


In [7]:
# check an example
idx = 1
print(f"Example label: {labels[idx]}")
print(f"Example text: {raw_texts[idx]}")

Example label: 11
Example text: What can I do if my card still hasn't arrived after 2 weeks?


The data is stored as two numpy arrays:
* `raw_texts` stores the customer service requests utterances in text format
* `labels` stores the intent categories (labels) for each example

Next we will convert the text strings into vectors better suited as inputs for our ML models. We will use numeric representations from a pretrained transformer model as embeddings of our text. The `sentence_transformers` library offers simple methods to compute these embedings for text data.

We will use the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example.

In [8]:
transformer = SentenceTransformer('google/electra-small-discriminator')
text_embeddings = transformer.encode(raw_texts)



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/54.2M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
text_embeddings, text_embeddings.shape

(array([[ 2.09494591e-01,  7.98101872e-02, -3.06025207e-01, ...,
          1.15906417e-01, -2.59357214e-01,  1.80690259e-01],
        [ 6.37911409e-02,  1.91813529e-01, -1.88507065e-01, ...,
         -1.49904087e-01, -2.88031548e-01,  1.59516186e-01],
        [ 1.42000288e-01,  1.95943072e-01, -2.18498647e-01, ...,
          4.84950840e-04, -2.42234126e-01,  1.24536991e-01],
        ...,
        [-1.77763790e-01,  2.53192842e-01,  4.67635952e-02, ...,
          3.17841917e-02, -7.75132328e-02,  2.68665910e-01],
        [ 1.43944606e-01,  1.55030815e-02,  7.09013939e-02, ...,
         -4.07675654e-02, -2.75613487e-01,  4.77761924e-01],
        [ 4.94463652e-01,  1.94700360e-01,  9.85107049e-02, ...,
          3.13025236e-01, -1.22599499e-02, -3.45590383e-01]], dtype=float32),
 (1000, 256))

## Define a classification model and compute out-of-sample predicted probabilities

Usually, we will add a linear output layer and fine-tune the parameters of neural networks on the new data to leverage pretrained networks for a particular classification task, but this process can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). In this example, we will do this approach by fitting a scikit-learn linear model on top of the extracted embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from our model for each datapoint. However, these predicitons will be overfit (and thus unreliable) for datapoints the model was previously trained on.

cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e., on datapoints held-out from the model during the training.

We will obtain out-of-sample predicted class probabilities for every example in our dataset using a logistic regression model with cross-validation. Make sure that the columns of our `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is **lexicographically sorted by class name**.

In [10]:
model = LogisticRegression(max_iter=400)

pred_probs = cross_val_predict(
    model,
    text_embeddings,
    labels,
    method='predict_proba'
)

## Use cleanlab to find issues in our dataset

Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model we have, cleanlab can quickly help us identify low-quality examples in our dataset.

We will use cleanlab's `Datalab` to find issues in our data. `Datalab` offers several ways of loading the data; we will simply wrap the training features and noisy labels in a dictionary.

In [11]:
data_dict = {'texts': raw_texts, 'labels': labels}

To audit our data, we just call `find_issues()`. We also pass the predicted probabilities and the feature embeddings obtained above, we do not necessarily need to provide all of this information depending on which types of issues we are interested in. The more inputs we provide, the more types of issues `Datalab` can detect in our data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues.

In [12]:
lab = Datalab(data_dict, label_name='labels')

lab.find_issues(
    pred_probs=pred_probs,
    features=text_embeddings,
)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 62 issues found in the dataset.


After the audit is complete, we can review the findings using the `report()` method:

In [13]:
lab.report()

Dataset Information: num_examples: 1000, num_classes: 7

Here is a summary of various issues found in your data:

    issue_type  num_issues
       outlier          37
near_duplicate          14
         label          10
       non_iid           1

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


---------------------- outlier issues ----------------------

About this issue:
	Examples that are very different from the rest of the dataset 
    (i.e. potentially out-of-distribution or rare/anomalous instances).
    

Number of examples with this issue: 37
Overall dataset quality in terms of this issue: 0.3671

Examples representing most severe instances of this issue:
     is_outlier_issue  outlier_score
791              True       0.024866
601              True 

### Label issues

We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues()` method, specifying `label` to focus on label issues in the data.

In [14]:
label_issues = lab.get_issues('label')
label_issues.head()

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.903995,11,11
1,False,0.860838,11,11
2,False,0.659735,11,11
3,False,0.697566,11,11
4,False,0.437183,11,11


This returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).

We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 10 most likely mislabeled examples in our dataset:

In [15]:
# examples with label issues
identified_label_issues = label_issues[label_issues['is_label_issue'] == True]
# 10 most likely mislabled examples
lowest_quality_labels = label_issues['label_score'].argsort()[:10].to_numpy()

print(
    f"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\n"
    f"Here are indices of the top 10 most likely errors:\n{lowest_quality_labels}"
)

cleanlab found 10 potential label errors in the dataset.
Here are indices of the top 10 most likely errors:
[379 100 300 485 159 413 444  35  93 774]


In [16]:
data_with_suggested_labels = pd.DataFrame(
    {
        'text': raw_texts,
        'given_label': labels,
        'suggested_label': label_issues['predicted_label']
    }
)

data_with_suggested_labels.iloc[lowest_quality_labels]

Unnamed: 0,text,given_label,suggested_label
379,Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?,32,11
100,can you share card tracking number?,11,36
300,"If I need to cash foreign transfers, how does that work?",32,46
485,Was I charged more than I should of been for a currency exchange?,17,34
159,Is there any way to see my card in the app?,13,11
413,I was charged the wing amount for an item.,17,34
444,Why is your exchange rate so bad?,17,32
35,Can I get an update on my replacement card?,11,13
93,WHAT IS THE SOLUTION OF THIS PROBLEM,11,32
774,Do you know when the withdrawal will show?,46,11


### Outlier issues

We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues()`. We will sort the resulting dataframe by cleanlab's outlier quality score to see the most severe outliers in our dataset.

In [17]:
outlier_issues = lab.get_issues('outlier')
outlier_issues.sort_values('outlier_score').head()

Unnamed: 0,is_outlier_issue,outlier_score
791,True,0.024866
601,True,0.031162
863,True,0.060738
355,True,0.064199
157,True,0.065075


In [18]:
lowest_quality_outliers = outlier_issues['outlier_score'].argsort()[:10]

data.iloc[lowest_quality_outliers]

Unnamed: 0,text,label
791,withdrawal pending meaning?,46
601,$1 charge in transaction.,34
863,My atm withdraw is stillpending,46
355,explain the interbank exchange rate,32
157,"lost card found, want to put it back in app",13
93,WHAT IS THE SOLUTION OF THIS PROBLEM,11
357,how do you get exchange rates,32
804,When will my cash withdrawal show/,46
303,What are you exchange rates?,32
122,i havent got my card,11


Outliers in the dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect our data modeling, and consider removing them from the dataset if so.

### Near-duplicate issues

We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset). via `get_issues()`. We will sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated.

In [20]:
duplicate_issues = lab.get_issues('near_duplicate')
duplicate_issues.sort_values('near_duplicate_score').head()

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
459,True,0.009548,[429],0.000566
429,True,0.009548,[459],0.000566
501,True,0.046048,"[412, 517]",0.002781
412,True,0.046048,[501],0.002781
698,True,0.054625,[607],0.003314


In [22]:
data.iloc[[459, 429]]

Unnamed: 0,text,label
459,I purchased something abroad and the incorrect exchange rate was applied.,17
429,I purchased something overseas and the incorrect exchange rate was applied.,17


In [23]:
data.iloc[[501, 412]]

Unnamed: 0,text,label
501,The exchange rate you are using is really bad.This can't be the official interbank exchange rate.,17
412,The exchange rate you are using is bad.This can't be the official interbank exchange rate.,17


In [24]:
data.iloc[[501, 412, 517]]

Unnamed: 0,text,label
501,The exchange rate you are using is really bad.This can't be the official interbank exchange rate.,17
412,The exchange rate you are using is bad.This can't be the official interbank exchange rate.,17
517,The exchange rate you are using is really bad. This can't possibly be the official interbank exchange rate.,17


### Non-IID issues (data drift)

The overall non-iid (Independent and Identically Distributed) score for the dataset coresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values.

A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population.

In [25]:
p_value = lab.get_info('non_iid')['p-value']
p_value

0.0

Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting.