In [None]:
import sys
!{sys.executable} -m pip install -U deepchecks[nlp]
# !{sys.executable} -m pip install -U deepchecks[nlp-properties]  # Optional - used to compute the more advanced properties

In [None]:
import pandas as pd
import numpy as np
from deepchecks.nlp.text_data import TextData

In this notebook we will go over:
1. Creating a TextData object and auto calculating properties
2. Running the built-in suites
3. Check spotlight - Embeddings drift and Under-Annotated Segments

# Setting up

## Load data 

In this tutorial we will use the tweet emotion dataset, containing tweets and metadata on the users who wrote them. </br>
Our goal will be to analyze a model that given a tweet classify its emotion in one of 4 categories: 'happiness', 'anger', 'optimism' and 'sadness'.

In [None]:
from deepchecks.nlp.datasets.classification import tweet_emotion

train, test = tweet_emotion.load_data(data_format='DataFrame')
train.head(3)

We explicitly define our (sorted) classes so that all checks know what classes to expect. This is optional.

In [None]:
model_classes = ['anger', 'happiness', 'optimism', 'sadness']

## Create TextData Objects (A Deepchecks' Artifact)

Deepchecks' TextData object contain the text samples, labels and possibly also properties and metadata. </br>
it stores cache to save time between repeated computations and contain functionalities for input validations and sampling.

In [None]:
train = TextData(train.text, label=train['label'], task_type='text_classification',
                 metadata=train.drop(columns=['label', 'text']))
test = TextData(test.text, label=test['label'], task_type='text_classification',
                metadata=test.drop(columns=['label', 'text']))

## Calculating Properties

Some of Deepchecks' checks uses properties of the text samples for various calculations. </br>
Deepcheck have a wide variety of such properties, some simple and some that rely on external models and are more heavy to run. </br>
In order for Deepcheck's checks to be able to access the properties they be stored within the TextData object.

In [None]:
# properties can be either either calculated directly by Deepchecks or imported for other sources in appropriate format

# from torch import device
# train.calculate_default_properties(include_long_calculation_properties=True, device=device('mps'))
# test.calculate_default_properties(include_long_calculation_properties=True,  device=device('mps'))

train.set_properties(pd.read_csv('train_properties.csv'), categorical_properties=['Language'])
test.set_properties(pd.read_csv('test_properties.csv'), categorical_properties=['Language'])

train.properties.head(2)

### Add some missing labesl

In [None]:
test_copy = test.copy()

In [None]:
np.random.seed(42)
idx_to_fillna = np.random.choice(range(len(test)), int(len(test) * 0.05), replace=False)
test_copy._label = test_copy._label.astype(dtype=object)
test_copy._label[idx_to_fillna] = None

In [None]:
under_unnotated_segment_idx = test_copy.properties[(test_copy.properties.Fluency < 0.4) & (test_copy.properties.Formality < 0.2)].index

In [None]:
under_unnotated_segment_idx

In [None]:
np.random.seed(42)
idx_to_fillna = np.random.choice(under_unnotated_segment_idx, int(len(under_unnotated_segment_idx) * 0.4), replace=False)
test_copy._label[idx_to_fillna] = None

# Running the deepchecks default suites

## Data Integrity

We will start by doing preliminary integrity check to validate the text formatting. </br>
It is recommended to do this step before model training as it may imply additional data engineering is required. </br>

We'll do that using the data_integrity pre-built suite.

In [None]:
from deepchecks.nlp.suites import data_integrity

In [None]:
data_integrity().run(train)

In [None]:
data_integrity().run(test)

### Integrity #1: Unknown Tokens

First up (in the "Didn't Pass" tab) we see that the Unknown Tokens check has returned a problem.

Looking at the result, we can see that it assumed (by default) that we're going to use the bert-base-uncased tokenizer for our NLP model, and that if that's the case there are many words in the dataset that contain
characters (such as emojies, or Korean characters) that are unrecognized by the tokenizer. This is an important insight, as bert tokenizers are very common.

### Integrity #2: Text outliers

Looking at the Text Outlier check result (in the "Other" tab) we can derive several insights: </br>
1. hashtags ('#...') are usually several words written together without spaces - we might consider splitting them before feeding the tweet to a model</br>
2. In some instances users deliberately misspell words, for example '!' instead of the letter 'l' or 'okayyyyyyyyyy'</br>
3. The majority of the data is in English but not all. If we want a classifier that is multi lingual we should collect more data, otherwise we may consider </br>
   dropping tweets in other languages from our dataset before training our model. 

### Integrity #3: Property-Label Correlation (Shortcut Learning)

The Property-Label Correlation check verifies the data does not contain any shortcuts the model can fixate on during the learning process. In our case we can see no indication that this problem exists in our dataset</br> 
For more information about shortcut learning see: https://towardsdatascience.com/shortcut-learning-how-and-why-models-cheat-1b37575a159

## Train Test Validation

The next suite serves to validate our split and compare the two dataset. This suite is useful for when a you already decided about your train and test / validation splits, but before training the model itself.

In [None]:
from deepchecks.nlp.suites import train_test_validation

In [None]:
train_test_validation().run(train, test)

### Label Drift

We can see that we have some significant change in the distribution of the label - the label "optimism" is suddenly way more common in the test dataset, while other labels declined. This happened because we split on time, so the topics covered by the tweets in the test dataset may correspond to specific trends or events that happened later in time. Let's investigate!

## Model Evalution

### Loading pre-calculated model predictions

The suite below is designed to be run after a model was trained, and so requires model predictions and can be supplied via the relevant arguments in the ``run`` function

In [None]:
train_preds, test_preds = tweet_emotion.load_precalculated_predictions(pred_format='predictions', as_train_test=True)

train_probas, test_probas = tweet_emotion.load_precalculated_predictions(pred_format='probabilities', as_train_test=True)

In [None]:
from deepchecks.nlp.suites import model_evaluation

In [None]:
result = model_evaluation().run(train, test, train_predictions=train_preds, test_predictions=test_preds, 
                                train_probabilities=train_probas, test_probabilities=test_probas, model_classes=model_classes)
result

OK! We have many important issues being surfaced by this suite. Let's dive into the individual checks:

### Model Eval #1: Train Test Performance 

On the most superficial level, we can immediately see (in the "Didn't Pass" tab) that there has been significant degradation in the Recall on class "optimism". This follows from the severe label drift we saw after running the previous suite.

### Model Eval #2: Segment Performance

The two segment performance checks - Property Segment Performance and Meadata Segment Performance, use the metadata columns of user related information OR our calculated properties to try and **automatically** detect significant data segments on which our model performs badly. 

In this case we can see that both checks have found issues in the test dataset:
1. The Property Segment Performance check has found that we're getting very poor results on low toxicity samples. That probably means that our model is using the toxicity of the text to infer the "anger" label, and is having a harder problem with other, more benign text samples.
2. The Metadata Segment Performance check has found that we have predicting correct results on new users from the Americas. That's 5% of our dataset so we better investigate that further.

### Model Eval #3: Prediction Drift

We note that the Prediction Drift (here in the "Passed" tab) shows no issue. Given that we already know that there is significant Label Drift, this means we have Concept Drift - the labels corresponding to our samples have changed, while the model continues to predict the same labels.

## Running all check in one suite

If you have a model that you already developed and you want to test all available checks at once, you can run the Full Suite

In [None]:
# from deepchecks.nlp.suites import full_suite
# suite = full_suite()
# result = suite.run(train, test, train_predictions=train_preds, test_predictions=test_preds, 
#                    train_probabilities=train_probas, test_probabilities=test_probas, model_classes=model_classes)

# Running Individual Checks 

Checks can also be run individually. In this section, we'll show two of the more interesting checks and how you can run them stand-alone.

## Embeddings Drift

In [None]:
from deepchecks.nlp.datasets.classification.tweet_emotion import load_embeddings

In order to run the embedding drift check you must have text embeddings loaded to both datasets. In this example, we have the embeddings already pre-calculated: 

In [None]:
train_embeddings, test_embeddings = load_embeddings()

In [None]:
train.set_embeddings(train_embeddings)
test.set_embeddings(test_embeddings)

You can also calculate the embeddings using deepchecks, either using an open-source sentence-transformer or using Open AI's embedding API.

In [None]:
# train.calculate_default_embeddings()
# test.calculate_default_embeddings()

In [None]:
from deepchecks.nlp.checks import TextEmbeddingsDrift

check = TextEmbeddingsDrift()
res = check.run(train, test)
res

Here we can see some distinct segments that distinctly contain more samples from train or more sample for test. For example, if we look at the cluster on the top left corner we see it's full of inspirational quotes and saying, and belongs mostly to the test dataset. That is the source of the drastic increase in optimistic labels!

There are some other note-worthy segments, such as the "tail" segment in the middle left that contains tweets about a terror attack in Bangladesh (and belongs solely to the test data), or a cluster on the bottom right that discusses a sports event that probably happened strictly in the training dataset.

## Under Annotated Segments

Another note-worth segment is the Under Annotated Segment check, which explores our data and automatically identifies segments where the data is under-annotated - meaning that the ratio of missing labels is higher. To this check we'll also add a condition that will alert us in case that a significant under-annotated segment is found.

In [None]:
from deepchecks.nlp.checks import UnderAnnotatedPropertySegments
check = UnderAnnotatedPropertySegments(segment_minimum_size_ratio=0.1).add_condition_segments_relative_performance_greater_than()
check.run(test_copy)

For example here the check detected that we have a lot of lacking annotations for samples that are informal and not very fluent. May it be the case that our annotators have a problem annotating these samples and prefer not to deal with them? If these samples are important for use, we may have to put special focus on annotating this segment.