# Find Challenging Cases In Any Huggingface Dataset
- https://medium.com/@daniel-klitzke/find-issues-in-huggingface-datasets-and-explore-them-interactively-dd3503c0985f
- https://github.com/Renumics/sliceguard/blob/main/examples/hf_dataset_loading.ipynb

Sliceguard helps you to quickly discover **problematic data segments**. It supports structured data as well as unstructured data like images, text or audio. Sliceguard generates an **interactive report** with just a few lines of code:

```python
from sliceguard import SliceGuard

sg = SliceGuard()
issues = sg.find_issues(df, features=["image"])

sg.report()
```

First, install sliceguard including its embedding and AutoML capabilities.

In [None]:
%pip install --no-cache-dir sliceguard[all] cleanlab

Collecting cleanlab
  Downloading cleanlab-2.7.0-py3-none-any.whl.metadata (60 kB)
Collecting termcolor>=2.4.0 (from cleanlab)
  Downloading termcolor-2.5.0-py3-none-any.whl.metadata (6.1 kB)
Downloading cleanlab-2.7.0-py3-none-any.whl (347 kB)
Downloading termcolor-2.5.0-py3-none-any.whl (7.8 kB)
Installing collected packages: termcolor, cleanlab
Successfully installed cleanlab-2.7.0 termcolor-2.5.0
Note: you may need to restart the kernel to use updated packages.


Import sliceguard and a metric function that is meaningful for the task of audio classification.

In [3]:
from sliceguard import SliceGuard
from sliceguard.data import from_huggingface
from sklearn.metrics import accuracy_score

Load an audio classification dataset

In [4]:
df = from_huggingface("renumics/emodb")

Downloading readme:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/535 [00:00<?, ? examples/s]

In [5]:
df

Unnamed: 0,age,gender,emotion,audio,split
0,31.0,male,happiness,./sliceguard_tmp/15632b224c0349a18b50ca3d0d897...,train
1,31.0,male,neutral,./sliceguard_tmp/6f896216f1dc480b8cf43b284995c...,train
2,31.0,male,anger,./sliceguard_tmp/2c174149ce664493a461bc34ac3c8...,train
3,31.0,male,happiness,./sliceguard_tmp/bdafbe22e6b54697acd6b1f359368...,train
4,31.0,male,neutral,./sliceguard_tmp/85dbaf08ba2d44f99464878814cdd...,train
...,...,...,...,...,...
530,31.0,female,boredom,./sliceguard_tmp/6f492e6d5f5c41fda76ef2851976f...,train
531,31.0,female,sadness,./sliceguard_tmp/e643ec7275774547bfb18a6173391...,train
532,31.0,female,sadness,./sliceguard_tmp/579ae503af8645c1a2e3a53cfe9f5...,train
533,31.0,female,anger,./sliceguard_tmp/784bfe68a37447068e0e8a1df1de6...,train


Detect challenging clusters using sliceguard.

In [6]:
sg = SliceGuard()
sg.find_issues(df, features=["audio"], y="emotion", metric=accuracy_score)
sg.report()

Feature audio was inferred as referring to raw data. If this is not the case, please specify in feature_types!
Using default model for computing embeddings for feature audio.
Computing audio embeddings.


preprocessor_config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Embedding computation on cpu with batch size 1 and multiprocessing None.


config.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Map:   0%|          | 0/535 [00:00<?, ? examples/s]

Pre-reducing feature audio in mode automl.
Using op mix ratio 0.8.
Using num dimensions 32.
[flaml.automl.logger: 12-03 15:40:02] {1728} INFO - task = classification
[flaml.automl.logger: 12-03 15:40:02] {1739} INFO - Evaluation method: cv
[flaml.automl.logger: 12-03 15:40:02] {1838} INFO - Minimizing error metric: 1-roc_auc_ovr
[flaml.automl.logger: 12-03 15:40:02] {1955} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl.logger: 12-03 15:40:02] {2258} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 12-03 15:40:02] {2393} INFO - Estimated sufficient time budget=2879s. Estimated necessary time budget=3s.
[flaml.automl.logger: 12-03 15:40:02] {2442} INFO -  at 0.3s,	estimator xgboost's best error=0.1282,	best estimator xgboost's best error=0.1282
[flaml.automl.logger: 12-03 15:40:02] {2258} INFO - iteration 1, current learner xgboost
[flaml.automl.logger: 12-03 15:40:02] {2442} INFO -  at 0.6s,	estimator xgboost's best error=0.1282,	best estimator xgbo

(      age  gender    emotion  \
 0    31.0    male  happiness   
 1    31.0    male    neutral   
 2    31.0    male      anger   
 3    31.0    male  happiness   
 4    31.0    male    neutral   
 ..    ...     ...        ...   
 530  31.0  female    boredom   
 531  31.0  female    sadness   
 532  31.0  female    sadness   
 533  31.0  female      anger   
 534  31.0  female      anger   
 
                                                  audio  split  \
 0    ./sliceguard_tmp/15632b224c0349a18b50ca3d0d897...  train   
 1    ./sliceguard_tmp/6f896216f1dc480b8cf43b284995c...  train   
 2    ./sliceguard_tmp/2c174149ce664493a461bc34ac3c8...  train   
 3    ./sliceguard_tmp/bdafbe22e6b54697acd6b1f359368...  train   
 4    ./sliceguard_tmp/85dbaf08ba2d44f99464878814cdd...  train   
 ..                                                 ...    ...   
 530  ./sliceguard_tmp/6f492e6d5f5c41fda76ef2851976f...  train   
 531  ./sliceguard_tmp/e643ec7275774547bfb18a6173391...  train   
 532  ./