Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.
Download the dataset from Huggingface Hub or alternatively from Kaggle Datasets.
Join #livechat-dataset
channel on holodata Discord for discussions.
- Source: YouTube Live Chat events (all streams covered by Holodex, including Hololive, Nijisanji, 774inc, etc)
- Temporal Coverage: From 2021-01-15T05:15:33Z
- Update Frequency: At least once per month
- Toxic Chat Classification
- Spam Detection
- Sentence Transformer for Live Chats
See public notebooks for ideas.
filename | summary | size |
---|---|---|
chats_flagged_%Y-%m.csv |
Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB |
chats_nonflag_%Y-%m.csv |
Non-flagged chats (3,100,000+) | ~ 300 MB |
To make it a balanced dataset, the number of chats_nonflags
is adjusted (randomly sampled) to be the same as chats_flagged
.
Ban and deletion are equivalent to markChatItemsByAuthorAsDeletedAction
and markChatItemAsDeletedAction
respectively.
column | type | description |
---|---|---|
body | string | chat message |
authorChannelId | string | anonymized author channel id |
channelId | string | source channel id |
label | string | {deleted, hidden, nonflagged} |
import pandas as pd
from glob import glob
df = pd.concat([pd.read_parquet(x) for x in glob('../input/sensai/*.parquet')], ignore_index=True)
https://huggingface.co/docs/datasets/loading_datasets.html
# $ pip3 install datasets
from datasets import load_dataset, Features, ClassLabel, Value
dataset = load_dataset("holodata/sensai",
features=Features(
{
"body": Value("string"),
"toxic": ClassLabel(num_classes=2, names=['0', '1'])
}
))
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["body"], padding="max_length", truncation=True)
tokenized_datasets = dataset['train'].shuffle().select(range(50000)).map(tokenize_function, batched=True)
tokenized_datasets.rename_column_("toxic", "label")
splitset = tokenized_datasets.train_test_split(0.2)
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']
)
trainer.train()
python3 ./examples/prepare_tangram_dataset.py
tangram train --file ./tangram_input.csv --target label
authorChannelId
are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
All custom emojis are replaced with a Unicode replacement character U+FFFD
.
@misc{sensai-dataset,
author={Yasuaki Uechi},
title={Sensai: Toxic Chat Dataset},
year={2021},
month={8},
version={31},
url={https://github.com/holodata/sensai-dataset}
}
- Code: MIT License
- Dataset: ODC Public Domain Dedication and Licence (PDDL)