Skip to content

Commit

Permalink
TV and Movie dialogue corpus (#2058)
Browse files Browse the repository at this point in the history
A huge collection (2781 texts total) of TV and Movie dialgoues and
transcripts either collected or scraped from transcript sites.
See datacard for more information:
https://huggingface.co/datasets/sedthh/tv_dialogue

They are in a standardized format ([speaker] speech) and the idea is
that these dialogues should help the Assistant to:
1) mimic the style of famous characters
2) generate new episodes of certain shows

NOTE: this is part 1, there will be a part 2 based entirely on
ForeverDreaming's transcripts
NOTE2: the notebook only contains the crawler for IMSDb, the parsers for
the rest of the sources were omited as they had a low code quality to
save time
  • Loading branch information
sedthh committed Mar 16, 2023
1 parent 8005c1f commit 77401ae
Show file tree
Hide file tree
Showing 5 changed files with 7,437 additions and 2 deletions.
5 changes: 3 additions & 2 deletions data/datasets/__init__.py
@@ -1,6 +1,7 @@
TEXT_DATASETS = {
"gutenberg_english": "sedthh/gutenberg_english",
"gutenberg_multilang": "sedthh/gutenberg_multilang",
"gutenberg_english": "sedthh/gutenberg_english", # Gutenberg eBooks in English
"gutenberg_multilang": "sedthh/gutenberg_multilang", # Gutenber eBooks in foreign languages
"tv_dialogue": "sedthh/tv_dialogue", # TV and Movie dialogues and transcripts
}

INSTRUCTION_DATASETS = {
Expand Down
79 changes: 79 additions & 0 deletions data/datasets/tv_dialogue/README.md
@@ -0,0 +1,79 @@
---
dataset_info:
features:
- name: TEXT
dtype: string
- name: METADATA
dtype: string
- name: SOURCE
dtype: string
splits:
- name: train
num_bytes: 211728118
num_examples: 2781
download_size: 125187885
dataset_size: 211728118
license: mit
task_categories:
- conversational
- text2text-generation
language:
- en
tags:
- OpenAssistant
- transcripts
- subtitles
- television
pretty_name: TV and Movie dialogue and transcript corpus
size_categories:
- 1K<n<10K
---

# Dataset Card for "tv_dialogue"

This dataset contains transcripts for famous movies and TV shows from multiple
sources.

An example dialogue would be:

```
[PERSON 1] Hello
[PERSON 2] Hello Person 2!
How's it going?
(they are both talking)
[PERSON 1] I like being an example
on Huggingface!
They are examples on Huggingface.
CUT OUT TO ANOTHER SCENCE
We are somewhere else
[PERSON 1 (v.o)] I wonder where we are?
```

All dialogues were processed to follow this format. Each row is a single episode
/ movie (**2781** rows total) Following the
[OpenAssistant](https://open-assistant.io/) format The METADATA column contains
dditional information as a JSON string.

## Dialogue only, with some information on the scene

| Show | Number of scripts | Via | Source |
| ------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------- | -------------------- |
| Friends | 236 episodes | https://github.com/emorynlp/character-mining | friends/emorynlp |
| The Office | 186 episodes | https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript | office/nasirkhalid24 |
| Marvel Cinematic Universe | 18 movies | https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue | marvel/pdunton |
| Doctor Who | 306 episodes | https://www.kaggle.com/datasets/jeanmidev/doctor-who | drwho/jeanmidev |
| Star Trek | 708 episodes | http://www.chakoteya.net/StarTrek/index.html based on https://github.com/GJBroughton/Star_Trek_Scripts/ | statrek/chakoteya |

## Actual transcripts with detailed information on the scenes

| Show | Number of scripts | Via | Source |
| ------------- | ----------------- | ----------------------------------- | ------------------- |
| Top Movies | 919 movies | https://imsdb.com/ | imsdb |
| Top Movies | 171 movies | https://www.dailyscript.com/ | dailyscript |
| Stargate SG-1 | 18 episodes | https://imsdb.com/ | imsdb |
| South Park | 129 episodes | https://imsdb.com/ | imsdb |
| Knight Rider | 80 episodes | http://www.knightriderarchives.com/ | knightriderarchives |

0 comments on commit 77401ae

Please sign in to comment.