Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
TV and Movie dialogue corpus (#2058)
A huge collection (2781 texts total) of TV and Movie dialgoues and transcripts either collected or scraped from transcript sites. See datacard for more information: https://huggingface.co/datasets/sedthh/tv_dialogue They are in a standardized format ([speaker] speech) and the idea is that these dialogues should help the Assistant to: 1) mimic the style of famous characters 2) generate new episodes of certain shows NOTE: this is part 1, there will be a part 2 based entirely on ForeverDreaming's transcripts NOTE2: the notebook only contains the crawler for IMSDb, the parsers for the rest of the sources were omited as they had a low code quality to save time
- Loading branch information
Showing
5 changed files
with
7,437 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
--- | ||
dataset_info: | ||
features: | ||
- name: TEXT | ||
dtype: string | ||
- name: METADATA | ||
dtype: string | ||
- name: SOURCE | ||
dtype: string | ||
splits: | ||
- name: train | ||
num_bytes: 211728118 | ||
num_examples: 2781 | ||
download_size: 125187885 | ||
dataset_size: 211728118 | ||
license: mit | ||
task_categories: | ||
- conversational | ||
- text2text-generation | ||
language: | ||
- en | ||
tags: | ||
- OpenAssistant | ||
- transcripts | ||
- subtitles | ||
- television | ||
pretty_name: TV and Movie dialogue and transcript corpus | ||
size_categories: | ||
- 1K<n<10K | ||
--- | ||
|
||
# Dataset Card for "tv_dialogue" | ||
|
||
This dataset contains transcripts for famous movies and TV shows from multiple | ||
sources. | ||
|
||
An example dialogue would be: | ||
|
||
``` | ||
[PERSON 1] Hello | ||
[PERSON 2] Hello Person 2! | ||
How's it going? | ||
(they are both talking) | ||
[PERSON 1] I like being an example | ||
on Huggingface! | ||
They are examples on Huggingface. | ||
CUT OUT TO ANOTHER SCENCE | ||
We are somewhere else | ||
[PERSON 1 (v.o)] I wonder where we are? | ||
``` | ||
|
||
All dialogues were processed to follow this format. Each row is a single episode | ||
/ movie (**2781** rows total) Following the | ||
[OpenAssistant](https://open-assistant.io/) format The METADATA column contains | ||
dditional information as a JSON string. | ||
|
||
## Dialogue only, with some information on the scene | ||
|
||
| Show | Number of scripts | Via | Source | | ||
| ------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------- | -------------------- | | ||
| Friends | 236 episodes | https://github.com/emorynlp/character-mining | friends/emorynlp | | ||
| The Office | 186 episodes | https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript | office/nasirkhalid24 | | ||
| Marvel Cinematic Universe | 18 movies | https://www.kaggle.com/datasets/pdunton/marvel-cinematic-universe-dialogue | marvel/pdunton | | ||
| Doctor Who | 306 episodes | https://www.kaggle.com/datasets/jeanmidev/doctor-who | drwho/jeanmidev | | ||
| Star Trek | 708 episodes | http://www.chakoteya.net/StarTrek/index.html based on https://github.com/GJBroughton/Star_Trek_Scripts/ | statrek/chakoteya | | ||
|
||
## Actual transcripts with detailed information on the scenes | ||
|
||
| Show | Number of scripts | Via | Source | | ||
| ------------- | ----------------- | ----------------------------------- | ------------------- | | ||
| Top Movies | 919 movies | https://imsdb.com/ | imsdb | | ||
| Top Movies | 171 movies | https://www.dailyscript.com/ | dailyscript | | ||
| Stargate SG-1 | 18 episodes | https://imsdb.com/ | imsdb | | ||
| South Park | 129 episodes | https://imsdb.com/ | imsdb | | ||
| Knight Rider | 80 episodes | http://www.knightriderarchives.com/ | knightriderarchives | |
Oops, something went wrong.