# MIT Movie Corpus

The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.

- https://groups.csail.mit.edu/sls/downloads/
- https://www.microsoft.com/en-us/research/publication/a-conversational-movie-search-system-based-on-conditional-random-fields/

In [1]:
from pathlib import Path

data_dir = Path('../data')

hub_username = 'marcov'

corpus_name = 'NER_ENGLISH_MOVIE_COMPLEX'
corpus_url = 'https://groups.csail.mit.edu/sls/downloads/movie'
corpus_dir = data_dir / corpus_name
corpus_dir.mkdir(parents=True, exist_ok=True)

train_filename = 'trivia10k13train.bio'
test_filename = 'trivia10k13test.bio'

In [2]:
!wget -nc {corpus_url}/{train_filename} -P {corpus_dir}
!wget -nc {corpus_url}/{test_filename} -P {corpus_dir}

--2024-04-28 12:00:23--  https://groups.csail.mit.edu/sls/downloads/movie/trivia10k13train.bio
Resolving groups.csail.mit.edu (groups.csail.mit.edu)... 128.30.2.44
Connecting to groups.csail.mit.edu (groups.csail.mit.edu)|128.30.2.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1785558 (1.7M)
Saving to: ‘../data/NER_ENGLISH_MOVIE_COMPLEX/trivia10k13train.bio’


2024-04-28 12:00:24 (1.28 MB/s) - ‘../data/NER_ENGLISH_MOVIE_COMPLEX/trivia10k13train.bio’ saved [1785558/1785558]

--2024-04-28 12:00:25--  https://groups.csail.mit.edu/sls/downloads/movie/trivia10k13test.bio
Resolving groups.csail.mit.edu (groups.csail.mit.edu)... 128.30.2.44
Connecting to groups.csail.mit.edu (groups.csail.mit.edu)|128.30.2.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438729 (428K)
Saving to: ‘../data/NER_ENGLISH_MOVIE_COMPLEX/trivia10k13test.bio’


2024-04-28 12:00:26 (822 KB/s) - ‘../data/NER_ENGLISH_MOVIE_COMPLEX/trivia10k13test.bio’ saved [43872

In [3]:
from ai_den.utils.datasets import read_conll_file
from datasets import DatasetDict

column_format = {'text': 1, 'ner': 0}

ds = DatasetDict({
    'train': read_conll_file(corpus_dir / train_filename, column_format),
    'test': read_conll_file(corpus_dir / test_filename, column_format),
})

ds

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'ner'],
        num_rows: 7816
    })
    test: Dataset({
        features: ['text', 'ner'],
        num_rows: 1953
    })
})

In [4]:
ds['train'].to_pandas()

Unnamed: 0,text,ner
0,"[steve, mcqueen, provided, a, thrilling, motor...","[B-Actor, I-Actor, O, O, B-Plot, I-Plot, I-Plo..."
1,"[liza, minnelli, and, joel, gray, won, oscars,...","[B-Actor, I-Actor, O, B-Actor, I-Actor, B-Awar..."
2,"[what, is, that, tom, hanks, and, julia, rober...","[O, O, O, B-Actor, I-Actor, O, B-Actor, I-Acto..."
3,"[what, is, the, movie, making, fun, of, macgyv...","[O, O, O, O, B-Plot, I-Plot, I-Plot, I-Plot, I..."
4,"[i, am, thinking, of, an, animated, film, base...","[O, O, O, O, O, B-Genre, O, B-Origin, I-Origin..."
...,...,...
7811,"[you, see, this, 1965, musical, masterpiece, r...","[O, O, O, B-Year, B-Genre, B-Opinion, O, O, O,..."
7812,"[young, traveler, allan, gray, discovers, evid...","[B-Plot, I-Plot, B-Character_Name, I-Character..."
7813,"[yul, bryner, recreated, his, broadway, role, ...","[B-Actor, I-Actor, B-Origin, I-Origin, I-Origi..."
7814,"[yul, brynner, won, an, oscar, for, his, role,...","[B-Actor, I-Actor, O, O, B-Award, O, O, O, O, ..."


In [5]:
ds.push_to_hub(
    repo_id=f'{hub_username}/{corpus_name}',
    private=True,
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/marcov/NER_ENGLISH_MOVIE_COMPLEX/commit/6aecc7d5b94162671eee05af203a707fbb5901da', commit_message='Upload dataset', commit_description='', oid='6aecc7d5b94162671eee05af203a707fbb5901da', pr_url=None, pr_revision=None, pr_num=None)