# MIT Movie Corpus

The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.

- https://groups.csail.mit.edu/sls/downloads/
- https://www.microsoft.com/en-us/research/publication/a-conversational-movie-search-system-based-on-conditional-random-fields/

In [1]:
from pathlib import Path

data_dir = Path('../data')

hub_username = 'marcov'

corpus_name = 'NER_ENGLISH_MOVIE_SIMPLE'
corpus_url = 'https://groups.csail.mit.edu/sls/downloads/movie'
corpus_dir = data_dir / corpus_name
corpus_dir.mkdir(parents=True, exist_ok=True)

train_filename = 'engtrain.bio'
test_filename = 'engtest.bio'

In [2]:
!wget -nc {corpus_url}/{train_filename} -P {corpus_dir}
!wget -nc {corpus_url}/{test_filename} -P {corpus_dir}

--2024-04-28 11:58:49--  https://groups.csail.mit.edu/sls/downloads/movie/engtrain.bio
Resolving groups.csail.mit.edu (groups.csail.mit.edu)... 128.30.2.44
Connecting to groups.csail.mit.edu (groups.csail.mit.edu)|128.30.2.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1013492 (990K)
Saving to: ‘../data/NER_ENGLISH_MOVIE_SIMPLE/engtrain.bio’


2024-04-28 11:58:52 (529 KB/s) - ‘../data/NER_ENGLISH_MOVIE_SIMPLE/engtrain.bio’ saved [1013492/1013492]

--2024-04-28 11:58:52--  https://groups.csail.mit.edu/sls/downloads/movie/engtest.bio
Resolving groups.csail.mit.edu (groups.csail.mit.edu)... 128.30.2.44
Connecting to groups.csail.mit.edu (groups.csail.mit.edu)|128.30.2.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252636 (247K)
Saving to: ‘../data/NER_ENGLISH_MOVIE_SIMPLE/engtest.bio’


2024-04-28 11:58:53 (431 KB/s) - ‘../data/NER_ENGLISH_MOVIE_SIMPLE/engtest.bio’ saved [252636/252636]



In [3]:
from ai_den.utils.datasets import read_conll_file
from datasets import DatasetDict

column_format = {'text': 1, 'ner': 0}

ds = DatasetDict({
    'train': read_conll_file(corpus_dir / train_filename, column_format),
    'test': read_conll_file(corpus_dir / test_filename, column_format),
})

ds

DatasetDict({
    train: Dataset({
        features: ['text', 'ner'],
        num_rows: 9775
    })
    test: Dataset({
        features: ['text', 'ner'],
        num_rows: 2443
    })
})

In [4]:
ds['train'].to_pandas()

Unnamed: 0,text,ner
0,"[what, movies, star, bruce, willis]","[O, O, O, B-ACTOR, I-ACTOR]"
1,"[show, me, films, with, drew, barrymore, from,...","[O, O, O, O, B-ACTOR, I-ACTOR, O, O, B-YEAR]"
2,"[what, movies, starred, both, al, pacino, and,...","[O, O, O, O, B-ACTOR, I-ACTOR, O, B-ACTOR, I-A..."
3,"[find, me, all, of, the, movies, that, starred...","[O, O, O, O, O, O, O, O, B-ACTOR, I-ACTOR, O, ..."
4,"[find, me, a, movie, with, a, quote, about, ba...","[O, O, O, O, O, O, O, O, O, O, O]"
...,...,...
9770,"[what, did, people, say, about, shadow, of, th...","[O, O, O, B-REVIEW, I-REVIEW, B-TITLE, I-TITLE..."
9771,"[show, me, the, reviews, about, road, kill]","[O, O, O, B-REVIEW, O, B-TITLE, I-TITLE]"
9772,"[what, do, people, think, of, the, movie, alic...","[O, O, O, B-REVIEW, I-REVIEW, O, O, B-TITLE, I..."
9773,"[show, me, the, movie, with, sherlock, holmes,...","[O, O, O, O, O, B-CHARACTER, I-CHARACTER, O, O..."


In [5]:
ds.push_to_hub(
    repo_id=f'{hub_username}/{corpus_name}',
    private=True,
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/marcov/NER_ENGLISH_MOVIE_SIMPLE/commit/613c5c6f02804334e5cc98ce39269cc50c4ad160', commit_message='Upload dataset', commit_description='', oid='613c5c6f02804334e5cc98ce39269cc50c4ad160', pr_url=None, pr_revision=None, pr_num=None)