## `expertise.dataset` module tutorial

Install from https://github.com/iesl/openreview-expertise

`pip install -e <openreview-expertise directory location>`

In [1]:
from expertise import dataset

dataset_dir = '/Users/michaelspector/projects/openreview/openreview-datasets/datasets/akbc19_dblp/'

In [2]:
'''
Dataset object is initialized with a valid dataset directory. 
A dataset directory has the following structure:

/example_dataset
    /submissions
        <paper1_id.jsonl>
        <paper2_id.jsonl>
        ...
    /archives
        <reviewer1_id.jsonl>
        <reviewer2_id.jsonl>
        ...
    /bids
        <paper1_id.jsonl>
        <paper2_id.jsonl>
    /extras
        # conference-specific files can go here
    metadata.json

'''

dataset = dataset.Dataset(dataset_dir)

In [3]:
'''
The dataset object exposes various statistics about itself:
'''

dataset.get_stats()

{'reviewer_count': 67,
 'submission_count': 54,
 'archive_counts': {'~Aishwarya_Kamath1': {'arx': 4, 'bid': 52},
  '~Alan_Ritter1': {'arx': 43, 'bid': 9},
  '~Alexander_Spangher1': {'arx': 0, 'bid': 35},
  '~Amrita_Saha1': {'arx': 17, 'bid': 10},
  '~Anca_Dumitrache1': {'arx': 15, 'bid': 10},
  '~Andreas_Vlachos1': {'arx': 67, 'bid': 54},
  '~Andrew_McCallum1': {'arx': 148, 'bid': 2},
  '~Anna_Lisa_Gentile1': {'arx': 29, 'bid': 7},
  '~Ashwin_Ittoo1': {'arx': 1, 'bid': 2},
  '~Bhuwan_Dhingra1': {'arx': 16, 'bid': 18},
  '~Bishan_Yang1': {'arx': 14, 'bid': 0},
  '~Caiming_Xiong1': {'arx': 115, 'bid': 18},
  '~Camilo_John_Thorne1': {'arx': 21, 'bid': 0},
  '~Chen_Liang1': {'arx': 63, 'bid': 54},
  '~Chris_Welty1': {'arx': 15, 'bid': 12},
  '~Danqi_Chen1': {'arx': 18, 'bid': 0},
  '~Dirk_Weissenborn1': {'arx': 30, 'bid': 53},
  '~Emma_Strubell1': {'arx': 27, 'bid': 6},
  '~Federica_Cena1': {'arx': 37, 'bid': 7},
  '~Gabriel_Stanovsky1': {'arx': 10, 'bid': 76},
  '~Gerhard_Weikum1': {'arx'

In [4]:
'''
A Dataset object's `archives`, `submissions`, and `bids` functions return generators 
that stream tuples of records in the format (id: str, records: list or dict)
'''

archives_as_dicts = dataset.archives(
    fields=[
        'title',
        'abstract',
        'fulltext',
        'keywords',
        'subject_areas'
    ],
    return_batches=False #default
)

print(next(archives_as_dicts))
print()

archives_as_lists = dataset.archives(
    fields=[
        'title',
        'abstract',
        'fulltext',
        'keywords',
        'subject_areas'
    ],
    return_batches=True
)

print(next(archives_as_lists))

('~Svitlana_Volkova1', {'title': 'Online Bayesian Models for Personal Analytics in Social Media', 'abstract': 'Latent author attribute prediction in social media provides a novel set of conditions for the construction of supervised classification models. With individual authors as training and test instances, their associated content ("features") are made available incrementally over time, as they converse over discussion forums. We propose various approaches to handling this dynamic data, from traditional batch training and testing, to incremental bootstrapping, and then active learning via crowdsourcing. Our underlying model relies on an intuitive application of Bayes rule, which should be easy to adopt by the community, thus allowing for a general shift towards online modeling for social media.'})

('~Svitlana_Volkova1', [{'title': 'Online Bayesian Models for Personal Analytics in Social Media', 'abstract': 'Latent author attribute prediction in social media provides a novel set of 

In [5]:
'''
There are various convenience options (e.g. partition_id and num_partitions, which can be useful for parallelization)
'''

submission_generator = dataset.submissions(
    fields=['title'],
    return_batches=True,
    progressbar="every other submission record",
    partition_id=0,
    num_partitions=2
)
every_other_submission = [submission_id for submission_id, submission in submission_generator]

archive_generator = dataset.archives(
    fields=['title'],
    return_batches=True,
    progressbar="every 3rd reviewer's archive",
    partition_id=0,
    num_partitions=3
)
every_third_reviewer = [reviewer_id for reviewer_id, archive in archive_generator]

bid_generator = dataset.bids(
    return_batches=True,
    progressbar="every 4th bid, keyed on submission id",
    partition_id=0,
    num_partitions=4
)
every_fourth_paper_bidcount = [(submission_id, len(bids)) for submission_id, bids in bid_generator]

every other submission record:  96%|█████████▋| 26/27.0 [00:00<00:00, 623.40it/s]
every 3rd reviewer's archive: 23it [00:00, 175.64it/s]                                        
every 4th bid, keyed on submission id: 14it [00:00, 55.00it/s]                          
