# OS Parser: API example

This notebook gives a quick demo of the process of parsing an individual syllabus with the Open Syllabus parser API, and then working with the JSON metedata returned by the model.

In [72]:
import boto3
import requests
import json

First, we'll add a helper that takes a document as a raw byte string, sends the API request to the model, and then parses the JSON that comes back into a Python dictionary.

In this example, we'll directly invoke a [SageMaker](https://aws.amazon.com/sagemaker/) endpoint that's serving the parser model. In production, depending on how access is granted to the model, this might instead involve sending a normal HTTP post request to a URL like `https://parser.opensyllabus.org/api/v1`, using a library like `requests`. But, the output of the model will be exactly the same in either case.

In [73]:
def parse_syllabus(data: bytes):
    """Given the raw bytes for a document (HTML, PDF, DOCX, etc), invoke the OS
    parser SageMaker endpoint and parse the JSON response.
    """
    client = boto3.client('sagemaker-runtime', region_name='us-east-1')
    
    # Invoke the parser API.
    res = client.invoke_endpoint(EndpointName='os-parser-v1', Body=data)
    
    # Parse the JSON that comes back from the model.
    return json.loads(res['Body'].read())

## Example: MIT 9.520

As an example syllabus, let's use the course webpage for [MIT 9.520, "Statistical Learning Theory and Applications"](https://www.mit.edu/~9.520/fall19/). For now, let's just download this directly from the web. (We could also load documents from the local filesystem, query them from a data warehouse, etc.)

In [37]:
res = requests.get('https://www.mit.edu/~9.520/fall19/')

And then pass the raw HTML data through the parser:

In [38]:
data = parse_syllabus(res.content)

This will return a nested Python dictionary that contains all of the metadata extracted from the document in a single bundle. This is organized under a set of top-level keys:

In [39]:
list(data.keys())

['md5',
 'doc_type',
 'text',
 'syllabus_probability',
 'field',
 'language',
 'institution',
 'date',
 'urls',
 'extracted_sections',
 'citations']

### `syllabus_probability`

Some of these are very simple, just single values. Eg, `syllabus_probability` is just the probability that the model is a syllabus, as predicted by the top-level document classifier:

In [40]:
data['syllabus_probability']

0.9876453612201228

### `institution`

Other keys contain more complex metadata. For example, `institution` is a nested object that contains metadata about the college or university where the course was taught. (This is inferred from a combination of signals in the document and the URL it was scraped from.)

Under the hood, we link documents against a database of ~22,000 institutions derived from IPEDS, ROR (previously GRID), and Wikidata. By following links to the external identifies, it's possible to get access to a really wide range of standardized metadata about the school. Eg, from Wikidata -

https://www.wikidata.org/wiki/Q49108

Or GRID:

https://grid.ac/institutes/grid.116068.8

In [16]:
data['institution']

{'_id': 18108,
 'ror_id': '042nb2s44',
 'grid_id': 'grid.116068.8',
 'wikidata_id': 'Q49108',
 'unitid': 166683,
 'city': 'Cambridge',
 'name': 'Massachusetts Institute of Technology',
 'lat': 42.35982,
 'lng': -71.09211,
 'url': 'http://web.mit.edu/',
 'country_code': 'US',
 'country': 'United States',
 'state_code': 'US-MA',
 'state': 'Massachusetts',
 'enrollment': 12321,
 'two_year': False,
 'four_year': True,
 'graduate': True,
 'research': True}

### `date`

Or, `date` contains the year and semester in which the course was taught.

In [17]:
data['date']

{'term': 'fall', 'year': 2019}

### `field`

And, `field` contains the output from the field classifier.

(This taxonomy is a rolled-up version of the the Department of Education [CIP codes](https://nces.ed.gov/ipeds/cipcode/browse.aspx?y=55) - the `cip_codes` key in the Open Syllabus data contains the list of CIP codes that were combined to form the Open Syllabus field.)

In [41]:
data['field']

{'_id': 45, 'cip_codes': ['27'], 'name': 'Mathematics'}

### `extracted_sections`

In many ways the core of the output from the parser is `extracted_sections` - this is the output from the top-level document segmentation model that takes the raw document and splits it into a set of 21 standardized entity types:

In [43]:
list(data['extracted_sections'].keys())

['title',
 'code',
 'section',
 'date',
 'class_days',
 'class_time',
 'class_location',
 'instructor',
 'instructor_phone',
 'office_hours_days',
 'office_location',
 'office_hours_times',
 'credits',
 'description',
 'learning_outcomes',
 'citations',
 'required_reading',
 'grading_rubric',
 'assessment_strategy',
 'topic_outline',
 'assignment_schedule']

In [64]:
def print_json(data: dict):
    print(json.dumps(data, indent=2))

Under each of these keys is a list of spans of that type that were identified in the document. In many cases, for fields like `title` or `code`, there will be just a single occurrence of the entity in the document. Eg, for `title`, we get `Statistical Learning Theory and Applications`. In addition to the raw text span from the document in the `text` field, the output also includes:

- `mean_proba` - The average of the probabilities assigned by the model to the start and end tokens in the span. The closer to 1, the more confident the model was.
- `ci1` - The offset of the first character in the span.
- `ci2` - The offset of the last character in the span.
- `ti1` - The offset of the first token in the tokenizer document.
- `ti2` - The offset of the last token in the tokenizer document.

In [65]:
print_json(data['extracted_sections']['title'])

[
  {
    "text": "Statistical Learning Theory and Applications",
    "mean_proba": 0.96484375,
    "ci1": 18,
    "ci2": 61,
    "ti1": 9,
    "ti2": 13
  }
]


And, for `code`, we get `9.520/6.860`:

In [66]:
print_json(data['extracted_sections']['code'])

[
  {
    "text": "9.520/6.860",
    "mean_proba": 0.8173828125,
    "ci1": 5,
    "ci2": 15,
    "ti1": 0,
    "ti2": 7
  }
]


Other sections can be much longer. Eg, the course description can sometimes be multiple paragraphs, as it is in this case:

In [67]:
print_json(data['extracted_sections']['description'][0])

{
  "text": "The course covers foundations and recent advances of machine learning from the point of view of statistical learning and regularization theory.\n      \nUnderstanding intelligence and how to replicate it in machines is\narguably one of the greatest problems in science. Learning, its\nprinciples and computational implementations, is at the very core of\nintelligence. During the last decade, for the first time, we have been\nable to develop artificial intelligence systems that begin to solve\ncomplex tasks, until recently the exclusive domain of biological\norganisms, such as computer vision, speech recognition or natural\nlanguage understanding: cameras recognize faces, smart phones\nunderstand voice commands, smart speakers/assistants answer questions\nand cars can see and avoid obstacles. The machine learning algorithms\nthat are at the roots of these success stories are trained with\nexamples rather than programmed to solve a task.     \n\nThe content is roughly divided 

And, some section types will often include multiple matching spans. Eg, here we get 17 `citation` spans:

In [47]:
len(data['extracted_sections']['citations'])

17

In [68]:
print_json(data['extracted_sections']['citations'])

[
  {
    "text": "L. Rosasco and T. Poggio,  Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes , Manuscript, Dec. 2017",
    "mean_proba": 0.87548828125,
    "ci1": 6850,
    "ci2": 6969,
    "ti1": 1151,
    "ti2": 1181
  },
  {
    "text": "S. Shalev-Shwartz and S. Ben-David.  Understanding Machine Learning: From Theory to Algorithms.  Cambridge University Press, 2014.",
    "mean_proba": 0.9599609375,
    "ci1": 7031,
    "ci2": 7160,
    "ti1": 1188,
    "ti2": 1217
  },
  {
    "text": "T. Hastie, R. Tibshirani and J. Friedman.  The Elements of Statistical Learning . 2nd Ed., Springer, 2009.",
    "mean_proba": 0.9580078125,
    "ci1": 7169,
    "ci2": 7274,
    "ti1": 1218,
    "ti2": 1247
  },
  {
    "text": "I. Steinwart and A. Christmann.  Support Vector Machines.  Springer, 2008.",
    "mean_proba": 0.9619140625,
    "ci1": 7282,
    "ci2": 7355,
    "ti1": 1248,
    "ti2": 1265
  },
  {
    "text": "O. Bousquet, S. Boucheron and G. Lugosi.  Introduction

## Parsed citations

The raw spans extracted from the document can be sufficient for simple fields like `title` or `code` (or for free-text fields like `description`, as input to general text-analysis models). But, in some cases, the raw document sections contain additional sub-structure that can be usefully parsed into structured data.

We currently do this for the `citation` strings, which need to be further parsed to extract titles, authors, publishers, editors, and ISBNs, so that we can link them against authoritative records in bibliographic databases. For example, here's the raw text string for one of the extracted citations, as it appeared in the original document:

In [54]:
data['citations'][1]['doc_span']['text']

'S. Shalev-Shwartz and S. Ben-David.  Understanding Machine Learning: From Theory to Algorithms.  Cambridge University Press, 2014.'

And, here's the parsed metadata from the citation parser, which splits the into component parts:

In [69]:
print_json(data['citations'][1]['parsed_citation'])

{
  "title": [
    {
      "text": "Understanding Machine Learning",
      "mean_proba": 0.998046875,
      "ci1": 37,
      "ci2": 66,
      "ti1": 15,
      "ti2": 17
    }
  ],
  "subtitle": [
    {
      "text": "From Theory to Algorithms",
      "mean_proba": 0.99267578125,
      "ci1": 69,
      "ci2": 93,
      "ti1": 19,
      "ti2": 22
    }
  ],
  "author": [
    {
      "text": "S. Shalev-Shwartz",
      "mean_proba": 0.998046875,
      "ci1": 0,
      "ci2": 16,
      "ti1": 0,
      "ti2": 7
    },
    {
      "text": "S. Ben-David",
      "mean_proba": 0.998046875,
      "ci1": 22,
      "ci2": 33,
      "ti1": 9,
      "ti2": 13
    }
  ],
  "editor": [],
  "publisher": [
    {
      "text": "Cambridge University Press",
      "mean_proba": 0.994140625,
      "ci1": 97,
      "ci2": 122,
      "ti1": 24,
      "ti2": 26
    }
  ],
  "isbn": []
}


And finally, the canonical bibliographic record that was linked to the citation, which contains standardized metadata (good for display in public-facing products), a richer set of third-party identifiers, etc.

In [70]:
print_json(data['citations'][1]['catalog_record'])

{
  "_id": 360789702129,
  "work_cluster_size": 2,
  "sources": {
    "crossref": [
      "10.1017/cbo9781107298019"
    ],
    "loc": [
      "2014001779"
    ]
  },
  "title": "Understanding Machine Learning",
  "subtitle": "From Foundations to Algorithms",
  "authors": [
    {
      "forenames": "Shai",
      "keyname": "Shalev-Shwartz"
    },
    {
      "forenames": "Shai",
      "keyname": "Ben-David"
    }
  ],
  "publisher": "Cambridge University Press",
  "year": 2009,
  "dois": [
    "10.1017/cbo9781107298019"
  ],
  "isbns": [
    "9781107057135",
    "1107057132",
    "9781107298019"
  ],
  "issns": null,
  "urls": [
    "http://dx.doi.org/10.1017/cbo9781107298019"
  ],
  "publication_type": "book",
  "open_access": null,
  "article": {
    "venue": null,
    "volume": null,
    "issue": null,
    "page_start": null,
    "page_end": null,
    "abstract": null
  }
}
