<a href="https://colab.research.google.com/github/lucasgneccoh/BDSS_Dauphine/blob/main/notebooks/students/BDSS_TD6_JSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bases de données semi-structurées - TD 6 - JSON, JSONSchema and JSONPath

Main teacher: **Dario COLAZZO**

Teaching Assistant: **Lucas GNECCO**

Special thanks to **Beatrice NAPOLITANO**

Université Paris Dauphine - PSL

# Introduction

Welcome!

In this notebook we will use the `json` and `jsonschema` Python libraries to load, save and validate JSON files and objects. We will also cover some JSONPath using the library `jsonpath-ng`

Here is some important documentation and resources

https://json-schema.org/learn/getting-started-step-by-step.html#starting

# Examples

## `json` library examples
Here we show how to read and write JSON files using the `json` library.

Example file is taken from 

https://json-schema.org/learn/getting-started-step-by-step.html#properties-deeper

In [None]:
'''
    Create a JSON object and write it in a file
'''

import json

# Take the json example in Python format (a dictionary)
# In Python, the equivalent of a JSON object is a dict
# A JSON array is a list
json_object = {
                "productId": 1,
                "productName": "An ice sculpture",
                "price": 12.50,
                "tags": [ "cold", "ice" ],
                "dimensions": {
                    "length": 7.0,
                    "width": 12.0,
                    "height": 9.5
                    },
                "warehouseLocation": {
                    "latitude": -78.75,
                    "longitude": 20.4
                    }
                }


In [None]:
'''
    First lets print the object as a string in JSON format
    Then read a JSON object from a string
    Use functions json.loads and json.dumps

'''
s = json.dumps(json_object)
print("JSON object as a string:\n", type(s), s)

json_object = None
json_object = json.loads(s)
print("JSON object read from a string:\n", type(json_object), json_object)


In [None]:
'''
    Lets write this to a file called myjson.json
    Use function json.dump
'''

filename = "myjson.json"
with open(filename, "w") as f:
    json.dump(json_object, f)
    f.close()

In [None]:
'''
    Now see the content of the file as it is
'''
!cat myjson.json

In [None]:
'''
    Lets read it in a Python object to see what it gives 
    Use function json.load
'''
json_object = None
with open(filename, "r") as f:
    json_object = json.load(f)


print(f"The json_object variable is of type: {type(json_object)}")
print()
print(json_object)

## `jsonschema` library
This library allows to validate JSON objects against a JSONSchema

Again we folow the example given in 

https://json-schema.org/learn/getting-started-step-by-step.html#properties-deeper


`jsonschema` documentation available at

https://python-jsonschema.readthedocs.io/en/stable/



In [None]:
%%capture
!pip install jsonschema
from jsonschema import Draft7Validator, RefResolver

In [None]:
schemaProducts =  {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "example.com.product.schema.json",
  "title": "Product",
  "description": "A product from Acme's catalog",
  "type": "object",
  "properties": {
    "productId": {
      "description": "The unique identifier for a product",
      "type": "integer"
    },
    "productName": {
      "description": "Name of the product",
      "type": "string"
    },
    "price": {
      "description": "The price of the product",
      "type": "number",
      "exclusiveMinimum": 0
    },
    "tags": {
      "description": "Tags for the product",
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 1,
      "uniqueItems": True
    },
    "dimensions": {
      "type": "object",
      "properties": {
        "length": {
          "type": "number"
        },
        "width": {
          "type": "number"
        },
        "height": {
          "type": "number"
        }
      },
      "required": [ "length", "width", "height" ]
    },
    "warehouseLocation": {
      "description": "Coordinates of the warehouse where the product is located.",
      "$ref": "example.com.geographical-location.schema.json"
    }
  },
  "required": [ "productId", "productName", "price" ]
}

schemaGeo = {
  "$id": "example.com.geographical-location.schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Longitude and Latitude",
  "description": "A geographical coordinate on a planet (most commonly Earth).",
  "required": [ "latitude", "longitude" ],
  "type": "object",
  "properties": {
    "latitude": {
      "type": "number",
      "minimum": -90,
      "maximum": 90
    },
    "longitude": {
      "type": "number",
      "minimum": -180,
      "maximum": 180
    }
  }
}

json.dump(schemaProducts, open("schemaProducts.json", "w") )
json.dump(schemaGeo, open("schemaGeo.json", "w") )

In [None]:
'''
    Let's see if the example json object we worked with follows the schema or not
'''
schemaProducts, schemaGeo, json_object = None, None, None
json_object = json.load(open("myjson.json"))
schemaProducts = json.load(open("schemaProducts.json"))
schemaGeo = json.load(open("schemaGeo.json"))


schema_store = {
    schemaProducts['$id'] : schemaProducts,
    schemaGeo['$id'] : schemaGeo,
}

# Create a resolver to work with local files and avoid fetching URLs
resolver = RefResolver.from_schema(schemaProducts, store=schema_store)

validator = Draft7Validator(schemaProducts, resolver=resolver)

try:
    validator.validate(json_object)
    print("File is valid!")
except Exception as e:
    print("File was not validated correctly. Here are the catched exceptions\n")
    print(e)

In [None]:
'''
    Let's see what happens when we do not follow the schema
'''

json_object['warehouseLocation']['latitude'] = 120
json_object['productId'] = 'Hello'

try:
    validator.validate(json_object)
    print("File is valid!")
except Exception as e:
    print("File was not validated correctly. Here are the catched exceptions\n")
    print(e)

# Excercise 1: Write a schema

To begin with, we are going to create a schema for a very simple situation we already know well... films!

Take the original DTD for the FILMS database and make it a JSON schema. 

**You can change some things if you want to make it more complex or realistic, as long as the Schema is well written and makes sense.**

***NOTE: Be careful, there might be some structural changes needed***

Here is the DTD

```
    <!ELEMENT FILMS (FILM+, ARTISTE+)>
    <!ELEMENT FILM (TITRE, GENRE, PAYS, MES, ROLES, RESUME?)>
    <!ELEMENT TITRE (#PCDATA)>
    <!ATTLIST FILM Annee CDATA #REQUIRED>
    <!ELEMENT GENRE (#PCDATA)>
    <!ELEMENT PAYS (#PCDATA)>
    <!ELEMENT MES (#PCDATA)>
    <!ATTLIST MES id_mes IDREF #IMPLIED>
    <!ELEMENT ROLES (ROLE*)>
    <!ELEMENT ROLE (PRENOM, NOM, INTITULE)>
    <!ELEMENT PRENOM (#PCDATA)>
    <!ELEMENT NOM (#PCDATA)>
    <!ELEMENT INTITULE (#PCDATA)>
    <!ELEMENT RESUME (#PCDATA)>
    <!ELEMENT ARTISTE (ACTNOM, ACTPNOM, ANNEENAISS)>
    <!ATTLIST ARTISTE id_art ID #REQUIRED>
    <!ELEMENT ACTNOM (#PCDATA)>
    <!ELEMENT ACTPNOM (#PCDATA)>
    <!ELEMENT ANNEENAISS (#PCDATA)>
```

# Excercise 2: XML to JSON

Transform the original XML Films database into a JSON file (or files) following your schema.

Use your favorite tool: Basic python, SAX, DOM...

In [None]:
# Download the file
!wget "https://raw.githubusercontent.com/lucasgneccoh/BDSS_Dauphine/main/data/films.xml"

# Excercise 3: Validate the data

Validate the FILMS dataset you just created with the JSON schema you created using `jsonschema`

# Excercise 4: JSONPath to query the dataset

We are going to redo some of the queries we have worked on but using this new tool, **JSONPath**.
The implementation for Python we are going to use is called `jsonpath-ng`.


`jsonpath-ng` PyPi site: https://pypi.org/project/jsonpath-ng/


Some documentation about **JSONPath** comparing it to **XPath**: https://goessner.net/articles/JsonPath/


Here is a nice documentation explaining how JSONPath works, but be careful because the implementation is for Java: https://github.com/json-path/JsonPath


This blog also has some documentation and examples using Python and the library `jsonpath-ng`:
https://blogboard.io/blog/knowledge/jsonpath-python/

In [None]:
!pip install jsonpath-ng
from jsonpath_ng.ext import parse

## Examples

In [None]:
'''
    Get all the titles
'''
jsonpath_expr = parse('$.arrFilms[:].TITRE.title')

result = [match.value for match in jsonpath_expr.find(FILMS)]

print("Get all the titles")

print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")


## Queries

Now try to do the original queries in TD1 on the FILMS dataset using **JSONPath**.

If you find it two hard to do a query on pure JSONPath, you can mix JSONPath and pure Python using the results of a JSONPath query. Nevertheless remember that this would not be directly applicable to other JSONPath implementations, specially in other languages. 

In [None]:
query_title = "1. Get all the titles"

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "2. Get all the titles of films from 1980"

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]

print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "3. Alien abstract"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "4. Films with Bruce Willis"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "5. Films with a RESUME"

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "6. Films without a RESUME"

# This one might be a little harder than before, as negations are not implemented in this library (AFAIK)

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "7. Films older than 30 years"
# Maybe you can use the datetime library

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "8. Role of Harvey Keitel in Reservoir Dogs"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "9. Last film in database"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "10. Film preceding The Shining"

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "11. Director of Vertigo"

jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "12. Titles with an S"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "13. Nodes with three descendants"


jsonpath_expr = parse('Your JSONPath query goes here')

result = [match.value for match in jsonpath_expr.find(FILMS)]


print(query_title)
print("\n" + "-"*30 + "\n")
print(result)
print("\n" + "-"*30 + "\n")

In [None]:
query_title = "14. Nodes whose name (tag) contains TU"

# This is again not easy to do (AFAIK). Try something, even if it is not perfect

