<a href="https://colab.research.google.com/github/lucasgneccoh/BDSS_Dauphine/blob/main/notebooks/students/BDSS_TD7_PostgreSQLJSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bases de données semi-structurées - TD 7 - PostgreSQL and JSON

Main teacher: **Dario COLAZZO**

Teaching Assistant: **Lucas GNECCO**

Special thanks to **Beatrice NAPOLITANO**

Université Paris Dauphine - PSL

# Introduction

Welcome!

In this notebook we will practice SQL while dealing with data in JSON format. To do so we will rely on PostgreSQL which has the capacity of dealing with such data.

PostgreSQL has many in-built functions to handle JSON objects and its syntax allows to work with JSON data in a very familiar SQL-like way

For more documentation please visit the official site

https://www.postgresql.org/docs/9.3/functions-json.html


***NOTE: This notebook was designed to be executed in Google Colab. Instructions below install PostgreSQL and use other tricks that were only tested in this environment***


## Working with JSON in PostgreSQL

In this notebook we will be using very simple datasets where each row contains a JSON object. We are by no means limited to that and we could have tables with "normal" columns and JSON columns all together. 

When dealing with JSON objects in PostgreSQl we have to use a special notation

Imagine that the columns *info* contains a JSON object in each row. If one row is {'foo': 1, 'bar': [5, 6]}, then we can access the diferent fields using the operators -> and ->>

For example *info* -> 'foo' would give 1, and *info* -> 'bar' would give the JSON array [5, 6].


For more details and examples I suggest the folowing resources

https://www.postgresqltutorial.com/postgresql-json/

https://www.postgresql.org/docs/current/functions-json.html

# Database setup

## Install PostgreSQL


In [None]:
# install
!apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"
# set connection
%load_ext sql
%config SqlMagic.feedback=False 
%config SqlMagic.autopandas=True
%sql postgresql+psycopg2://@/postgres

## Create tables and insert data
First get data from the original XML format and transform it to JSON
Then insert it in the PostgreSQL tables

In [None]:
from lxml import etree
import re
from xml.dom.minidom import parse
import copy


!wget "https://raw.githubusercontent.com/lucasgneccoh/BDSS_Dauphine/main/data/films.xml"
path = "films.xml"

dom = parse("films.xml")
filmTextElems = ["TITRE", "GENRE", "PAYS", "RESUME"]
artistTextElems = ["ACTNOM", "ACTPNOM", "ANNEENAISS"]
roleTextElements = ["NOM", "PRENOM", "INTITULE"]

def getText(node):
    try:
        return node.childNodes[0].data
    except Exception as e:  
        print(f'Problems getText with node {node.tagName}')
        raise e

def getAttributes(node):
    res = {}
    if node.hasAttributes():
        for k, v in node.attributes.items():
            res[k] = v
    return res

def getTextElements(node, elements):
    res = {}
    for elem in elements:
        for t in node.getElementsByTagName(elem):
            if t.hasChildNodes():
                res[elem] = getText(t)
            
    return res

# Get films
films = []
for f in dom.getElementsByTagName("FILM"):
    film = getTextElements(f, filmTextElems)
    film.update(getAttributes(f))

    # Read MES
    for m in f.getElementsByTagName('MES'):
        film["MES"] = m.getAttribute('id_mes')
    
    # Read ROLES
    roles = []
    for r in f.getElementsByTagName('ROLE'):
        roles.append(getTextElements(r, roleTextElements))
    
    film.update({'ROLES':  copy.deepcopy(roles)})

    # I created a special TITRE. I have to create it
    film["TITRE"] = {
        "title": film["TITRE"],
        "lang": "@fr",
        "note": "Lorem ipsum"
    }

    films.append(film)


# Get artists
artists = []
for a in dom.getElementsByTagName("ARTISTE"):
    artist = getTextElements(a, artistTextElems)
    artist.update(getAttributes(a))
    artists.append(artist)



FILMS = {'arrArtistes':artists , 'arrFilms':films}

In [None]:
%%sql

DROP TABLE IF EXISTS artistsSQL;
DROP TABLE IF EXISTS filmsSQL; 

CREATE TABLE filmsSQL (
	id serial NOT NULL PRIMARY KEY,
	data json NOT NULL
);
CREATE TABLE artistsSQL (
	id serial NOT NULL PRIMARY KEY,
	data json NOT NULL
);

In [None]:
# Be careful with the ' character
a = "retrouve l'un de ses"
b = re.sub("\'","''", a)
print(b)

In [None]:
%%capture
import json
import re


for f in FILMS["arrArtistes"]:
    json_string = re.sub("\'","''", json.dumps(f))
    cmd = f'''INSERT INTO artistsSQL (data) VALUES('{json_string}')'''
    %sql $cmd;

for f in FILMS["arrFilms"]:
    json_string = re.sub("\'","''", json.dumps(f))
    cmd = f'''INSERT INTO filmsSQL (data) VALUES('{json_string}')'''
    %sql $cmd;

## Make simple SELECT statements to see if the data is right

Notice how we can access fields on the JSON objects in each row of our table

In [None]:
%%sql
SELECT *
FROM artistsSQL
LIMIT 3
;

In [None]:
%%sql
SELECT data -> 'ACTNOM' as nom,
        data -> 'ACTPNOM' as prenom,
        data -> 'ANNEENAISS' as anneN
FROM artistsSQL
WHERE CAST(data ->> 'ANNEENAISS' as INTEGER) > 1950
LIMIT 5
;

# Exercises

Now that we have our data in PostgreSQL we can do a lot of things!

We can do almost everything we know in standard SQL if we are able to create the right tables from our JSON data.

On top of that, PostgreSQL has a lot of functions to deal with JSON objects that make this approach a lot easier and powerful.

Here are some examples and documentation that can be helpful:


https://www.postgresqltutorial.com/postgresql-json/

https://www.postgresql.org/docs/current/functions-json.html



## Exercise 1: Simple queries we already know

Do queries 2, 4 and 8. They should not be that hard

Query 2: Films released in 1980

Query 4: Films with Bruce Willis in it

Query 8: Role of Harvey Keitel in Reservoir Dogs


If you want something a bit more challenging, try to do someething general. For example in Query 4, what if I want to look for some other artist?

In [None]:
# Query 2: Films released in 1980
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


In [None]:
# Query 4: Films with Bruce Willis in it
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


In [None]:
# Query 8: Role of Harvey Keitel in Reservoir Dogs

%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


## Exercise 2: More complex queries we have maybe talked about

Let's use the nice SQL syntax to JOIN the two tables we have (films and artists)

### Exercise 2.1: For every movie, show its title, year and the information about the director

In [None]:
# '''
#     Remember the basic JOIN syntax
#     SELECT table1.column1, table2.column2...
#     FROM table1
#     INNER JOIN table2
#     ON table1.common_filed = table2.common_field;
# '''
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


### Exercise 2.2: For each artist, count the participations on any film (as actor, not as a director)

In [None]:
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


###  Exercise 2.3: For each artist, compute the average year of the films in which he/she has participated

In [None]:
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;


###  Exercise 2.4: For each artist and each participation in a film, compute the age the artist had when he participated in the movie.

Filter out NaN values !

In [None]:
%%sql

SELECT *
FROM filmsSQL
LIMIT 3
;
