# Willkommen zum ersten Teil des Workshops !

- Ich bin: Lukas Erhard
    - Email: lukas.erhard@sowi.uni-stuttgart.de

## Inhalt

### Erste Sitzung

    1. Was ist SQLAlchemy? pandas-orientiertes Arbeiten mit Datenbanken
    2. Was bedeutet ORM und wie kann man das verwenden?
    3. Erstellung einer Datenbank mit SQLAlchemy

### Zweite Sitzung
    1. Vorstellung unseres Datensatzes und Darstellung der Disambiguierungsprobleme - Lukas Erhard (30 min)
    2. Beschreibung der Autoren-Disambiguierung - Lukas Erhard (30 min)
    3. Beschreibung der Institutionen-Disambiguierung - Christian Koss (30 min)

# Was / Warum ist SQLAlchemy?

SQLAlchemy ist ein python-package, das es ermöglicht, Abfragen an relationale Datenbanken zu stellen und dabei 

1. unabhängig vom 'Dialekt' zu sein. Aktuell unterstützt SQLAlchemy folgende (Auswahl an) Datenbank-backends: 
        - SQLite
        - PostgreSQL
        - MySQL/MariaDB
        - Oracle
        - MS-SQL

2. die Abfragen und Ergebnisse in Python-Objekten zu verwalten

## Wichtige Grundbegriffe

### ORM

- SQLAlchemy ist ein **O**bject **R**elational **M**apper. 
- Ziel aller ORMs ist, eine Abstraktion zum Datenbank-Layer zu schaffen, um python Code statt SQL zu schreiben.
    - schreibe (schönere / dynamischere) Queries in Python
    - handle string-escaping
    - arbeite mit Python-Objekten statt rohen SQL Daten
- Andere ORMs in Python wären z.B. [Django ORM](https://docs.djangoproject.com/en/3.2/topics/db/models/), [Peewee](http://docs.peewee-orm.com/en/latest/), [Pony](https://ponyorm.org/), [Tortoise](https://tortoise-orm.readthedocs.io/en/latest/index.html).

SQLAlchemy verfolgt dieses Ziel durch zwei verschiedene Konzepte, was das Package beim ersten Betrachten sehr unübersichtlich macht (aber große Freiräume beim Umgang mit Datenbanken ermöglicht).

- Es existieren daher zwei verschiedene APIs in SQLAlchemy: 

    1. [SQLAlchemy Core](https://docs.sqlalchemy.org/en/14/core/)
        - weniger Abstraktion
        - weniger Vorarbeit (keine Erstellung von 'models' notwendig)
        - Code ist weniger 'schön'
        - geringere Möglichkeiten
    2. [SQLAlchemy ORM](https://docs.sqlalchemy.org/en/14/orm/)
        - höhere Abstraktion vom Datenbanklayer
        - mehr Vorarbeit erforderlich
        - großer Handlungsspielraum bei der Erstellung von Python-Objekten

### Engine

- Engine beschreibt im Jargon von SQLAlchemy das Verbindungsobjekt zur Datenbank.

In [None]:
# DO NOT RUN (will fail)!

# Beispiel 'create_engine'

from sqlalchemy import create_engine

try:
    # SQLite DB engine
    create_engine("sqlite:///mydatabase.db")

    # OracleDB engine
    create_engine(
        "oracle+cx_oracle://username:password@host:port/?service_name=myservice"
    )

    # PostgreSQL engine
    create_engine("postgresql+psycopg2://username:password@host/database")
except:
    pass

### Session

- entspricht einer 'realisierten' Connection zur Datenbank

In [None]:
from sqlalchemy import create_engine
from sqlalchemy.orm import Session

engine = create_engine("sqlite:///:memory:")

In [None]:
# Möglichkeit 1
session = Session(bind=engine)
result = session.execute("SELECT 1")
print(result.fetchall())

[(1,)]


In [None]:
# Möglichkeit 2
with Session(engine) as session:
    result = session.execute("SELECT 2")
    print(result.fetchall())

[(2,)]


# Beispiel: Eine Query aus pandas

In [None]:
# Erstellen einer Verbindung zur KB DB
from src.connect import create_wos_engine

engine = create_wos_engine()

In [None]:
import pandas as pd

query = """
SELECT pk_items
       , fk_sources
       , ut_eid
       , article_title
       , doctype
       , d_author_cnt
FROM WOS_B_2020.items
FETCH FIRST 10 ROWS ONLY
"""

pd.read_sql(query, engine)

Unnamed: 0,pk_items,fk_sources,ut_eid,article_title,doctype,d_author_cnt
0,12868002,139222,000075342700002,Influence of residential fungal contamination on peripheral blood lymphocyte populations in children,Article,5
1,12868122,145724,A1990DK73800001,TRANSESOPHAGEAL ECHOCARDIOGRAPHY,Review,7
2,12868325,103331,A1995QE40800018,"EXPRESSION OF LACZ FROM THE HTRA, NIRB AND GROE PROMOTERS IN A SALMONELLA VACCINE STRAIN - INFLUENCE OF GROWTH IN MAMMALIAN-...",Article,6
3,12868421,102737,A1993KM96600011,MAO-A AND MAO-B INHIBITORS SELECTIVELY ALTER XENOPUS MUCUS-INDUCED BEHAVIORS OF SNAKES,Article,3
4,12868536,89029,A1991FL10600004,RECURRENT AND DENOVO RENAL-DISEASE AFTER KIDNEY-TRANSPLANTATION WITH OR WITHOUT CYCLOSPORINE-A,Article,4
5,12868673,81306,000406038400005,The ADRON-RM Instrument Onboard the ExoMars Rover,Article,18
6,12868831,89029,000233933400001,Use of EPO in critically ill patients with acute renal failure requiring renal replacement therapy,Article,3
7,12868962,41904,A1987G182200046,REPETITIVE REGION OF CALPASTATIN IS A FUNCTIONAL UNIT OF THE PROTEINASE-INHIBITOR,Article,6
8,12869037,70462,A1991FF88400007,"EFFECTS OF NA0344, A NEW SMOOTH-MUSCLE RELAXANT, ON THE ACTIN MYOSIN ATP INTERACTION AND MYOSIN LIGHT CHAIN PHOSPHORYLATION ...",Article,6
9,12869166,40174,000447150200001,Antidepressant-Like Effects of Low- and High-Molecular Weight FGF-2 on Chronic Unpredictable Mild Stress Mice,Article,6


# Dieselbe Query mit SQLAlchemy Core

In [None]:
from sqlalchemy import MetaData

# das MetaData Objekt bildet alle Tabellen in einem Datenbank User ab
meta = MetaData(bind=engine, schema="wos_b_2020")

In [None]:
from sqlalchemy import Table

# autoload: erkenne alle Spalten / Datentypen und bilde diese automatisch ab
table_items = Table("items", meta, autoload=True)

print(table_items.c)

ImmutableColumnCollection(items.pk_items, items.fk_issues, items.fk_sources, items.ut_eid, items.t9_sgr, items.doi, items.pii, items.article_title, items.article_title_en, items.firstpage, items.lastpage, items.page_cnt, items.pubyear, items.pubtype, items.doctype, items.d_author_cnt, items.d_ref_cnt, items.d_source_ref_cnt, items.d_country_cnt, items.d_inst_full_cnt, items.etal, items.d_orga1_cnt, items.prepublication_item, items.rp_problem)


- zum Vergleich noch einmal die Query von eben:

```sql
SELECT pk_items
       , fk_sources
       , ut_eid
       , article_title
       , doctype
       , d_author_cnt
FROM WOS_B_2020.items
FETCH FIRST 10 ROWS ONLY
```

In [None]:
from sqlalchemy.orm import Query

query = (
    Query(table_items)
    .with_entities(
        table_items.c.pk_items,
        table_items.c.fk_sources,
        table_items.c.ut_eid,
        table_items.c.article_title,
        table_items.c.doctype,
        table_items.c.d_author_cnt,
    )
    .limit(10)
)

pd.read_sql(query.statement, engine)

Unnamed: 0,pk_items,fk_sources,ut_eid,article_title,doctype,d_author_cnt
0,12868002.0,139222.0,000075342700002,Influence of residential fungal contamination on peripheral blood lymphocyte populations in children,Article,5.0
1,12868122.0,145724.0,A1990DK73800001,TRANSESOPHAGEAL ECHOCARDIOGRAPHY,Review,7.0
2,12868325.0,103331.0,A1995QE40800018,"EXPRESSION OF LACZ FROM THE HTRA, NIRB AND GROE PROMOTERS IN A SALMONELLA VACCINE STRAIN - INFLUENCE OF GROWTH IN MAMMALIAN-...",Article,6.0
3,12868421.0,102737.0,A1993KM96600011,MAO-A AND MAO-B INHIBITORS SELECTIVELY ALTER XENOPUS MUCUS-INDUCED BEHAVIORS OF SNAKES,Article,3.0
4,12868536.0,89029.0,A1991FL10600004,RECURRENT AND DENOVO RENAL-DISEASE AFTER KIDNEY-TRANSPLANTATION WITH OR WITHOUT CYCLOSPORINE-A,Article,4.0
5,12868673.0,81306.0,000406038400005,The ADRON-RM Instrument Onboard the ExoMars Rover,Article,18.0
6,12868831.0,89029.0,000233933400001,Use of EPO in critically ill patients with acute renal failure requiring renal replacement therapy,Article,3.0
7,12868962.0,41904.0,A1987G182200046,REPETITIVE REGION OF CALPASTATIN IS A FUNCTIONAL UNIT OF THE PROTEINASE-INHIBITOR,Article,6.0
8,12869037.0,70462.0,A1991FF88400007,"EFFECTS OF NA0344, A NEW SMOOTH-MUSCLE RELAXANT, ON THE ACTIN MYOSIN ATP INTERACTION AND MYOSIN LIGHT CHAIN PHOSPHORYLATION ...",Article,6.0
9,12869166.0,40174.0,000447150200001,Antidepressant-Like Effects of Low- and High-Molecular Weight FGF-2 on Chronic Unpredictable Mild Stress Mice,Article,6.0


## Eine Query mit SQLAlchemy Core
```
SELECT *
FROM WOS_B_2020.authors
WHERE firstname = 'Niklas'
    AND lastname = 'Luhmann'
```

In [None]:
table_authors = Table("authors", meta, autoload=True)

query = Query(table_authors).filter_by(firstname="Niklas", lastname="Luhmann")

pd.read_sql(query.statement, engine)

Unnamed: 0,pk_authors,author_id,fullname,lastname,firstname,middlename,author_group,role,orcid_id,orcid_id_tr,r_id,r_id_tr
0,8495791.0,,"Luhmann, Niklas",Luhmann,Niklas,,,researcher_id,0000-0003-1108-058X,,,
1,31991735.0,,"Luhmann, N",Luhmann,Niklas,,,author,,,,
2,27758888.0,,"Luhmann, Niklas",Luhmann,Niklas,,,researcher_id,0000-0002-3912-0769,,,


In [None]:
# queries können erweitert werden

extended_query = query.filter_by(role="author")

pd.read_sql(extended_query.statement, engine)

Unnamed: 0,pk_authors,author_id,fullname,lastname,firstname,middlename,author_group,role,orcid_id,orcid_id_tr,r_id,r_id_tr
0,31991735.0,,"Luhmann, N",Luhmann,Niklas,,,author,,,,


# Was ist ein Join?
 
- Ein Join verbindet mehrere Tabellen anhand von sich überschneidenden Identifiern.
- Häufig heißen analoge Strategien auch `merge` (wie in pandas oder Stata).
- Dokumentation zu den verschiedenen Arten von Joins gibt es [hier](https://www.w3schools.com/sql/sql_join.asp), ein anschauliches Tutorial [hier](https://www.sqlservertutorial.net/sql-server-basics/sql-server-joins/)

In [None]:
df_person = pd.DataFrame(
    [
        {"person_id": 1, "name": "John", "gender": "male"},
        {"person_id": 2, "name": "Jane", "gender": "female"},
    ]
)
df_location = pd.DataFrame(
    [
        {"person_id": 1, "country": "USA", "last_online": 2017},
        {"person_id": 1, "country": "CAN", "last_online": 2020},
        {"person_id": 2, "country": "UK", "last_online": 2019},
    ]
)

In [None]:
df_person

Unnamed: 0,person_id,name,gender
0,1,John,male
1,2,Jane,female


In [None]:
df_location

Unnamed: 0,person_id,country,last_online
0,1,USA,2017
1,1,CAN,2020
2,2,UK,2019


In [None]:
pd.merge(df_person, df_location, how="inner", on="person_id")

Unnamed: 0,person_id,name,gender,country,last_online
0,1,John,male,USA,2017
1,1,John,male,CAN,2020
2,2,Jane,female,UK,2019


### Beziehungen in relationalen Datenbanken

- 1:1 Beziehung
- 1:n Beziehung
- n:m Beziehung

<img src="resources/one_to_one_relation.jpg" alt="Drawing" style="width:300px"/>

<img src="resources/one_to_many_relation.jpg" alt="Drawing" style="width:300px"/>

<img src="resources/many_to_many_relation.jpg" alt="Drawing" style="width:300px"/>

## Ein Join mit SQLAlchemy

```sql
SELECT DISTINCT pk_items
    , pubyear
    , doi
    , doctype 
    , article_title
FROM wos_b_2020.authors 
JOIN wos_b_2020.items_authors_institutions 
    ON wos_b_2020.authors.pk_authors = wos_b_2020.items_authors_institutions.fk_authors 
JOIN wos_b_2020.items 
    ON wos_b_2020.items_authors_institutions.fk_items = wos_b_2020.items.pk_items 
WHERE wos_b_2020.authors.firstname = 'Niklas' 
    AND wos_b_2020.authors.lastname = 'Luhmann'
    AND wos_b_2020.authors.role = 'author'
ORDER BY wos_b_2020.items.pubyear ASC
```

In [None]:
table_itauinst = Table("items_authors_institutions", meta, autoload=True)

items = (
    extended_query.join(
        table_itauinst, table_authors.c.pk_authors == table_itauinst.c.fk_authors
    )
    .join(table_items, table_itauinst.c.fk_items == table_items.c.pk_items)
    .with_entities(
        table_items.c.pk_items,
        table_items.c.pubyear,
        table_items.c.doi,
        table_items.c.doctype,
        table_items.c.article_title,
    )
    .distinct()
    .order_by(table_items.c.pubyear.asc())
)

In [None]:
pd.read_sql(items.statement, engine)

Unnamed: 0,pk_items,pubyear,doi,doctype,article_title
0,186493000.0,2013,10.1016/j.drugpo.2012.08.005,Article,An urgent need to scale-up injecting drug harm reduction services in Tanzania: Prevalence of blood-borne viruses among drug ...
1,5231207000.0,2014,10.1016/j.drugpo.2014.01.007,Article,"Hepatitis C among people who inject drugs in Tbilisi, Georgia: An urgent need for prevention and treatment"
2,340623400.0,2015,10.1016/j.drugpo.2015.07.016,Article,Access to hepatitis C treatment for people who inject drugs in low and middle income settings: Evidence from 5 countries in ...
3,326657400.0,2016,10.1016/j.drugpo.2016.02.010,Article,"Prevalence and risk factors associated with HIV and tuberculosis in people who use drugs in Abidjan, Ivory Coast"
4,241868000.0,2017,10.1186/s12879-017-2767-0,Article,Survey of programmatic experiences and challenges in delivery of hepatitis B and C testing in low- and middle-income countries
5,322384400.0,2017,10.1063/1.4989775,Article,Effect of oxygen plasma on nanomechanical silicon nitride resonators
6,15481430000.0,2017,10.1007/s11577-017-0430-9,Editorial Material,Action Theory and System Theory
7,15545490000.0,2017,10.5771/0038-6073-2017-1-5,Article,The inner differentiation of society: stratification and functional differentiation
8,94881330.0,2018,,Meeting Abstract,Modelling the Impact of Prevention and Treatment Interventions on HIV and Hepatitis C Virus Transmission Among PWID in Nairobi
9,248403600.0,2018,10.1016/j.drugpo.2017.11.014,Article,Harm reduction-based and peer-supported hepatitis C treatment for people who inject drugs in Georgia
