<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Building a Database with SQLModel

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>

* What is an ORM and why is it useful in database management?
* What are common Python packages for ORMs?
* What are the benefits of using ORMs?
* How do I create databases using SQL model.

<strong>Objectives:</strong>

* Create a database table structure using SQLModel.

</div>

In the previous notebook, we learned about SQL queries and how to build a SQLite database by building queries and executing them using Python.
When we took this approach, we had to write the SQL ourselves, which can be time-consuming and error-prone, and also varies depending on the database system we are using.
In this notebook, we'll learn about a Python library called SQLModel that can be used to build databases by writing Python classes. The classes are then used to create the database schema and interact with the database without writing SQL queries.

SQLModel is something called an  ORM, or an "object-relational mapping" library. 
An ORM is a programming technique for converting data between incompatible type systems using object-oriented programming languages. 
This allows us to interact with a database using Python objects, which can make our code more readable and maintainable.

The "classic" ORM in Python is [SQLAlchemy](https://www.sqlalchemy.org/), which is a very powerful and flexible library.
Many of MolSSI's own projects such as QCArchive or SEAMM use SQLAlchemy for their databases.
SQLModel is a more "modern" ORM that is built on top of SQLAlchemy, and is designed to be easier to use and more Pythonic.
SQLModel combines SQLAlchemy with data validation through [Pydantic](https://docs.pydantic.dev/latest/) and is easier to plug into apps like APIs using FastAPI.

This notebook starts with a convenience function for removing the database we will create in it (this is in case you run the notebook more than once!)

In [None]:
import os

def remove_db():
    """Convenience function to remove database file for notebook."""
    if os.path.exists("sqlmodel_database.db"):
        os.remove("sqlmodel_database.db")

remove_db()

## Defining a Table using SQLModel

When using an ORM, you define tables using Python classes and inheritance.
Briefly, inheritance is a way to define a new class that inherits attributes and methods from an existing class.
The classes will start with  the behavior of some base class, and is extended through additional code definition.

In the case of SQLModel, we can create a SQL table by inheriting from the class `SQLModel` and setting the `table` argument to `True`.
Defining the columns of the table is very similar to defining a model using Pydantic.
SQLModel uses SQLAlchemy, a standard ORM in Python coupled with Pydantic for validation, making it a bit quicker to prototype than SQLAlchemy.
Each column is given a name by adding a class attribute to your class. 
The name of the column will correspond to the name of the attribue in the class.
The name of the SQL table will correspond to the name of the class, but in all lower case.

When using SQLModel, The attributes (columns) must be accompanied by type-hinting.
This will allow SQLModel to map the Python type of the variable to the SQL type.

Finally, if there are any constraints on the column such as a primary key or uniqueness, 
that is added by using the `SQLModel` `Field` object.

In the cell below, we have defined columns for the `Article` table.
The class is set up to define a SQL table that has fields named `doi` (type `str`), `title` (type `str`), `publication_year` (type `int`), `abstract` (`optional`, type `str`, default `None`).

In [None]:
from typing import Optional

from sqlmodel import Field, SQLModel, Session, create_engine

class Article(SQLModel, table=True):
    
    doi: str = Field(primary_key=True)
    title: str
    publication_year: int
    abstract: Optional[str] = Field(default=None)

After defining a single table, we can use SQLModel to create and connect to a database.

In [None]:
sqlite_file_name = "sqlmodel_database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"

engine = create_engine(sqlite_url, echo=True)

For the purposes of this tutorial, we're using the `echo=True` option to the SQL engine. 
This was set in the cell above.
When you work with objects in SQLModel, the library is building and executing the SQL queries under the hood.
The library will construct queries of the appropriate SQL dialect depending on what type of SQL database you are connecting to.
Using `echo=True` when using a SQLite database will show us the SQL queries being executed, but if we were to use a different type of SQL database, we would see those types of queries.

Our database is currently created but empty. To actually add tables to our database, we use `SQLModel.metadata.create_all` and pass in our database engine.
SQLModel will automatically create tables for any classes we've defined using `SQLModel` and the `table=True` argument.

In [None]:
SQLModel.metadata.create_all(engine)

In the cell output above, you will be able to see the SQL query that SQLModel constructed to build our table, namely

```sql
CREATE TABLE article (
	doi VARCHAR NOT NULL, 
	title VARCHAR NOT NULL, 
	publication_year INTEGER NOT NULL, 
	abstract VARCHAR, 
	PRIMARY KEY (doi)
)
```

Instead of us writing the SQL statement as we did in the last notebook, the ORM translates our object to SQL code and executes it for us.

If you view your database file now, you will see a databbase with a single `Article` table. 
SQLModel (using SQLAlchemy underneath) has mapped our Python class to a SQL statement for creating a table.

<div class="alert alert-block alert-warning">

## Exercise
Create classes for the `keyword` table and the `author` table.

The `author` table should have : `id` (integer, primary key), `first_name`, `last_name`, `affiliation` (optional). See note below for author ID.

The `keyword` table should have `id` (integer, primary key) and `keyword`.

For the `id` columns on both tables. To make the `id` autoincrement without the user adding an ID, use `None | int` or `Optional[int]` as the type. 
This will allow someone to not pass in an author ID an have the table autoincrement.
</div>


In [None]:
# Complete exercise here



In [None]:
SQLModel.metadata.create_all(engine)

### Associative Tables

Finally for our associative tables we add `foreign_key` to our column fields referencing our other tables, and create our composite primary key by making both columns primary keys.

In [None]:
class ArticleKeyword(SQLModel, table=True):
    __table_args__ = {"extend_existing": True} # This lets us run the Jupyter notebook cell multiple times without error
    
    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    keyword_id: int = Field(foreign_key="keyword.id", primary_key=True)


<div class="alert alert-block alert-warning">

## Exercise

Define the `ArticleAuthor` associative table.

</div>

The cell below adds our new tables to the database.

In [None]:
SQLModel.metadata.create_all(engine)

## Adding Data to the Database

Now that we've defined our database schema, we can add data to the database.
We'll use the paper we've retrieved from ChemRxiv to demonstrate.

In [None]:
import json


datafile = "data/one_paper.json"

with open(datafile) as f:
    search_data = json.load(f)

paper_data = search_data["itemHits"][0]["item"]
print(paper_data)

To add data to the database, we create instances of our objects.
We first create a database session using `Session` with our engine. 
We add a row to our `article` table by using our `Article` class and passing in the required arguments (the columns we defined in the class).

```python
Article(doi=paper_data["doi"], title=paper_data["title"], publication_year=2024, abstract=paper_data["abstract"])
```

Then, we can add the row to the database session using `session.add()`.
Finally, we commit the changes to the database using `session.commit()`.

We follow the same process for adding each keyword associated with the paper to the database.
In the example below, we are normalizing the keywords by converting all of them to lowercase before adding them to the database.


In [None]:
with Session(engine) as session:

    first_article = Article(doi=paper_data["doi"], title=paper_data["title"], publication_year=2024, abstract=paper_data["abstract"])
    session.add(first_article)
    session.commit()

    for keyword in paper_data["keywords"]:
        keyword_obj = Keyword(keyword=keyword.lower())
        session.add(keyword_obj)
        session.commit()

Although defining these models has made some things about connecting to the database and defining tables simpler, we're still missing an easy way to add and access data across our associative tables.
Luckily, ORM's have functionality built in to make this easier. 
We'll have to redefine our models for this, so we'll move on to the next notebook.