<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Building a Database with SQLModel

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>


<strong>Objectives:</strong>

</div>



In [1]:
import os 

from typing import Optional, List

from sqlmodel import Field, SQLModel, Session, create_engine

Explanation here.

In [2]:
def remove_db():
    """Convenience function to remove database file for notebook."""
    if os.path.exists("sqlmodel_database.db"):
        os.remove("sqlmodel_database.db")

remove_db()

## Defining a Table using SQLModel

When using an ORM, you define tables using Python classes using inheritance.
The classes will inherit from some base class that defines the behavior of the object.

In the case of SQLModel, we can create a SQL table by inheriting from the class `SQLModel` and setting the `table` argument to `True`.
Defining the columns of the table is very similar to defining a model using [Pydantic].
SQLModel uses SQLAlchemy, a standard ORM in Python coupled with Pydantic for validation, making it a bit quicker to prototype than SQLAlchemy.
Each column is given a name by adding a class attribute to your class. 
The name of the column will correspond to the name of the attribue in the class.
The name of the SQL table will correspond to the name of the class, but in all lower case.

The attributes must be accompanied by type-hinting.
This will allow SQLModel to map the Python type of the variable to the SQL type.

Finally, if there are any constraints on the column such as a primary key or uniqueness, 
that is added by using the `SQLModel` field object.

In the cell below, we have defined columns for the `Article` table.

In [3]:
class Article(SQLModel, table=True):
    
    doi: str = Field(primary_key=True)
    title: str
    publication_year: int
    abstract: Optional[str] = Field(default=None)

After defining a single table, we can use SQLModel to create and connect to a database.

In [4]:
sqlite_file_name = "sqlmodel_database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"

engine = create_engine(sqlite_url, echo=True)

For the purposes of this tutorial, we're using the `echo=True` option to the SQL engine. 
This was set in the cell above.
When you work with objects in SQLModel, the library is building and executing the SQL queries under the hood.
The library will construct queries of the appropriate SQL dialect depending on what type of SQL database you are connecting to.
Using `echo=True` when using a SQLite database will show us the SQL queries being executed, but if we were to use a different type of SQL database, we would see those types of queries.

Our database is currently created but empty. To actually add tables to our database, we use `SQLModel.metadata.create_all` and pass in our database engine.
SQLModel will automatically create tables for any classes we've defined using `SQLModel` and the `table=True` argument.

In [5]:
SQLModel.metadata.create_all(engine)

2024-08-29 10:59:01,111 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-29 10:59:01,111 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-29 10:59:01,112 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:01,113 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("article")
2024-08-29 10:59:01,114 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:01,116 INFO sqlalchemy.engine.Engine 
CREATE TABLE article (
	doi VARCHAR NOT NULL, 
	title VARCHAR NOT NULL, 
	publication_year INTEGER NOT NULL, 
	abstract VARCHAR, 
	PRIMARY KEY (doi)
)


2024-08-29 10:59:01,116 INFO sqlalchemy.engine.Engine [no key 0.00066s] ()
2024-08-29 10:59:01,134 INFO sqlalchemy.engine.Engine COMMIT


In the cell output above, you will be able to see the SQL query that SQLModel constructed to build our table, namely

```sql
CREATE TABLE article (
	doi VARCHAR NOT NULL, 
	title VARCHAR NOT NULL, 
	publication_year INTEGER NOT NULL, 
	abstract VARCHAR, 
	PRIMARY KEY (doi)
)
```

Instead of us writing the SQL statement as we did in the last notebook, the ORM translates our object to SQL code and executes it for us.

If you view your database file now, you will see a databbase with a single `Article` table. 
SQLModel (using SQLAlchemy underneath) has mapped our Python class to a SQL statement for creating a table.

<div class="alert alert-block alert-warning">

## Exercise
Create classes for the `keyword` table and the `author` table.

The `author` table should have : `id` (integer, primary key), `first_name`, `last_name`, `affiliation` (optional). See note below for author ID.

The `keyword` table should have `id` (integer, primary key) and `keyword`.

For the `id` columns on both tables, use `None | int` as the type. 
This will allow someone to not pass in an author ID an have the table autoincrement.
</div>


In [6]:
class Author(SQLModel, table=True):

    id: None | int = Field(primary_key=True)
    first_name: str
    last_name: str
    affiliation: Optional[str] = Field(default=None)

class Keyword(SQLModel, table=True):

    id: None | int = Field(primary_key=True)
    keyword: str = Field(unique=True, index=True)

In [7]:
SQLModel.metadata.create_all(engine)

2024-08-29 10:59:05,323 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-29 10:59:05,326 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-29 10:59:05,326 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:05,328 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("author")
2024-08-29 10:59:05,329 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:05,331 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("author")
2024-08-29 10:59:05,331 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:05,332 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("keyword")
2024-08-29 10:59:05,333 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:05,334 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("keyword")
2024-08-29 10:59:05,334 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:05,335 INFO sqlalchemy.engine.Engine 
CREATE TABLE author (
	id INTEGER NOT NULL, 
	first_name VARCHAR NOT NULL, 
	last_name VARCHAR NOT NU

### Associative Tables

Finally for our associative tables we add `foreign_key` to our column fields referencing our other tables, and create our composite primary key by making both columns primary keys.

In [8]:
class ArticleKeyword(SQLModel, table=True):
    __table_args__ = {"extend_existing": True} # This lets us run the Jupyter notebook cell multiple times without error
    
    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    keyword_id: str = Field(foreign_key="keyword.id", primary_key=True)

class ArticleAuthor(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    author_id: str = Field(foreign_key="author.id", primary_key=True)

In [9]:
SQLModel.metadata.create_all(engine)

2024-08-29 10:59:11,618 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-29 10:59:11,620 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-29 10:59:11,620 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:11,621 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("author")
2024-08-29 10:59:11,622 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:11,623 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("keyword")
2024-08-29 10:59:11,624 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:11,624 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("articlekeyword")
2024-08-29 10:59:11,625 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:11,626 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("articlekeyword")
2024-08-29 10:59:11,627 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-29 10:59:11,630 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("articleauthor")
2024-08-29 10:59:11,632 INFO sqlalchemy.engine.

## Adding Data to the Database

In [10]:
import json


datafile = "data/one_paper.json"

with open(datafile) as f:
    search_data = json.load(f)

paper_data = search_data["itemHits"][0]["item"]
print(paper_data)

{'id': '66cc884620ac769e5fe4275b', 'doi': '10.26434/chemrxiv-2024-jjdsq', 'vor': None, 'title': 'Latin American Natural Product Database (LANaPDB): an update ', 'abstract': 'Natural product (NP) databases are crucial tools in computer-aided drug design (CADD). Over the last decade, there has been a worldwide effort to assemble information regarding natural products (NPs) isolated and characterized in certain geographical regions. In 2023, it was published LANaPDB, to our knowledge, it is the first attempt to gather and standardize all the NP databases of Latin America. Herein, we present and analyze in detail the contents of an updated version of LANaPDB, which includes 619 newly added compounds from Colombia, Costa Rica, and Mexico. The present version of LANaPDB has a total of 13,578 compounds, coming from ten databases of seven Latin American countries. A chemoinformatic characterization of LANAPDB was carried out, which includes the structural classification of the compounds, calcu

To add data to the database, we create instances of our objects. 
We'll start by adding the article.

In [11]:
with Session(engine) as session:

    first_article = Article(doi=paper_data["doi"], title=paper_data["title"], publication_year=2024, abstract=paper_data["abstract"])
    session.add(first_article)
    session.commit()

    for keyword in paper_data["keywords"]:
        keyword_obj = Keyword(keyword=keyword.lower())
        session.add(keyword_obj)
        session.commit()

2024-08-29 10:59:18,689 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-29 10:59:18,693 INFO sqlalchemy.engine.Engine INSERT INTO article (doi, title, publication_year, abstract) VALUES (?, ?, ?, ?)
2024-08-29 10:59:18,694 INFO sqlalchemy.engine.Engine [generated in 0.00138s] ('10.26434/chemrxiv-2024-jjdsq', 'Latin American Natural Product Database (LANaPDB): an update ', 2024, 'Natural product (NP) databases are crucial tools in computer-aided drug design (CADD). Over the last decade, there has been a worldwide effort to ass ... (1120 characters truncated) ... American natural product collection LANaPDB is publicly available and can be downloaded at https://github.com/alexgoga21/LANaPDB-version-2/tree/main.')
2024-08-29 10:59:18,696 INFO sqlalchemy.engine.Engine COMMIT
2024-08-29 10:59:18,709 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-29 10:59:18,710 INFO sqlalchemy.engine.Engine INSERT INTO keyword (keyword) VALUES (?)
2024-08-29 10:59:18,711 INFO sqlalchemy.engine

Although defining these models has made some things about connecting to the database and defining tables simpler, we're still missing an easy way to add and access data across our associative tables.
Luckily, ORM's have functionality built in to make this easier. 
We'll have to redefine our models for this, so we'll move on to the next notebook.