<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Building a Database with SQLModel

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>


<strong>Objectives:</strong>

</div>



In [1]:
import os 

from typing import Optional, List

from sqlmodel import Field, SQLModel, Session, Relationship, create_engine

In [2]:
def remove_db():
    """Convenience function to remove database file for notebook."""
    if os.path.exists("sqlmodel_database.db"):
        os.remove("sqlmodel_database.db")

remove_db()

## Defining a Table using SQLModel

When using an ORM, you define tables using Python classes using inheritance.
The classes will inherit from some base class that defines the behavior of the object.

In the case of SQLModel, we can create a SQL table by inheriting from the class `SQLModel` and setting the `table` argument to `True`.
Defining the columns of the table is very similar to defining a model using [Pydantic].
Each column is given a name by adding a class attribute to your class. 
The name of the column will correspond to the name of the attribue in the class.
The name of the SQL table will correspond to the name of the class, but in all lower case.

The attributes must be accompanied by type-hinting.
This will allow SQLModel to map the Python type of the variable to the SQL type.

Finally, if there are any constraints on the column such as a primary key or uniqueness, 
that is added by using the `SQLModel` field object.

In the cell below, we have defined columns for the `Article` table.


In [3]:
class Article(SQLModel, table=True):
    
    doi: str = Field(primary_key=True)
    title: str
    publication_year: int
    abstract: Optional[str] = Field(default=None)

After defining a single table, we can use SQLModel to create and connect to a database.

In [4]:
sqlite_file_name = "sqlmodel_database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"

engine = create_engine(sqlite_url, echo=True)

For the purposes of this tutorial, we're using the `echo=True` option to the SQL engine. 
This was set in the cell above.
When you work with objects in SQLModel, the library is building and executing the SQL queries under the hood.
The library will construct queries of the appropriate SQL dialect depending on what type of SQL database you are connecting to.

Our database is currently created but empty. To actually add tables to our database, we use `SQLModel.metadata.create_all` and pass in our database engine.

In [5]:
SQLModel.metadata.create_all(engine)

2024-08-28 04:46:10,513 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-28 04:46:10,514 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-28 04:46:10,515 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,516 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("article")
2024-08-28 04:46:10,516 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,518 INFO sqlalchemy.engine.Engine 
CREATE TABLE article (
	doi VARCHAR NOT NULL, 
	title VARCHAR NOT NULL, 
	publication_year INTEGER NOT NULL, 
	abstract VARCHAR, 
	PRIMARY KEY (doi)
)


2024-08-28 04:46:10,518 INFO sqlalchemy.engine.Engine [no key 0.00082s] ()
2024-08-28 04:46:10,528 INFO sqlalchemy.engine.Engine COMMIT


In the cell output above, you will be able to see the SQL query that SQLModel constructed to build our table, namely

```sql
CREATE TABLE article (
	doi VARCHAR NOT NULL, 
	title VARCHAR NOT NULL, 
	publication_year INTEGER NOT NULL, 
	abstract VARCHAR, 
	PRIMARY KEY (doi)
)
```

If you view your database file now, you will see a databbase with a single `Article` table. 
SQLModel (using SQLAlchemy underneath) has mapped our Python class to a SQL statement for creating a table.

<div class="alert alert-block alert-warning">

## Exercise
Create classes for the `keyword` table and the `author` table.

The `author` table should have : `id` (integer), `first_name`, `last_name`, `affiliation` (optional). F

The `keyword` table should have `id` and `keyword`.

For the `id` columns, use `None | int` as the type. 
This will allow someone to not pass in an author ID an have the table autoincrement.
</div>


In [6]:
class Author(SQLModel, table=True):

    id: None | int = Field(primary_key=True)
    first_name: str
    last_name: str
    affiliation: Optional[str] = Field(default=None)

class Keyword(SQLModel, table=True):

    id: None | int = Field(primary_key=True)
    keyword: str = Field(unique=True, index=True)

In [7]:
SQLModel.metadata.create_all(engine)

2024-08-28 04:46:10,580 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-28 04:46:10,581 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-28 04:46:10,582 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,586 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("author")
2024-08-28 04:46:10,587 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,588 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("author")
2024-08-28 04:46:10,589 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,593 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("keyword")
2024-08-28 04:46:10,596 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,597 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("keyword")
2024-08-28 04:46:10,597 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,599 INFO sqlalchemy.engine.Engine 
CREATE TABLE author (
	id INTEGER NOT NULL, 
	first_name VARCHAR NOT NULL, 
	last_name VARCHAR NOT NU

### Associative Tables

Finally for our associative tables we add `foreign_key` to our column fields referencing our other tables.

In [8]:
class ArticleKeyword(SQLModel, table=True):
    __table_args__ = {"extend_existing": True} # This lets us run the Jupyter notebook cell multiple times without error
    
    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    keyword_id: str = Field(foreign_key="keyword.id", primary_key=True)

class ArticleAuthor(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    author_id: str = Field(foreign_key="author.id", primary_key=True)

In [9]:
SQLModel.metadata.create_all(engine)

2024-08-28 04:46:10,660 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-28 04:46:10,661 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-28 04:46:10,662 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,663 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("author")
2024-08-28 04:46:10,663 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,664 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("keyword")
2024-08-28 04:46:10,665 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,665 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("articlekeyword")
2024-08-28 04:46:10,666 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,666 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("articlekeyword")
2024-08-28 04:46:10,667 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,667 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("articleauthor")
2024-08-28 04:46:10,668 INFO sqlalchemy.engine.

## Show adding and retrieving a single author

  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)
  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)
  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)
  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)


In [14]:
import os

from typing import Optional, List

from sqlmodel import Field, SQLModel, Session, Relationship, create_engine

# remove the database file if it exists
if os.path.exists("sqlmodel_database.db"):
    os.remove("sqlmodel_database.db")


# Define associative tables first - we will use these in relationships for our main tables
# It's easier to define these first because we can use them in the main tables' definitions,
# otherwise, Python would have a problem and we would need to use something like ForwardRef (probably, I actually didn't get that to work :) )
# This cell will be broken up and explained, especially the relationship parts.
class ArticleKeyword(SQLModel, table=True):
    __table_args__ = {"extend_existing": True} # This lets us run the Jupyter notebook cell multiple times without error
    
    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    keyword_id: str = Field(foreign_key="keyword.id", primary_key=True)

class ArticleAuthor(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    article_doi: str = Field(foreign_key="article.doi", primary_key=True)
    author_id: str = Field(foreign_key="author.id", primary_key=True)

class Article(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    doi: str = Field(primary_key=True)
    title: str
    publication_year: int
    abstract: Optional[str] = Field(default=None)

    keywords: list["Keyword"] = Relationship(back_populates="articles", link_model=ArticleKeyword)
    authors: list["Author"] = Relationship(back_populates="articles", link_model=ArticleAuthor)

class Author(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    id: None | int = Field(primary_key=True)
    first_name: str
    last_name: str
    affiliation: Optional[str] = Field(default=None)

    articles: List["Article"] = Relationship(back_populates="authors", link_model=ArticleAuthor)


class Keyword(SQLModel, table=True):
    __table_args__ = {"extend_existing": True}

    id: None | int = Field(primary_key=True)
    keyword: str = Field(unique=True, index=True)

    articles: List["Article"] = Relationship(back_populates="keywords", link_model=ArticleKeyword)



  DeclarativeMeta.__init__(cls, classname, bases, dict_, **kw)


In [15]:
sqlite_file_name = "sqlmodel_database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"

engine = create_engine(sqlite_url, echo=True)

In [16]:
SQLModel.metadata.create_all(engine)

2024-08-28 04:46:10,913 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-28 04:46:10,915 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("article")
2024-08-28 04:46:10,915 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,916 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("article")
2024-08-28 04:46:10,917 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,918 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("author")
2024-08-28 04:46:10,918 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,919 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("author")
2024-08-28 04:46:10,919 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,920 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("keyword")
2024-08-28 04:46:10,921 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-08-28 04:46:10,922 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("keyword")
2024-08-28 04:46:10,922 INFO sqlalchemy.engine.Engine [raw sql] ()
2

OperationalError: (sqlite3.OperationalError) index ix_keyword_keyword already exists
[SQL: CREATE UNIQUE INDEX ix_keyword_keyword ON keyword (keyword)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

In [None]:
import requests

import datetime

# most recent theoretical chemistry paper on ChemRXiv
paper = requests.get("https://chemrxiv.org/engage/chemrxiv/public-api/v1/items?categoryIds=605c72ef153207001f6470ce&limit=1")
print(paper.json()["itemHits"])

In [None]:
import json

with open("data/one_paper.json", "w") as json_file:
    json.dump(paper.json(), json_file, indent=4) 

In [None]:
recent_paper ={
    "doi": paper.json()["itemHits"][0]["item"]["doi"],
    "title": paper.json()["itemHits"][0]["item"]["title"],
    # get the current year - making some assumptions here :)
    "publication_year": datetime.datetime.now().year,
    "abstract": paper.json()["itemHits"][0]["item"]["abstract"],
    "keywords": paper.json()["itemHits"][0]["item"]["keywords"],
    "authors": paper.json()["itemHits"][0]["item"]["authors"]
}

print(recent_paper)

In [None]:
keyword_objs = []
for keyword in recent_paper["keywords"]:
    keyword_obj = Keyword(keyword=keyword.lower())
    keyword_objs.append(keyword_obj)

author_objs = []
for author in recent_paper["authors"]:
    author_obj = Author(first_name=author["firstName"], last_name=author["lastName"], affiliation=author["institutions"][0]["name"])
    author_objs.append(author_obj)

recent_paper["keywords"] = keyword_objs
recent_paper["authors"] = author_objs

In [None]:
recent_paper["keywords"]

In [None]:
with Session(engine) as session:
    # Add the article
    article = Article(**recent_paper)
    session.add(article)
    session.commit()

In [None]:
# Show how to query here.



Let's pull 50 more papers from ChemArxiv and add them to our database.

In [None]:
import requests
import datetime
from sqlmodel import select, Session

# Get the most recent theoretical chemistry papers on ChemRxiv
papers = requests.get("https://chemrxiv.org/engage/chemrxiv/public-api/v1/items?categoryIds=605c72ef153207001f6470ce&limit=50&skip=1")

import json

with open("data/fifty_papers.json", "w") as json_file:
    json.dump(papers.json(), json_file, indent=4) 

for paper in papers.json()["itemHits"]:
    recent_paper = {
        "doi": paper["item"]["doi"],
        "title": paper["item"]["title"],
        "publication_year": datetime.datetime.now().year,
        "abstract": paper["item"]["abstract"],
        "keywords": paper["item"]["keywords"],
        "authors": paper["item"]["authors"]
    }

    keyword_objs = []
    with Session(engine) as session:
        for keyword in recent_paper["keywords"]:
            # Check if keyword already exists
            normalized_keyword = keyword.lower()
            existing_keyword = session.exec(select(Keyword).where(Keyword.keyword == normalized_keyword)).first()
            if existing_keyword:
                keyword_objs.append(existing_keyword)
            else:
                keyword_obj = Keyword(keyword=normalized_keyword)
                session.add(keyword_obj)
                session.commit()  # Commit to get the keyword ID
                session.refresh(keyword_obj)  # Refresh to load the keyword ID
                keyword_objs.append(keyword_obj)

    author_objs = []
    with Session(engine) as session:
        for author in recent_paper["authors"]:
            affiliation = author["institutions"][0]["name"] if author["institutions"] else None

            # Check if author already exists
            existing_author = session.exec(
                select(Author).where(
                    Author.first_name == author["firstName"],
                    Author.last_name == author["lastName"],
                    Author.affiliation == affiliation
                )
            ).first()
            
            if existing_author:
                author_objs.append(existing_author)
            else:
                author_obj = Author(
                    first_name=author["firstName"],
                    last_name=author["lastName"],
                    affiliation=affiliation
                )
                session.add(author_obj)
                session.commit()  # Commit to get the author ID
                session.refresh(author_obj)  # Refresh to load the author ID
                author_objs.append(author_obj)

    recent_paper["keywords"] = keyword_objs
    recent_paper["authors"] = author_objs

    try:
        with Session(engine) as session:
            # Add the article
            article = Article(**recent_paper)
            session.add(article)
            session.commit()
    except Exception as e:
        print(f"Adding article {recent_paper['doi']} failed with error: {e}")


In [None]:
keyword_to_search = "machine learning"

with Session(engine) as session:
    # Get the specific Keyword object by keyword
    keyword = session.exec(select(Keyword).where(Keyword.keyword == keyword_to_search)).first()
    
    if keyword:
        print(f"Found keyword: {keyword.keyword}")
        # Access related articles directly
        for article in keyword.articles:
            print(f"DOI: {article.doi}, Title: {article.title}, Year: {article.publication_year}")
    else:
        print(f"No articles found for keyword: {keyword_to_search}")


In [None]:
keyword.articles