<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Building a Database with SQL and SQLite

<strong>Author(s):</strong> Jessica A. Nash, The Molecular Sciences Software Institute

<div class="alert alert-block alert-info"> 
<h2>Overview</h2>

<strong>Questions:</strong>

* How do we create a database using SQL commands?

* What are SQL commands, and how do we use them to define and manipulate tables?

* How do we interact with a SQLite database using Python?

<strong>Objectives:</strong>

* Understand how to create and initialize a SQLite database.

* Learn the basic SQL commands for creating tables and inserting data.

* Use Python to interact with SQLite databases.

</div>

## Introduction to SQLite

Relational databases are often implemented in some flavor of SQL (Structured Query Language).
Structured Query Language (SQL) is a language used to manage and manipulate relational databases.
Some common SQL databases include MySQL, PostgreSQL, and SQLite.
Each of these has their own SQL dialect, though they share many common features.

In this lesson, we will be learning about SQL command using SQLite. 

SQLite is a lightweight, self-contained SQL database engine that is widely used for local databases in applications. 
It doesn't require a separate server to run, which makes it ideal for use in smaller applications or prototyping.

In this lesson, we'll create a database using SQLite, define our tables, and insert some initial data. 
Python has [built-in support as part of the Python Standard Library to interact with SQLite databases](https://docs.python.org/3/library/sqlite3.html), 
so we'll use the `sqlite3` module to interact with our database. 
At the end of this lesson, we'll also show how you could have performed the same tasks using the SQLite command line interface.

<div class="alert alert-block alert-success">

<h3>Tip: Viewing Database Files</h3>

If you would like to be able to view the database file that is created, we recommend installing [this plugin for VS Code](https://marketplace.visualstudio.com/items?itemName=qwtel.sqlite-viewer).

</div>

## Using SQLite from Python

To use SQLite from Python, we need to import the `sqlite3` module.
We then, create a database connection by giving a file path to the database file we would like to use.
As stated earlier, SQLite is a serverless database, so the database file is the database itself.

In [25]:
import sqlite3

# Connect to a database (or create one if it doesn't exist)
connection = sqlite3.connect('research_articles.db')

print("Opened database successfully")

Opened database successfully


If you are watching your file system, you will now see that a file has been created in the current directory with the name `research_articles.db`.
If you now view your database file using the SQLite plugin for VS Code, you will see that the database is empty.

After creating the connection, we create a cursor object to interact with the database.

In [26]:

# Create a cursor object using the connection
cursor = connection.cursor()

# Print a confirmation message
print("Cursor created.")


Cursor created.


We will use this cursor to add data to our database.

Currently, we have a database, but the database has no tables or structure.
In order to add data to our database, we first need to define the database structure by defining our databases tables.
As a reminder, we decided to have three tables: `authors`, `articles`, and `author_article`.

To define these, we will have to write SQL commands.
An SQL command is generally composed of several key parts: keywords, identifiers, operators, and the data itself. 
Keywords are specific terms like `SELECT`, `CREATE`, `INSERT`, and `DELETE`, each referring to a possible action in the database. 

For example, to create a table, you would use the `CREATE TABLE` keyword followed by your chosen table name and details about the columns it should contain.
You use the `CREATE TABLE` keyword followed by the table name and a list of columns with their names and types.

```sql
CREATE TABLE table_name (
    column1_name column1_type conditions,
    column2_name column2_type conditions,
    ...
);
```

For example, to create our `DOI` column in our article table, the syntax would be something like

```
CREATE TABLE IF NOT EXISTS Article (
    DOI TEXT PRIMARY KEY
)
```

This says to create a table called `Article` and to add a column called `DOI`
that has the data type `TEXT` and that this column should be the primary key.

The next cell contains the full SQL command for creating the `article` table.
We first define the SQL command as a multiline string in Python.
Next, we use the cursor object to execute the SQL command.
Finally, we commit the changes to the database.

In [27]:
# SQL command to create Article table
article_table_command = """
CREATE TABLE IF NOT EXISTS articles (
    DOI TEXT PRIMARY KEY,
    article_title TEXT NOT NULL,
    publication_year INTEGER NOT NULL,
    abstract TEXT
);
"""

# Execute the SQL commands
cursor.execute(article_table_command)

# Commit the changes
connection.commit()


If we now examine our database file using the SQLite plugin for VS Code, we will see that the `article` table has been created.
The primary key (the `DOI` column) is shown with a key icon next to it.
The viewer shows us that we don't have any data in the database yet.

<div style="text-align:center;">
  <img src="images/article_table.png" style="display: block; margin: 0 auto; max-height:400px;">
</div>

## Database Transactions

What we just did in the previous cell is an example of a database **transaction**.
A transaction is a single unit of work that ensures that the change we wanted to make (in this case, creating a table) either happens completely or doesn't happen at all if something goes wrong.

In the cell above, the transaction concludes with the  `commit` method. 
When you execute the `CREATE TABLE` statement, a transaction begins. 
The COMMIT command finalizes this transaction, making all changes permanent and visible to other users or processes interacting with the database. 

<div class="alert alert-block alert-warning">

## Exercise

Add the SQL commands to create the `authors` table in the cell below. 

Use the following columns:
* `author_id` (INTEGER, PRIMARY KEY)
* `first_name` (TEXT, NOT NULL)
* `last_name` (TEXT, NOT NULL)
* `affiliation` (TEXT, OPTIONAL)

</div>

In [28]:
# SQL command to create Article table
author_table_command = """
CREATE TABLE IF NOT EXISTS authors (
    author_id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    affiliation TEXT
);
"""

# Execute the SQL commands
cursor.execute(author_table_command)

# Commit the changes
connection.commit()


To add the `article_author` table, we need use the concepts of a **foreign key**.
This table will also have what is called a **composite primary key**.
A composite primary key is a primary key that consists of more than one column.


In [29]:
article_author_table_command = """
CREATE TABLE IF NOT EXISTS article_authors (
    article_id TEXT,
    author_id INT,
    PRIMARY KEY (article_id, author_id),
    FOREIGN KEY (article_id) REFERENCES articles(DOI),
    FOREIGN KEY (author_id) REFERENCES authors(author_id)
);
"""

# Execute the SQL commands
cursor.execute(article_author_table_command)

# Commit the changes
connection.commit()

Next, we'll make the keyword tables discussed in the exercise for notebook 1.

In [30]:
keyword_table_command = """
CREATE TABLE IF NOT EXISTS keywords (
    keyword_id INTEGER PRIMARY KEY,
    keyword TEXT NOT NULL
);
"""

keyword_article_table_command = """
CREATE TABLE IF NOT EXISTS article_keywords (
    article_id TEXT,
    keyword_id INTEGER,
    PRIMARY KEY (article_id, keyword_id),
    FOREIGN KEY (article_id) REFERENCES articles(DOI),
    FOREIGN KEY (keyword_id) REFERENCES keywords(keyword_id)
);
"""

# Execute the SQL commands
cursor.execute(keyword_table_command)
cursor.execute(keyword_article_table_command)

# Commit the changes
connection.commit()

# Inserting Data

Let's insert some data into our database.
To get some data, we'll use the REST API for ChemRxiv.

In the cell below, we pull the most recent article in the "Theoretical and Computational Chemistry" category from ChemRxiv.
Note that this will be different every time the notebook is run.

In [33]:
import requests

import datetime

# most recent theoretical chemistry paper on ChemRXiv
paper = requests.get("https://chemrxiv.org/engage/chemrxiv/public-api/v1/items?categoryIds=605c72ef153207001f6470ce&limit=1")
print(paper.json()["itemHits"])

[{'item': {'id': '66cba705a4e53c48769eb2a8', 'doi': '10.26434/chemrxiv-2024-hjzcr', 'vor': None, 'title': 'Modeling the ionization efficiency of small molecules in positive electrospray ionization using molecular dynamics simulations', 'abstract': '    Technological advancements in liquid chromatography (LC) electrospray ionization (ESI) high resolution mass spectrometry (HRMS) have made it an increasingly popular analytical technique in non-targeted analysis (NTA) of environmental and biological samples. One critical limitation of current methods in NTA is the lack of available analytical standards for many of the compounds detected in biological and environmental samples. Computational approaches can provide estimates of concentrations by modeling the ionization efficiency of a compound expressed as the relative response factor (RRF). In this paper, we explore the application of molecular dynamics (MD) in the development of a predictive model for RRF.  We obtained measurements of RRF

We can gather data needed for our database from the JSON response.
Note that ChemRxiv has a lot more information than we are going to store in our datbase.
In fact, they probably have their own database with this information!
Some choices we've made here are for simplicity and to illustrate the concepts we are learning.

In [40]:
recent_paper ={
    "DOI": paper.json()["itemHits"][0]["item"]["doi"],
    "title": paper.json()["itemHits"][0]["item"]["title"],
    # get the current year - making some assumptions here :)
    "year": datetime.datetime.now().year,
    "abstract": paper.json()["itemHits"][0]["item"]["abstract"],
    "keywords": paper.json()["itemHits"][0]["item"]["keywords"],
    "authors": paper.json()["itemHits"][0]["item"]["authors"]
}

print(recent_paper)

{'DOI': '10.26434/chemrxiv-2024-hjzcr', 'title': 'Modeling the ionization efficiency of small molecules in positive electrospray ionization using molecular dynamics simulations', 'year': 2024, 'abstract': '    Technological advancements in liquid chromatography (LC) electrospray ionization (ESI) high resolution mass spectrometry (HRMS) have made it an increasingly popular analytical technique in non-targeted analysis (NTA) of environmental and biological samples. One critical limitation of current methods in NTA is the lack of available analytical standards for many of the compounds detected in biological and environmental samples. Computational approaches can provide estimates of concentrations by modeling the ionization efficiency of a compound expressed as the relative response factor (RRF). In this paper, we explore the application of molecular dynamics (MD) in the development of a predictive model for RRF.  We obtained measurements of RRF for 48 compounds with LC - quadrupole time

To insert data into our database, we use the `INSERT INTO` SQL command.
We leave question marks `?` in the SQL command where we want to insert data,
then feed the data to be inserted into the database as a tuple as the second argument to the `execute` method.

This is a safety feature to prevent SQL injection attacks.

In [39]:
# Insert the data
insert_article_command = """
 INSERT INTO articles (DOI, article_title, publication_year, abstract)
    VALUES (?, ?, ?, ?)"""

cursor.execute(insert_article_command, (recent_paper["DOI"], recent_paper["title"], recent_paper["year"], recent_paper["abstract"]))

# Commit the changes
connection.commit()


Processing our keywords and authors is a bit more complicated.
We need to insert the keywords into the `keywords` table and the authors into the `authors` table, 
then make our associations.

This is quite cumbersome when using raw SQL, so we will avoid it for this tutorial.
In the next section, we will use an object-relational mapping (ORM) library called SQLModel to simplify this process.