# 1. Basic data storage

All programs process data in one form or another, and many need to be able to save and retrieve that data from one invocation to the next. In `Python`, we have many ways to store data. 
- Flat files (e.g. csv, json, xml, etc.)
- RDBMS (e.g. postgres, mysql, etc)
- NoSQL DB (e.g. mongoDB, Neo4J, etc.)

In this tutorial, we will use first flat files, then RDBMS (SQLite), to store and retrieve data. At last we will use `SQLAlchemy (ORM)` which allow you to map rows in a table with python object. `SQLite` is a light weight RDBMS which stores data in a single file without the need for complex installation.

## 1.1 Working with flat files 

### Advantages of Flat Files
- Easy to use and implement
- human-readable makes it easy to edit, view/examine, and transfer 
- support by many language and framework (import, export)


### Disadvantages of Flat Files
- performance: Large files are difficult to view/examine/edit
- No explicit relation between data parts (tables). Relation need to be written in the business logic
- No schema or data structure. Users need understand not only the structure of the data but also the programming tools necessary for accessing it.


In below example, we use pandas to read the csv file. And `treelib` to print the hierarchies

In [39]:
import pandas as pd
from treelib import Node, Tree
import sqlite3
from sqlite3 import Error
# imports the Column, Integer, String, ForeignKey, and Table classes from SQLAlchemy, 
# which are used to help define the model attributes.
from sqlalchemy import Column, Integer, String, ForeignKey, Table
# imports the relationship() and backref objects, which are used to create the 
# relationships between objects.
from sqlalchemy.orm import relationship, backref
# imports the declarative_base object, which connects the database engine to the 
# SQLAlchemy functionality of the models.
from sqlalchemy.ext.declarative import declarative_base

In [9]:
def get_data(file_path):
    return pd.read_csv(file_path)

def get_books_by_publisher(data,ascending=True):
    """Return the number of books by each publisher as a pandas series"""
    return data.groupby("publisher").size().sort_values(ascending=ascending)

def get_authors_by_publisher(data, ascending=True):
    """Returns the number of authors by each publisher as a pandas series"""
    return (
        data.assign(name=data.first_name.str.cat(data.last_name, sep=" "))
        .groupby("publisher")
        .nunique()
        .loc[:, "name"]
        .sort_values(ascending=ascending)
    )

def add_new_book(data, author_name, book_title, publisher_name):
    """Adds a new book to the system"""
    # Does the book exist?
    first_name, _, last_name = author_name.partition(" ")
    if any(
        (data.first_name == first_name)
        & (data.last_name == last_name)
        & (data.title == book_title)
        & (data.publisher == publisher_name)
    ):
        return data
    # Add the new book
    return data.append(
        {
            "first_name": first_name,
            "last_name": last_name,
            "title": book_title,
            "publisher": publisher_name,
        },
        ignore_index=True,
    )

def output_author_hierarchy(data):
    """Output the data as a hierarchy list of authors"""
    authors = data.assign(
        name=data.first_name.str.cat(data.last_name, sep=" ")
    )
    authors_tree = Tree()
    authors_tree.create_node("Authors", "authors")
    for author, books in authors.groupby("name"):
        authors_tree.create_node(author, author, parent="authors")
        for book, publishers in books.groupby("title")["publisher"]:
            book_id = f"{author}:{book}"
            authors_tree.create_node(book, book_id, parent=author)
            for publisher in publishers:
                authors_tree.create_node(publisher, parent=book_id)

    # Output the hierarchical authors data
    authors_tree.show()

Below is the main function that calls the above functions 

In [4]:
root_path="../../../data"
file_path=f"{root_path}/author_book_publisher.csv"

data = get_data(file_path)
data.head()

Unnamed: 0,first_name,last_name,title,publisher
0,Isaac,Asimov,Foundation,Random House
1,Pearl,Buck,The Good Earth,Random House
2,Pearl,Buck,The Good Earth,Simon & Schuster
3,Tom,Clancy,The Hunt For Red October,Berkley
4,Tom,Clancy,Patriot Games,Simon & Schuster


In [5]:
# Get the number of books printed by each publisher
books_by_publisher = get_books_by_publisher(data, ascending=False)
for publisher, total_books in books_by_publisher.items():
    print(f"Publisher: {publisher}, total books: {total_books}")
print()

Publisher: Random House, total books: 4
Publisher: Simon & Schuster, total books: 4
Publisher: Berkley, total books: 2
Publisher: Penguin Random House, total books: 2



In [6]:
# Get the number of authors each publisher publishes
authors_by_publisher = get_authors_by_publisher(data, ascending=False)
for publisher, total_authors in authors_by_publisher.items():
    print(f"Publisher: {publisher}, total authors: {total_authors}")
print()

Publisher: Simon & Schuster, total authors: 4
Publisher: Random House, total authors: 3
Publisher: Berkley, total authors: 2
Publisher: Penguin Random House, total authors: 1



In [10]:
# Output hierarchical authors data
output_author_hierarchy(data)

Authors
├── Alex Michaelides
│   └── The Silent Patient
│       └── Simon & Schuster
├── Carol Shaben
│   └── Into The Abyss
│       └── Simon & Schuster
├── Isaac Asimov
│   └── Foundation
│       └── Random House
├── John Le Carre
│   └── Tinker, Tailor, Soldier, Spy: A George Smiley Novel
│       └── Berkley
├── Pearl Buck
│   └── The Good Earth
│       ├── Random House
│       └── Simon & Schuster
├── Stephen King
│   ├── Dead Zone
│   │   └── Random House
│   ├── It
│   │   ├── Penguin Random House
│   │   └── Random House
│   └── The Shining
│       └── Penguin Random House
└── Tom Clancy
    ├── Patriot Games
    │   └── Simon & Schuster
    └── The Hunt For Red October
        └── Berkley



In [11]:
# Add a new book to the data structure
data = add_new_book(
    data,
    author_name="Pengfei Liu",
    book_title="The real world",
    publisher_name="Random House",
)

  return data.append(


In [12]:
# Output the updated hierarchical authors data
output_author_hierarchy(data)

Authors
├── Alex Michaelides
│   └── The Silent Patient
│       └── Simon & Schuster
├── Carol Shaben
│   └── Into The Abyss
│       └── Simon & Schuster
├── Isaac Asimov
│   └── Foundation
│       └── Random House
├── John Le Carre
│   └── Tinker, Tailor, Soldier, Spy: A George Smiley Novel
│       └── Berkley
├── Pearl Buck
│   └── The Good Earth
│       ├── Random House
│       └── Simon & Schuster
├── Pengfei Liu
│   └── The real world
│       └── Random House
├── Stephen King
│   ├── Dead Zone
│   │   └── Random House
│   ├── It
│   │   ├── Penguin Random House
│   │   └── Random House
│   └── The Shining
│       └── Penguin Random House
└── Tom Clancy
    ├── Patriot Games
    │   └── Simon & Schuster
    └── The Hunt For Red October
        └── Berkley




## 1.2 Store data in a RDBMS

As we mentioned before, data in flat file has no structure. If we put the `author_book_publisher.csv` data into one table. It can work, but we lose all the advantage of a RDBMS. So first step, we need to do 
data normalization. It often takes three steps:
- convert data to 1st normal form
- convert the result of step1 to 2nd normal form
- convert the result of setp2 to 3rd normal form

Here we will not show the details of how to normalize the tables. 

In below section, we will:
1. create a data model for the csv file
2. create a data base structure (tables with relation) in a RDBMS (SQLite)
3. Populate the DB

The `SQLite database` offers a full-featured relational database management system (RDBMS) that works with a single file to maintain all the database functionality.

### 1.2.1 Create data model

Entities:
- book
- author
- publisher

Relations between entities:

Below figure is an **entity-relation diagram (ERD)**  which is created by using `data grip`.

![book_publisher_ERD.PNG](../../../images/book_publisher_ERD.PNG)

We have a `one-to-many relations` between authors and books (here, we suppose, one book only have one authors, otherwise it's a many to many relations). An author can have multiple books.

We have two `many-to-many relations` between (author,publisher) and (book, publisher). Because One author can work with many publishers, and one publisher can work with many authors. Similarly, one book can be published by many publishers, and one publisher can publish many books.

### 1.2.2 Create db structure

Sqlite has a native support of python. So we can crate a sqlite db with python. For more info, please visit [sqlite-python](https://www.sqlitetutorial.net/sqlite-python/).


We use python to connect to sqlite, you can also use `sqlite3 <db-path>` to open an interactive CLI. Once you have the shell, you can run any sql request as in a real RDBMS. 

```sqlite
# formatting output
.header on
.mode column
.timer on

# list existing tables
SELECT name FROM sqlite_schema
WHERE type='table'
ORDER BY name;

# to exit shell
.quit

```

In [18]:
def create_connection(db_file_path:str):
    """ create a database connection to a SQLite database 
        if you put db_file_path=:memory:, the sqlite db resides in the memory
    """
    conn = None
    try:
        conn = sqlite3.connect(db_file_path)
        print(sqlite3.version)
    except Error as e:
        print(e)
    return conn

In [19]:
# we create a new data base 
db_path=f"{root_path}/author_book_publisher.db"
conn=create_connection(db_path)

2.6.0


In [21]:
# add an entity table author in the database
conn.execute('''
CREATE TABLE IF NOT EXISTS author (
    author_id INTEGER NOT NULL PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR
);
''')

<sqlite3.Cursor at 0x7f6e9bb353c0>

In [22]:
# add an entity table book in the database
# author book has one to many relation. So the primary key of author is the foreign
# of book 
conn.execute('''
CREATE TABLE IF NOT EXISTS book (
    book_id INTEGER NOT NULL PRIMARY KEY,
    author_id INTEGER REFERENCES author,
    title VARCHAR
);
''')

<sqlite3.Cursor at 0x7f6e9bb34fc0>

In [23]:
# add an entity table book in the database
conn.execute('''
CREATE TABLE IF NOT EXISTS publisher (
    publisher_id INTEGER NOT NULL PRIMARY KEY,
    name VARCHAR
);
''')

<sqlite3.Cursor at 0x7f6e9bb35540>

We have created the three table that represent the three entities. The author table has a one to many relation with book table. So the primary key of author is the foreign key of the book table.

For expressing many-to-many relationships, we need to create **association table**. As we have two many to many relations, we need two association table
- author_publisher
- book_publisher



In [26]:
# add an association table author_publisher in the database
conn.execute('''
CREATE TABLE IF NOT EXISTS author_publisher (
    author_id INTEGER REFERENCES author,
    publisher_id INTEGER REFERENCES publisher
);
''')

<sqlite3.Cursor at 0x7f6e9bb371c0>

In [27]:
# add an association table book_publisher in the database
conn.execute('''
CREATE TABLE IF NOT EXISTS book_publisher (
    book_id INTEGER REFERENCES book,
    publisher_id INTEGER REFERENCES publisher
);
''')

<sqlite3.Cursor at 0x7f6e9bb351c0>

### 1.2.3 Populating the DB

To populate a table, we can use `insert into` command. Below is an example.

In [24]:
# insert data to tables
conn.execute('''
INSERT INTO author
    (first_name, last_name)
VALUES ('Paul', 'Mendez');
''')

<sqlite3.Cursor at 0x7f6e9bb35140>

In [28]:
# insert data to tables
conn.execute('''
INSERT INTO author
    (first_name, last_name)
VALUES ('Stephen', 'King');
''')

<sqlite3.Cursor at 0x7f6e9bb37140>

You may notice that, after the insert, the table is not updated. Instead, sqlite create a temporary (.db-journal) file
that stores you transaction. Unless you run `commit`, no changes will be persistent

In [29]:
conn.execute('''
commit;
''')

<sqlite3.Cursor at 0x7f6e9bb35b40>

After running commit, you can notice the temporary file is deleted, and the data is added to the DB.

### 1.2.4 Modify data
If you want to modify the value of an existing row, you can use the `UPDATE` command. Below is an example.
Similar to insert, `UPDATE` is considered as a transaction too. So we need to run `commit` to validate the transaction.

In [30]:
conn.execute('''
UPDATE author
SET first_name = 'Richard', last_name = 'Bachman'
WHERE first_name = 'Stephen' AND last_name = 'King';
''')

<sqlite3.Cursor at 0x7f6e9bb36540>

In [31]:
conn.execute('''
commit;
''')

<sqlite3.Cursor at 0x7f6e9bb35c40>

### 1.2.5 Deleting data

To delete data, we can use `DELETE` command. Note the `execute` method only allows you to execute one query a time. So you can't add `commit;` just after the delete command. As result you can execute the deletion as many as you want, it won't return error. 

In [36]:
conn.execute('''
DELETE FROM author
WHERE first_name = 'Paul'
AND last_name = 'Mendez';
''')

<sqlite3.Cursor at 0x7f6e9bb40040>

In [37]:
conn.execute('''
commit;
''')

<sqlite3.Cursor at 0x7f6e9bb36c40>

## 1.3 Working With SQLAlchemy and Python Objects

**SQLAlchemy** is a powerful database access tool kit for Python, with its `object-relational mapper (ORM)` being one of its most famous components.

As we know, the RDBMS data model does not always match with the data model of Object Oriented Programming.
This problem is known as [object-relational impedance mismatch](https://en.wikipedia.org/wiki/Object-relational_impedance_mismatch).

The `ORM provided by SQLAlchemy sits between the database and your Python program and transforms the data flow between the database engine and Python objects`. SQLAlchemy allows you to think in terms of objects and still retain the powerful features of a database engine.


### 1.3.1 The Model

The SQLAlchemy model is a Python class defining the data mapping between the Python objects returned as a result of a database query and the underlying database tables.

In the `entity-relationship diagram`, all boxes will be represented by a table (Python classes) in the model. The arrows are the relationships between the tables.

The tables in the model are Python classes inheriting from an `SQLAlchemy Base class`. The Base class provides the interface operations between instances of the model and the database table

In [None]:
# creates the Base class, which is what all models inherit from and how they 
# get SQLAlchemy ORM functionality.
Base = declarative_base()

In [None]:
# create the author_publisher association table model.
author_publisher = Table(
    # table name
    "author_publisher",
    # Base.metadata provides the connection between the SQLAlchemy functionality and the database engine.
    Base.metadata,
    # column description: name, type, if foreign key, need to add a foreign key reference
    # This reference creates a a dependency between two Column fields in different tables. 
    # A ForeignKey is how you make SQLAlchemy aware of the relationships between tables.
    # Below code defines author_id is a foreign key related to the primary key in the author table.
    Column("author_id", Integer, ForeignKey("author.author_id")),
    Column("publisher_id", Integer, ForeignKey("publisher.publisher_id")),
)

In [None]:
# create the book_publisher association table model.
book_publisher = Table(
    "book_publisher",
    Base.metadata,
    Column("book_id", Integer, ForeignKey("book.book_id")),
    Column("publisher_id", Integer, ForeignKey("publisher.publisher_id")),
)

In [None]:
# define the Author class model to the author database table.
class Author(Base):
    __tablename__ = "author"
    author_id = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    # One to many relation
    # Having a ForeignKey defines the existence of the relationship between tables but not 
    # the collection of books an author can have. 
    # Below code defines a parent-child collection. The books attribute being plural 
    # (which is not a requirement, just a convention) is an indication that it’s a collection.
    # The first parameter is the class name Book (which is not the table name book), is the class to 
    # which the books attribute is related. The relationship informs SQLAlchemy that there’s a relationship 
    # between the **Author and Book classes**. SQLAlchemy will find the relationship in the Book class 
    # definition (line 3 of book class)
    # The backref parameter creates an author attribute for each Book instance. This attribute refers to 
    # the parent Author that the Book instance is related to.
    books = relationship("Book", backref=backref("author"))

    # Many to many relation
    # The first parameter, "Publisher", informs SQLAlchemy what the related class is.
    # "secondary" tells SQLAlchemy that the relationship to the Publisher class is through a secondary table, 
    # which is the author_publisher association table. It makes SQLAlchemy find the publisher_id ForeignKey 
    # defined in the author_publisher association table
    # back_populates is a convenience configuration telling SQLAlchemy that there’s a complementary collection 
    # in the Publisher class called authors.
    publishers = relationship(
        "Publisher", secondary=author_publisher, back_populates="authors"
    )

In [None]:
# define the Book class model to the book database table.
class Book(Base):
    __tablename__ = "book"
    book_id = Column(Integer, primary_key=True)
    author_id = Column(Integer, ForeignKey("author.author_id"))
    title = Column(String)
    publishers = relationship(
        "Publisher", secondary=book_publisher, back_populates="books"
    )

In [None]:
# define the Publisher class model to the publisher database table.
class Publisher(Base):
    __tablename__ = "publisher"
    publisher_id = Column(Integer, primary_key=True)
    name = Column(String)
    authors = relationship(
        "Author", secondary=author_publisher, back_populates="publishers"
    )
    books = relationship(
        "Book", secondary=book_publisher, back_populates="publishers"
    )