# 1. Basic data storage

All programs process data in one form or another, and many need to be able to save and retrieve that data from one invocation to the next. In `Python`, we have many ways to store data. 
- Flat files (e.g. csv, json, xml, etc.)
- RDBMS (e.g. postgres, mysql, etc)
- NoSQL DB (e.g. mongoDB, Neo4J, etc.)

In this tutorial, we will use first flat files, then RDBMS (SQLite), to store and retrieve data. At last we will use `SQLAlchemy (ORM)` which allow you to map rows in a table with python object. `SQLite` is a light weight RDBMS which stores data in a single file without the need for complex installation.

## 1.1 Working with flat files 

### Advantages of Flat Files
- Easy to use and implement
- human-readable makes it easy to edit, view/examine, and transfer 
- support by many language and framework (import, export)


### Disadvantages of Flat Files
- performance: Large files are difficult to view/examine/edit
- No explicit relation between data parts (tables). Relation need to be written in the business logic
- No schema or data structure. Users need understand not only the structure of the data but also the programming tools necessary for accessing it.


In below example, we use pandas to read the csv file. 

In [None]:
import pandas as pd

In [None]:
def get_data(file_path):
    return pd.read_csv(file_path)

def get_books_by_publisher(data,ascending=True):
    """Return the number of books by each publisher as a pandas series"""
    return data.groupby("publisher").size().sort_values(ascending=ascending)

def get_authors_by_publisher(data, ascending=True):
    """Returns the number of authors by each publisher as a pandas series"""
    return (
        data.assign(name=data.first_name.str.cat(data.last_name, sep=" "))
        .groupby("publisher")
        .nunique()
        .loc[:, "name"]
        .sort_values(ascending=ascending)
    )

def add_new_book(data, author_name, book_title, publisher_name):
    """Adds a new book to the system"""
    # Does the book exist?
    first_name, _, last_name = author_name.partition(" ")
    if any(
        (data.first_name == first_name)
        & (data.last_name == last_name)
        & (data.title == book_title)
        & (data.publisher == publisher_name)
    ):
        return data
    # Add the new book
    return data.append(
        {
            "first_name": first_name,
            "last_name": last_name,
            "title": book_title,
            "publisher": publisher_name,
        },
        ignore_index=True,
    )

def output_author_hierarchy(data):
    """Output the data as a hierarchy list of authors"""
    authors = data.assign(
        name=data.first_name.str.cat(data.last_name, sep=" ")
    )
    authors_tree = Tree()
    authors_tree.create_node("Authors", "authors")
    for author, books in authors.groupby("name"):
        authors_tree.create_node(author, author, parent="authors")
        for book, publishers in books.groupby("title")["publisher"]:
            book_id = f"{author}:{book}"
            authors_tree.create_node(book, book_id, parent=author)
            for publisher in publishers:
                authors_tree.create_node(publisher, parent=book_id)

    # Output the hierarchical authors data
    authors_tree.show()

Below is the main function that calls the above functions 

In [1]:
def main():
    """The main entry point of the program"""
    # Get the resources for the program
    with resources.path(
        "project.data", "author_book_publisher.csv"
    ) as filepath:
        data = get_data(filepath)

    # Get the number of books printed by each publisher
    books_by_publisher = get_books_by_publisher(data, ascending=False)
    for publisher, total_books in books_by_publisher.items():
        print(f"Publisher: {publisher}, total books: {total_books}")
    print()

    # Get the number of authors each publisher publishes
    authors_by_publisher = get_authors_by_publisher(data, ascending=False)
    for publisher, total_authors in authors_by_publisher.items():
        print(f"Publisher: {publisher}, total authors: {total_authors}")
    print()

    # Output hierarchical authors data
    output_author_hierarchy(data)

    # Add a new book to the data structure
    data = add_new_book(
        data,
        author_name="Stephen King",
        book_title="The Stand",
        publisher_name="Random House",
    )

    # Output the updated hierarchical authors data
    output_author_hierarchy(data)