# Big(ger) Data Notes

In this class, we have been working with small data sets - the largest have been 100MB-1.5GB.  These data sets can be stored in CSV files, loaded into memory in their entirety, and then manipulated in **pandas**.  Data sets that you encounter in practice will not be much larger, and you'll have to work with alternative technology to handle them.

## Relational Databases

Relational databases are the most popular way to store rectangular (i.e. structured data).  A relational database is a type of database that organizes data into tables with rows and columns, where each row represents a record and each column represents an attribute, allowing for defined relationships between different data points, making it easy to access and manipulate related information efficiently.

Relational databases are the standard database backend for many applications such as general web applications and position management systems.

Common examples of relational databases are:
1. Postgres
2. MySQL
3. Microsoft SQL Server
4. Oracle
5. SQLite

## Columnar Databases and Data Formats

Traditional relational database management systems, such as Postgres and MySQL, are row oriented.  This means that the rows of data are stored close together in memory.  This is ideal for transactional (OLTP) databases: think banking or credit card transactions.  Columnar databases are column oriented, meaning that the columns of data are stored close together in memory.  This is ideal for analytical querying: think data science, machine learning, time-series analysis.

1. Examples of columnar databases.
    - DuckDb
    - AWS Redshift
    - Kdb
3. Examples of columnar data storage formats:
    - Apache Parquet
    - Apache ORC
4. Apache Arrow
    - A project supported by the Apache Sofware foundation.
    - Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs.
    - Polars, Apache Spark, and Apache Parquet are all based on the the Arrow data model.

## SQLite

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine;  it is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.

Unlike most other SQL databases, SQLite does not have a separate server process. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file.

There are over 1 trillion (1e12) SQLite databases in active use.

## Polars

Polars is a `DataFrame` library written in Rust; it has a Python API via the **polars** package.  Polars is relatively new and is gaining ground on **pandas** due to its ability to handle larger data and it's expressive syntax.  Polars uses the Apache Arrow data model.

## DuckDb

DuckDb is an in-process OLAP database that is built in C++.  It uses a columnar data model.  You can think of it as the SQLite for data analytics.

## Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. It has its architectural foundation in the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines.  It has an associated machine learning library called MLlib.