# Storing Data

## Data Structures



[Slides: Data Structures](https://drive.google.com/open?id=1unuzodcSQ4De9SQJXoj0yAdWGvZ1IPBw&usp=drive_fs)

### Structured Data
**Structured data** is **highly organized** and easy to search. It follows a rigid format, like a spreadsheet with predefined columns. Each column contains specific data types such as text, date, or decimal values. This structured format allows for easy relationships between data, making it suitable for **relational databases**.

- **Example**: SQL (Structured Query Language) is used to query structured data.
- **Prevalence**: About 20% of data is structured.

#### Employee Table

A structured data example is Spotflix's employee table:

- Each row represents an employee.
- Each column contains specific information (e.g., team, role).
- The index serves as a unique identifier.
- Logical values (e.g., true/false) exist for certain attributes.

### Relational Database
Structured data can be connected across multiple tables to form a relational database.

- **Example**: An employee table can relate to an office table through a shared column.

### Semi-Structured Data
Semi-structured data is more **flexible** than structured data but still maintains some level of organization.

- It allows for varying numbers of attributes within records.
- It can be grouped to form relationships, though less straightforward than structured data.
- **Stored in**: NoSQL databases (e.g., JSON, XML, YAML formats).

### Favorite Artists JSON File
Example of a JSON file storing users' favorite artists:

In [None]:
json
{
  "user_id": 123,
  "first_name": "John",
  "last_name": "Doe",
  "favorite_artists": ["Artist A", "Artist B", "Artist C"]
}

* Users have varying numbers of favorite artists.
* This flexibility is not possible in relational databases.

### Unstructured Data
Unstructured data **does not follow a fixed model** and cannot be stored in traditional rows and columns.

* Examples: Text, audio, images, videos.
* Storage: Typically stored in data lakes, but can also appear in data warehouses or databases.
* Challenges: Hard to search and organize.
* Solution: Machine learning and AI help extract value from unstructured data.

#### Spotflix's Unstructured Data
Spotflix deals with various forms of unstructured data, including:

* Lyrics
* Songs
* Pictures (album covers, artist profiles)
* Videos (music videos)

##### Adding Some Structure
To improve searchability, Spotflix could:

* Use **machine learning** to analyze song spectrums, BPM, chord progressions, and genres.
* Ask artists to provide metadata (e.g., genre, tags), making data **semi-structured**.


### Summary
* **Structured data**: Rigid format, easily searchable, used in relational databases.
* **Semi-structured data**: Some organization, flexible attributes, stored in NoSQL formats.
* **Unstructured data**: No predefined model, harder to manage but valuable with AI/ML.

## SQL Databases

### SQL Overview
SQL (Structured Query Language) is the standard language for querying **Relational Database Management Systems (RDBMS)**. It allows for accessing multiple records at once, filtering, grouping, and aggregating data. SQL is widely used by **data engineers** to create and maintain databases and by **data scientists** to query them.


### SQL for Data Engineers
Data engineers use SQL to **create and manage databases**. Below is an example of creating an `employees` table:





``` sql
CREATE TABLE employees (
    employee_id INT,
    first_name VARCHAR(255),
    last_name VARCHAR(255),
    role VARCHAR(255),
    team VARCHAR(255),
    full_time BOOLEAN,
    office VARCHAR(255)
);
```

- `INT` stores whole numbers (e.g., `employee_id`).
- `VARCHAR(255)` stores text with a max length of 255 characters (e.g., `first_name`, `last_name`).
- `BOOLEAN` holds logical values (`0` for false, `1` for true`).

Data engineers use additional SQL statements to **update** and **insert records** into tables.

``` sql
SELECT first_name, last_name
FROM employees
WHERE role LIKE '%Data%';
```
* The `%` wildcard allows `"Data"` to appear **anywhere** in the role title.


### Database Schema
Databases consist of multiple related tables. The **schema** defines how they are connected.

### Spotflix Database Example
Spotflix organizes its music data using multiple related tables:

#### Albums Table
- Stores album details (`album_id`, `artist_id`, `title`, etc.).

#### Artists Table
- Stores artist details (`artist_id`, `name`, `biography`).

#### Linking Albums and Artists
- The `artist_id` connects the **artists** table to the **albums** table.

#### Songs Table
- Stores song details (`song_id`, `album_id`, `title`, etc.).

#### Linking Albums and Songs
- The `album_id` connects the **albums** table to the **songs** table.

#### Playlists Table
- Stores playlist details (`playlist_id`, `user_id`, `song_id`, etc.).

#### Linking Playlists and Songs
- The `song_id` connects the **playlists** table to the **songs** table.

These relationships define **relational databases**, allowing efficient **data retrieval and organization**.


### SQL Implementations
Different SQL implementations exist, such as **MySQL, PostgreSQL, SQLite, and SQL Server**. While there are slight differences, they are mostly interchangeable—similar to switching between different keyboard layouts.

### Summary
- **SQL** is the **language of reference** for relational databases.
- **Data engineers** create and manage tables.
- **Data scientists** query and analyze data.
- **Database schemas** define table relationships.

## Data Warehouses & Data Lakes

### Data Pipeline Overview  
A data pipeline **collects and processes data from different sources**, storing it in either a data lake or a data warehouse.

### Data Lakes vs. Data Warehouses  

#### Data Lake  
Stores all collected **raw data** in its original form, whether **structured, semi-structured, or unstructured**.  

**Characteristics:**  
- Can handle **petabytes** of data.  
- **No enforced structure**, making it cost-effective but harder to analyze.  
- Used for **real-time analytics** on big data.  
- Requires a **data catalog** to avoid becoming a **data swamp**.  

#### Data Warehouse  
Stores **specific, structured data** optimized for analytics.  

**Characteristics:**  
- More **costly** to manipulate due to enforced structure.  
- Used for **read-only, ad-hoc queries** like aggregation and summarization.  
- Supports **business decision-making** with structured analytics.


### Data Catalog for Data Lakes  
A **data catalog** is a metadata management tool that helps organizations **organize, track, and manage** data assets within a data ecosystem; essential for managing **data lakes** due to their lack of structure.  

- Tracks **data sources, usage, ownership, and update frequency**.  
- Supports **data governance** (availability, usability, integrity, security).  
- Ensures **reproducibility** of analyses.  
- Prevents the **data lake** from turning into a **data swamp**.  


### Database vs. Data Warehouse  
- **Database**: A **broad term** for any organized collection of data stored on a computer.  
- **Data Warehouse**: A **specific type of database** designed for structured, analytical queries.  

### Summary  
- **Data lakes** store raw, unstructured data and require a **data catalog**.  
- **Data warehouses** store structured data for analytics.  
- **Data catalogs** ensure scalability and prevent reliance on **tribal knowledge**.  
- **Databases** are a general category, while **data warehouses** are a specialized type of database.