# Storing Data

## Data Structures



[Slides: Data Structures](https://drive.google.com/open?id=1unuzodcSQ4De9SQJXoj0yAdWGvZ1IPBw&usp=drive_fs)

### Structured Data
**Structured data** is **highly organized** and easy to search. It follows a rigid format, like a spreadsheet with predefined columns. Each column contains specific data types such as text, date, or decimal values. This structured format allows for easy relationships between data, making it suitable for **relational databases**.

- **Example**: SQL (Structured Query Language) is used to query structured data.
- **Prevalence**: About 20% of data is structured.

#### Employee Table

A structured data example is Spotflix's employee table:

- Each row represents an employee.
- Each column contains specific information (e.g., team, role).
- The index serves as a unique identifier.
- Logical values (e.g., true/false) exist for certain attributes.

### Relational Database
Structured data can be connected across multiple tables to form a relational database.

- **Example**: An employee table can relate to an office table through a shared column.

### Semi-Structured Data
Semi-structured data is more **flexible** than structured data but still maintains some level of organization.

- It allows for varying numbers of attributes within records.
- It can be grouped to form relationships, though less straightforward than structured data.
- **Stored in**: NoSQL databases (e.g., JSON, XML, YAML formats).

### Favorite Artists JSON File
Example of a JSON file storing users' favorite artists:

In [None]:
json
{
  "user_id": 123,
  "first_name": "John",
  "last_name": "Doe",
  "favorite_artists": ["Artist A", "Artist B", "Artist C"]
}

* Users have varying numbers of favorite artists.
* This flexibility is not possible in relational databases.

### Unstructured Data
Unstructured data **does not follow a fixed model** and cannot be stored in traditional rows and columns.

* Examples: Text, audio, images, videos.
* Storage: Typically stored in data lakes, but can also appear in data warehouses or databases.
* Challenges: Hard to search and organize.
* Solution: Machine learning and AI help extract value from unstructured data.

#### Spotflix's Unstructured Data
Spotflix deals with various forms of unstructured data, including:

* Lyrics
* Songs
* Pictures (album covers, artist profiles)
* Videos (music videos)

##### Adding Some Structure
To improve searchability, Spotflix could:

* Use **machine learning** to analyze song spectrums, BPM, chord progressions, and genres.
* Ask artists to provide metadata (e.g., genre, tags), making data **semi-structured**.


### Summary
* **Structured data**: Rigid format, easily searchable, used in relational databases.
* **Semi-structured data**: Some organization, flexible attributes, stored in NoSQL formats.
* **Unstructured data**: No predefined model, harder to manage but valuable with AI/ML.

## SQL Databases

### SQL Overview
SQL (Structured Query Language) is the standard language for querying **Relational Database Management Systems (RDBMS)**. It allows for accessing multiple records at once, filtering, grouping, and aggregating data. SQL is widely used by **data engineers** to create and maintain databases and by **data scientists** to query them.


### SQL for Data Engineers
Data engineers use SQL to **create and manage databases**. Below is an example of creating an `employees` table:





``` sql
CREATE TABLE employees (
    employee_id INT,
    first_name VARCHAR(255),
    last_name VARCHAR(255),
    role VARCHAR(255),
    team VARCHAR(255),
    full_time BOOLEAN,
    office VARCHAR(255)
);
```

- `INT` stores whole numbers (e.g., `employee_id`).
- `VARCHAR(255)` stores text with a max length of 255 characters (e.g., `first_name`, `last_name`).
- `BOOLEAN` holds logical values (`0` for false, `1` for true`).

Data engineers use additional SQL statements to **update** and **insert records** into tables.

``` sql
SELECT first_name, last_name
FROM employees
WHERE role LIKE '%Data%';
```
* The `%` wildcard allows `"Data"` to appear **anywhere** in the role title.


### Database Schema
Databases consist of multiple related tables. The **schema** defines how they are connected.

### Spotflix Database Example
Spotflix organizes its music data using multiple related tables:

#### Albums Table
- Stores album details (`album_id`, `artist_id`, `title`, etc.).

#### Artists Table
- Stores artist details (`artist_id`, `name`, `biography`).

#### Linking Albums and Artists
- The `artist_id` connects the **artists** table to the **albums** table.

#### Songs Table
- Stores song details (`song_id`, `album_id`, `title`, etc.).

#### Linking Albums and Songs
- The `album_id` connects the **albums** table to the **songs** table.

#### Playlists Table
- Stores playlist details (`playlist_id`, `user_id`, `song_id`, etc.).

#### Linking Playlists and Songs
- The `song_id` connects the **playlists** table to the **songs** table.

These relationships define **relational databases**, allowing efficient **data retrieval and organization**.


### SQL Implementations
Different SQL implementations exist, such as **MySQL, PostgreSQL, SQLite, and SQL Server**. While there are slight differences, they are mostly interchangeable—similar to switching between different keyboard layouts.

### Summary
- **SQL** is the **language of reference** for relational databases.
- **Data engineers** create and manage tables.
- **Data scientists** query and analyze data.
- **Database schemas** define table relationships.

## Data Warehouses & Data Lakes

### Data Pipeline Overview  
A data pipeline **collects and processes data from different sources**, storing it in either a data lake or a data warehouse.

### Data Lakes vs. Data Warehouses  

#### Data Lake  
Stores all collected **raw data** in its original form, whether **structured, semi-structured, or unstructured**.  

**Characteristics:**  
- Can handle **petabytes** of data.  
- **No enforced structure**, making it cost-effective but harder to analyze.  
- Used for **real-time analytics** on big data.  
- Requires a **data catalog** to avoid becoming a **data swamp**.  

#### Data Warehouse  
Stores **specific, structured data** optimized for analytics.  

**Characteristics:**  
- More **costly** to manipulate due to enforced structure.  
- Used for **read-only, ad-hoc queries** like aggregation and summarization.  
- Supports **business decision-making** with structured analytics.


### Data Catalog for Data Lakes  
A **data catalog** is a metadata management tool that helps organizations **organize, track, and manage** data assets within a data ecosystem; essential for managing **data lakes** due to their lack of structure.  

- Tracks **data sources, usage, ownership, and update frequency**.  
- Supports **data governance** (availability, usability, integrity, security).  
- Ensures **reproducibility** of analyses.  
- Prevents the **data lake** from turning into a **data swamp**.  


### Database vs. Data Warehouse  
- **Database**: A **broad term** for any organized collection of data stored on a computer.  
- **Data Warehouse**: A **specific type of database** designed for structured, analytical queries.  

### Summary  
- **Data lakes** store raw, unstructured data and require a **data catalog**.  
- **Data warehouses** store structured data for analytics.  
- **Data catalogs** ensure scalability and prevent reliance on **tribal knowledge**.  
- **Databases** are a general category, while **data warehouses** are a specialized type of database.

# Moving & Processing Data

## Processing Data
Data processing is the final step in the data engineering workflow, where raw data is transformed into meaningful information.


### Data Pipeline Overview
- Moving data to a **data lake**  
- Splitting data into **different tables**  
- Removing **corrupted tracks**  

These are all examples of **data processing**.

### What is Data Processing?
**Data processing** is the conversion of raw data into meaningful information.

### Why Process Data?
- **Optimizing Storage & Costs**:  
  - Unnecessary data can be removed after a feature rollout.  
  - Storing and processing data costs money, so optimization is essential.  
  - **Compression** reduces storage needs—uncompressed data can be **10x larger** than compressed.  

- **Transforming Data for Usability**:  
  - Data may need conversion to a different type for easier use.  
  - Example: **File size vs. sound quality tradeoff** in music files.

### Data Processing at Spotflix
- **Music Uploads**:  
  - Artists upload high-quality **wav** or **flac** files.  
  - Streaming these large files would **incur high network costs**.  
  - Data is processed by converting them to **.ogg format**, which is lighter but slightly lower in quality.  

- **Metadata Extraction**:  
  - Music files contain **artist names, genres, etc.**  
  - This metadata is processed and stored in a **database** for easy access.  

- **Structuring Employee Data**:  
  - Data is formatted to match a specific schema (e.g., separating first and last names).  
  - **Logical classification** is applied (e.g., part-time vs. full-time employees).  

### Benefits of Data Processing
- **Easier access for analysts**:  
  - Processed data is structured and **ready for analysis**.  
- **Increased productivity**:  
  - Automating data preparation saves time for data scientists.  
- **Improved data organization**:  
  - Processing ensures data fits into well-defined schemas.  


### How Data Engineers Process Data
Data engineers perform essential **data cleaning, manipulation, and structuring** tasks, such as:
- **Handling corrupted data**:  
  - Rejecting corrupt song files.  
  - Deciding how to handle missing metadata (e.g., leaving blank, rejecting file, or assigning a default genre).  

- **Structuring databases**:  
  - Ensuring data is stored in a **well-organized relational database**.  
  - Creating **views** to combine related data for easy querying.  

- **Optimizing database performance**:  
  - Using **indexing** to speed up data retrieval.  


### Tools for Data Processing
- There are many **data processing tools**, but they are out of scope for this course.  

### Apache Spark
- **Apache Spark** is a powerful data processing tool.  
- Courses on **DataCamp** cover its usage.

## Scheduling Data

### Scheduling
Scheduling is the glue of a data engineering system. It organizes how different tasks work together, running them in a specific order and resolving dependencies correctly. It can apply to tasks such as updating tables and databases.


### Manual, Time, and Sensor Scheduling
There are different ways to schedule tasks:
- **Manual Scheduling**: Tasks are run manually by an employee (e.g., updating a table when an employee moves).
- **Time Scheduling**: Tasks are set to execute at specific times, such as updating a database every morning at 6 AM.
- **Sensor Scheduling**: Tasks execute when a specific condition is met, like updating a department table only if a new employee is added.

While **manual** scheduling may involve human intervention, **automated** scheduling sets tasks to run at specific times or conditions, which reduces the need for human oversight. However, sensor scheduling requires continuous monitoring, which can demand more resources.

### Batches and Streams
- **Batch Processing**: Data is processed in groups at specific intervals, often overnight. It's more cost-effective as it can be scheduled during off-peak times.
  - Example: Updating the employee database every morning at 6:00 AM.
  
- **Stream Processing**: Data is processed immediately as it is received, making it suitable for real-time applications.
  - Example: Updating a user profile as soon as they sign up.
  
Both batch and stream processing have their use cases, and stream processing is sometimes considered the same as real-time processing, especially in scenarios like fraud detection.

### Scheduling Tools
Some popular tools for scheduling data processing tasks include:
- Apache Airflow
- Luigi

### Summary
Scheduling helps organize and automate tasks in a data pipeline. Tasks can be scheduled manually, by time, or based on sensors, and data can be processed in batches or streams depending on the use case. Tools like Apache Airflow and Luigi help with automating the scheduling of tasks.










---





---



## Recap  

So far, I explored the critical role of data processing in data engineering, focusing on how raw data is transformed into meaningful information. This transformation is essential for optimizing storage, reducing processing costs, and improving network efficiency. Key takeaways include:  

- **Data Processing**: Converts raw data into usable information by filtering out unnecessary data, optimizing costs, and improving usability.  
- **Data Compression**: Example: Converting music files from high-quality formats like `WAV` or `FLAC` to `.ogg` to reduce network costs while maintaining acceptable sound quality.  
- **Data Organization**: Structuring and organizing data for easy access and analysis, such as extracting metadata from music files or fitting employee data into table schemas.  
- **Automation**: Streamlining data preparation tasks to enhance productivity, allowing data scientists to focus on insights.  
- **Data Engineers' Responsibilities**: Data manipulation, cleaning, structuring databases, creating views for efficient access, and optimizing performance.  

Additionally, I connected data processing with the **ETL (Extract, Transform, Load) framework**, understanding its role in data pipelines:  

- **Extraction**: Retrieving data from a source.  
- **Transformation**: Modifying and processing data.  
- **Loading**: Storing the data in a database or other storage solution.  


## Parallel Computing

Parallel computing is the foundation of most modern data processing tools. It is crucial for managing memory efficiently and increasing processing power. Big data tools break down processing tasks into smaller subtasks, which are then distributed across multiple computers.  

### Understanding Parallel Computing  
#### T-Shirt Folding Analogy  
Imagine running a music merchandise shop that needs to fold 1,000 t-shirts:  
- A **senior sales assistant** folds **100 shirts in 15 minutes**.  
- A **junior sales assistant** folds **100 shirts in 30 minutes**.  

If only one assistant can work at a time, choosing the senior assistant is the fastest option. However, if the batch is split into **four groups of 250 shirts**, having four junior assistants working in parallel is faster:  
- They finish in **1 hour and 15 minutes**,  
- Compared to **2 hours and 30 minutes** if the senior assistant worked alone.  

### Benefits and Risks of Parallel Computing  
**Benefits:**  
- **Increased Processing Power:** Multiple processing units work simultaneously.  
- **Optimized Memory Usage:** Instead of loading all data into one computer's memory, data is partitioned and processed across multiple computers, reducing memory footprint.  

**Risks:**  
- **Data Transfer Costs:** Moving data between computers has a cost.  
- **Task Coordination Overhead:** Splitting tasks into subtasks and merging results requires communication, which adds time.  

#### T-Shirt Analogy Revisited  
If distributing t-shirts among four assistants takes **10 minutes**, and collecting folded shirts takes **5 minutes**, the total time increases to **1 hour and 30 minutes**, instead of the expected **1 hour and 15 minutes**.  

### Parallel Computing at Spotflix  
Spotflix uses **parallel computing** to convert songs from lossless formats to `.ogg`, reducing memory load on a single computer and leveraging extra processing power for conversion scripts.  

### Summary  
Parallel computing enhances processing efficiency but comes with coordination costs. At Spotflix, it enables efficient audio conversion while optimizing memory and computation.  


## Cloud Computing for Data Processing


Cloud computing offers a more flexible and cost-efficient approach to data processing compared to on-premises data centers. Key advantages include:
- **Resource Optimization**: Companies only rent servers when needed, avoiding waste during quieter times.
- **Cost Efficiency**: Renting servers in the cloud is cheaper than maintaining on-premises infrastructure.
- **Global Availability**: Cloud servers can be placed closer to users, reducing latency and improving user experience.

## Cloud Computing for Data Storage
Cloud computing enhances database reliability and safeguards data:
- **Disaster Recovery**: Cloud computing allows companies to replicate data across geographical locations, reducing risk.
- **Sensitive Data**: The cloud can pose risks for confidential data, with concerns around external hosting and government surveillance.

## Cloud Providers
The three main cloud service providers are:
1. **Amazon Web Services (AWS)**
2. **Microsoft Azure**
3. **Google Cloud**

### Key Services:
- **File Storage**:
  - AWS S3
  - Azure Blob Storage
  - Google Cloud Storage
- **Computation**:
  - AWS EC2
  - Azure Virtual Machines
  - Google Compute Engine
- **Databases**:
  - AWS RDS
  - Azure SQL Database
  - Google Cloud SQL

## Cloud Computing at Spotflix
Spotflix uses AWS for various services:
- **S3**: To store cover albums
- **EC2**: To process songs
- **RDS**: To store employee information

## Multicloud
Multicloud allows companies to use services from different providers. Benefits include:
- **Reduced Vendor Reliance**: Avoid being locked into a single provider.
- **Cost Optimization**: Leverage the best pricing and features.
- **Disaster Mitigation**: Using multiple providers can reduce the impact of outages (e.g., AWS outage in 2017).

However, multicloud introduces challenges:
- **Compatibility Issues**: Some services from different providers may not work well together.
- **Security & Governance**: Managing multiple cloud environments can complicate security and compliance.

## Summary
Cloud computing offers cost-effective and reliable solutions for data processing and storage. Spotflix uses AWS services for storage, processing, and databases. Multicloud strategies offer advantages in cost and disaster recovery, but also require careful management of compatibility and security.
