# Flipkart/Amazon Case Study on Data Storage, Compliance, and Workflows

## Overview
This document provides comprehensive notes on designing machine learning systems, specifically focusing on the architecture and workflows of prominent e-commerce platforms like Flipkart and Amazon. It covers aspects such as data storage strategies, compliance, data transfer protocols, and more.

### System Flow
1. **Main Page**: User lands on the platform.
2. **Search Bar**: Allows users to search for products.
3. **Product Pages**: Displays the details of selected products.
4. **Order Process**: Users can place orders and proceed to payment.
5. **Logistics**: This tracking aspect details the transfer of items from warehouses to users' homes.
6. **Previous Orders**: Users can access their past orders categorized by time frames.

## Search and Product Order Workflow
### Workflow Diagram (Schematic)

```mermaid
flowchart TD
    A[Main Page] --> B[Search Bar]
    B --> C[Product Search]
    C --> D[Product Page]
    D --> E[Order Process]
    E --> F[Payment Interface]
    F --> G[Logistics & Tracking]
    G --> H[User Receives Product]
    H --> I[Previous Orders]
```

### Question: How to deal with this scenario?
#### Explanation:
To effectively manage these workflows, we can utilize **multi-tier databases**, which include both ElasticSearch for rapid query capabilities and RDBMS for transactional processing.

### Data Storage Strategy
#### Definition and Explanation of Hot, Warm, and Cold Storage

| Type of Storage | Characteristics | Use Cases |
|------------------|-----------------|-----------|
| **Hot Storage** | - High-speed access<br>- Expensive<br>- Strong consistency<br>- Transactional | Frequent reads/writes (e.g., current orders) |
| **Warm Storage** | - Primarily read-heavy<br>- Less expensive<br>- Slightly slower | Less frequent reads (e.g., data from the last 6 months) |
| **Cold Storage** | - Infrequent access<br>- Very cost-effective<br>- Very slow access<br>- Primarily for compliance/audit | Old and infrequently accessed data (e.g., data older than 2 years) |

#### Example:
A user might frequently access their last three months’ orders (hot), occasionally check their orders from 3 to 6 months ago (warm), but rarely look at orders older than that (cold).

### Question: What is Compliance? What is Audit? 
#### Explanation:
- **Compliance** refers to adhering to laws, regulations, and guidelines relevant to a business or industry. It ensures data is handled and stored correctly.
- **Audit** consists of systematic investigations of an entity's records and account systems, often to verify compliance.

#### Regulation Examples (by country):
- **GDPR**: European Union regulation on data protection and privacy.
- **HIPAA**: US law providing data privacy and security provisions for safeguarding medical information.
- **CCPA**: California Consumer Privacy Act that enhances privacy rights for residents of California.

### Question: What is the problem if you delete data after, say 7 years of compliance?
#### Explanation:
Large companies typically avoid deleting data even after compliance periods because historical data can provide valuable insights. For example, in the event of a situational analysis post-COVID-19 in 2034, having access to 2020 data can significantly influence decision-making processes.

## Flow Chart of Amazon System Design
### Explanation of System: Users, Orders, Payments, Logistics, Databases
```mermaid
flowchart TD
    A[Users]
    B[Orders]
    C[Payments]
    D[Logistics]
    E[Databases]

    A --> B
    B --> C
    C --> D
    B --> E
    C --> E
    D --> E
```

### Question: Which is better MySQL/PostgreSQL?
- **MySQL**: Well-known for fast read operations, simpler queries.
- **PostgreSQL**: Preferred in professional settings due to advanced features.

#### Comparison Table

| MySQL | PostgreSQL |
|-------|------------|
| Open-source | Open-source |
| Basic queries | Slightly advanced queries |
| Cannot index complex objects | Can index arrays, JSON, binary |
| Basic indexing | Advanced indexing |
| Multi-column indexing (up to 16 columns) | Multi-column indexing (up to 32 columns) |
| No partial index | Partial index available |
| Not available | Expression index |
| Asynchronous replication | Synchronous replication |
| Faster, reliable, less overhead | Slower, but more advanced options |

### Example Code Snippet: PostgreSQL Table Definition
```sql
CREATE TABLE payments (
   payment_id SERIAL PRIMARY KEY,
   order_id INT NOT NULL,
   amount DECIMAL(10, 2),
   status VARCHAR(50),
   created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
   FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
```

### Question: If some orders became less frequent reads, can we transfer to warm storage?
- **Explanation**: Direct data transfers between hot and warm storage are not recommended without proper processes. You should use dumper scripts that convert RDBMS tables into denormalized formats. These formats can be stored temporarily in Amazon S3 before migrating to warm storage.

### Data Deletion Policy
#### Definition: 
A policy implemented based on triggers that ensures obsolete data is deleted, reducing storage costs.

## Migration Process from Hot to Warm Storage
### Flow Chart of Data Transfer to Warm Storage
```mermaid
flowchart TD
    A[Hot Storage PostgreSQL] --> B[Dumper Scripts]
    B --> C[S3 Storage]
    C --> D[Load to Warm Storage MongoDB]
```

### Question: How to transfer data from warm storage to cold storage?
#### Workflow Diagram
```mermaid
flowchart TD
    A[Warm Storage MongoDB/ElasticSearch] --> B[Simpler Dumper Scripts]
    B --> C[S3 Storage]
    C --> D[Apache Hudi]
```

### Explanation of S3 and Apache Hudi
- **S3**: Amazon Simple Storage Service is a scalable object storage service for storing and retrieving any amount of data.
- **Apache Hudi**: This is an open-source data management framework that provides capabilities to manage large datasets on distributed storage systems.

### Question: If the analytics/audit team needed the old data, can they directly query from Apache Hudi?
- **Answer**: No, queries are performed on PrestoDB or AWS Athena, which interact with S3, making the analytics process faster.

### Scenario: Follow/Following Relationship
- **Graph Database Usage**: Users are represented as nodes in a graph database, connected based on their follow relationships. 

### Problems with Pagination in Graph DB
Graph databases are not optimized for pagination because:
- They require traversing connection nodes, which can be slow and resource-intensive with large datasets.

### SQL Example for Pagination and Follower Queries
```sql
-- Follower of B
SELECT * FROM relationships WHERE destination = 'B';

-- Users following B
SELECT * FROM relationships WHERE source = 'B';
```

## Question: Cost trade-offs of Sharding
- Sharding increases costs due to more storage and management overhead but can reduce query time.

### Optimized Sharding Strategies
1. **Hash-Based Sharding**: This evenly distributes data using hashing to avoid hotspots.
2. **Range-Based Sharding**: Divides data into predictable ranges for easier access.
3. **Hybrid Sharding**: Combines hashing and ranges for better distribution.
4. **Graph-Based Partitioning**: Clusters closely related users together.
5. **Dynamic Sharding**: Adjusts shard distribution based on load.

## Total Flowchart
```mermaid
flowchart TD
    A[Main Page] --> B[Search Bar]
    B --> C[Product Search]
    C --> D[Product Page]
    D --> E[Order Process]
    E --> F[Payment Interface]
    F --> G[Logistics & Tracking]
    G --> H[User Receives Product]
    H --> I[Previous Orders]

    subgraph Data_Storage_Strategy
        D1[Hot Storage PostgreSQL]
        D2[Warm Storage MongoDB]
        D3[Cold Storage S3/Apache Hudi]
    end

    A --> D1
    D1 -->|Transfer using Dumper Scripts| D2
    D2 -->|Transfer to Cold Storage| D3

    subgraph Compliance_Audit
        CA1[Compliance Regulations]
        CA2[Audit Processes]
        CA3[Historical Data Importance]
    end

    D1 --> CA1
    D2 --> CA2
    CA2 --> CA3

    subgraph Sharding_Strategies
        S1[Hash-Based Sharding]
        S2[Range-Based Sharding]
        S3[Hybrid Sharding]
        S4[Graph-Based Partitioning]
        S5[Dynamic Sharding]
    end

    H --> S1
    H --> S2
    H --> S3
    H --> S4
    H --> S5

    subgraph Query_Pagination
        Q1[Follower Queries in SQL]
        Q2[Pagination Challenges in Graph DB]
    end

    H --> Q1
    H --> Q2
``` 

This flowchart presents an overarching view of the system design considerations and processes discussed throughout your notes, spanning from user interactions to data management strategies, compliance, and querying approaches.

## Conclusion
Understanding the intricacies of e-commerce platform data management through these detailed systems gives insight into efficiently managing data retention, retrieval, and processing.