## <center> Distributed File Systems </center>

#### <center> Linh B. Ngo </center>
#### <center> CPSC 3620 </center>

<center> How to arrange read/write accesses with processes running on computer that are part of a computing cluster?

** Networked File System **
- Allow transparent access to files stored on a remote disk (Palmetto's `/home` and `/software`)

** Clustered File System **
- Allow transparent access to files stored on a large set of disks, which could be distributed across multiple computers (Palmetto's `/scratch2` and `/scratch3`)

** Parallel File System **
- Enable parallel access to files (Palmetto's `/scratch1`)

** Networked File System ** 

<img src="pictures/distributed-file-systems/sun.jpg" height="42">

*Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. (1985, June). Design and implementation of the Sun network filesystem. In Proceedings of the Summer USENIX conference (pp. 119-130)*

- Sun Network Filesystem Protocol (NFS)
- Current version: 4.2v
- Design Goals
  - Machine and operating system independence
  - Crash recovery
  - Transparent access
  - UNIX semantics maintained on client
  - Reasonable performance (target 80% as fast as local disk)

NFS Design:
- NFS Protocol
- Server side
- Client side    

NFS Protocol:
- Remote Procedure Call mechanism
- Stateless protocol
- Transport independence (UDP/IP)

NFS Server:
- Must commit modifications before return results
- Generation number in inode and filesystem id in superblock

NFS Client:
- Additional virtual file system interface in the Linux kernel
- Attach remote file system via `mount`

<img src="pictures/distributed-file-systems/SunNFS.png">


** Clustered File System **

- Additional middleware layers such as the tasks of a file system server can be distributed among a cluster of computers
- Example: The Zettabyte File System by Sun Microsystem

*Bonwick, Jeff, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. "The zettabyte file system." In Proc. of the 2nd Usenix Conference on File and Storage Technologies, vol. 215. 2003.*

Design Principles:
- Simple administration: simplify and automate adminstration of storage to a much greater degree
- Pooled storage: decouple file systems from physical storage with allocation being done on the pooled storage side rather than the file system side
- Dynamic file system size:
- Always consisten on-disk data
- Immense capacity
- Error detection and correction

** More fine-grained distinctions **
- Data distribution
    - DFS often stores entire files on a single node (can have multiple nodes)
    - PFS distributes contents of a file across multiple nodes
- Symmetry (not entirely true)
    - DFS often runs on architecture where the storage is colocated with the application
    - PFS often runs on architecture where the storage is physically separate from the compute system

** More fine-grained distinctions **
- Fault-tolerance
    - DFS takes on fault-tolerance responsibilities
    - PFS runs on enterprise shared storage (no fault-tolerance but rely on hardware quality)
- Workloads
    - DFS is geared for loosely coupled, distributed applications (think data-intensive/big data)
    - PFS targets HPC applications that require highly coordinated I/O accesses with massive bandwidth requirements

** Parallel file access mechanims **
- Shared-file (N-to-1): A single file is created, and all application tasks write to that file (usually to disjoint regions)
    - Increased usability: only one file is needed
    - Can create lock contention and reduce performance
- File-per-process (N-to-N): Each application task creates a separate file, and writes only to that file. 
    - Avoids lock contention
    - Can create massive amount of small files
    - Does not support application restart on different number of tasks

** Data Distribution in Parallel File Systems **
- Original File: Sequence of Bytes
- Sequence of bytes are converted into sequence of offsets (each offset can cover multiple bytes)
- Offsets are mapped to objects
    - not necessarily ordered mapping
    - reversible to allow clients to contact specific PFS server for specific data content
- Objects are distributed across PFS servers
    - Information about where the objects are is stored at the metadata server

** Object Placement **
- Round robin is reasonable default solution
- Work consistently on most systems
- Default solutions for: GPFS, Lustre, PVFS
- Potential scalability issue with massive scaling of file servers and file size
    - Two dimensional distribution
    - Limit number of servers per file

** Block-based PFS **
- Objects are created as fixed-width blocks
- File growth requires more blocks
- Blocks distributed over storage nodes
- Suffer from block allocation issues, lock managers
- Example: GPFS

** Object-based PFS **
- Objects are created as variable-length regions of the file
- A file has a constant number of objjects
- File growth increases the size of the object(s)
- Space allocation is managed locally on a per-object basis
- Potential issue with workload imbalance
- Example: Lustre, PVFS