title: File Systems notebook: Distributed Systems layout: note date: 2020-10-22 tags: ...

DFS Summary

key design issues for distributed file systems
- effective client caching for performance similar to local file system
- maintaining consistency of files copied in multiple client caches
- recovery from server/client failure
- high throughput for reading/writing files
- scalability

File Systems

filesystem vs file system:
- filesystem: set of files held in a storage device/partition
- file system: software component providing access to files
file system: provides
- convenient programming interface for persistent/disk storage
- name space (hierarchical/flat)
- sharing of data between users
  - access control
  - file-locking
access files in early days of networking
- FTP, Telnet
- lacks transparency: need to know IP address, as well as directory structure
distributed file system: emulates non-distributed file system for client programs running on multiple remote computers
- transparency: local computing, with a remote file system
file service: allow programs to store and access remote files as they do local ones
- access files from any computer on intranet
- hosts providing such services can be optimised for multiple disk drives, and can supply file services for other services (web, email)
- facilitates backup and archiving
files: data + attributes
- data: sequence of bytes that can be accessed/modified
- attributes: timestamps, file type, owner, access control lists
files have a name.
directory: file containing list of other files

File System Layers

Directory: relate file names to IDs
File: relate file IDs to particular files
Access control: check permissions for requested operations
File access: read/write file data/attributes
Block: access/allocate disk blocks
Device: disk IO and buffering

UNIX file system operations

open
create
close
read
write
lseek: move read/write pointer to new position in the file
link: add new name for file
stat: get file attributes

Distributed File System Requirements

e.g. Hadoop

Transparency

Access: clients unaware of distribution of files
- uniform API for accessing local and remote files
Location: clients see a uniform file name space
- names of files should be consistent regardless of where the file is physically stored
Mobility: client programs/admin services don't need to change when the files are physically moved
Performance: client programs should perform satisfactorily while the load varies in specified range
Scaling: service can be expanded by incremental growth

Concurrent file updates

multiple clients' updates should not interfere with each other
should be able to manage policies
solutions
- file, record-level locking
- isolation

File replication

multiple copies of files over several servers: reduced latency, scalability, fault tolerance
requires other copies are updated when one copy is modified
full replication is challenging to implement
solutions
- cache all/part of file: gives many benefits

Heterogeneity

client and server should be able to operate on various hardware/OS
i.e. design must be compatible with the file system of different OS
use protocols to standardise operations with server

Fault tolerance

transient communication problems shouldn't result in file corruption
system should still function as much as possible when clients/server fails
faults need to be detected, reported, and corrected
invocation semantics: can be
- at-most-once
- at-least-once: simpler, but requires idempotent operations
servers can be stateless such that there is no recovery required if a server goes down
solutions
- redundant copies of data, redundant hardware, backups, transaction logs
- stateless servers
- idempotent operations

Consistency

multiple, concurrent access to file should see consistent representation of the file
- complete, current, correct
differences in file's location/update latencies shouldn't make the file look different at different times
file metadata should be consistent on all clients
particular concern when data is replicated
solutions:
- timestamps and ownership info

Security

file systems need to be protected against
- unathorised access
- data corruption
- loss and other threats
solutions:
- access control: ownership, permissions
- client requests should be authenticated
- data transfer should be encrypted

Efficiency

comparable to conventional file systems

File Service Architecture

abstract architecture based on NFS

Flat file service

implements operations on contents of files
UFID, Unique File Identifier given to flat file service to refer to the file to operate on
- unique over all files in the distributed system
flat file service creates a new UFID for each new file it creates

Directory service

maps between text file names and UFID
creates directories, and manages files within them
client of flat file service, as directory files are stored there

Client module

integrates directory service and flat file service to provide API expected by client applications
client maintains a list of available file servers
caching to improve performance

Flat file service interface

UNIX interface requires the filesystem to maintain state (in the file pointer), which is manipulated during read/write
flat file service differs from UNIX interface for fault tolerance
- repeatable operations: except for Create, operations are idempotent, permitting at-least-once RPC semantics
- stateless server: flat file service doesn't need to maintain state. Can be restarted after failure and resume operation without need for clients/server to restore any state
- files can be accessed immediately, c.f. UNIX where they first need to be opened

RPC Calls

Read(UFID, i, n) -> Data: read up to $n$ items from position $i$ in the file
Write(UFID, i, Data): write data starting at position $i$ in the file
Create() -> UFID: creates a new file of length 0, and returns its UFID
Delete(UFID): remove file from file store
GetAttributes(UFID) -> Attr
SetAttributes(UFID, Attr)

Flat file service access control

service needs to
- authenticate RPC caller
- prevent illegal operations: e.g. legal UFIDs, enforce access privileges
cannot store access control state: would break idempotency
options:
- access check made whenever file name is converted to UFID, and results encoded as a capability returned to client for submission to flat file server
- user ID can be submitted for every request, with access checks performed by flat file server for each file operation

Directory service interface

primary purpose: translation from file name to UFID
directory server maintains directory files, which store this mapping
directory files themselves are stored in flat file server: directory server is a client of the flat file server
in a hierarchical file system, the client function can make requests for UFIDs corresponding to each component of the pathname

Interface

Lookup(Dir, Name) -> UFID: return UFID for file name in given directory
AddName(Dir, Name, UFID): add file name with UFID to directory
UnName(Dir, Name): remove file name from directory
GetNames(Dir, Pattern) -> Names: return all names matching a pattern

File Group

file group: collection of files on a given server
server may hold several file groups, and file groups can be moved between servers
files cannot change file group
permits file service to be implemented across several servers
files given UFIDs that ensure uniqueness across different servers
- e.g. concatenate server IP address with a date the file was created
- permits files in a group (i.e. files with common file group id) to be relocated to a different server without conflicting with files already on that server
mapping of UFIDs to servers can be cached at client module

Sun Network File System

uses architecture described above
many implementations of NFS following NFS protocols, using a set of RPCs that provide means for the client to perform operations on the remote file store
NFS client makes requests to NFS server to access files

Virtual File System

Virtual file system (VFS): used by UNIX to provide transparent access to any number of different file systems, combining remote and local file systems into a single filesystem
- maintains VFS structure for each filesystem in use
- maintains v-node for each open file, which records whether file is local/remote
  - if local, v-node contains reference to i-node on UNIX file system
  - if remote, v-node contains reference to files NFS file handle, a combo of filesystem identifier, i-node number and any other identifying info
NFS integrated in the same way

Client Integration

emulates UNIX file system primitives closely
NFS client is integrated within the kernel, meaning:
- user programs can access files using UNIX system calls without recompiling or reloading
- one client module serves all user-level processes (with a shared cache)
- encryption key used to authenticate user IDs passed to the server can be retained in the kernel: this prevents impersonation by user-level clients
client transfers blocks of files from server host to local host and caches them
- shares the same buffer cache used for local IO
- as several hosts may access the same remote file concurrently, caching poses a consistency problem

Server Interface

NFS server interface integrates directory and file ops in a single service
create(filename, targetDirectoryHandle): both creates and inserts file names in directories in a single operation
primitives resemble UNIX filesystem primitives

Mount Service

mount service: process running at user level of an NFS server computer, allowing clients to mount subtrees of remote filesystems
each server maintains a file (/etc/exports) describing the parts of the local filesystems available for remote mounting
clients use a modified UNIX mount to request mounting of a remote filesystem, which communicates using an RPC mount protocol
remote filesystems can be hard/soft-mounted in a client's file system:
- hard-mounted filesystem: blocks on each access until the access is complete
  - i.e. a user-level process accessing a file is suspended until the request is completed
  - if the remote host is unavailable, the NFS client module continues to retry until the request is satisfied
- soft-mounted filesystem: retries a few times before returning an error to the calling process

Server caching

conventional UNIX systems: data read from disk/pages are retained in main memory buffer cache, and evicted when buffer space is needed. Accesses to the cache do not require disk access
- read-ahead: anticipates read accesses, fetches pages following those recently read
- delayed-write/write-back: optimises writes to disk by only writing pages when both modified and evicted
  - UNIX sync flushes modified pages every 30s
- works for conventional filesystem on single host, because there is only 1 cache and file accesses cannot bypass it
use of cache at server for client reads introduces no problems
use of cache for writes requires special care: client needs to be confident writes are persistent if server crashes
options: cache policies used by the server
- Write-through: data written to cache and directly to disk
  - client can be sure data is stored persistently as soon as the reply is received
  - increases disk I/O and latency for write
  - operation completes when the data has been written to disk
  - poor when server receives large number of write requests for the same data: causes performance bottleneck
  - saves network bandwidth
- Commit: data is written to cache and is only written to disk when a commit operation is received for the data
  - reply sent when data has been written to disk
  - client can be sure data is persistently stored only when a reply to a commit has been received
  - uses more network bandwidth
  - may lead to uncommitted data being lost
  - receives full benefit of cache
  - standard NFS clients use this mode, issuing commit whenever a file open for writing is closed

Client Caching

NFS Client caches data reads, writes, attributes and directory operations to reduce network IO
caching at the client: problem for cache consistency, as different caches on multiple clients, and the server
reading and writing are both problems: a write on another client between two reads will lead to the second read being incorrect
NFS clients are responsible for polling the server for currency of cached data (i.e. are there any updates?)
timestamp method is used to validate cached blocks before used
each item in the cache is tagged with 2 timestamps:
- $T_c$: time when a cache block was last validated by the client
- $T_m$: time when a block was last modified
cache block is valid at time $T$ if
- $T-T_c < t$ where $t$ is a freshness interval, or
- $T_{m, client}= T_{m,server}$: i.e. the data hasn't been modified at the server since the cache entry was made
value of $t$ is a trade off between consistency and efficiency
small value for $t$ leads to close approximation of one-copy consistency, but costs greater network IO
in Sun Solaris clients $t$ is set adaptively (3-30s) depending on file update frequency
- for directories its only 30-60s, as there is a lower risk of concurrent updates
validity check is made on each access to a cache block
- first half of check requires no network IO, and if this is true, the second condition need not be checked
separate server $T_{m,server}$: kept by server for file attributes
- if first half of check is found to be false, the client contacts the server to retrieve $T_{m, server}$ for file attributes
- if it matches $T_{m,client}$ then cache block is valid, and client sets $T_c$ to the current time
- otherwise, cache block is invalid, and the client must request a new copy
traffic can be reduced by applying new values of $T_{m,server}$ to all relevant cache blocks and piggy-backing attribute values with results of every operation on the file
write-back is used for writes, where modified files are flushed when a file is closed or when a sync operation takes place in the VFS.
- special purpose daemon is used to do this asynchronously

NFS Summary

$\checkmark$ access transparency: applications are usually unaware files are remote
$\times$ location transparency: not enforced; no global namespace as different clients can mount filesystems at different points
$\times$ mobility transparency: if server changes, the client must be updated
$\thicksim$ scalability: good, can be better. System can grow to accommodate more servers as needed. Bottlenecks when many processes access a single file.
$\times$ file replication: not supported for updates. Additional services can be added to do this
$\checkmark$ Hardware/OS heterogeneity: NFS implemented on most OS and hardware platforms
$\checkmark$ fault tolerance: acceptable. NFS is stateless, idempotent. Options to handle failures
$\checkmark$ consistency: tunable. not recommended for close synchronisation between processes
$\checkmark$ security: Kerberos is integrated with NFS. Secure RPC also an option
$\checkmark$ efficiency: acceptable, can be tuned.

Cloud Storage

Dropbox

100+ millions users, 1 billion files/day
design
- small client with minimal resource usage
- works with a low-capacity/disrupted network
- backend scales
read/write ratio of 1:1 c.f. typical value of 100:1 for Twitter/Facebook
traffic only when changes occur
uses compression to reduce traffic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

06-file-systems.md

06-file-systems.md

DFS Summary

File Systems

File System Layers

UNIX file system operations

Distributed File System Requirements

Transparency

Concurrent file updates

File replication

Heterogeneity

Fault tolerance

Consistency

Security

Efficiency

File Service Architecture

Flat file service

Directory service

Client module

Flat file service interface

RPC Calls

Flat file service access control

Directory service interface

Interface

File Group

Sun Network File System

Virtual File System

Client Integration

Server Interface

Mount Service

Server caching

Client Caching

NFS Summary

Cloud Storage

Dropbox

Files

06-file-systems.md

Latest commit

History

06-file-systems.md

File metadata and controls

DFS Summary

File Systems

File System Layers

UNIX file system operations

Distributed File System Requirements

Transparency

Concurrent file updates

File replication

Heterogeneity

Fault tolerance

Consistency

Security

Efficiency

File Service Architecture

Flat file service

Directory service

Client module

Flat file service interface

RPC Calls

Flat file service access control

Directory service interface

Interface

File Group

Sun Network File System

Virtual File System

Client Integration

Server Interface

Mount Service

Server caching

Client Caching

NFS Summary

Cloud Storage

Dropbox