# Storing and Accessing Data on `gm2gpvm` machines <a class="tocSkip">
    
There are several options for storing and accessing data on `gm2gpvm` machines and grid worker nodes. This document goes over the options and how to copy data between different storage areas. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Home-area" data-toc-modified-id="Home-area-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Home area</a></span></li><li><span><a href="#/gm2/app-and-/gm2/data" data-toc-modified-id="/gm2/app-and-/gm2/data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><code>/gm2/app</code> and <code>/gm2/data</code></a></span></li><li><span><a href="#Areas-on-/pnfs" data-toc-modified-id="Areas-on-/pnfs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Areas on <code>/pnfs</code></a></span><ul class="toc-item"><li><span><a href="#Scratch-area-on-/pnfs/scratch/users" data-toc-modified-id="Scratch-area-on-/pnfs/scratch/users-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Scratch area on <code>/pnfs/scratch/users</code></a></span></li><li><span><a href="#Persistent-area-in-/pnfs/persistent" data-toc-modified-id="Persistent-area-in-/pnfs/persistent-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Persistent area in <code>/pnfs/persistent</code></a></span></li><li><span><a href="#Tape-backed-areas-in-/pnfs" data-toc-modified-id="Tape-backed-areas-in-/pnfs-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Tape backed areas in <code>/pnfs</code></a></span><ul class="toc-item"><li><span><a href="#Checking-if-a-file-is-in-the-cache" data-toc-modified-id="Checking-if-a-file-is-in-the-cache-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Checking if a file is in the cache</a></span></li></ul></li></ul></li></ul></div>

## Home area

When you log into one of the `gm2gpvm` machines you are in your home area (like `/nashome/l/lyon`). Details are,

* You have a 5 GB quota (I think you can ask for an increase to 10 GB)
* Files here are backed up every night
* Files here are not accessible to grid worker nodes

You can check your quota with `quota -s`. For example,

In [2]:
cd ~ ; quota -s

quota: error while getting quota from pnfs-stken:/pnfs/fs/usr/GM2 for lyon (id 5049): Connection refused
Disk quotas for user lyon (uid 5049): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
if-nas-0.fnal.gov:/gm2/app
                  6004G       0   6349G          11398k       0       0        
blue2:/fermigrid-data
                 12702M       0    200G           57960       0       0        
blue3.fnal.gov:/gm2/data
                 15974G       0  21504G            938k       0       0        
blue2:/fermigrid-fermiapp
                  3739G       0   4608G          37153k       0       0        
homesrv01.fnal.gov:/home
                   833M       0   5120M          18139k       0       0        


Only look at the last entry (`/home`).

I tend not to use my home area for very much. If you have important text or source files, you may want to keep them here so that they're backed up. 

## `/gm2/app` and `/gm2/data`

`/gm2/app` is the main area for building your code. This area is shared by everyone and there is no quota. Please use good manners. `/gm2/data` is similar but meant for data. Details are,

* No quota, but check remaining space with `df`, for example, `cd /gm2/app ; df -h .` 
* Please use good manners and share these areas
* Files here are *not* backed up
* Files here are *not* accessible to grid worker nodes

Because files here are not accessible to grid worker nodes, these areas should be used for building and testing code.

## Areas on `/pnfs`

`/pnfs` is connected to the *dCache* disk cache system, that is mainly meant for accessing files on tape. But it serves other purposes as well. `/pnfs` is not a real fileystem, but rather it is a presentation of the dCache database with a file system-like interface. On our interactive nodes (e.g. `gm2gpvm01.fnal.gov`), `/pnfs` is mounted with NFS that allows actual file operations. But currently this system is rather fragile and care must be taken when attempting large scale access. Alternatives will be discussed below. There is general dCache monitoring information at http://fndca.fnal.gov.

Files in `/pnfs` and its sub-directories are accessible by grid worker nodes and this facility is the one that is to be used for storing output of simulation, reconstruction, and analysis jobs. 

### Scratch area on `/pnfs/scratch/users`

`/pnfs/scratch/users` is a very large area for users to store files temporarily and is shared by all experiments. As users store files into the scratch area, files are deleted according to a *Least Recently Used* (LRU) policy. That is, files that have not been accessed for a long time are deleted first. The typical lifetime of a file is on the order of a month. From the dCache monitoring page, choose *File Lifetime Plots* and then search for *Public scratch pools*, or go there directly with http://fndca.fnal.gov/dcache/lifetime//PublicScratchPools.jpg . An example is shown below.

![File lifetime plot](http://fndca.fnal.gov/dcache/lifetime//PublicScratchPools.jpg)

This plot shows the distribution of file ages in the scratch cache. You can get an idea of how long your files will last. Note that when files are deleted, the directory structures remain (but they may be empty). 

Data files you produce should reside in the scratch area until they are validated by you. Once you are sure that these files are valuable, you can either move them to the *persistent* area or tape (see below).  Be sure you check your files before their lifetime expires. 

Files in this area are not backed up (of course). While many experiments share the scratch area, it is many peta-bytes in size, making it difficult for one experiment or person to quickly turn over the cache (e.g. replace all files with their files). That being said, one should not delay in using or checking files in scratch in case the lifetime starts to shorten due to large scale activity. 

As mentioned above, files in `/pnfs/scratch/users` are accessible by grid worker nodes. Files needed to run jobs, including code tar files and static input files should be placed here for access by jobs. The original source of such files could be on `/gm2/app` and/or `/gm2/data` and copied to `/pnfs/scratch/users` before starting the jobs that need them. 

### Persistent area in `/pnfs/persistent`

The word *persistent* is meant to be the opposite sense of *scratch*. Files in `/pnfs/persistent` are never automatically deleted. Instead, this space is managed by the experiment and its users. At the time of writing this document, we have 90 TB of total space in this area. Files in this area are **not** backed up to tape (despite the use of *persistent*). Users in the collaboration need to share this space and use good manners. 

Unlike `/pnfs/scratch/users`, which is divided up by individual users, `/pnfs/persistent` should be divided up by *topic* and files here should be useful to more than one person. 

The amount of space used up by directory can be determined from the main dCache monitoring page (http://fndca.fnal.gov) and choosing *Space usage in dCache Analysis pools by Storage Group* and then selecting *GM2*. You can go directly to this page with http://fndca.fnal.gov/cgi-bin/du_cgi.py?key=GM2. Similarly, you can see who is using up space by choosing *Space usage in dCache Analysis pools by Storage Group and User* and then selecting *GM2*. You can also get there directly with http://fndca.fnal.gov/cgi-bin/space_usage_by_user_cgi.py?key=GM2. 

These files are accessible by grid worker nodes. Grid jobs should not store files here. Rather store files in scratch and move them here if they are deemed valuable (see below for moving commands). 

### Tape backed areas in `/pnfs`

Any directory in `/pnfs` that is not `/pnfs/scratch` and not `/pnfs/persistent` is part of the *tape backed read/write pool*. This area serves as a disk cache in front of tapes. It works much like the scratch area, in that files are removed according to an LRU policy, but these files are backed up to tape. The removed files do not disappear from `/pnfs`. The files will remain, but they will not be in the cache. Instead, when the files are accessed a tape mount will be requested and the files will be copied from tape into the cache. Thus, the access that moves the files from tape into the cache may take a very long time. There are strategies to pre-stage files so that they will be in the cache when you need them. Files much be stored here very carefully, as any file written to such an area will be stored on tape. Tape is an expensive resource and must be treated with care. That being said, valuable files that will take a long time to regenerate should go on tape. Moving files to tape will be discussed below. If you plan to move a large amount of files to tape (e.g. over 1 TB) then you should contact the offline coordinators first. 



#### Checking if a file is in the cache