# Introduction to python for hydrologists &mdash; sys, path, shutil, and subprocess
These four packages are part of the standard python library and provide very useful functionality for working with your operating system and files.  This notebook will provide explore these packages and demonstrate some of their functionality.  Online documentation is at [sys](https://docs.python.org/2/library/sys.html "sys doc"), [os](https://docs.python.org/2/library/os.html "os doc"), [shutil](https://docs.python.org/2/library/shutil.html "shutil doc"), and [subprocess](https://docs.python.org/2/library/subprocess.html "subprocess doc").

Import things to cover:
* sys: path
* os: path, chdir, getcwd, listdir
* shutil: copy, copytree, rmtree
* subprocess: check_call, check_output

## Sys Module

System-specific parameters and functions.

The following cells simply print some of the sys methods and attributes that you might find useful.

In [1]:
import sys
import os
import shutil
import subprocess
import traceback
import zipfile

In [2]:
print('sys.argv: ', sys.argv)

sys.argv:  ['/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/site-packages/ipykernel_launcher.py', '-f', '/Users/aleaf/Library/Jupyter/runtime/kernel-f553ed43-ff3b-43e8-8df3-80c040705556.json']


In [3]:
print('sys.byteorder: ', sys.byteorder)

sys.byteorder:  little


In [4]:
print('sys.copyright: ', sys.copyright)

sys.copyright:  Copyright (c) 2001-2019 Python Software Foundation.
All Rights Reserved.

Copyright (c) 2000 BeOpen.com.
All Rights Reserved.

Copyright (c) 1995-2001 Corporation for National Research Initiatives.
All Rights Reserved.

Copyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.
All Rights Reserved.


In [5]:
print('sys.float_info: ', sys.float_info)

sys.float_info:  sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)


In [6]:
print('The size of an integer is ', sys.getsizeof(1), ' bytes.')
print('The size of a float is ', sys.getsizeof(1.0), ' bytes.')
print('The size of the string "Goldschlager" is ', sys.getsizeof('Goldschlager'), ' bytes.')

The size of an integer is  28  bytes.
The size of a float is  24  bytes.
The size of the string "Goldschlager" is  61  bytes.


In [7]:
try:
    print(sys.getwindowsversion())
except:
    print('Why are you against windows?')

Why are you against windows?


In [8]:
print(sys.prefix)

/Users/aleaf/anaconda3/envs/pyclass


In [9]:
print(sys.version_info)

sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)


In [10]:
sys.platform

'darwin'

## sys.path

If you haven't seen `sys.path` already mentioned in a python script, you will soon.  `sys.path` is a list of directories.  This path list is used by python to search for python modules and packages.  If for some reason, you want to use a python package that is not installed in the main python folder, you can add directory containing your module to sys.path.

In [11]:
print(sys.path)

# Or more elegantly
for pth in sys.path:
    print(pth)

['/Users/aleaf/Documents/GitHub/python-usgs-training/notebooks/part1_python_intro', '/Users/aleaf/anaconda3/envs/pyclass/lib/python37.zip', '/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7', '/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/lib-dynload', '', '/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/site-packages', '/Users/aleaf/Documents/GitHub/flopy', '/Users/aleaf/Documents/GitHub/modflow-export', '/Users/aleaf/Documents/GitHub/gisutils', '/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/site-packages/IPython/extensions', '/Users/aleaf/.ipython']
/Users/aleaf/Documents/GitHub/python-usgs-training/notebooks/part1_python_intro
/Users/aleaf/anaconda3/envs/pyclass/lib/python37.zip
/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7
/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/lib-dynload

/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/site-packages
/Users/aleaf/Documents/GitHub/flopy
/Users/aleaf/Documents/GitHub/modflow-export
/Users/aleaf/Documents/GitHub/gisutils
/

A common way that we add a folder to sys.path is as follows:

    pathtomymodule = os.path.join('..')
    if pathtomymodule not in sys.path:
        sys.path.append(pathtomymodule)

This will allow us to import any modules or packages that are up one directory from the current working directory.  Keep this in mind as we use this throughout the class exercises.

## os Module
Module for providing portable operating system functionality.

In [12]:
print('os.name: ', os.name)

os.name:  posix


In [13]:
#environment variables stored in a dictionary
print('os.environ: ', os.environ)
print('\n')

#or we can look at them in a nicer format
for k, v in os.environ.items():
    print('{0} : {1}'.format(k, v))

os.environ:  environ({'PROJ_LIB': '/Users/aleaf/anaconda3/envs/pyclass/share/proj', 'TERM_PROGRAM': 'Apple_Terminal', 'SSL_CERT_FILE': '/Users/aleaf/cert.pem', 'TERM': 'xterm-color', 'SHELL': '/bin/bash', 'TMPDIR': '/var/folders/4x/bmhyjcdn3mgfdvkk_jgz6bsr0028s1/T/', 'CONDA_SHLVL': '1', 'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.Qp2MX2UUrU/Render', 'CONDA_PROMPT_MODIFIER': '(pyclass) ', 'TERM_PROGRAM_VERSION': '421.2', 'OLDPWD': '/Users/aleaf/Documents/GitHub/flopy', 'TERM_SESSION_ID': 'B2950181-E15C-44FA-A301-A92BDD220BA9', 'USER': 'aleaf', 'CONDA_EXE': '/Users/aleaf/anaconda3/bin/conda', 'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.zWXIuiIJhq/Listeners', '_CE_CONDA': '', 'CPL_ZIP_ENCODING': 'UTF-8', 'PATH': '/Users/aleaf/anaconda3/envs/pyclass/bin:/Users/aleaf/anaconda3/condabin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/Users/aleaf/anaconda3/bin:/Users/aleaf/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/Users/aleaf/anaconda3/

In [14]:
cwd = os.getcwd()
print(cwd)

/Users/aleaf/Documents/GitHub/python-usgs-training/notebooks/part1_python_intro


In [15]:
#list all the entries in the specified directory. 
mylistofitems = os.listdir(os.getcwd())
for thingy in mylistofitems:
    if os.path.isdir(thingy):
        print('directory: ', thingy)
    else:
        print('file: ', thingy)

directory:  extracted_data
file:  09_sys-os.ipynb
file:  02_functions.ipynb
file:  TheisExercise.pdf
file:  06_numpy.ipynb
file:  .DS_Store
file:  08_namespace.ipynb
file:  Pandas_weather_timeseries_Wunderground.ipynb
directory:  images
file:  Untitled.ipynb
file:  Pandas_NWIS.ipynb
file:  04_objects.ipynb
file:  Pandas_ColoradoRiver-FFT.ipynb
file:  gis_vector_msn_crime.ipynb
file:  05_files.ipynb
file:  TheisExercise.tex
file:  03_scripts.ipynb
file:  mtsthelens.pdf
directory:  .ipynb_checkpoints
file:  Matplotlib_StHelens.ipynb
file:  gis_raster_mt_rainier_glaciers.ipynb
file:  xarray_mt_rainier_precip.ipynb
file:  Pandas_ColoradoRiver.ipynb
directory:  data
file:  tmp
file:  01_basics.ipynb
file:  junk.zip
file:  LeesFerryOnePlot.pdf


In [16]:
# Example of changing the working directory
old_wd = os.getcwd()

# Go up one directory
os.chdir('..')
cwd = os.getcwd()
print ('Now in: ', cwd)

# Change back to original
os.chdir(old_wd)
cwd = os.getcwd()
print('Switched back to: ', cwd)

Now in:  /Users/aleaf/Documents/GitHub/python-usgs-training/notebooks
Switched back to:  /Users/aleaf/Documents/GitHub/python-usgs-training/notebooks/part1_python_intro


## Glob
The glob library provides handy shorthand for listing files using patterns and wildcard (*) characters

https://en.wikipedia.org/wiki/Glob_(programming)

**Note!** Sorting of the files returned by `Glob` is platform-dependent. In general, if your code depends on a specific ordering of a list, it is best to explicitly sort it yourself using `sorted()` or `.sort()`, instead of depending on the behavior of an imported module.  
https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/

In [17]:
import glob

In [18]:
# list all of the Jupyter notebooks in the current working directory
glob.glob('*.ipynb')

['09_sys-os.ipynb',
 '02_functions.ipynb',
 '06_numpy.ipynb',
 '08_namespace.ipynb',
 'Pandas_weather_timeseries_Wunderground.ipynb',
 'Untitled.ipynb',
 'Pandas_NWIS.ipynb',
 '04_objects.ipynb',
 'Pandas_ColoradoRiver-FFT.ipynb',
 'gis_vector_msn_crime.ipynb',
 '05_files.ipynb',
 '03_scripts.ipynb',
 'Matplotlib_StHelens.ipynb',
 'gis_raster_mt_rainier_glaciers.ipynb',
 'xarray_mt_rainier_precip.ipynb',
 'Pandas_ColoradoRiver.ipynb',
 '01_basics.ipynb']

In [19]:
sorted(glob.glob('*.ipynb'))

['01_basics.ipynb',
 '02_functions.ipynb',
 '03_scripts.ipynb',
 '04_objects.ipynb',
 '05_files.ipynb',
 '06_numpy.ipynb',
 '08_namespace.ipynb',
 '09_sys-os.ipynb',
 'Matplotlib_StHelens.ipynb',
 'Pandas_ColoradoRiver-FFT.ipynb',
 'Pandas_ColoradoRiver.ipynb',
 'Pandas_NWIS.ipynb',
 'Pandas_weather_timeseries_Wunderground.ipynb',
 'Untitled.ipynb',
 'gis_raster_mt_rainier_glaciers.ipynb',
 'gis_vector_msn_crime.ipynb',
 'xarray_mt_rainier_precip.ipynb']

## os.path

os.path is a very widely used submodule of os.  In fact we use it in almost all of the class notebooks and scripts to deal with file system paths.  Some common os.path functions are:

    os.path.join()
    os.path.abspath()
    os.path.exists()
    os.path.isdir()
    os.path.normpath()
    os.path.split()
    os.path.splitext()
    
A common attribute of os.path is:

    os.path.sep
    
Experiment with these functions to gain a better understanding of what they do.

## os.walk

os.walk() is a great way to recursively generate all the file names and folders in a directory.  The following shows how it can be used to identify large directories.

In [20]:
pth = os.path.join('..')
for root, dirs, files in os.walk(pth):
    mbytes = sum(os.path.getsize(os.path.join(root, name)) for name in files) / 1.e6
    print('{:<50} --> {:10.2f} megabytes.'.format(root, mbytes))

..                                                 -->       0.01 megabytes.
../part1_python_intro                              -->       7.35 megabytes.
../part1_python_intro/extracted_data               -->       0.00 megabytes.
../part1_python_intro/images                       -->       0.01 megabytes.
../part1_python_intro/.ipynb_checkpoints           -->       6.70 megabytes.
../part1_python_intro/data                         -->       1.50 megabytes.
../part1_python_intro/data/04_numpy                -->       2.71 megabytes.
../part1_python_intro/data/fileio                  -->       0.03 megabytes.
../part1_python_intro/data/fileio/netcdf_data      -->       0.00 megabytes.
../part1_python_intro/data/fileio/netcdf_data/zipped -->       0.00 megabytes.
../part1_python_intro/data/fileio/netcdf_data/zipped/zipped_1991 -->       0.00 megabytes.
../part1_python_intro/data/fileio/netcdf_data/zipped/zipped_1996 -->       0.00 megabytes.
../part1_python_intro/data/fileio/netcdf_data/

## shutil Module
shutil is a high level file managment module for copying, moving, and deleting files and directories.

The functions from shutil that you may find useful are:

    shutil.copy()
    shutil.copytree()
    shutil.move()
    shutil.rmtree()  #obviously, you need to be careful with this one!
    
Give these guys a shot and see what they do.  Remember, you can always get help by typing:

    help(shutil.copy)


In [21]:
#try them here.  Be careful!


## subprocess Module

The subprocess module offers a way to execute system commands.  This is how we will run MODFLOW, for example, but you can also use run any operating system command that you can type at the command line.

subprocess.Popen() is the primary underlying function for running system commands, however, it is recommended that you use subprocess.check_output and subprocess.check_call instead.  Both of these functions use Popen.

Take a look at the following help descriptions for check_output, and check_call.

Note, that on Windows, you may commonly have to specify "shell=True" in order to access system commands.

In [22]:
help(subprocess.check_output)

Help on function check_output in module subprocess:

check_output(*popenargs, timeout=None, **kwargs)
    Run command with arguments and return its output.
    
    If the exit code was non-zero it raises a CalledProcessError.  The
    CalledProcessError object will have the return code in the returncode
    attribute and output in the output attribute.
    
    The arguments are the same as for the Popen constructor.  Example:
    
    >>> check_output(["ls", "-l", "/dev/null"])
    b'crw-rw-rw- 1 root root 1, 3 Oct 18  2007 /dev/null\n'
    
    The stdout argument is not allowed as it is used internally.
    To capture standard error in the result, use stderr=STDOUT.
    
    >>> check_output(["/bin/sh", "-c",
    ...               "ls -l non_existent_file ; exit 0"],
    ...              stderr=STDOUT)
    b'ls: non_existent_file: No such file or directory\n'
    
    There is an additional optional argument, "input", allowing you to
    pass a string to the subprocess's stdin.  If

In [23]:
help(subprocess.check_call)

Help on function check_call in module subprocess:

check_call(*popenargs, **kwargs)
    Run command with arguments.  Wait for command to complete.  If
    the exit code was zero then return, otherwise raise
    CalledProcessError.  The CalledProcessError object will have the
    return code in the returncode attribute.
    
    The arguments are the same as for the call function.  Example:
    
    check_call(["ls", "-l"])



In [24]:
# if on mac/unix
subprocess.check_output(['ls', '-l'], shell=True)

b'01_basics.ipynb\n02_functions.ipynb\n03_scripts.ipynb\n04_objects.ipynb\n05_files.ipynb\n06_numpy.ipynb\n08_namespace.ipynb\n09_sys-os.ipynb\nLeesFerryOnePlot.pdf\nMatplotlib_StHelens.ipynb\nPandas_ColoradoRiver-FFT.ipynb\nPandas_ColoradoRiver.ipynb\nPandas_NWIS.ipynb\nPandas_weather_timeseries_Wunderground.ipynb\nTheisExercise.pdf\nTheisExercise.tex\nUntitled.ipynb\ndata\nextracted_data\ngis_raster_mt_rainier_glaciers.ipynb\ngis_vector_msn_crime.ipynb\nimages\njunk.zip\nmtsthelens.pdf\ntmp\nxarray_mt_rainier_precip.ipynb\n'

In [25]:
# if on Windows
try:
    output = subprocess.check_output(['dir'], shell=True)
    output
except:
    print('Why are you against windows?')

Why are you against windows?


In [26]:
# What is going on here?
try:
    subprocess.check_call(['dir'], shell=True)
except Exception as e:
    traceback.print_exc()

Traceback (most recent call last):
  File "<ipython-input-26-aa18cf219b2e>", line 3, in <module>
    subprocess.check_call(['dir'], shell=True)
  File "/Users/aleaf/anaconda3/envs/pyclass/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['dir']' returned non-zero exit status 127.


## Zipfiles

#### zip up one of the files in data/

In [27]:
with zipfile.ZipFile('junk.zip', 'w') as dest:
    dest.write('data/430429089230301.dat')

#### now extract it

In [28]:
with zipfile.ZipFile('junk.zip') as src:
    src.extract('data/430429089230301.dat', path='data/extracted_data')

## Testing Your Skills with a truly awful example:

#### the problem:
Pretend that the file `data/fileio/netcdf_data.zip` contains some climate data that we downloaded. If you open `data/fileio/netcdf_data.zip`, you'll see that within a subfolder `zipped` are a bunch of additional subfolders, each for a different year. Within each subfolder is another zipfile. Within each of these zipfiles is yet another subfolder, inside of which is the actual data file we want (`prcp.nc`). 

#### the goal:
To extract all of these `prcp.nc` files into a single folder, after renaming them with their respective years (obtained from their enclosing folders or zip files). e.g.  
```
prcp_1980.nc
prcp_1981.nc
...
```
This will allow us to open them together as a dataset in `xarray` (more on that later). Does this sound awful? I'm not making this up. This is the kind of structure you get if when downloading tiles of climate data with the [Daymet Tile Selection Tool](https://daymet.ornl.gov/gridded/)

#### hint:
you might find these functions helpful:
```
glob.glob
os.path.isdir
os.makedirs
zipfile.ZipFile
os.path.split
os.path.splitext
os.path.join
shutil.move
os.rename
os.rmdir
```

### solution

First, extract the master zipfile

In [29]:
with zipfile.ZipFile('data/fileio/netcdf_data.zip') as src:
    src.extractall('data/fileio/')

Make a list of the zipfiles

In [30]:
zipfiles = sorted(glob.glob('data/fileio/netcdf_data/zipped/*/*.zip'))
zipfiles[:5]

['data/fileio/netcdf_data/zipped/zipped_1980/12270_1980.zip',
 'data/fileio/netcdf_data/zipped/zipped_1981/12270_1981.zip',
 'data/fileio/netcdf_data/zipped/zipped_1982/12270_1982.zip',
 'data/fileio/netcdf_data/zipped/zipped_1983/12270_1983.zip',
 'data/fileio/netcdf_data/zipped/zipped_1984/12270_1984.zip']

In [31]:
# declare a destination path
dest_path = 'extracted_data'
variable = 'prcp'

for f in zipfiles:
    with zipfile.ZipFile(f) as src:
        # get the path to the source file and the year
        _, fname = os.path.split(f)
        name = os.path.splitext(fname)[0].replace('.tar', '')
        srcfile = '{}/{}.nc'.format(name, variable)
        year = name.split('_')[1]

        # where we want the extracted .nc file to end up
        destfile = os.path.join(dest_path, '{}_{}.nc'.format(variable, year))

        # extract the srcfile path to the /daymet folder
        # unfortunately this extracts the whole path, not just the file
        src.extract(srcfile, dest_path)
        # move the file up from subfolders to /daymet
        shutil.move(os.path.join(dest_path, srcfile), dest_path)
        # rename to include year
        os.rename(os.path.join(dest_path, '{}.nc'.format(variable)),
                  destfile)
        # trash subfolders that were extracted
        os.rmdir(os.path.join(dest_path, name))
        print('{}/{} --> {}'.format(f, srcfile, destfile))

data/fileio/netcdf_data/zipped/zipped_1980/12270_1980.zip/12270_1980/prcp.nc --> extracted_data/prcp_1980.nc
data/fileio/netcdf_data/zipped/zipped_1981/12270_1981.zip/12270_1981/prcp.nc --> extracted_data/prcp_1981.nc
data/fileio/netcdf_data/zipped/zipped_1982/12270_1982.zip/12270_1982/prcp.nc --> extracted_data/prcp_1982.nc
data/fileio/netcdf_data/zipped/zipped_1983/12270_1983.zip/12270_1983/prcp.nc --> extracted_data/prcp_1983.nc
data/fileio/netcdf_data/zipped/zipped_1984/12270_1984.zip/12270_1984/prcp.nc --> extracted_data/prcp_1984.nc
data/fileio/netcdf_data/zipped/zipped_1985/12270_1985.zip/12270_1985/prcp.nc --> extracted_data/prcp_1985.nc
data/fileio/netcdf_data/zipped/zipped_1986/12270_1986.zip/12270_1986/prcp.nc --> extracted_data/prcp_1986.nc
data/fileio/netcdf_data/zipped/zipped_1987/12270_1987.zip/12270_1987/prcp.nc --> extracted_data/prcp_1987.nc
data/fileio/netcdf_data/zipped/zipped_1988/12270_1988.zip/12270_1988/prcp.nc --> extracted_data/prcp_1988.nc
data/fileio/netcdf_

## Bonus -- Determining the location of an executable

There are often times that you run an executable that is nested somewhere deep within your system path.  It can often be a good idea to know exactly where that executable is located.  This might help you one day from accidently using an older version of an executable, such as MODFLOW.

In [32]:
# Define two functions to help determine 'which' program you are using
def is_exe(fpath):
    """
    Return True if fpath is an executable, otherwise return False
    """
    return os.path.isfile(fpath) and os.access(fpath, os.X_OK)

def which(program):
    """
    Locate the program and return its full path.  Return
    None if the program cannot be located.
    """
    fpath, fname = os.path.split(program)
    if fpath:
        if is_exe(program):
            return program
    else:
        # test for exe in current working directory
        if is_exe(program):
            return program
        # test for exe in path statement
        for path in os.environ["PATH"].split(os.pathsep):
            path = path.strip('"')
            exe_file = os.path.join(path, program)
            if is_exe(exe_file):
                return exe_file
    return None

In [33]:
which('MODFLOW-NWT_64.exe')