# Creating a File Report for `CMIP5` Data

First, import the `FileReport` class from the `PyDatSci.FileReport` module.

In [1]:
from PyDatSci.FileReport import FileReport

Set the root directory that should be scanned. We will also scan all subdirectories.

In [2]:
rootDir = '/work/ch0636/eddy/pool/sims/cmip5'

Now, create a `FileReport` object from `rootDir` and start the reporting process. This might take some time, if your directories contain a large number of files.

In [3]:
report = FileReport(rootDir)
report.report()

Scanning: /work/ch0636/eddy/pool/sims/cmip5/IPCC-AR5-SLR-DATA                      finished scan               
Found 12034 files
Scanning for Symlinks...
Progress: |██████████████████████████████████████████████████| 100.0% Complete-| 0.0% Complete
Checking valid files...
Progress: |██████████████████████████████████████████████████| 100.0% Complete-----| 0.0% Complete
Creating suffix overview...
Progress: |██████████████████████████████████████████████████| 100.0% Complete0% Complete


After the scanning process, we write the contents of `FileReport` to disk.

In [9]:
report.write('/work/ch0636/eddy/pool/sims/cmip5/file-report-cmip5.fr')

Now, we can print the report to get an overview:

In [5]:
print(report)


FileReport of /work/ch0636/eddy/pool/sims/cmip5

 Number of Files               : 12034
 Number of valid Files         : 12031
 Number of Symlinks            : 0
 Number of missing Symlinks    : 3
 
 Number of Suffixes            : 11
               .py             : 59
               .txt            : 16
               .p              : 9
               .png            : 31
               .status         : 2
               .csv#           : 2
               .gr             : 5
               .nc             : 11902
               .sh             : 2
               .csv            : 2
               no suffix       : 1



The report sorts the files with respect to their suffix and also checks for missing symbolic links. The report is iteratible, e.g., to get a list of all shell scripts, simply check:

In [6]:
print(report['.sh'])

['/work/ch0636/eddy/pool/sims/cmip5/download/wget-20180731121952.sh', '/work/ch0636/eddy/pool/sims/cmip5/download/wget-20180730165342.sh']


We can see that there are 3 broken symbolic links somewhere in `rootDir`. To see, which links are broken, you can get a list by accessing the internal file lists of the report, e.g.

In [7]:
print(report.missing)

['/work/ch0636/eddy/pool/sims/cmip5/GFDL-CM3/fx/areacella_fx_GFDL-CM3_historical_r0i0p0.nc', '/work/ch0636/eddy/pool/sims/cmip5/GFDL-CM3/fx/sftlf_fx_GFDL-CM3_historical_r0i0p0.nc', '/work/ch0636/eddy/pool/sims/cmip5/GFDL-CM3/fx/orog_fx_GFDL-CM3_historical_r0i0p0.nc']


However, the missing links are sorted out automatically (you can see that by checking the number of valid files), e.g., by print the length of the valid file list in the report:

In [8]:
print(len(report.valid))

12031
