Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmor crashing #6

Closed
pfuhe1 opened this issue Mar 19, 2014 · 13 comments
Closed

cmor crashing #6

pfuhe1 opened this issue Mar 19, 2014 · 13 comments
Assignees
Labels
Milestone

Comments

@pfuhe1
Copy link

pfuhe1 commented Mar 19, 2014

I'm having an issue with cmor crashing intermittently. I am processing a number of files in a row, and cmor will produce a few files (even up to 100), then throw this error in the cmor_axis routine:

*** glibc detected *** /apps/python/2.7.3/bin/python: double free or corruption (!prev): 0x0000000004aa17d0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x760e6)[0x7f9fc7e940e6]
/lib64/libc.so.6(+0x78c13)[0x7f9fc7e96c13]
/home/599/pfu599/pythonlibs/lib/python2.7/site-packages/cmor/_cmor.so(cmor_axis+0x5a2)[0x7f9f90fcb222]
/home/599/pfu599/pythonlibs/lib/python2.7/site-packages/cmor/_cmor.so(+0xa805)[0x7f9f90faa805]
/apps/python/2.7.3/bin/python(PyEval_EvalFrameEx+0x5600)[0x4a7580]
/apps/python/2.7.3/bin/python(PyEval_EvalCodeEx+0x877)[0x4a9067]
/apps/python/2.7.3/bin/python(PyEval_EvalFrameEx+0x5294)[0x4a7214]
/apps/python/2.7.3/bin/python(PyEval_EvalFrameEx+0x6575)[0x4a84f5]

I've also seen this similar error:
*** glibc detected *** /apps/python/2.7.3/bin/python: free(): invalid pointer: 0x00000000246ee160 ***
and this:
*** glibc detected *** /apps/python/2.7.3/bin/python: corrupted double-linked list: 0x0000000003286ec0 ***

If I run the script to produce the same file again, generally it will succeed but then crash again when trying to produce a different file.

I am using the most recent version of cmor (2.9.1), with python2.7.3, netCDF4.3.0. I've tried building cmor with both intel and gnu compilers.

@doutriaux1
Copy link
Collaborator

@pfuhe1 Could you please attach or send me a sample script that reproduces this?

@doutriaux1
Copy link
Collaborator

@pfuhe1 can you compile with debug so that the trace tell us where the core dump happens? Or run it via gdb.

@pfuhe1
Copy link
Author

pfuhe1 commented Apr 4, 2014

@doutriaux1 I have been having trouble reproducing this crash reliably, so will do a bit more testing myself before sending you a script.

I am also unsure if I have compiled in debug model correctly. I set the environment variable CFLAGS = '-g' when I compiled, but this doesn't seem to change the trace that is output when it crashes. Do I have to specify some other debug options or set them another way?

@pfuhe1
Copy link
Author

pfuhe1 commented Jul 24, 2014

I have come back to this issue again, and have produced a simple script that uses cmor to write random data to a file. It then loops, writing over the file many times.

It seems to have a memory leak and crashes after a while from running out of memory. I'm wondering if this is due to the same issue as above.

I don't think I can attach the files here, so I'm sending you the script and some example output by email.

@doutriaux1
Copy link
Collaborator

thank you so much for doing this, can you please post the script it will help debugging.

@pfuhe1
Copy link
Author

pfuhe1 commented Jul 25, 2014

# This is a dummy version of the ACCESS Post Processor.
# Peter Uhe
# 24 July 2014
#
import numpy as np
import datetime
import cmor

#
#main function to post-process files
#
def app(opts):

    #
    # Set up the CMOR stuff.
    #
    print 'cmor setup'
    cmor.setup(inpath=opts['table_path'], 
            netcdf_file_action=cmor.CMOR_REPLACE_3, 
            set_verbosity=cmor.CMOR_NORMAL, 
            exit_control=cmor.CMOR_NORMAL,
            logfile=opts['logfile'], create_subdirectories=1)

    #
    # Define the dataset.
    #
    cmor.dataset(outpath=opts['outpath'],
            experiment_id=opts['experiment_id'], 
            institution=opts['institution'], 
            source=opts['source'],
            calendar=opts['calendar'], 
            realization=opts['realization'],
            contact=opts['contact'],
            history=opts['history'],
            comment=opts['comment'],
            references=opts['references'],
            model_id=opts['model_id'],
            forcing=opts['forcing'],
            initialization_method=opts['initialization_method'],
            physics_version=opts['physics_version'],
            institute_id=opts['institution_id'],
            parent_experiment_id=opts['parent_experiment_id'],
            branch_time=opts['branch_time'],
            parent_experiment_rip=opts['parent_experiment_rip'])

    #
    # Load the CMIP tables into memory.
    #
    tables=[]
    tables.append(cmor.load_table('CMIP5_grids'))
    tables.append(cmor.load_table(opts['cmip_table']))

    #manually create time axis for monthly data
    min_tvals=[]
    max_tvals=[]
    cmor_tName='time'
    tvals=[]
    axis_ids=[]
    for year in range(opts['tstart'],opts['tend']+1):
        for mon in range(1,13):
            tvals.append(datetime.date(year,mon,15).toordinal()-1)
    # set up time values and bounds
    for i,ordinaldate in enumerate(tvals):
        model_date  = datetime.date.fromordinal(int(ordinaldate)+1)
        #min bound is first day of month
        model_date=model_date.replace(day=1)
        min_tvals.append(model_date.toordinal()-1)
        #max_bound is first day of next month
        tyr=model_date.year+model_date.month/12
        tmon=model_date.month%12+1                              
        model_date=model_date.replace(year=tyr,month=tmon)
        max_tvals.append(model_date.toordinal()-1)
        #correct date to middle of month
        mid=(max_tvals[i]-min_tvals[i])/2.
        tvals[i]=min_tvals[i]+mid
    tval_bounds = np.column_stack((min_tvals, max_tvals))
    #set up cmor time axis:
    cmor.set_table(tables[1])
    time_axis_id = cmor.axis(table_entry=cmor_tName,
        units='days since 0001-01-01', length=len(tvals),
        coord_vals=tvals[:], cell_bounds=tval_bounds[:],
        interval=None)
    axis_ids.append(time_axis_id)

    #
    # Define the CMOR variable.
    #
    cmor.set_table(tables[1])
    in_missing = float(1.e20)
    print 'defining cmor variable'
    variable_id = cmor.variable(table_entry=opts['vcmip'], units=opts['in_units'], \
    axis_ids=axis_ids, type='f', missing_value=in_missing)

    #
    # Write the data 
    #
    data_vals=np.array(np.random.rand(len(tvals)),dtype=np.float32)
    try:
        print 'writing...'
        cmor.write(variable_id, data_vals[:], ntimes_passed=np.shape(data_vals)[0]) #assuming time is the first dimension
    except Exception, e:
        raise Exception("ERROR writing data!")
    #
    # Close the CMOR file.
    #
    try:
        path = cmor.close(variable_id, file_name=True)
    except:
        raise Exception("ERROR closing cmor file!")

    return path



if __name__ == "__main__":
#   from pympler import tracker
    import resource
#   from guppy import hpy

# Example dictionary containing metadata used by the post-processor
    opts={'initialization_method': 1, 'calculation': '', 'vin': ['temp_global_ave'], 'branch_time': 109207.0, 'vcmip': 'thetaoga', 'positive': '', 'tend': 1852, 'tstart': 1850, 'realization': 1, 'forcing': 'GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)', 'infile': '/g/data1/p66/ACCESSDIR/har599/ACCESS/output/hg2-r11Mhd/history///ocn/ocean_scalar.nc-*', 'model_id': 'ACCESS-test', 'parent_experiment_id': 'piControl', 'cmip_table': 'CMIP5_Omon', 'in_units': 'K', 'version_number': 'v20130710', 'notes': 'branch date is 300-01-01', 'physics_version': 1, 'axes_modifier': 'dropX', 'experiment_id': 'historical', 'parent_experiment_rip': 'r1i1p1'}

    opts['source']='ACCESS-test 2011. \
        Atmosphere: AGCM v1.0 (N96 grid-point, 1.875 degrees EW x approx 1.25 degree NS, 38 levels); '+\
        'ocean: NOAA/GFDL MOM4p1 (nominal 1.0 degree EW x 1.0 degrees NS, tripolar north of 65N, '+\
         'equatorial refinement to 1/3 degree from 10S to 10 N, cosine dependent NS south of 25S, 50 levels); '+\
        'sea ice: CICE4.1 (nominal 1.0 degree EW x 1.0 degrees NS, tripolar north of 65N, '+\
        'equatorial refinement to 1/3 degree from 10S to 10 N, cosine dependent NS south of 25S); '+\
        'land: MOSES2 (1.875 degree EW x approx. 1.25 degree NS, 4 levels'
    opts['logfile']=None
    opts['institution']='CSIRO-BOM'
    opts['institution_id']='CSIRO-BOM'
    opts['calendar']='proleptic_gregorian'
    opts['contact']='dummy'
    opts['history']='dummy'
    opts['references']='dummy'
    opts['comment']='dummy'
    opts['outpath']='/short/p66/pfu599'
    opts['table_path']='/g/data1/p66/pfu599/post_processor/branches/APP1-0/cmip5-cmor-tables/Tables'

# Memory profiler setup
#   tr = tracker.SummaryTracker()
#   tr.print_diff()              
#   hp=hpy()
#   new=hp.heap()

# Loop over many times rewriting the same file. 
    for i in range(10):
        print i
        print app(opts)

# Memory profiling 
#       tr.print_diff()
#       old=new
#       new=hp.heap()
#       diff=new-old
#       print diff
#       print diff.byrcs[0].byid
        print 'Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

@pfuhe1
Copy link
Author

pfuhe1 commented Jul 25, 2014

That's it. Sorry about the length of the script.

@doutriaux1
Copy link
Collaborator

perfect! Thx!

@pfuhe1
Copy link
Author

pfuhe1 commented Jul 25, 2014

You need to change the lines setting opts['outpath'] and opts['tablepath'] for your machine.

Note I am running cmor 2.9.1 with python 2.7.6 and numpy 1.8.0.

I have also just ran the script on an old machine I still have access to, which had cmor 2.8.3 installed along with python 2.6 and numpy 1.3.0, and the problem with the increasing memory usage doesn't occur.

@MartinDix
Copy link

I use the same system for which Peter reported the memory leak.

More or less by accident I found that it's due to building with a particular copy of the uuid library that was on the machine. Using a new version built from source fixes the leak.

Now testing whether this fixes the intermittent crashes from the full processing.

@doutriaux1
Copy link
Collaborator

@MartinDix this is great news! please let usknow, I will tweak to configure to make sure we use the correct uuid version.

@MartinDix
Copy link

This wasn't the real problem.

A lucky observation showed that the crashes occurred when writing a 4D file after a 3D file.This allowed creating an example small enough to run in totalview which then pointed to the line

  free(cmor_axes[cmor_naxes].wrapping);

at the end of cmor_axis in cmor_axes.c, https://github.com/PCMDI/cmor/blob/master/Src/cmor_axes.c#L1343

The wrapping pointer is only allocated for longitude axes. cmor_axes is an external variable so keeps the value of the freed pointer between calls. If the first file has dimensions (T,Y,X), cmor_axes[2].wrapping gets allocated and freed. If the next file then has dimensions (T,Z,Y,X) the cmor_axis call for Y still has a non-null value for cmor_axes[2].wrapping which it tries to free again.

Sometimes this gives the double free error that Peter originally reported. Other times it gives other more or less obscure crashes.

I've created an example script https://gist.github.com/MartinDix/6b2624d620da79c4e9f9

Adding a print

  printf("Wrapping %d %p\n", cmor_naxes,   cmor_axes[cmor_naxes].wrapping);

before the free then gives

% python cmor_testscript.py 
Wrapping 0 (nil)
Wrapping 1 (nil)
Wrapping 2 0x3270b50
writing...
/short/p66/mrd599/CMIP5/output/CMOR-test/CMOR-test/historical/mon/atmos/ts/r1i1p1//ts_Amon_CMOR-test_historical_r1i1p1_185001-185012.nc
Wrapping 0 (nil)
Wrapping 1 (nil)
Wrapping 2 0x3270b50
Wrapping 3 0x32b1a70
writing...
*** glibc detected *** python: corrupted double-linked list: 0x0000000003270b40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e76)[0x2b2719f81e76]
/lib64/libc.so.6(+0x79caa)[0x2b2719f85caa]
/lib64/libc.so.6(__libc_malloc+0x71)[0x2b2719f866b1]
/apps/netcdf/4.3.2/lib/libnetcdf.so.7(+0x6a7ce)[0x2b274af017ce]

In this case it's crashed at some point after the actual free, but just where it crashes seems to depend on array sizes, netcdf library versions etc.

I think the fix is to add

  cmor_axes[cmor_naxes].wrapping = NULL;

after the free. This seems to have fixed things here.

@doutriaux1
Copy link
Collaborator

wow! Nice catch! Will fix and add your script to the test suite! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants