Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EM Data Model #14

Open
stufisher opened this issue Aug 23, 2017 · 11 comments
Open

EM Data Model #14

stufisher opened this issue Aug 23, 2017 · 11 comments

Comments

@stufisher
Copy link

Following up from antolinos/em-model#1, here is the latest EM Data Model that i have put together from Alex's input, DLS Scisoft & EM Staff, and EPN EM people:

EM Model

@stufisher
Copy link
Author

@antolinos i got some clarification on the per movie nominaldefocus we were discussing. There is a value recorded with each movie but it is a total guess. It is determined properly by the CTF correction. So debatable as to whether we should store it. (it is apparently captured in the xml file)

@antolinos
Copy link
Collaborator

Hi @stufisher,

Thanks. We are starting with Scipion and the ISPyB monitors and gathering all metadata that it will pushed into ISPyB later on.
My feeling today is that some parameters will need to be stored per movie.
As soon as we got a clear and clean data flow we will share it with you.

@olofsvensson
Copy link
Collaborator

Hi @stufisher and @antolinos,

I have been exterminating the files produced by our CryoEM, and after discussion with Isai here's a suggested list of meta-data we would like to start to upload to ISPyB after each movie acquisition (i.e. before motion correction):

Common meta-data to all movies:

  • Path to the directory of the movies
  • Paths to the GridSquare JPG, MRC and XML files
  • NumberOffractions (i.e. number of frames per movie)
  • Pixelsize
  • Counting or Super resolution mode

Individual movie meta-data:

  • Filename of the movie file
  • Identifier of the foilhole
  • Date and time of acquisition
  • Sequential index of movie
  • Paths to movie meta-data JPG, MRC and XML files
  • Dose per movie

This list will probably be extended in the future, however, for now it should get us going.

After a quick inspection of the suggested data model I found that these parameter can be stored without any modification:

  • Path to the directory of the movies
  • Paths to the GridSquare JPG, MRC and XML files (via DataCollectionFileAttachment)
  • NumberOffractions (i.e. number of frames per movie)
  • Pixelsize

However, I don't see how these parameters can be fitted:

  • Counting or Super resolution mode
  • Filename of the movie file
  • Identifier of the foilhole
  • Date and time of acquisition
  • Sequential index of movie
  • Paths to movie meta-data JPG, MRC and XML files
  • Dose per movie

Maybe we need to add a new specific "movie" table?

This is just a start of discussion and not a list of requirements written in stone...

@stufisher
Copy link
Author

stufisher commented Oct 9, 2017

I'm trying to avoid a movie table if it all possible as it sends us down the same hole as the Image table for mx, we should really avoid saving full paths to jpg, mrc, and xml. The Image table does not scale well at all, hence why we have abandoned it at DLS (we can assume long term EM will scale like MX has so should think carefully about this now!). We should be able to construct per movie jpg, mrc, xml files from other variables as images are in mx. (DC.fileprefix, DC.imagedirectory, GridImageMap.some sequential number)

Dose per movie is an interesting one, we know the total dose of the whole exp, cant we just divide through, or is each movie really unique? Do people actually care as this will be calculated properly in MotionCorr after?

Sequential index of movie is stored in GridImageMap, and i will add a timestamp in there too, as this is required here too

@olofsvensson
Copy link
Collaborator

Hi @stufisher,
I agree that we should think carefully about this now so that we don't have the same situation as the Image table. The situation is though not quite the same:

The file name from a SR data collection can be found via a template and an image number. This is not true for a Cryo-EM movie file name: FoilHole_19150795_Data_19148847_19148848_20170619_2101-0344.mrc. For each movie many parts of the filename change:

  • "FoilHole_": This prefix is always the same for a whole grid square
  • "19150795": This seems to me to be the identifier of the foilhole, as this number is the same for four (or more) consecutive movies.
  • "19148847_19148848": These numbers seem to identify the foilhole, as these numbers are repeated for movies taken in different foilholes.
  • "20170619_2101": Date and time of the acquisition
  • "0344": a sequential index which is increased by one for each new movie.

We can find the date, time and the sequential index from the GridImageMap, but where will we be able to find the other foilhole identifiers? You can argue that we don't need them since we have a unique sequential index, however, this is not true for the corresponding mrc, jpg and xml files:

  • FoilHole_19150795_Data_19148847_19148848_20170619_2101.jpg
  • FoilHole_19150795_Data_19148847_19148848_20170619_2101.mrc
  • FoilHole_19150795_Data_19148847_19148848_20170619_2101.xml

Current data rates from one Cryo-EM instrument is (I guess) about 10-20 movies / minute, while current image rates from one SR data collection is > 1000 images / minute. So, the question is if the data rate from Cryo-EMs are going to be significantly increased in the not so far future?

@stufisher
Copy link
Author

stufisher commented Oct 10, 2017

We could add some other identifier fields to gridimagemap and store the corresponding numbers, these fields would then have a fixed size and be more scalable than a varchar(255)

i.e.

GridImageMap
identifier1 int
identifier2 int

I really want to keep these generic too, and not EM specific if possible.

Why the FEI/Gatan? software cant write sane file names is beyond me...

I think we should try to assume nothing, when ISPyB was designed 10 years ago we didnt expect MX to collect ~1000s images a second. 1-3kfps detectors already exist for EM (we make one)

@stufisher
Copy link
Author

stufisher commented Oct 11, 2017

Following from yesterdays discussion i have now added a movie table and deprecated gridimagemap:
em_ispyb_model

@antolinos
Copy link
Collaborator

Thanks. @olofsvensson and I are still working on webservices and even if most likely this will change I wanted to keep you updated and get your feedback:

So, this is preliminary structure for Movie table:
image

We propose to rename movieFullPath by moviePath and add some extra fields.

@antolinos
Copy link
Collaborator

antolinos commented Oct 23, 2017

Hi @stufisher,

This is how it looks like now:
image

Please have a look as there are some changes due to:

  • We wanted to store a value that was not in your schema
  • We did not include fields because we thought that they don't belong to that table or we don't know how to get them (yet)

In both cases, we are not sure then some discussion about that would be appreciated.

There are still few parameters that belong to datacollection:

  • voltage
  • sphericalAberration
  • amplitudeContrast
  • magnification
  • scannedPixelSize

We are thinking about specializing a new table called EMDataCollection with these values. It will avoid to increase the number of columns on data collection and will make ISPyB more scalable.

@stufisher
Copy link
Author

  • We did not include fields because we thought that they don't belong to that table or we don't know how to get them (yet)

Please specify these explicitly, the two last points are quite different from each other!

You have undone a lot of my work here. You have renamed a lot of the columns, I don't understand? Why not work from our existing schema, rather than starting from scratch?

I had conceded and added a table (movie) to store a single varchar(255) per movie, now you have added another 4 columns of the same dimensions to a table that we discussed is going to be heavily populated and may grow exponentially over time. Can we not determine the xml path from the movieFullPath? I had chosen movieFullPath as the name to be consistent with the other tables in ISPyB.

As i had previously described:

  • voltage = wavelength => energy of radiation
  • sphericalAberration is a fixed function of the microscope and should be stored in BeamlineSetup as CS (what its referred to in EM terms).
  • amplitudeContrast is a function of the CTF (= it is determined by the correction)

Please can you provide a diff from my schema?

Movie

I'm not sure why you have micrograph or micrograph snapshot in movie? Does a micrograph even exist at this point? Movie is as it says a series of frames, is a micrograph not constructed from these via another process => MotionCorrection? (at least the one people will look at)
dosePerImage = dosePerFrame in MotionCorrection (=duplication?)

MotionCorrection

You have added log file, please remove it. MotionCorrection links to AutoProcProgram, which has a link to AutoProcProgramAttachment where logs should be stored
timestamp should be removed it is catered for by AutoProcProgram as well
do we really need another varchar(255) to the dose corrected micrograph?

CTF

You have added log file, please remove it. CTF links to AutoProcProgram, which has a link to AutoProcProgramAttachment where logs should be stored
timestamp should be removed it is catered for by AutoProcProgram as well

I'm not sure why you have removed amplitudeContrast from CTF. It is not a function of the movie/datacollection, it is determined by the CTF correction.

Why have spectraImage and spectraImageThumbnail? We dont need both. Why rename from fftTheoretical? (its a [fast] fourier transform of the micrograph + the theoretical one from the CTF function)

MotionCorrectionDrift

is the data that makes up the driftPlotFullPath. You will probably show the driftPlotFullPath in EXI, i want access to the raw data. Kevin tells me this data is available to Scipion somewhere

Lots of other columns have changed name, i dont understand why...

@stufisher
Copy link
Author

Can we try and pick up converging on this model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants