diff --git a/.github/workflows/qiita-ci.yml b/.github/workflows/qiita-ci.yml index bbc6c25f5..6c3b06be1 100644 --- a/.github/workflows/qiita-ci.yml +++ b/.github/workflows/qiita-ci.yml @@ -154,6 +154,8 @@ jobs: echo "5. Setting up qiita" conda activate qiita + # adapt environment_script for private qiita plugins from travis to github actions. + sed 's#export PATH="/home/travis/miniconda3/bin:$PATH"; source #source /home/runner/.profile; conda #' -i qiita_db/support_files/patches/54.sql qiita-env make --no-load-ontologies qiita-test-install qiita plugins update diff --git a/CHANGELOG.md b/CHANGELOG.md index 1756c7238..24ad9a78a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,7 +13,7 @@ Deployed on January 8th, 2024 * Workflow definitions can now use sample or preparation information columns/values to differentiate between them. * Updated the Adapter and host filtering plugin (qp-fastp-minimap2) to v2023.12 addressing a bug in adapter filtering; [more information](https://qiita.ucsd.edu/static/doc/html/processingdata/qp-fastp-minimap2.html). * Other fixes: [3334](https://github.com/qiita-spots/qiita/pull/3334), [3338](https://github.com/qiita-spots/qiita/pull/3338). Thank you @sjanssen2. -* The internal Sequence Processing Pipeline is now using the human pan-genome reference, together with the GRCh38 genome + PhiX and CHM13 genome for human host filtering. +* The internal Sequence Processing Pipeline is now using the human pan-genome reference, together with the GRCh38 genome + PhiX and T2T-CHM13v2.0 genome for human host filtering. Version 2023.10 diff --git a/INSTALL.md b/INSTALL.md index f23e85f38..89b63cabb 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -162,9 +162,9 @@ Navigate to the cloned directory and ensure your conda environment is active: cd qiita source activate qiita ``` -If you are using Ubuntu or a Windows Subsystem for Linux (WSL), you will need to ensure that you have a C++ compiler and that development libraries and include files for PostgreSQL are available. Type `cc` into your system to ensure that it doesn't result in `program not found`. The following commands will install a C++ compiler and `libpq-dev`: +If you are using Ubuntu or a Windows Subsystem for Linux (WSL), you will need to ensure that you have a C++ compiler and that development libraries and include files for PostgreSQL are available. Type `cc` into your system to ensure that it doesn't result in `program not found`. If you use the the GNU Compiler Collection, make sure to have `gcc` and `g++` available. The following commands will install a C++ compiler and `libpq-dev`: ```bash -sudo apt install gcc # alternatively, you can install clang instead +sudo apt install gcc g++ # alternatively, you can install clang instead sudo apt-get install libpq-dev ``` Install Qiita (this occurs through setuptools' `setup.py` file in the qiita directory): @@ -178,7 +178,7 @@ At this point, Qiita will be installed and the system will start. However, you will need to install plugins in order to process any kind of data. For a list of available plugins, visit the [Qiita Spots](https://github.com/qiita-spots) github organization. Each of the plugins have their own installation instructions, we -suggest looking at each individual .travis.yml file to see detailed installation +suggest looking at each individual .github/workflows/qiita-plugin-ci.yml file to see detailed installation instructions. Note that the most common plugins are: - [qtp-biom](https://github.com/qiita-spots/qtp-biom) - [qtp-sequencing](https://github.com/qiita-spots/qtp-sequencing) @@ -224,15 +224,15 @@ export REDBIOM_HOST=http://my_host.com:7379 ## Configure NGINX and supervisor -(NGINX)[https://www.nginx.com/] is not a requirement for Qiita development but it's highly recommended for deploys as this will allow us -to have multiple workers. Note that we are already installing (NGINX)[https://www.nginx.com/] within the Qiita conda environment; also, -that Qiita comes with an example (NGINX)[https://www.nginx.com/] config file: `qiita_pet/nginx_example.conf`, which is used in the Travis builds. +[NGINX](https://www.nginx.com/) is not a requirement for Qiita development but it's highly recommended for deploys as this will allow us +to have multiple workers. Note that we are already installing [NGINX](https://www.nginx.com/) within the Qiita conda environment; also, +that Qiita comes with an example [NGINX](https://www.nginx.com/) config file: `qiita_pet/nginx_example.conf`, which is used in the Travis builds. -Now, (supervisor)[https://github.com/Supervisor/supervisor] will allow us to start all the workers we want based on its configuration file; and we -need that both the (NGINX)[https://www.nginx.com/] and (supervisor)[https://github.com/Supervisor/supervisor] config files to match. For our Travis +Now, [supervisor](https://github.com/Supervisor/supervisor) will allow us to start all the workers we want based on its configuration file; and we +need that both the [NGINX](https://www.nginx.com/) and [supervisor](https://github.com/Supervisor/supervisor) config files to match. For our Travis testing we are creating 3 workers: 21174 for master and 21175-6 as a regular workers. -If you are using (NGINX)[https://www.nginx.com/] via conda, you are going to need to create the NGINX folder within the environment; thus run: +If you are using [NGINX](https://www.nginx.com/) via conda, you are going to need to create the NGINX folder within the environment; thus run: ```bash mkdir -p ${CONDA_PREFIX}/var/run/nginx/ @@ -256,7 +256,7 @@ Start the qiita server: qiita pet webserver start ``` -If all the above commands executed correctly, you should be able to access Qiita by going in your browser to https://localhost:21174 if you are not using NGINX, or https://localhost:8383 if you are using NGINX, to login use `test@foo.bar` and `password` as the credentials. (In the future, we will have a *single user mode* that will allow you to use a local Qiita server without logging in. You can track progress on this on issue [#920](https://github.com/biocore/qiita/issues/920).) +If all the above commands executed correctly, you should be able to access Qiita by going in your browser to https://localhost:21174 if you are not using NGINX, or https://localhost:8383 if you are using NGINX, to login use `test@foo.bar` and `password` as the credentials. (Login as `admin@foo.bar` with `password` to see admin functionality. In the future, we will have a *single user mode* that will allow you to use a local Qiita server without logging in. You can track progress on this on issue [#920](https://github.com/biocore/qiita/issues/920).) diff --git a/notebooks/resource-allocation/generate-allocation-summary-arrays.py b/notebooks/resource-allocation/generate-allocation-summary-arrays.py new file mode 100644 index 000000000..8bca68742 --- /dev/null +++ b/notebooks/resource-allocation/generate-allocation-summary-arrays.py @@ -0,0 +1,239 @@ +from qiita_core.util import MaxRSS_helper +from qiita_db.software import Software +import datetime +from io import StringIO +from subprocess import check_output +import pandas as pd +from os.path import join + +# This is an example script to collect the data we need from SLURM, the plan +# is that in the near future we will clean up and add these to the Qiita's main +# code and then have cronjobs to run them. + +# at time of writting we have: +# qp-spades spades +# (*) qp-woltka Woltka v0.1.4 +# qp-woltka SynDNA Woltka +# qp-woltka Calculate Cell Counts +# (*) qp-meta Sortmerna v2.1b +# (*) qp-fastp-minimap2 Adapter and host filtering v2023.12 +# ... and the admin plugin +# (*) qp-klp +# Here we are only going to create summaries for (*) + + +sacct = ['sacct', '-p', + '--format=JobName,JobID,ElapsedRaw,MaxRSS,ReqMem', '-j'] +# for the non admin jobs, we will use jobs from the last six months +six_months = datetime.date.today() - datetime.timedelta(weeks=6*4) + +print('The current "sofware - commands" that use job-arrays are:') +for s in Software.iter(): + if 'ENVIRONMENT="' in s.environment_script: + for c in s.commands: + print(f"{s.name} - {c.name}") + +# 1. Command: woltka + +fn = join('/panfs', 'qiita', 'jobs_woltka.tsv.gz') +print(f"Generating the summary for the woltka jobs: {fn}.") + +cmds = [c for s in Software.iter(False) + if 'woltka' in s.name for c in s.commands] +jobs = [j for c in cmds for j in c.processing_jobs if j.status == 'success' and + j.heartbeat.date() > six_months and j.input_artifacts] + +data = [] +for j in jobs: + size = sum([fp['fp_size'] for fp in j.input_artifacts[0].filepaths]) + jid, mjid = j.external_id.strip().split() + rvals = StringIO(check_output(sacct + [jid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + jmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + jwt = _d.ElapsedRaw.max() + + rvals = StringIO(check_output(sacct + [mjid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + mmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + mwt = _d.ElapsedRaw.max() + + data.append({ + 'jid': j.id, 'sjid': jid, 'mem': jmem, 'wt': jwt, 'type': 'main', + 'db': j.parameters.values['Database'].split('/')[-1]}) + data.append( + {'jid': j.id, 'sjid': mjid, 'mem': mmem, 'wt': mwt, 'type': 'merge', + 'db': j.parameters.values['Database'].split('/')[-1]}) +df = pd.DataFrame(data) +df.to_csv(fn, sep='\t', index=False) + +# 2. qp-meta Sortmerna + +fn = join('/panfs', 'qiita', 'jobs_sortmerna.tsv.gz') +print(f"Generating the summary for the woltka jobs: {fn}.") + +# for woltka we will only use jobs from the last 6 months +cmds = [c for s in Software.iter(False) + if 'minimap2' in s.name.lower() for c in s.commands] +jobs = [j for c in cmds for j in c.processing_jobs if j.status == 'success' and + j.heartbeat.date() > six_months and j.input_artifacts] + +data = [] +for j in jobs: + size = sum([fp['fp_size'] for fp in j.input_artifacts[0].filepaths]) + jid, mjid = j.external_id.strip().split() + rvals = StringIO(check_output(sacct + [jid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + jmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + jwt = _d.ElapsedRaw.max() + + rvals = StringIO(check_output(sacct + [mjid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + mmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + mwt = _d.ElapsedRaw.max() + + data.append({ + 'jid': j.id, 'sjid': jid, 'mem': jmem, 'wt': jwt, 'type': 'main'}) + data.append( + {'jid': j.id, 'sjid': mjid, 'mem': mmem, 'wt': mwt, 'type': 'merge'}) +df = pd.DataFrame(data) +df.to_csv(fn, sep='\t', index=False) + + +# 3. Adapter and host filtering. Note that there is a new version deployed on +# Jan 2024 so the current results will not be the most accurate + +fn = join('/panfs', 'qiita', 'jobs_adapter_host.tsv.gz') +print(f"Generating the summary for the woltka jobs: {fn}.") + +# for woltka we will only use jobs from the last 6 months +cmds = [c for s in Software.iter(False) + if 'meta' in s.name.lower() for c in s.commands] +jobs = [j for c in cmds if 'sortmerna' in c.name.lower() + for j in c.processing_jobs if j.status == 'success' and + j.heartbeat.date() > six_months and j.input_artifacts] + +data = [] +for j in jobs: + size = sum([fp['fp_size'] for fp in j.input_artifacts[0].filepaths]) + jid, mjid = j.external_id.strip().split() + rvals = StringIO(check_output(sacct + [jid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + jmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + jwt = _d.ElapsedRaw.max() + + rvals = StringIO(check_output(sacct + [mjid]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + mmem = _d.MaxRSS.apply(lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + mwt = _d.ElapsedRaw.max() + + data.append({ + 'jid': j.id, 'sjid': jid, 'mem': jmem, 'wt': jwt, 'type': 'main'}) + data.append( + {'jid': j.id, 'sjid': mjid, 'mem': mmem, 'wt': mwt, 'type': 'merge'}) +df = pd.DataFrame(data) +df.to_csv(fn, sep='\t', index=False) + + +# 4. The SPP! + +fn = join('/panfs', 'qiita', 'jobs_spp.tsv.gz') +print(f"Generating the summary for the SPP jobs: {fn}.") + +# for the SPP we will look at jobs from the last year +year = datetime.date.today() - datetime.timedelta(days=365) +cmds = [c for s in Software.iter(False) + if s.name == 'qp-klp' for c in s.commands] +jobs = [j for c in cmds for j in c.processing_jobs if j.status == 'success' and + j.heartbeat.date() > year] + +# for the SPP we need to find the jobs that were actually run, this means +# looping throught the existing slurm jobs and finding them +max_inter = 2000 + +data = [] +for job in jobs: + jei = int(job.external_id) + rvals = StringIO( + check_output(sacct + [str(jei)]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + mem = _d.MaxRSS.apply( + lambda x: x if type(x) is not str else MaxRSS_helper(x)).max() + wt = _d.ElapsedRaw.max() + # the current "easy" way to determine if amplicon or other is to check + # the file extension of the filename + stype = 'other' + if job.parameters.values['sample_sheet']['filename'].endswith('.txt'): + stype = 'amplicon' + rid = job.parameters.values['run_identifier'] + data.append( + {'jid': job.id, 'sjid': jei, 'mem': mem, 'stype': stype, 'wt': wt, + 'type': 'main', 'rid': rid, 'name': _d.JobName[0]}) + + # let's look for the convert job + for jid in range(jei + 1, jei + max_inter): + rvals = StringIO(check_output(sacct + [str(jid)]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + if [1 for x in _d.JobName.values if x.startswith(job.id)]: + cjid = int(_d.JobID[0]) + mem = _d.MaxRSS.apply( + lambda x: x if type(x) is not str else MaxRSS_helper(x)).max() + wt = _d.ElapsedRaw.max() + + data.append( + {'jid': job.id, 'sjid': cjid, 'mem': mem, 'stype': stype, + 'wt': wt, 'type': 'convert', 'rid': rid, + 'name': _d.JobName[0]}) + + # now let's look for the next step, if amplicon that's fastqc but + # if other that's qc/nuqc + for jid in range(cjid + 1, cjid + max_inter): + rvals = StringIO( + check_output(sacct + [str(jid)]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + if [1 for x in _d.JobName.values if x.startswith(job.id)]: + qc_jid = _d.JobIDRaw.apply( + lambda x: int(x.split('.')[0])).max() + qcmem = _d.MaxRSS.apply( + lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + qcwt = _d.ElapsedRaw.max() + + if stype == 'amplicon': + data.append( + {'jid': job.id, 'sjid': qc_jid, 'mem': qcmem, + 'stype': stype, 'wt': qcwt, 'type': 'fastqc', + 'rid': rid, 'name': _d.JobName[0]}) + else: + data.append( + {'jid': job.id, 'sjid': qc_jid, 'mem': qcmem, + 'stype': stype, 'wt': qcwt, 'type': 'qc', + 'rid': rid, 'name': _d.JobName[0]}) + for jid in range(qc_jid + 1, qc_jid + max_inter): + rvals = StringIO(check_output( + sacct + [str(jid)]).decode('ascii')) + _d = pd.read_csv(rvals, sep='|') + if [1 for x in _d.JobName.values if x.startswith( + job.id)]: + fqc_jid = _d.JobIDRaw.apply( + lambda x: int(x.split('.')[0])).max() + fqcmem = _d.MaxRSS.apply( + lambda x: x if type(x) is not str + else MaxRSS_helper(x)).max() + fqcwt = _d.ElapsedRaw.max() + data.append( + {'jid': job.id, 'sjid': fqc_jid, + 'mem': fqcmem, 'stype': stype, + 'wt': fqcwt, 'type': 'fastqc', + 'rid': rid, 'name': _d.JobName[0]}) + break + break + break + +df = pd.DataFrame(data) +df.to_csv(fn, sep='\t', index=False) diff --git a/notebooks/resource-allocation/generate-allocation-summary.py b/notebooks/resource-allocation/generate-allocation-summary.py index e081a5d12..7c8634293 100644 --- a/notebooks/resource-allocation/generate-allocation-summary.py +++ b/notebooks/resource-allocation/generate-allocation-summary.py @@ -5,6 +5,7 @@ from json import loads from os.path import join +from qiita_core.util import MaxRSS_helper from qiita_db.exceptions import QiitaDBUnknownIDError from qiita_db.processing_job import ProcessingJob from qiita_db.software import Software @@ -117,19 +118,8 @@ print('Make sure that only 0/K/M exist', set( df.MaxRSS.apply(lambda x: str(x)[-1]))) - -def _helper(x): - if x[-1] == 'K': - y = float(x[:-1]) * 1000 - elif x[-1] == 'M': - y = float(x[:-1]) * 1000000 - else: - y = float(x) - return y - - # Generating new columns -df['MaxRSSRaw'] = df.MaxRSS.apply(lambda x: _helper(str(x))) +df['MaxRSSRaw'] = df.MaxRSS.apply(lambda x: MaxRSS_helper(str(x))) df['ElapsedRawTime'] = df.ElapsedRaw.apply( lambda x: timedelta(seconds=float(x))) diff --git a/qiita_core/tests/test_util.py b/qiita_core/tests/test_util.py index a3fef6942..dd33e902c 100644 --- a/qiita_core/tests/test_util.py +++ b/qiita_core/tests/test_util.py @@ -10,7 +10,7 @@ from qiita_core.util import ( qiita_test_checker, execute_as_transaction, get_qiita_version, - is_test_environment, get_release_info) + is_test_environment, get_release_info, MaxRSS_helper) from qiita_db.meta_util import ( generate_biom_and_metadata_release, generate_plugin_releases) import qiita_db as qdb @@ -82,6 +82,20 @@ def test_get_release_info(self): self.assertEqual(biom_metadata_release, ('', '', '')) self.assertNotEqual(archive_release, ('', '', '')) + def test_MaxRSS_helper(self): + tests = [ + ('6', 6.0), + ('6K', 6000), + ('6M', 6000000), + ('6G', 6000000000), + ('6.9', 6.9), + ('6.9K', 6900), + ('6.9M', 6900000), + ('6.9G', 6900000000), + ] + for x, y in tests: + self.assertEqual(MaxRSS_helper(x), y) + if __name__ == '__main__': main() diff --git a/qiita_core/util.py b/qiita_core/util.py index b3f6b4142..9692f210b 100644 --- a/qiita_core/util.py +++ b/qiita_core/util.py @@ -151,3 +151,15 @@ def get_release_info(study_status='public'): archive_release = ((md5sum, filepath, timestamp)) return (biom_metadata_release, archive_release) + + +def MaxRSS_helper(x): + if x[-1] == 'K': + y = float(x[:-1]) * 1000 + elif x[-1] == 'M': + y = float(x[:-1]) * 1000000 + elif x[-1] == 'G': + y = float(x[:-1]) * 1000000000 + else: + y = float(x) + return y diff --git a/qiita_db/artifact.py b/qiita_db/artifact.py index 2604ee6ef..e4ac92a34 100644 --- a/qiita_db/artifact.py +++ b/qiita_db/artifact.py @@ -1277,10 +1277,8 @@ def _add_edge(edges, src, dest): qdb.sql_connection.TRN.add(sql, [self.id]) sql_edges = qdb.sql_connection.TRN.execute_fetchindex() - lineage = nx.DiGraph() - edges = set() - nodes = {} - if sql_edges: + # helper function to reduce code duplication + def _helper(sql_edges, edges, nodes): for jid, pid, cid in sql_edges: if jid not in nodes: nodes[jid] = ('job', @@ -1291,9 +1289,29 @@ def _add_edge(edges, src, dest): nodes[cid] = ('artifact', qdb.artifact.Artifact(cid)) edges.add((nodes[pid], nodes[jid])) edges.add((nodes[jid], nodes[cid])) + + lineage = nx.DiGraph() + edges = set() + nodes = dict() + extra_edges = set() + extra_nodes = dict() + if sql_edges: + _helper(sql_edges, edges, nodes) else: nodes[self.id] = ('artifact', self) lineage.add_node(nodes[self.id]) + # if this is an Analysis we need to check if there are extra + # edges/nodes as there is a chance that there are connecions + # between them + if self.analysis is not None: + roots = [a for a in self.analysis.artifacts + if not a.parents and a != self] + for r in roots: + # add the root to the options then their children + extra_nodes[r.id] = ('artifact', r) + qdb.sql_connection.TRN.add(sql, [r.id]) + sql_edges = qdb.sql_connection.TRN.execute_fetchindex() + _helper(sql_edges, extra_edges, extra_nodes) # The code above returns all the jobs that have been successfully # executed. We need to add all the jobs that are in all the other @@ -1329,8 +1347,10 @@ def _add_edge(edges, src, dest): # need to check both the input_artifacts and the # pending properties for in_art in n_obj.input_artifacts: - _add_edge(edges, nodes[in_art.id], - nodes[n_obj.id]) + iid = in_art.id + if iid not in nodes and iid in extra_nodes: + nodes[iid] = extra_nodes[iid] + _add_edge(edges, nodes[iid], nodes[n_obj.id]) pending = n_obj.pending for pred_id in pending: diff --git a/qiita_db/metadata_template/prep_template.py b/qiita_db/metadata_template/prep_template.py index f39aaacb7..d05493d3f 100644 --- a/qiita_db/metadata_template/prep_template.py +++ b/qiita_db/metadata_template/prep_template.py @@ -793,6 +793,7 @@ def _get_node_info(workflow, node): def _get_predecessors(workflow, node): # recursive method to get predecessors of a given node pred = [] + for pnode in workflow.graph.predecessors(node): pred = _get_predecessors(workflow, pnode) cxns = {x[0]: x[2] @@ -864,7 +865,8 @@ def _get_predecessors(workflow, node): if wk_params['sample']: df = ST(self.study_id).to_dataframe(samples=list(self)) for k, v in wk_params['sample'].items(): - if k not in df.columns or v not in df[k].unique(): + if k not in df.columns or (v != '*' and v not in + df[k].unique()): reqs_satisfied = False else: total_conditions_satisfied += 1 @@ -872,7 +874,8 @@ def _get_predecessors(workflow, node): if wk_params['prep']: df = self.to_dataframe() for k, v in wk_params['prep'].items(): - if k not in df.columns or v not in df[k].unique(): + if k not in df.columns or (v != '*' and v not in + df[k].unique()): reqs_satisfied = False else: total_conditions_satisfied += 1 @@ -890,117 +893,112 @@ def _get_predecessors(workflow, node): # let's just keep one, let's give it preference to the one with the # most total_conditions_satisfied - workflows = sorted(workflows, key=lambda x: x[0], reverse=True)[:1] + _, wk = sorted(workflows, key=lambda x: x[0], reverse=True)[0] missing_artifacts = dict() - for _, wk in workflows: - missing_artifacts[wk] = dict() - for node, degree in wk.graph.out_degree(): - if degree != 0: - continue - mscheme = _get_node_info(wk, node) - if mscheme not in merging_schemes: - missing_artifacts[wk][mscheme] = node - if not missing_artifacts[wk]: - del missing_artifacts[wk] + for node, degree in wk.graph.out_degree(): + if degree != 0: + continue + mscheme = _get_node_info(wk, node) + if mscheme not in merging_schemes: + missing_artifacts[mscheme] = node if not missing_artifacts: # raises option b. raise ValueError('This preparation is complete') # 3. - for wk, wk_data in missing_artifacts.items(): - previous_jobs = dict() - for ma, node in wk_data.items(): - predecessors = _get_predecessors(wk, node) - predecessors.reverse() - cmds_to_create = [] - init_artifacts = None - for i, (pnode, cnode, cxns) in enumerate(predecessors): - cdp = cnode.default_parameter - cdp_cmd = cdp.command - params = cdp.values.copy() - - icxns = {y: x for x, y in cxns.items()} - reqp = {x: icxns[y[1][0]] - for x, y in cdp_cmd.required_parameters.items()} - cmds_to_create.append([cdp_cmd, params, reqp]) - - info = _get_node_info(wk, pnode) - if info in merging_schemes: - if set(merging_schemes[info]) >= set(cxns): - init_artifacts = merging_schemes[info] - break - if init_artifacts is None: - pdp = pnode.default_parameter - pdp_cmd = pdp.command - params = pdp.values.copy() - # verifying that the workflow.artifact_type is included - # in the command input types or raise an error - wkartifact_type = wk.artifact_type - reqp = dict() - for x, y in pdp_cmd.required_parameters.items(): - if wkartifact_type not in y[1]: - raise ValueError(f'{wkartifact_type} is not part ' - 'of this preparation and cannot ' - 'be applied') - reqp[x] = wkartifact_type - - cmds_to_create.append([pdp_cmd, params, reqp]) - - if starting_job is not None: - init_artifacts = { - wkartifact_type: f'{starting_job.id}:'} - else: - init_artifacts = {wkartifact_type: self.artifact.id} - - cmds_to_create.reverse() - current_job = None - loop_starting_job = starting_job - for i, (cmd, params, rp) in enumerate(cmds_to_create): - if loop_starting_job is not None: - previous_job = loop_starting_job - loop_starting_job = None - else: - previous_job = current_job - if previous_job is None: - req_params = dict() - for iname, dname in rp.items(): - if dname not in init_artifacts: - msg = (f'Missing Artifact type: "{dname}" in ' - 'this preparation; this might be due ' - 'to missing steps or not having the ' - 'correct raw data.') - # raises option c. - raise ValueError(msg) - req_params[iname] = init_artifacts[dname] - else: - req_params = dict() - connections = dict() - for iname, dname in rp.items(): - req_params[iname] = f'{previous_job.id}{dname}' - connections[dname] = iname - params.update(req_params) - job_params = qdb.software.Parameters.load( - cmd, values_dict=params) - - if params in previous_jobs.values(): - for x, y in previous_jobs.items(): - if params == y: - current_job = x + previous_jobs = dict() + for ma, node in missing_artifacts.items(): + predecessors = _get_predecessors(wk, node) + predecessors.reverse() + cmds_to_create = [] + init_artifacts = None + for i, (pnode, cnode, cxns) in enumerate(predecessors): + cdp = cnode.default_parameter + cdp_cmd = cdp.command + params = cdp.values.copy() + + icxns = {y: x for x, y in cxns.items()} + reqp = {x: icxns[y[1][0]] + for x, y in cdp_cmd.required_parameters.items()} + cmds_to_create.append([cdp_cmd, params, reqp]) + + info = _get_node_info(wk, pnode) + if info in merging_schemes: + if set(merging_schemes[info]) >= set(cxns): + init_artifacts = merging_schemes[info] + break + if init_artifacts is None: + pdp = pnode.default_parameter + pdp_cmd = pdp.command + params = pdp.values.copy() + # verifying that the workflow.artifact_type is included + # in the command input types or raise an error + wkartifact_type = wk.artifact_type + reqp = dict() + for x, y in pdp_cmd.required_parameters.items(): + if wkartifact_type not in y[1]: + raise ValueError(f'{wkartifact_type} is not part ' + 'of this preparation and cannot ' + 'be applied') + reqp[x] = wkartifact_type + + cmds_to_create.append([pdp_cmd, params, reqp]) + + if starting_job is not None: + init_artifacts = { + wkartifact_type: f'{starting_job.id}:'} + else: + init_artifacts = {wkartifact_type: self.artifact.id} + + cmds_to_create.reverse() + current_job = None + loop_starting_job = starting_job + for i, (cmd, params, rp) in enumerate(cmds_to_create): + if loop_starting_job is not None: + previous_job = loop_starting_job + loop_starting_job = None + else: + previous_job = current_job + if previous_job is None: + req_params = dict() + for iname, dname in rp.items(): + if dname not in init_artifacts: + msg = (f'Missing Artifact type: "{dname}" in ' + 'this preparation; this might be due ' + 'to missing steps or not having the ' + 'correct raw data.') + # raises option c. + raise ValueError(msg) + req_params[iname] = init_artifacts[dname] + else: + req_params = dict() + connections = dict() + for iname, dname in rp.items(): + req_params[iname] = f'{previous_job.id}{dname}' + connections[dname] = iname + params.update(req_params) + job_params = qdb.software.Parameters.load( + cmd, values_dict=params) + + if params in previous_jobs.values(): + for x, y in previous_jobs.items(): + if params == y: + current_job = x + else: + if workflow is None: + PW = qdb.processing_job.ProcessingWorkflow + workflow = PW.from_scratch(user, job_params) + current_job = [ + j for j in workflow.graph.nodes()][0] else: - if workflow is None: - PW = qdb.processing_job.ProcessingWorkflow - workflow = PW.from_scratch(user, job_params) - current_job = [ - j for j in workflow.graph.nodes()][0] + if previous_job is None: + current_job = workflow.add( + job_params, req_params=req_params) else: - if previous_job is None: - current_job = workflow.add( - job_params, req_params=req_params) - else: - current_job = workflow.add( - job_params, req_params=req_params, - connections={previous_job: connections}) - previous_jobs[current_job] = params + current_job = workflow.add( + job_params, req_params=req_params, + connections={previous_job: connections}) + previous_jobs[current_job] = params return workflow diff --git a/qiita_db/processing_job.py b/qiita_db/processing_job.py index 9c9aa1186..a1f7e5baa 100644 --- a/qiita_db/processing_job.py +++ b/qiita_db/processing_job.py @@ -1020,7 +1020,7 @@ def submit(self, parent_job_id=None, dependent_jobs_list=None): # names to know if it should be executed differently and the # plugin should let Qiita know that a specific command should be ran # as job array or not - cnames_to_skip = {'Calculate Cell Counts'} + cnames_to_skip = {'Calculate Cell Counts', 'Calculate RNA Copy Counts'} if 'ENVIRONMENT' in plugin_env_script and cname not in cnames_to_skip: # the job has to be in running state so the plugin can change its` # status @@ -2392,6 +2392,20 @@ def add(self, dflt_params, connections=None, req_params=None, with qdb.sql_connection.TRN: self._raise_if_not_in_construction() + # checking that the new number of artifacts is not above + # max_artifacts_in_workflow + current_artifacts = sum( + [len(j.command.outputs) for j in self.graph.nodes()]) + to_add_artifacts = len(dflt_params.command.outputs) + total_artifacts = current_artifacts + to_add_artifacts + max_artifacts = qdb.util.max_artifacts_in_workflow() + if total_artifacts > max_artifacts: + raise ValueError( + "Cannot add new job because it will create more " + f"artifacts (current: {current_artifacts} + new: " + f"{to_add_artifacts} = {total_artifacts}) that what is " + f"allowed in a single workflow ({max_artifacts})") + if connections: # The new Job depends on previous jobs in the workflow req_params = req_params if req_params else {} diff --git a/qiita_db/study.py b/qiita_db/study.py index 367b846a1..3c4989f3b 100644 --- a/qiita_db/study.py +++ b/qiita_db/study.py @@ -1175,17 +1175,37 @@ def has_access(self, user, no_public=False): Whether user has access to study or not """ with qdb.sql_connection.TRN: - # if admin or superuser, just return true + # return True if the user is one of the admins if user.level in {'superuser', 'admin'}: return True - if no_public: - study_set = user.user_studies | user.shared_studies - else: - study_set = user.user_studies | user.shared_studies | \ - self.get_by_status('public') + # if no_public is False then just check if the study is public + # and return True + if not no_public and self.status == 'public': + return True + + # let's check if the study belongs to this user or has been + # shared with them + sql = """SELECT EXISTS ( + SELECT study_id + FROM qiita.study + JOIN qiita.study_portal USING (study_id) + JOIN qiita.portal_type USING (portal_type_id) + WHERE email = %s AND portal = %s AND study_id = %s + UNION + SELECT study_id + FROM qiita.study_users + JOIN qiita.study_portal USING (study_id) + JOIN qiita.portal_type USING (portal_type_id) + WHERE email = %s AND portal = %s AND study_id = %s + ) + """ + qdb.sql_connection.TRN.add( + sql, [user.email, qiita_config.portal, self.id, + user.email, qiita_config.portal, self.id]) + result = qdb.sql_connection.TRN.execute_fetchlast() - return self in study_set + return result def can_edit(self, user): """Returns whether the given user can edit the study diff --git a/qiita_db/support_files/patches/90.sql b/qiita_db/support_files/patches/90.sql new file mode 100644 index 000000000..a0b5d58c9 --- /dev/null +++ b/qiita_db/support_files/patches/90.sql @@ -0,0 +1,6 @@ +-- Jan 9, 2024 +-- add control of max artifacts in analysis to the settings +-- using 35 as default considering that a core div creates ~17 so allowing +-- for 2 of those + 1 +ALTER TABLE settings + ADD COLUMN IF NOT EXISTS max_artifacts_in_workflow INT DEFAULT 35; diff --git a/qiita_db/test/test_artifact.py b/qiita_db/test/test_artifact.py index b9ba78149..5cd425e23 100644 --- a/qiita_db/test/test_artifact.py +++ b/qiita_db/test/test_artifact.py @@ -1406,6 +1406,42 @@ def test_has_human(self): self.assertTrue(artifact.has_human) + def test_descendants_with_jobs(self): + # let's tests that we can connect two artifacts with different root + # in the same analysis + # 1. make sure there are 3 nodes + a = qdb.artifact.Artifact(8) + self.assertEqual(len(a.descendants_with_jobs.nodes), 3) + self.assertEqual(len(a.analysis.artifacts), 2) + # 2. add a new root and make sure we see it + c = qdb.artifact.Artifact.create( + self.filepaths_root, "BIOM", analysis=a.analysis, + data_type="16S") + self.assertEqual(len(a.analysis.artifacts), 3) + # 3. add jobs conencting the new artifact to the other root + # a -> job -> b + # c + # job1 connects b & c + # job2 connects a & c + cmd = qdb.software.Command.create( + qdb.software.Software(1), + "CommandWithMultipleInputs", "", { + 'input_b': ['artifact:["BIOM"]', None], + 'input_c': ['artifact:["BIOM"]', None]}, {'out': 'BIOM'}) + params = qdb.software.Parameters.load( + cmd, values_dict={'input_b': a.children[0].id, 'input_c': c.id}) + job1 = qdb.processing_job.ProcessingJob.create( + qdb.user.User('test@foo.bar'), params) + params = qdb.software.Parameters.load( + cmd, values_dict={'input_b': a.id, 'input_c': c.id}) + job2 = qdb.processing_job.ProcessingJob.create( + qdb.user.User('test@foo.bar'), params) + + jobs = [j[1] for e in a.descendants_with_jobs.edges + for j in e if j[0] == 'job'] + self.assertIn(job1, jobs) + self.assertIn(job2, jobs) + @qiita_test_checker() class ArtifactArchiveTests(TestCase): diff --git a/qiita_db/test/test_processing_job.py b/qiita_db/test/test_processing_job.py index b3b8be2c6..4940e039c 100644 --- a/qiita_db/test/test_processing_job.py +++ b/qiita_db/test/test_processing_job.py @@ -1205,6 +1205,18 @@ def test_add_error(self): qdb.exceptions.QiitaDBOperationNotPermittedError): qdb.processing_job.ProcessingWorkflow(1).add({}, None) + # test that the qdb.util.max_artifacts_in_workflow + with qdb.sql_connection.TRN: + qdb.sql_connection.perform_as_transaction( + "UPDATE settings set max_artifacts_in_workflow = 1") + with self.assertRaisesRegex( + ValueError, "Cannot add new job because it will create " + "more artifacts "): + qdb.processing_job.ProcessingWorkflow(2).add( + qdb.software.DefaultParameters(1), + req_params={'input_data': 1}, force=True) + qdb.sql_connection.TRN.rollback() + def test_remove(self): exp_command = qdb.software.Command(1) json_str = ( diff --git a/qiita_db/test/test_util.py b/qiita_db/test/test_util.py index 112cb3e6b..33766f3e4 100644 --- a/qiita_db/test/test_util.py +++ b/qiita_db/test/test_util.py @@ -45,6 +45,11 @@ def test_max_preparation_samples(self): obs = qdb.util.max_preparation_samples() self.assertEqual(obs, 800) + def test_max_artifacts_in_workflow(self): + """Test that we get the correct max_artifacts_in_workflow""" + obs = qdb.util.max_artifacts_in_workflow() + self.assertEqual(obs, 35) + def test_filepath_id_to_object_id(self): # filepaths 1, 2 belongs to artifact 1 self.assertEqual(qdb.util.filepath_id_to_object_id(1), 1) diff --git a/qiita_db/util.py b/qiita_db/util.py index df7153bb4..76d27e90b 100644 --- a/qiita_db/util.py +++ b/qiita_db/util.py @@ -416,6 +416,20 @@ def max_preparation_samples(): return qdb.sql_connection.TRN.execute_fetchlast() +def max_artifacts_in_workflow(): + r"""Returns the max number of artifacts allowed in a single workflow + + Returns + ------- + int + The max number of artifacts allowed in a single workflow + """ + with qdb.sql_connection.TRN: + qdb.sql_connection.TRN.add( + "SELECT max_artifacts_in_workflow FROM settings") + return qdb.sql_connection.TRN.execute_fetchlast() + + def compute_checksum(path): r"""Returns the checksum of the file pointed by path diff --git a/qiita_pet/handlers/analysis_handlers/base_handlers.py b/qiita_pet/handlers/analysis_handlers/base_handlers.py index bc3de16d0..bd9c208e2 100644 --- a/qiita_pet/handlers/analysis_handlers/base_handlers.py +++ b/qiita_pet/handlers/analysis_handlers/base_handlers.py @@ -196,7 +196,9 @@ def analyisis_graph_handler_get_request(analysis_id, user): # This should never happen, but worth having a useful message raise ValueError('More than one workflow in a single analysis') - return {'edges': edges, 'nodes': nodes, 'workflow': wf_id, + # the list(set()) is to remove any duplicated nodes + return {'edges': list(set(edges)), 'nodes': list(set(nodes)), + 'workflow': wf_id, 'artifacts_being_deleted': artifacts_being_deleted} diff --git a/qiita_pet/handlers/api_proxy/studies.py b/qiita_pet/handlers/api_proxy/studies.py index 431583c57..547c7621f 100644 --- a/qiita_pet/handlers/api_proxy/studies.py +++ b/qiita_pet/handlers/api_proxy/studies.py @@ -11,6 +11,8 @@ from qiita_core.exceptions import IncompetentQiitaDeveloperError from qiita_core.util import execute_as_transaction from qiita_core.qiita_settings import r_client +from qiita_db.artifact import Artifact +from qiita_db.sql_connection import TRN from qiita_db.user import User from qiita_db.study import Study from qiita_db.exceptions import QiitaDBColumnError, QiitaDBLookupError @@ -114,8 +116,8 @@ def study_get_req(study_id, user_id): study_info['has_access_to_raw_data'] = study.has_access( User(user_id), True) or study.public_raw_download - study_info['show_biom_download_button'] = 'BIOM' in [ - a.artifact_type for a in study.artifacts()] + study_info['show_biom_download_button'] = len( + study.artifacts(artifact_type='BIOM')) != 0 study_info['show_raw_download_button'] = any([ True for pt in study.prep_templates() if pt.artifact is not None]) @@ -201,53 +203,71 @@ def study_prep_get_req(study_id, user_id): access_error = check_access(study_id, user_id) if access_error: return access_error - # Can only pass ids over API, so need to instantiate object + study = Study(int(study_id)) - prep_info = defaultdict(list) + prep_info = {dtype: [] for dtype in study.data_types} editable = study.can_edit(User(user_id)) - for dtype in study.data_types: - dtype_infos = list() - for prep in study.prep_templates(dtype): - if prep.status != 'public' and not editable: + with TRN: + sql = """SELECT prep_template_id, pt.name as name, data_type, + artifact_id, + creation_timestamp, modification_timestamp, visibility, + (SELECT COUNT(sample_id) + FROM qiita.prep_template_sample + WHERE prep_template_id = spt.prep_template_id) + as total_samples, + (SELECT COUNT(sample_id) + FROM qiita.prep_template_sample + WHERE prep_template_id = spt.prep_template_id + AND ebi_experiment_accession != '') + as ebi_experiment + FROM qiita.study_prep_template spt + LEFT JOIN qiita.prep_template pt USING (prep_template_id) + LEFT JOIN qiita.data_type USING (data_type_id) + LEFT JOIN qiita.artifact USING (artifact_id) + LEFT JOIN qiita.visibility USING (visibility_id) + WHERE study_id = %s + GROUP BY prep_template_id, pt.name, data_type, artifact_id, + creation_timestamp, modification_timestamp, + visibility + ORDER BY creation_timestamp""" + + TRN.add(sql, [study_id]) + for row in TRN.execute_fetchindex(): + row = dict(row) + if row['visibility'] != 'public' and not editable: continue - start_artifact = prep.artifact + # for those preps that have no artifact + if row['visibility'] is None: + row['visibility'] = 'sandbox' + info = { - 'name': prep.name, - 'id': prep.id, - 'status': prep.status, - 'total_samples': len(prep), - 'creation_timestamp': prep.creation_timestamp, - 'modification_timestamp': prep.modification_timestamp + 'name': row['name'], + 'id': row['prep_template_id'], + 'status': row['visibility'], + 'total_samples': row['total_samples'], + 'creation_timestamp': row['creation_timestamp'], + 'modification_timestamp': row['modification_timestamp'], + 'start_artifact': None, + 'start_artifact_id': None, + 'youngest_artifact': None, + 'num_artifact_children': 0, + 'youngest_artifact_name': None, + 'youngest_artifact_type': None, + 'ebi_experiment': row['ebi_experiment'] } - if start_artifact is not None: - youngest_artifact = prep.artifact.youngest_artifact + if row['artifact_id'] is not None: + start_artifact = Artifact(row['artifact_id']) + youngest_artifact = start_artifact.youngest_artifact info['start_artifact'] = start_artifact.artifact_type - info['start_artifact_id'] = start_artifact.id + info['start_artifact_id'] = row['artifact_id'] info['num_artifact_children'] = len(start_artifact.children) info['youngest_artifact_name'] = youngest_artifact.name info['youngest_artifact_type'] = \ youngest_artifact.artifact_type info['youngest_artifact'] = '%s - %s' % ( youngest_artifact.name, youngest_artifact.artifact_type) - info['ebi_experiment'] = len( - [v for _, v in prep.ebi_experiment_accessions.items() - if v is not None]) - else: - info['start_artifact'] = None - info['start_artifact_id'] = None - info['youngest_artifact'] = None - info['num_artifact_children'] = 0 - info['youngest_artifact_name'] = None - info['youngest_artifact_type'] = None - info['ebi_experiment'] = 0 - - dtype_infos.append(info) - - # default sort is in ascending order of creation timestamp - sorted_info = sorted(dtype_infos, - key=lambda k: k['creation_timestamp'], - reverse=False) - prep_info[dtype] = sorted_info + + prep_info[row['data_type']].append(info) return {'status': 'success', 'message': '', diff --git a/qiita_pet/handlers/api_proxy/tests/test_studies.py b/qiita_pet/handlers/api_proxy/tests/test_studies.py index ec938099f..4c574ecfd 100644 --- a/qiita_pet/handlers/api_proxy/tests/test_studies.py +++ b/qiita_pet/handlers/api_proxy/tests/test_studies.py @@ -229,8 +229,7 @@ def test_study_prep_get_req_failed_EBI(self): # actual test obs = study_prep_get_req(study.id, user_email) - temp_info = defaultdict(list) - temp_info['16S'] = [ + temp_info = {'16S': [ {"status": 'sandbox', 'name': 'Prep information %d' % pt.id, 'start_artifact': None, 'youngest_artifact': None, @@ -241,7 +240,7 @@ def test_study_prep_get_req_failed_EBI(self): 'num_artifact_children': 0, 'youngest_artifact_name': None, 'youngest_artifact_type': None, - 'total_samples': 3}] + 'total_samples': 3}]} exp = { 'info': temp_info, diff --git a/qiita_pet/handlers/study_handlers/prep_template.py b/qiita_pet/handlers/study_handlers/prep_template.py index e02d2c477..167f981bd 100644 --- a/qiita_pet/handlers/study_handlers/prep_template.py +++ b/qiita_pet/handlers/study_handlers/prep_template.py @@ -5,12 +5,13 @@ # # The full license is in the file LICENSE, distributed with this software. # ----------------------------------------------------------------------------- -from os.path import join +from os.path import join, relpath from tornado.web import authenticated from tornado.escape import url_escape import pandas as pd +from qiita_core.qiita_settings import qiita_config from qiita_pet.handlers.util import to_int from qiita_pet.handlers.base_handlers import BaseHandler from qiita_db.util import (get_files_from_uploads_folders, get_mountpoint, @@ -79,6 +80,14 @@ def get(self): fp = res['creation_job'].parameters.values['sample_sheet'] res['creation_job_filename'] = fp['filename'] res['creation_job_filename_body'] = fp['body'] + summary = None + if res['creation_job'].status == 'success': + if res['creation_job'].outputs: + # [0] is the id, [1] is the filepath + _file = res['creation_job'].outputs[ + 'output'].html_summary_fp[1] + summary = relpath(_file, qiita_config.base_data_dir) + res['creation_job_artifact_summary'] = summary self.render('study_ajax/prep_summary.html', **res) diff --git a/qiita_pet/templates/study_ajax/prep_summary.html b/qiita_pet/templates/study_ajax/prep_summary.html index a943a8fdb..f7eeaf1ad 100644 --- a/qiita_pet/templates/study_ajax/prep_summary.html +++ b/qiita_pet/templates/study_ajax/prep_summary.html @@ -435,6 +435,9 @@

{{name}} - ID {{prep_id}} ({{data_type}}) {% if user_level in ('admin', 'wet-lab admin') and creation_job is not None %} SampleSheet + {% if creation_job_artifact_summary is not None %} + Creation Job Output + {% end %} {% end %} Edit name Prep info diff --git a/qiita_ware/commands.py b/qiita_ware/commands.py index 83efcf5a4..249864ebe 100644 --- a/qiita_ware/commands.py +++ b/qiita_ware/commands.py @@ -215,40 +215,64 @@ def submit_EBI(artifact_id, action, send, test=False, test_size=False): LogEntry.create( 'Runtime', 'The submission: %d is larger than allowed (%d), will ' 'try to fix: %d' % (artifact_id, max_size, total_size)) - # transform current metadata to dataframe for easier curation - rows = {k: dict(v) for k, v in ebi_submission.samples.items()} - df = pd.DataFrame.from_dict(rows, orient='index') - # remove unique columns and same value in all columns - nunique = df.apply(pd.Series.nunique) - nsamples = len(df.index) - cols_to_drop = set( - nunique[(nunique == 1) | (nunique == nsamples)].index) - # maximize deletion by removing also columns that are almost all the - # same or almost all unique - cols_to_drop = set( - nunique[(nunique <= int(nsamples * .01)) | - (nunique >= int(nsamples * .5))].index) - cols_to_drop = cols_to_drop - {'taxon_id', 'scientific_name', - 'description', 'country', - 'collection_date'} - all_samples = ebi_submission.sample_template.ebi_sample_accessions - samples = [k for k in ebi_submission.samples if all_samples[k] is None] - if samples: - ebi_submission.write_xml_file( - ebi_submission.generate_sample_xml(samples, cols_to_drop), - ebi_submission.sample_xml_fp) + def _reduce_metadata(low=0.01, high=0.5): + # helper function to + # transform current metadata to dataframe for easier curation + rows = {k: dict(v) for k, v in ebi_submission.samples.items()} + df = pd.DataFrame.from_dict(rows, orient='index') + # remove unique columns and same value in all columns + nunique = df.apply(pd.Series.nunique) + nsamples = len(df.index) + cols_to_drop = set( + nunique[(nunique == 1) | (nunique == nsamples)].index) + # maximize deletion by removing also columns that are almost all + # the same or almost all unique + cols_to_drop = set( + nunique[(nunique <= int(nsamples * low)) | + (nunique >= int(nsamples * high))].index) + cols_to_drop = cols_to_drop - {'taxon_id', 'scientific_name', + 'description', 'country', + 'collection_date'} + all_samples = ebi_submission.sample_template.ebi_sample_accessions + + if action == 'ADD': + samples = [k for k in ebi_submission.samples + if all_samples[k] is None] + else: + samples = [k for k in ebi_submission.samples + if all_samples[k] is not None] + if samples: + ebi_submission.write_xml_file( + ebi_submission.generate_sample_xml(samples, cols_to_drop), + ebi_submission.sample_xml_fp) + + # let's try with the default pameters + _reduce_metadata() # now let's recalculate the size to make sure it's fine new_total_size = sum([stat(tr).st_size for tr in to_review if tr is not None]) LogEntry.create( - 'Runtime', 'The submission: %d after cleaning is %d and was %d' % ( + 'Runtime', + 'The submission: %d after defaul cleaning is %d and was %d' % ( artifact_id, total_size, new_total_size)) if new_total_size > max_size: - raise ComputeError( - 'Even after cleaning the submission: %d is too large. Before ' - 'cleaning: %d, after: %d' % ( + LogEntry.create( + 'Runtime', 'Submission %d still too big, will try more ' + 'stringent parameters' % (artifact_id)) + + _reduce_metadata(0.05, 0.4) + new_total_size = sum([stat(tr).st_size + for tr in to_review if tr is not None]) + LogEntry.create( + 'Runtime', + 'The submission: %d after defaul cleaning is %d and was %d' % ( artifact_id, total_size, new_total_size)) + if new_total_size > max_size: + raise ComputeError( + 'Even after cleaning the submission: %d is too large. ' + 'Before cleaning: %d, after: %d' % ( + artifact_id, total_size, new_total_size)) st_acc, sa_acc, bio_acc, ex_acc, run_acc = None, None, None, None, None if send: