Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ribcounts to sample metadata for dashboard deployment #21

Merged
merged 3 commits into from
Sep 25, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions dashboard/prepare-dashboard-data.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,7 @@ def count_dups(hvr_fname):

# sample -> {metadata}
sample_metadata = defaultdict(dict)

for project in projects:
with open("%s/bioprojects/%s/metadata/metadata.tsv" % (
ROOT_DIR, project)) as inf:
Expand All @@ -289,6 +290,15 @@ def count_dups(hvr_fname):
sample_metadata[sample]["reads"] = \
project_sample_reads[project][sample]

rc_fname = "ribocounts/%s.ribocounts.txt" % sample
try:
with open(rc_fname, 'r') as file:
content = file.read().strip()
ribocount = int(content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd write this as int(file.read().strip()); I don't think the intermediate content variable adds anything in this case.

sample_metadata[sample]["ribocounts"] = ribocount
except FileNotFoundError:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like pass here better than continue. If someone later adds code below that's supposed to be within this loop then we wouldn't want to skip it in cases where there was no ribocounts file.


for taxid in observed_taxids:
for project in projects:
for sample in project_sample_reads[project]:
Expand Down
10 changes: 10 additions & 0 deletions dashboard/prepare-dashboard-data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ cd $ROOT_DIR/dashboard
mkdir -p allmatches/
mkdir -p hvreads/
mkdir -p hvrfull/
mkdir -p ribocounts/

if [ ! -e names.dmp ] ; then
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp_2022-12-01.zip
Expand Down Expand Up @@ -48,6 +49,15 @@ for study in $(aws s3 ls $S3_DIR | awk '{print $NF}'); do
done
done | xargs -I {} -P 32 aws s3 cp {} hvreads/

for study in $(aws s3 ls $S3_DIR | awk '{print $NF}'); do
for rc in $(aws s3 ls $S3_DIR${study}ribocounts/ | \
awk '{print $NF}'); do
if [ ! -s ribocounts/$rc ]; then
echo $S3_DIR${study}ribocounts/$rc
fi
done
done | xargs -I {} -P 32 aws s3 cp {} ribocounts/

$MGS_PIPELINE_DIR/dashboard/prepare-dashboard-data.py $ROOT_DIR $MGS_PIPELINE_DIR

echo "Now check in data.js and the json files and check out on prod"
Expand Down
3 changes: 2 additions & 1 deletion run.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,8 @@ def calculate_average_read_length(file_path):
ribodetector_cmd = [
"ribodetector_cpu",
"--ensure", "rrna",
"--threads", "24"
"--threads", "24",
"--chunk_size", "256"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the future it's better practice to put unrelated changes in their own PRs

]
ribodetector_cmd.extend(["--len", str(avg_length)])

Expand Down