Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WORK IN PROGRESS] - preparsing job recognises uploaded archive better #492

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 44 additions & 12 deletions app/workers/preparsing.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# frozen_string_literal: true
require 'zip'
require 'zlib'
require 'rubygems/package'
require 'digest'

class Preparsing
Expand All @@ -10,11 +12,40 @@ class Preparsing
def perform(genotype_id)
genotype = Genotype.find(genotype_id)

logger.info "Starting preparse"
biggest = ''
biggest_size = 0
begin
Zip::File.open(genotype.genotype.path) do |zipfile|
logger.info "Starting preparse on #{genotype.genotype.path}"
# First, we need to find out which archive or flat text our uploaded file is!
# We use the bash tool file for that
#
# There are two possible outcomes - file is a collection of files (tar, tar.gz, zip)
# or file is a single file (ASCII, gz)
filetype = `file #{genotype.genotype.path}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some .docx files as well.. not sure if after extraction or before..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah, some folks aren't great at uploading the right file types. I'd say we should reject anything that's not ASCII post-unzip :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I also saw a few post-genotype analyses from Promethease.

case filetype
when /ASCII text/
logger.info 'File is flat text'
reader = File.method('open')
is_collection = false
when /gzip compressed data, was/
reader = Zlib::GzipReader.method('open')
logger.info 'file is gz'
is_collection = false
when /gzip compressed data, last modified/
reader = ->(zipfile){ Gem::Package::TarReader.new(Zlib::GzipReader.open(zipfile)) }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing to the left of {.

is_collection = true
when /POSIX tar archive/
logger.info 'File is tar'
reader = Gem::Package::TarReader.method('new')
is_collection = true
when /Zip archive data/
logger.info 'File is zip'
reader = Zip::File.method('open')
is_collection = true
end

if is_collection
# Find the biggest file in the archive
biggest = ''
biggest_size = 0
reader.call genotype.genotype.path do |zipfile|
# find the biggest file, since that's going to be the genotyping
zipfile.each do |entry|
if entry.size > biggest_size
Expand All @@ -23,18 +54,19 @@ def perform(genotype_id)
end
end

zipfile.extract(biggest,"#{Rails.root}/tmp/#{genotype.fs_filename}.csv")
system("mv #{Rails.root}/tmp/#{genotype.fs_filename}.csv #{Rails.root}/public/data/#{genotype.fs_filename}")
logger.info "copied file"
zipfile.extract(biggest, Rails.root.join('tmp', "#{genotype.fs_filename}.csv"))
system("mv #{Rails.root.join('tmp', "#{genotype.fs_filename}.csv")} \
#{Rails.root.join('public', 'data',genotype.fs_filename)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing after comma.

logger.info 'Copied file'
end

rescue
logger.info "nothing to unzip, seems to be a text-file in the first place"
else
system("cp #{genotype.genotype.path} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Rails.root.join('path', 'to') instead.

#{Rails.root.join('public', 'data', genotype.fs_filename)}")
end

# now that they are unzipped, check if they're actually proper files
file_is_ok = false
fh = File.open(genotype.genotype.path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this ever work?? It parses the original uploaded file the way this looks like, that can be anything before extraction

fh = File.open Rails.root.join('public', 'data', genotype.fs_filename)
l = fh.readline()
# some files, for some reason, start with the UTF-BOM-marker
l = l.sub("\uFEFF","")
Expand Down