Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export/import group data via json file #4632

Merged
merged 42 commits into from Oct 24, 2018
Merged

export/import group data via json file #4632

merged 42 commits into from Oct 24, 2018

Conversation

robguthrie
Copy link
Member

No description provided.

@gdpelican
Copy link
Contributor

@robguthrie Does this need additional work to be merged? It looks like the existing code doesn't really do anything other than memberships...

@robguthrie
Copy link
Member Author

robguthrie commented May 7, 2018 via email

@gdpelican
Copy link
Contributor

Would be nice to get this merged in, @robguthrie ; anything I can do to support?

@robguthrie
Copy link
Member Author

Hi @gdpelican. This version does not gobble memory when exporting large groups.. we stream data straight into the file with one json record per line. Ruby process stays around 100 mb rather than growing to 8gb.

We may want to think about how to use the .import method, maybe by doing a group by table then import after dropping/ignoring existing records.

Anyway, I'm looking to complete this over the next day or so. Hello.

@robguthrie
Copy link
Member Author

robguthrie commented May 31, 2018

@gdpelican the export and import of our biggest groups can happen in 5 minutes or so each now, without crazy memory usage.

I'm thinking burndown lists now.

  • Consider who can export a group and it's subgroups. If a group admin cannot access a subgroup, do they get to export it?
  • Self service export - create background job and upload the file somewhere. I'd like to reuse our own attachments system for this somehow.
  • excluded fields. We're going to exclude some tokens and keys right? Which ones should we exclude?
  • attachment files. Need to consider what to do here. could be that it's handled on import rather than export. IE: download attachments from previous server on import and update their location.

@robguthrie
Copy link
Member Author

Ok so we're now excluding user confidential fields.

So I think now we enable users to export json for any group they belong to.
They select the parent group then get told they'll be emailed when the file is available. They then get an email with a link to their json for the group. It will contain the top level group and all the subgroups they belong to in a single json file.

I'm thinking that we create an attachment owned by the user with no attachable, and add the exported json to it, then add a delete delayed job scheduled for a week later.

@robguthrie
Copy link
Member Author

@gdpelican this is in final stages now. Just going to finish the email styling. CR appreciated

@@ -37,6 +37,11 @@ def upload_photo
respond_with_resource
end

def export
service.export(group: load_and_authorize(:group), actor: current_user)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but you should be able to do

service.export(group: load_resource, actor: current_user)

because it should be the service's job to authorize whether you can export or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of related. One thing that might be relevant is anonymous polls.
A data export would allow people to see who voted what.

If it were not for anonymous polls then I'd say that anyone who is a member should be able to export the group data.

If there are anonymous polls, I don't know if anyone should be able to download the data. It's a weird situation. What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a poll's anonymous, we should anonymize the export data.

Here's where I kinda wish we had serializers for these things, because it would mean we could make tweaks like this a bit more easily, rather than throwing scopes on the exportable_relations.

Instead of that though (I reckon it's a PITA), I think I'd prefer filtering out anonymous polls over to including them

# exportable_relations.rb
has_many :exportable_polls, -> { where(anonymous: false) }, source: :polls
# group_export_service
RELATIONS = [
  ...
  'exportable_polls' # (and not 'polls')
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NB that we don't use the 'discussion_polls' relation at the moment, because we're ensuring that a poll has the group_id set correctly if the discussion_id is set. A bit stateful-y, but it's worked so far.

@@ -48,7 +48,7 @@ class FormalGroup < Group
belongs_to :default_group_cover

has_many :subgroups,
-> { where(archived_at: nil).order(:name) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, but wonder if we need to check the /g/:key/subgroups query to make sure it's still returning the same thing?

end

def documents
Document.none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm I wonder if we want to move the has_many :documents line from FormalGroup to Group instead of doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -89,6 +89,6 @@ def discussion_readers
private

def set_volume
self.volume = user.default_membership_volume if group.is_formal_group?
self.volume = user.default_membership_volume if id.nil? && group.is_formal_group?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm what is this change needed for?

I think more idiomatic is self.new_record?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've working to remove all unnecessary N+1 queries. Reading volume triggers a query in a couple of places (checking default membership volume and group volume) and it also pollutes the export because it returns a value that isn't actually what the column actually contains.

So that's why I've moved to using a method that clearly says it's giving a computed value for volume rather than overloading the simple accessor. I like it more and it hasn't really been a problem to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was referring to the id.nil? addition.

@@ -33,10 +33,6 @@ def discussion_reader_id
object.id
end

def discussion_reader_volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets used in discussion_model.coffee, I wonder if we need to account for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wonder if volume changes are not strictly necessary for group export

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A flag that this needs to be addressed in some way before merge; we can't be referencing discussionReaderVolume anywhere on the client if this method is taken out

def perform(group, actor)
groups = actor.groups.where(id: group.all_groups)
filename = GroupExportService.export_filename_for(group)
GroupExportService.export(groups, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected

filename = GroupExportService.export(group)

and then call group.all_groups from within the service

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was that anyone in a group can export the data. Meaning they should only be able to export data for groups they belong to. So I want to pass in only groups the user belongs to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, then better to pass in the actor here and do that line within the service. I think the filename method is out of place here, there's no reason to have two static methods on GroupExportService like this, since they're so intertwined.

filename = GroupExportService.export(group, actor)
document = Document.create author: actor, file: File.open(filename, 'r'), title: filename
UserMailer.group_export_ready(actor, group, document).deliver

I also wonder about creating an event for this so that we can easily track it and send it to other places in the future if we want (like, text me when it's done, or send me a push notification with a link to the thing, etc.)

GroupExported.publish!(group, actor, document)

This also maintains our current distance from our ideal of 'events are the only way to send emails within the app'


$scope.openGroupExportModal = ->
ModalService.open 'ConfirmModal', confirm: ->
submit: -> Records.groups.export($scope.group.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend towards actions on the model, so

# group_model.coffee
export: =>
  @remote.post(@id, 'export')
submit: $scope.group.export

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks

@@ -12,6 +12,11 @@
sign_in user
end

describe 'export' do
it 'kicks off a group export job'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to write these, even if it's just a check for 200 and 403. Maybe check to ensure Document.count is incremented too.

end

def self.import(filename)
tables = File.open(filename, 'r').map { |line| JSON.parse(line)['table'] }.uniq
Copy link
Contributor

@gdpelican gdpelican Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some stuff that smells a bit here, but I haven't got a better solution off the top of my head given that (I'm assuming) we don't want to port all of the file into memory. I think json parsing each line of the file N+1 times is going to hurt us, and wonder if there's a way to avoid it.

I wouldn't be opposed to, say, iterating through each line of the file, parsing it, and then,
if the table matches the current table, klass.new and append it to an array of records
if the table doesn't match, klass.import, flush the existing array, and start a new one with the new table

Then we only iterate through the file once, run JSON.parse once per line, and still maintain support for weirdo input like a stray group in a middle of a big run of discussions or whatnot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I believe import will silently fail imports with ids that exist already, which would be the same behaviour as this would exhibit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import totally fails if some of the ids already exist. so the check is necessary.

Thanks for helping to try and improve, but i think I'd like to call it as good enough for now. It's the best solution (fastest export for large groups by miles) after quite a few attempts and I'd like to move on.

end

def self.export_filename_for(group)
"tmp/#{DateTime.now.strftime("%Y-%m-%d_%H-%M-%S")}_#{group.name.parameterize}.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note that the group.name.parameterize will fail for guest groups, which we should be able to export just fine.

@robguthrie
Copy link
Member Author

@gdpelican my show stopper issue with this is anonymous polls. What do you think we should do?

Some options that come to mind:

  • Do nothing and have a backdoor where users can see who said what
  • Mangle data on anonymous polls
  • Don't include anonymous polls in export

Any ideas?

@robguthrie robguthrie merged commit dc27aef into master Oct 24, 2018
@robguthrie robguthrie deleted the copy-data branch October 24, 2018 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants