Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: keep cached user stats [BD-38] #361

Merged
merged 1 commit into from
Jan 25, 2022
Merged

Conversation

xitij2000
Copy link
Contributor

@xitij2000 xitij2000 commented Dec 10, 2021

In a previous PR (https://github.com/edx/cs_comments_service/pull/352) we introduced a new API that would generate user stats for a course on-the-fly. The query would return data about all the users in a course who'd participated in discussions. In some cases, this could be in hundreds of thousands or millions of users, and result in dozens if not hundreds of MBs of data.

This data also needed to be sorted, and paginated. Due to the nature of the query, paginating this data would not reduce the performance implications of the query, since it'd still need to query all the course data to build the stats. That said, it would reduce the final amount of data involved.

This PR implements a different mechanism for the same. It will keep track of user counts across courses, by incrementing the counts each time a user creates a post, comment etc. It will also track deletes, and when content is marked/unmarked as abusive. The first time such an increment/decrement happens it will auto-backfill the data to keep counts accurate. However, other than that it will not automatically backfill data for users yet.

@openedx-webhooks
Copy link

openedx-webhooks commented Dec 10, 2021

Thanks for the pull request, @xitij2000! I've created BLENDED-1044 to keep track of it in Jira. More details are on the BD-38 project page.

When this pull request is ready, tag your edX technical lead.

@openedx-webhooks openedx-webhooks added blended PR is managed through 2U's blended developmnt program waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Dec 10, 2021
api/users.rb Outdated
@@ -11,11 +11,54 @@
end
end

get "#{APIPREFIX}/users/:course_id/stats2" do |course_id|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary name to allow testing both endpoints. This will replace stats.

Comment on lines 430 to 460
# Sort the map entries using the default sort
expected_result = expected_result.values.sort_by { |val| [val["threads"], val["responses"], val["replies"]] }.reverse

get "/api/v1/users/#{course_id}/stats"
get "/api/v1/users/#{course_id}/stats2"
expect(last_response.status).to eq(200)
res = parse(last_response.body)
expect(res).to eq expected_result
expect(res["user_stats"]).to eq expected_result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API now returns a list of objects sorted by activity but other than that the results should be the same as before.

I'll add more tests for checking that adding comments, removing comments, etc keep the counts updated.

@xitij2000
Copy link
Contributor Author

@mtyaka Could you do a review of this PR?

api/users.rb Outdated
Comment on lines 44 to 48
data = paginated_stats.to_a.map { |user_stats| {
:username => user_stats["username"]
}.merge(
user_stats["course_stats"].first.except("_id", "course_id")
) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flatten the nesting, and remove the ids from each document since the id is unneeded, and the course_id is common.

api/users.rb Outdated

paginated_stats = User
.where("course_stats.course_id" => course_id)
.only(:username, :'course_stats.$')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only return the username field, and the course_stats entry that matches the course_id condition.

Comment on lines 210 to 240
data = Content.collection.aggregate(
[
# Match all content in the course
{ "$match" => { :course_id => course_id, :author_id => user.external_id } },
# Keep a count of flags for each entry
{
"$set" => {
# Just using $ne with null will return true if the field is absent
# So we first fall all absent fields to null, then check if it's null,
# that way we match for the absence of the field or value = null
:is_reply => { "$ne" => [{ "$ifNull" => ["$parent_id", nil] }, nil] }
}
},
{
"$group" => {
# Here we're grouping items by the type (comment or thread), and the user, and whether the comment is a reply.
# For threads is_reply will always be false.
:_id => { :type => "$_type", :is_reply => "$is_reply" },
# This will just count each group, so we get a breakdown of how many comments and threads a user has created.
:count => { "$sum" => 1 },
# These two will sum up the active and inactive reports in each category
# i.e. reported threads, reported comments, reported replies
# The way this works is (starting from inside out), we take the size of the abuse_flaggers list, we compare
# it to 0 using $cmp. If it's greater than zero then $cmp results in 1 otherwise 0.
# So we're summing up 1 for each abuse_flagger array that has entries, and 0 for the rest. This gives us a
# count of
:active_flags => { "$sum" => { "$cmp" => [{ "$size" => "$abuse_flaggers" }, 0] } },
:inactive_flags => { "$sum" => { "$cmp" => [{ "$size" => "$historical_abuse_flaggers" }, 0] } },
}
}
])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same query, but scoped to a single user, so it's a much smaller set of data to go through.

Copy link
Contributor

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good @xitij2000! I posted a few comments, but they are mostly nits. The only larger concern is that it looks like we're treating empty arrays as falsey in some places, but empty arrays are truthy in ruby.

I didn't do any manual testnig. Do we have a sandbox where I could test it out?

api/users.rb Outdated
per_page = (params["per_page"] || DEFAULT_PER_PAGE).to_i
per_page = DEFAULT_PER_PAGE if per_page <= 0

# There are two sorts available, activity t sor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is incomplete.

lib/helpers.rb Outdated Show resolved Hide resolved
lib/helpers.rb Outdated Show resolved Hide resolved
lib/helpers.rb Outdated Show resolved Hide resolved
api/users.rb Outdated
.paginate(:page => page, :per_page => per_page)
total_count = paginated_stats.total_entries

data = paginated_stats.to_a.map { |user_stats| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not sure if there's a edx style guide for ruby code, but the common way to format blocks in ruby is to use do |x| ... end for multi-line blocks and { |x| ... } for single line blocks.

models/comment.rb Outdated Show resolved Hide resolved
stats.inactive_flags = inactive_flags
stats.save
stats
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no newline at end of file

models/user.rb Outdated
def mark_as_read(thread)
reconnect_mongo_primary
read_state = read_states.find_or_create_by(course_id: thread.course_id)
read_state.last_read_times[thread.id.to_s] = Time.now.utc
read_state.save
end

def stats_for_course(course_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used anywhere. Can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think the place where it was used got refactored out, will remove.

spec/api/user_spec.rb Show resolved Hide resolved
spec/api/user_spec.rb Show resolved Hide resolved
Copy link
Contributor

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great @xitij2000! The three unless have to be changed to ifs, which I think is the reason why tests are failing.

Do we have a sandbox where I could test this manually?

lib/helpers.rb Outdated Show resolved Hide resolved
lib/helpers.rb Outdated Show resolved Hide resolved
lib/helpers.rb Outdated Show resolved Hide resolved
lib/helpers.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@mtyaka mtyaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xitij2000 I found a subtle bug in the flag_as_abuse and un_flag_as_abuse methods.

However those don't explain the failing tests related to this method. The errors look really baffling, I've been trying to figure out why they happen without luck :/

lib/helpers.rb Outdated Show resolved Hide resolved
@mtyaka
Copy link
Contributor

mtyaka commented Dec 24, 2021

However those don't explain the failing tests related to this method. The errors look really baffling, I've been trying to figure out why they happen without luck :/

Actually the first two failures might be related to that, but the "handles removing flags" error makes no sense to me.

@xitij2000
Copy link
Contributor Author

Actually the first two failures might be related to that, but the "handles removing flags" error makes no sense to me.

I tried isolating changes since the last passing tests and it seems that the errors are due to a change from after_save to after_create. Not sure why, but looking into it.

@xitij2000
Copy link
Contributor Author

I tried isolating changes since the last passing tests and it seems that the errors are due to a change from after_save to after_create. Not sure why, but looking into it.

OK, I figured it out. When using after_save, the stats are updated after the flags are added in the test data, however since the flags are added without triggering the update mechanism they are not updated when you switch to after_create.

I've now updated the tests so that we run an initial update of course stats, and then perform the rest of the tests. This fixes teh issues.

@xitij2000 xitij2000 marked this pull request as ready for review December 27, 2021 08:53
@openedx-webhooks openedx-webhooks added needs triage and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Dec 27, 2021
@mtyaka
Copy link
Contributor

mtyaka commented Dec 27, 2021

When using after_save, the stats are updated after the flags are added in the test data, however since the flags are added without triggering the update mechanism they are not updated when you switch to after_create.

Ah, that makes sense.

Code looks good to me 👍

Copy link
Contributor

@felipetrz felipetrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on a couple potential performance concerns still left, but otherwise this is looking a lot better than the previous solution.

models/user.rb Outdated Show resolved Hide resolved
models/user.rb Show resolved Hide resolved
@mtyaka
Copy link
Contributor

mtyaka commented Dec 30, 2021

Nice work @xitij2000! 👍

  • I tested this on the https://discussions.sandbox.opencraft.hosting/ sandbox. I made a few posts, added a few replies, and flaged/unflagged a few other users' posts. I verified that the stats API endpoint returned correct results after each change.
  • I read through the code
  • I checked for accessibility issues N/A
  • Includes documentation -- includes inline comments.

In a previous approach course stats were being generated on the fly for counts of user posts, responses and flags, however this would result in overly large amounts of data in the case of larger courses. This commit instead maintains these counts in the database so that this data can be fetched using a simpler, paginated and sortable query.
@xitij2000 xitij2000 merged commit 8103fc5 into master Jan 25, 2022
@xitij2000 xitij2000 deleted the kshitij/cached-user-stats branch January 25, 2022 12:45
@openedx-webhooks
Copy link

@xitij2000 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blended PR is managed through 2U's blended developmnt program merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants