adapt CSV parsing to new data model #315

Closed
mfenner opened this Issue May 31, 2015 · 8 comments

Comments

Projects
None yet
2 participants
@mfenner
Member

mfenner commented May 31, 2015

For each work the reports in CSV format collect total counts for every source. For three sources (mendeley, counter, pmc) we need additional information (readers, html, and pdf counts, respectively). Before Lagotto 4.0 this was done using a to_csv method in the respective model. With Lagotto 4.0, and the switch to store all sources data in MySQL instead of CouchDB, this functionality is broken.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jun 23, 2015

Member

Relevant code:

Member

mfenner commented Jun 23, 2015

Relevant code:

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 24, 2015

Contributor

@mfenner, here are some observations / assumptions I made today exploring the code...

1: All source models (except Mendeley) inherit their to_csv functionality from the base Source model. Given that all source data moved from CouchDB to MySQL in the 4.0 release I believe this is why all source CSV reports are broken and that fixing Source#to_csv will resolve that particular issue (except for Mendeley).

2: Three individual source CSV reports – Mendeley, Counter, Pmc – need additional information that the other source CSV reports do not require. My assumption here is that once generic source CSV reports are working again (e.g. Source#to_csv) then these three reports can be updated to include the new fields: readers, html, counts.

3: Do you want the new columns (readers,html,counts) to be pushed up and included in the aggregate ALM report? If so, I'll plan on updating that the code that generates the ALM report to handle that.

Let me know if I'm heading down the right track for the intended goals of this issue or if I'm missing anything. Thanks!

Contributor

zdennis commented Jun 24, 2015

@mfenner, here are some observations / assumptions I made today exploring the code...

1: All source models (except Mendeley) inherit their to_csv functionality from the base Source model. Given that all source data moved from CouchDB to MySQL in the 4.0 release I believe this is why all source CSV reports are broken and that fixing Source#to_csv will resolve that particular issue (except for Mendeley).

2: Three individual source CSV reports – Mendeley, Counter, Pmc – need additional information that the other source CSV reports do not require. My assumption here is that once generic source CSV reports are working again (e.g. Source#to_csv) then these three reports can be updated to include the new fields: readers, html, counts.

3: Do you want the new columns (readers,html,counts) to be pushed up and included in the aggregate ALM report? If so, I'll plan on updating that the code that generates the ALM report to handle that.

Let me know if I'm heading down the right track for the intended goals of this issue or if I'm missing anything. Thanks!

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jun 25, 2015

Member

1: Yes, fixing the CSV generation should be straightforward. Two other sources (pmc, counter) also have a to_csv, but I have removed it. I linked the revisions above that still contain to_csv.

2: Yes.

3: Yes, the aggregate report used to have separate columns for html, pdf, etc. where it mattered. Forgot to include a link to a sample report: http://figshare.com/articles/Cumulative_PLOS_ALM_Report_February_2014/1189396 (BTW, Figshare is similar to Zenodo in functionality).

The only other comment I would make is that I found the code used for report generation always a bit obscure, e.g. the giant SQL call in Report.from_sql (https://github.com/articlemetrics/lagotto/blob/master/app/models/report.rb#L33-L41). I think covering this in tests works fine, but some of the database calls take rather long, and there was for example an issue with timeouts of CouchDB calls (no longer relevant). In the calls to external APIs I was able to break large queries into chunks of 200 or 1000 that are then handled by Sidekiq in parallel (the latter for example in https://github.com/articlemetrics/lagotto/blob/master/app/models/imports/crossref_import.rb. Not sure how much refactoring would be required to do something similar with the CSV report.

Member

mfenner commented Jun 25, 2015

1: Yes, fixing the CSV generation should be straightforward. Two other sources (pmc, counter) also have a to_csv, but I have removed it. I linked the revisions above that still contain to_csv.

2: Yes.

3: Yes, the aggregate report used to have separate columns for html, pdf, etc. where it mattered. Forgot to include a link to a sample report: http://figshare.com/articles/Cumulative_PLOS_ALM_Report_February_2014/1189396 (BTW, Figshare is similar to Zenodo in functionality).

The only other comment I would make is that I found the code used for report generation always a bit obscure, e.g. the giant SQL call in Report.from_sql (https://github.com/articlemetrics/lagotto/blob/master/app/models/report.rb#L33-L41). I think covering this in tests works fine, but some of the database calls take rather long, and there was for example an issue with timeouts of CouchDB calls (no longer relevant). In the calls to external APIs I was able to break large queries into chunks of 200 or 1000 that are then handled by Sidekiq in parallel (the latter for example in https://github.com/articlemetrics/lagotto/blob/master/app/models/imports/crossref_import.rb. Not sure how much refactoring would be required to do something similar with the CSV report.

@mfenner mfenner added the data-export label Jun 25, 2015

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 25, 2015

Contributor

The only other comment I would make is that I found the code used for report generation always a bit obscure, e.g. the giant SQL call in Report.from_sql (https://github.com/articlemetrics/lagotto/blob/master/app/models/report.rb#L33-L41)

Yeah, I can see that. I think there's an intermingling of report-related concepts in the Report model. It took me a little bit to mentally separate out that the class-level reports generated for the CSV aren't related to the kinds of reports that the Report model represents and that the Report class has other unrelated service-like methods (all of the send_xxx methods for kicking off mailers).

As I work on getting the CSV reports back in place I'm going to put a little thought on how to refactor and further the separation of those concepts.

Contributor

zdennis commented Jun 25, 2015

The only other comment I would make is that I found the code used for report generation always a bit obscure, e.g. the giant SQL call in Report.from_sql (https://github.com/articlemetrics/lagotto/blob/master/app/models/report.rb#L33-L41)

Yeah, I can see that. I think there's an intermingling of report-related concepts in the Report model. It took me a little bit to mentally separate out that the class-level reports generated for the CSV aren't related to the kinds of reports that the Report model represents and that the Report class has other unrelated service-like methods (all of the send_xxx methods for kicking off mailers).

As I work on getting the CSV reports back in place I'm going to put a little thought on how to refactor and further the separation of those concepts.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jun 25, 2015

Member

Great.

Member

mfenner commented Jun 25, 2015

Great.

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 25, 2015

Contributor

@mfenner, I was looking at the reports.json from 691d4a6 to make sure I had an accurate understanding of what those views were doing. When generating a format specific CSV source report does the event's date and count sums correspond with the year/date and sums in Month model?

Contributor

zdennis commented Jun 25, 2015

@mfenner, I was looking at the reports.json from 691d4a6 to make sure I had an accurate understanding of what those views were doing. When generating a format specific CSV source report does the event's date and count sums correspond with the year/date and sums in Month model?

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jun 25, 2015

Member

Yes, the Month model replaced this information from CouchDB.

We are actually several different reports (see https://github.com/articlemetrics/lagotto/blob/master/lib/tasks/report.rake). The only one that we ever made public, and is the most important one is the summary report, generated by the combined_stats rake task.

Part of this is historic. Before there was Ruby code these reports were generated by a combination of Perl, R and some manual merging. The intermediate reports are useful, but strictly speaking no longer needed for the combined_stats report, as we can get all the required info from the retrieval_statuses table now (after adding columns for readers, html and pdf).

Member

mfenner commented Jun 25, 2015

Yes, the Month model replaced this information from CouchDB.

We are actually several different reports (see https://github.com/articlemetrics/lagotto/blob/master/lib/tasks/report.rake). The only one that we ever made public, and is the most important one is the summary report, generated by the combined_stats rake task.

Part of this is historic. Before there was Ruby code these reports were generated by a combination of Perl, R and some manual merging. The intermediate reports are useful, but strictly speaking no longer needed for the combined_stats report, as we can get all the required info from the retrieval_statuses table now (after adding columns for readers, html and pdf).

@zdennis

This comment has been minimized.

Show comment
Hide comment
@zdennis

zdennis Jun 30, 2015

Contributor

The core work on this is almost completed (see PR #351). I'm nearing completion on updating report:combined_stats and report:combined_private_stats rake tasks. That will conclude all of the reports being generated using data from MySQL.

I haven't looked at the scheduling code yet, but I'm anticipating that will either not need to change at all or it will need to change very minimally.

Contributor

zdennis commented Jun 30, 2015

The core work on this is almost completed (see PR #351). I'm nearing completion on updating report:combined_stats and report:combined_private_stats rake tasks. That will conclude all of the reports being generated using data from MySQL.

I haven't looked at the scheduling code yet, but I'm anticipating that will either not need to change at all or it will need to change very minimally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment