Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual spike in 23.02 literature data #2904

Closed
eric-czech opened this issue Mar 10, 2023 · 5 comments
Closed

Unusual spike in 23.02 literature data #2904

eric-czech opened this issue Mar 10, 2023 · 5 comments
Assignees
Labels
Data Relates to Open Targets data team Literature Relates to EPMC literature pipeline Platform Issues related to Open Targets Platform

Comments

@eric-czech
Copy link

When looking at the data in gs://open-targets-data-releases/23.02/output/etl/parquet/literature/matches, we noticed that there are some coarse patterns in time, with regards to included publications, that look like large departures from older versions. For example, this shows the number of pmids by year in that dataset for 23.02 vs 22.06:

Screen Shot 2023-03-10 at 11 00 34 AM

Has anything changed drastically in the underlying corpus that might make this expected? To be clear, I don't think this is a definitive problem. It does seem to merit a little digging though ... I'm not aware of any reason to expect a spike like that in the mid 1970's.

Code
from pyspark.sql import functions as F

cts_22_06 = (
    spark.read.parquet('gs://open-targets-data-releases/22.06/output/literature-etl/parquet/matches')
    .groupby('year')
    .agg(F.count_distinct('pmid').alias('n_pmids'))
    .toPandas()
)

ax = (
    cts_22_06
    .set_index('year')['n_pmids']
    .sort_index()
    .plot(figsize=(16, 4), style='.-')
)
ax.set_yscale('log')
ax.set_title('Num pmids by year in OT 22.06 literature data');

cts_23_02 = (
    spark.read.parquet('gs://open-targets-data-releases/23.02/output/etl/parquet/literature/matches')
    .groupby('year')
    .agg(F.count_distinct('pmid').alias('n_pmids'))
    .toPandas()
)

ax = (
    cts_23_02
    .set_index('year')['n_pmids']
    .sort_index()
    .plot(figsize=(16, 4), style='.-')
)
ax.set_yscale('log')
ax.set_title('Num pmids by year in OT 23.02 literature data');
@d0choa
Copy link
Contributor

d0choa commented Mar 10, 2023

We might need to loop in @tsantosh7 and the rest of the EPMC team.

We know about the inclusion of preprints and patents but I don't think they would be enough to justfify such spikes.

@tsantosh7
Copy link
Collaborator

Interesting!! I will inform our help-desk to see if they have come across this pattern.

@prashantuniyal02 prashantuniyal02 added Platform Issues related to Open Targets Platform Literature Relates to EPMC literature pipeline Data Relates to Open Targets data team labels Mar 13, 2023
@DSuveges
Copy link

DSuveges commented Jun 8, 2023

A number of issues were identified in the most recent release of literature data. Over the past month, most of them got addressed. This peculiar pattern in the number of publication over years however remained:
distribution

@tsantosh7 from EuroPMC is reviewing the potential underlying causes. Although the pattern seems strong, given the log scale, then discrepancy between the number of publications cannot be too big:

image

The number of publications in the upcoming (23.06) release shows a healthy increase:

pub count (million)
22_06 13.4
23_02 12.62
23_06 14.71

(The drop of publications in the 23.02 release was due to failing publication processing jobs.)

I'm closing this ticket now as I'm not sure if there's anything wrong going on. @tsantosh7 will comment if the database team of EuroPMC could figure something out.

@DSuveges DSuveges closed this as completed Jun 8, 2023
@eric-czech
Copy link
Author

The number of publications in the upcoming (23.06) release shows a healthy increase

Awesome, thanks for the update @DSuveges!

@DSuveges
Copy link

DSuveges commented Jun 9, 2023

@eric-czech you're very welcome! Please keep letting us know in case you identify any concerning issues on the platform!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Literature Relates to EPMC literature pipeline Platform Issues related to Open Targets Platform
Projects
None yet
Development

No branches or pull requests

5 participants