Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data visualisation #40

Open
TS404 opened this issue Apr 7, 2020 · 16 comments
Open

Data visualisation #40

TS404 opened this issue Apr 7, 2020 · 16 comments

Comments

@TS404
Copy link
Collaborator

TS404 commented Apr 7, 2020

I'm a big fan of a good visualisation. I'm going to start thinking about some possible visualisations using [R] and Shinyapps. Any assistance and ideas welcomed on possible individual or combined visualisations for:

  • Topics, themes, findings
  • Citations
  • Authors
  • Changes over time
  • Others?

Some existing examples:

Obviously networks and multidimensional scaling projections could be useful. Also probably circos plots between themes?

@petermr
Copy link
Owner

petermr commented Apr 8, 2020

Very keen on this.
If you can make this appeal to a citizen audience that would be great. Citations and Authors may be seen as niche academic subjects whereas themes in everyday discourse (respirator, social distance) are likely to engage people.

@TS404
Copy link
Collaborator Author

TS404 commented Apr 14, 2020

It's possible to do simple static diagrams via [R] packages like igraph.

A major bonus, however, might be interactive/responsive graphics. I've tested out the networkD3 and chorddiag packages (both of which are based on D3.js run via the R2D3 package). See

These should at least be sufficient to adapt for grouping and displaying sets articles based on coauthors, citations or topics.

Ideally, eventually would love to use the bundle variant of a chord diagram (example1 or example2, tutorial.

Initial tests of network for some covid authors:

image

@petermr
Copy link
Owner

petermr commented Apr 14, 2020 via email

@TS404
Copy link
Collaborator Author

TS404 commented Apr 15, 2020

I've now added the visualisation code to wikiPackageTesting.R in the #Visualisations section

Data options:

  • Easiest: I should be able to read the html tables (e.g. full.dataTables.html as a matrix of article vs (author / topic / citing article) in any tabular format (csv, tsv whatever) should be sufficient to import.
  • Most size-efficient: a table of edges (start, end, weight for each) and a table of nodes (name and properties for each) per MisLins and MisNodes here.
  • Ideal: Store all info in wikidata, where I can then pull via SPARQL e.g. all publications with a main subject (P921) of covid-19 (Q84263196), SARS-CoV-2 (Q82069695), Coronavirus (Q290805) etc along with their other topics, authors, citations, etc. e.g:
SELECT DISTINCT ?work ?workLabel ?pdate ?topic ?topicLabel ?author1 ?author1Label ?citing_work WHERE {
  VALUES ?topics { wd:Q82069695 wd:Q84263196 wd:Q81068910 }
  ?work wdt:P31 wd:Q13442814;
    wdt:P921 ?topics.
  OPTIONAL { ?work wdt:P577 ?pdate. }
  OPTIONAL { ?work wdt:P921 ?topic. }
  OPTIONAL { ?work wdt:P50  ?author1. }
  OPTIONAL { ?citing_work wdt:P2860 ?work. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?pdate ?work ?workLabel  ?topic ?topicLabel ?author1 ?author1Label ?citing_work

@petermr
Copy link
Owner

petermr commented Apr 15, 2020

Thanks @TS404

I've now added the visualisation code to wikiPackageTesting.R in the #Visualisations section

Well done. Can you give some screen shots?

Data options:

Easiest: I should be able to read the html tables (e.g. full.dataTables.html as a matrix of article vs (author / topic / citing article) in any tabular format (csv, tsv whatever) should be sufficient to import.

That should be possible. Note there are usually many entries in a facet-cell. If you are just looking at bibliographic data we may manage things. There are multiple authors per article. How do we manage that?

And what is a "citing" article? we don't have, and won't have , a citation graph.

Most size-efficient: a table of edges (start, end, weight for each) and a table of nodes (name and properties for each) per MisLins and MisNodes here.

Don't understand where these edges come from, and what a MisLin or MisNode.

Ideal: Store all info in wikidata, where I can then pull via SPARQL e.g. all publications with a main subject (P921) of covid-19 (Q84263196), SARS-CoV-2 (Q82069695), Coronavirus (Q290805) etc along with their other topics, authors, citations, etc. e.g:

That would be great. Presumably a questioon of getting this accepted by Wikidata-ns , but DanielM put millions of bibliographic references into Wikidata.

(HEY! we should be adding QIDs for publications. That would be great!)

Note also that I have not got pointers back to Biorxiv working properly.

Are they putting preprints into Wikidata?

@deadlyvices
Copy link
Collaborator

When I worked at AZ the New Opportunities group did a visualization where they examined the author list, and then ranked them by first author, last author and secondary contributor count in papers. It was a triangular plot, as you'd use for a three component phase diagram.. I think they call it a 'ternary plot':
image
So we could do that for a particular topic search and then get to identify the key opinion leaders.

@petermr
Copy link
Owner

petermr commented Apr 15, 2020

Note that we haven't got a simple approach to bibliography. We can do JATS from EPMC . JATS is not always much fun as there can be authorstrings (i.e. all authors run together) and disambiguation (no ORCIDs).
What is the driver for this? I suspect academics will use it but who else?

@petermr petermr closed this as completed Apr 15, 2020
@petermr petermr reopened this Apr 15, 2020
@deadlyvices
Copy link
Collaborator

I think it's useful to know who is helping to lead an area of investigation. I've been playing around and have been able to generate the percentages of publications for each author as first author, last author and other. Spotfire doesn't do ternary plots, so I generated a ||el coordinate plot:
image

@deadlyvices
Copy link
Collaborator

So it's possible to generate the input

@TS404
Copy link
Collaborator Author

TS404 commented Apr 16, 2020

Data format and storage

The facet cell listing multiple items is fine (essentially I'll aim to turn it into a nested list in [R]). Similarly, ideally there should be a column listing all the authors of a publication (disambiguating to QIDs will be the greatest challenge) but as plaintext strings is fine as a backup.

I've checked over at Wikidata's Wikiproject COVID-19 and it seems there are already a few hundred preprints already listed in wikidata, so it shouldn't be too controversial to add all the covid-relevant ones (and eventually others).

Visualisations

Visualisations focusing on topics and publications has the clearest immediate public value to show where the main research threads are heading.

Visualisations focusing on authors can demonstrate which authors are collaborative (and which are in silos) and in what roles and can help researchers to identify people to watch or contact for collaboration. I like the idea of separating first/middle/last if possible (like this query).

I've done a bit more stress-testing of the code for networks of different sizes (e.g. see Anthony Fauci's co-author network below). Next step, I'll start tweaking it to make the nodes=publications and the links=topic_similarity.

image
Anthony Fauci's co-author network, larger circles and thicker lines indicate people he's co-authored more with. For interactive version, see WDNetworkVis.nb.html

@TS404
Copy link
Collaborator Author

TS404 commented Apr 19, 2020

Ok, so I've managed to get the concomitant co-topic graph working reasonably robustly!

In order to make it interactive, I've built a simple shiny app. It works locally fine locally, but the version on shinyapps.io seems to still be having problems (I've left a query on stack overflow).

Once I've managed to get it properly working online, next steps for the visualisation:

  1. Take the topics graph from biorxiv700/full.dataTables.html as the input rather than only wikidata
  2. Present chord diagram as well
  3. Improve the click actions
    a. select node to list publications on that topic?
    b. select multiple nodes to subset?
    c. easy navigation to wikidata/publication/scholia
    d. loading time indicator? (larger wikidata queries can take >30)

image
Local instance of TS404/topicnetwork.
image
Same data visualised as chord diagram (not yet included in TS404/topicnetwork).

@petermr
Copy link
Owner

petermr commented Apr 19, 2020 via email

@petermr
Copy link
Owner

petermr commented Apr 19, 2020 via email

@TS404
Copy link
Collaborator Author

TS404 commented Apr 20, 2020

The CJ Yetman comment fixed it! Try https://ts404.shinyapps.io/topicnetwork now! I'll have to test why the fix works to avoid re-introducing it later, but v. useful for now.

@TS404
Copy link
Collaborator Author

TS404 commented Apr 27, 2020

Updates to https://ts404.shinyapps.io/topicnetwork now enable it to report back the list of publications that are about a set of subjects. Currently picked based on checkboxes, but eventually I'd like it to be based on clicking the nodes.

@petermr
Copy link
Owner

petermr commented Apr 27, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants