Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face Mask knowledge extraction #30

Open
petermr opened this issue Mar 30, 2020 · 2 comments
Open

Face Mask knowledge extraction #30

petermr opened this issue Mar 30, 2020 · 2 comments

Comments

@petermr
Copy link
Owner

petermr commented Mar 30, 2020

Face masks

@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.

search/scrape

We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or icanhazpdf. Here's a rough list in rough order:

### EuropePMC
The first place to go and default for getpapers. XML and PDF. Nothing needed.

biorxiv and medrxiv

Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF.
Scraper needs updating

Royal Society

Fulltext in PDF.
needs a lazy scraper

Theses

Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins.
Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL)
Andy Jackson may have better knowledge.

Redalyc

Mexico, but usually in EN. May need lazy loader.

## Metadata

The systems should all be converted to create JATS. Most HTML <meta> can be JATS-ised - I have written equivalencers.

TO DO coordinate the metadata extraction.

Body

EuropePMC provide XML which is already catered for.

PDF

PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.

  • TO DO check PDF conversion for RoyalSoc
  • TO DO check PDF for Redalyc?
  • TO DO check PDF for biorxiv

Html

biorxiv and medrxiv need converting to XHTML (I may already have done this).

Dictionaries

TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists

Indexing

  • **TO DO ** Will SOLR index XML or do we need to flatten?
@anjackson
Copy link
Contributor

I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?

  1. Medical procedures
  2. Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.)

@petermr
Copy link
Owner Author

petermr commented Apr 1, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants