Face Mask knowledge extraction #30

petermr · 2020-03-30T12:10:57Z

Face masks

@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.

search/scrape

We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or icanhazpdf. Here's a rough list in rough order:

### EuropePMC
The first place to go and default for getpapers. XML and PDF. Nothing needed.

biorxiv and medrxiv

Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF.
Scraper needs updating

Royal Society

Fulltext in PDF.
needs a lazy scraper

Theses

Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins.
Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL)
Andy Jackson may have better knowledge.

Redalyc

Mexico, but usually in EN. May need lazy loader.

## Metadata

The systems should all be converted to create JATS. Most HTML <meta> can be JATS-ised - I have written equivalencers.

TO DO coordinate the metadata extraction.

Body

EuropePMC provide XML which is already catered for.

PDF

PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.

TO DO check PDF conversion for RoyalSoc
TO DO check PDF for Redalyc?
TO DO check PDF for biorxiv

Html

biorxiv and medrxiv need converting to XHTML (I may already have done this).

Dictionaries

TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists

Indexing

**TO DO ** Will SOLR index XML or do we need to flatten?

The text was updated successfully, but these errors were encountered:

anjackson · 2020-04-01T21:06:37Z

I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?

Medical procedures
Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.)

petermr · 2020-04-01T21:13:28Z

That's a lovely Wikipedia page you've found. I'll show how to make the dictionary. The surgical mask is less clear. We can probably hoover some terms from Wikipedia pages.

…

On Wed, Apr 1, 2020 at 10:06 PM Andy Jackson ***@***.***> wrote: I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries? 1. Medical procedures <https://en.wikipedia.org/wiki/Medical_procedure> 2. Terms relating to surgical masks and safetly ('surgical mask', 'n95', 'droplets', 'viruses', 'bacteria' etc.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS5YA7DHXC2AGL2AS2LRKOUGXANCNFSM4LWRDI6A> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Face Mask knowledge extraction #30

Face Mask knowledge extraction #30

petermr commented Mar 30, 2020 •

edited

Loading

anjackson commented Apr 1, 2020

petermr commented Apr 1, 2020 via email

Face Mask knowledge extraction #30

Face Mask knowledge extraction #30

Comments

petermr commented Mar 30, 2020 • edited Loading

Face masks

search/scrape

biorxiv and medrxiv

Royal Society

Theses

Redalyc

Body

PDF

Html

Dictionaries

Indexing

anjackson commented Apr 1, 2020

petermr commented Apr 1, 2020 via email

petermr commented Mar 30, 2020 •

edited

Loading