You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.
search/scrape
We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or icanhazpdf. Here's a rough list in rough order:
### EuropePMC
The first place to go and default for getpapers. XML and PDF. Nothing needed.
biorxiv and medrxiv
Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF. Scraper needs updating
Royal Society
Fulltext in PDF. needs a lazy scraper
Theses
Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins.
Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL)
Andy Jackson may have better knowledge.
Redalyc
Mexico, but usually in EN. May need lazy loader.
## Metadata
The systems should all be converted to create JATS. Most HTML <meta> can be JATS-ised - I have written equivalencers.
TO DO coordinate the metadata extraction.
Body
EuropePMC provide XML which is already catered for.
PDF
PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.
TO DO check PDF conversion for RoyalSoc
TO DO check PDF for Redalyc?
TO DO check PDF for biorxiv
Html
biorxiv and medrxiv need converting to XHTML (I may already have done this).
Dictionaries
TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists
Indexing
**TO DO ** Will SOLR index XML or do we need to flatten?
The text was updated successfully, but these errors were encountered:
I'm also thinking of that request that came in looking for evidence of mask effectiveness for typical medical procedures. e.g. should we try to build up the co-occurance matrix for these two dictionaries?
That's a lovely Wikipedia page you've found. I'll show how to make the
dictionary.
The surgical mask is less clear. We can probably hoover some terms from
Wikipedia pages.
On Wed, Apr 1, 2020 at 10:06 PM Andy Jackson ***@***.***> wrote:
I'm also thinking of that request that came in looking for evidence of
mask effectiveness for typical medical procedures. e.g. should we try to
build up the co-occurance matrix for these two dictionaries?
1. Medical procedures <https://en.wikipedia.org/wiki/Medical_procedure>
2. Terms relating to surgical masks and safetly ('surgical mask',
'n95', 'droplets', 'viruses', 'bacteria' etc.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS5YA7DHXC2AGL2AS2LRKOUGXANCNFSM4LWRDI6A>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Face masks
@axiomsofchoice has suggested we take "face mask" - do they work? - as our first project; he is also physically making them - 1000 - in Makespace. So here's thoughts on the workflow and tasks.
search/scrape
We should concentrate on Freely readable legal sources - this includes anything pointed to by Unpaywall, readable without login, but NOT SciHub or
icanhazpdf
. Here's a rough list in rough order:### EuropePMC
The first place to go and default for
getpapers
. XML and PDF. Nothing needed.biorxiv and medrxiv
Fulltext in HTML but no API, @petermr has built a prototype scraper but there's a cascade of lazyload and landing page to navigate . HTML-> XML and PDF.
Scraper needs updating
Royal Society
Fulltext in PDF.
needs a lazy scraper
Theses
Fulltext in PDF. Very valuable as they are additional. BUT multiple sites with arcane landing pages and logins.
Aggregated with CORE (UK, may need login - not happy about that), HAL (FR), DARE (NL)
Andy Jackson may have better knowledge.
Redalyc
Mexico, but usually in EN. May need lazy loader.
## Metadata
The systems should all be converted to create JATS. Most HTML
<meta>
can be JATS-ised - I have written equivalencers.TO DO coordinate the metadata extraction.
Body
EuropePMC provide XML which is already catered for.
PDF
PDF needs converting to text, ideally HTML. Full conversion includes formatting, styles, weights which are important in scientific documents. Most pdf-to-html produce flat ASCII test which is highly usable but not perfect. Many tools do not recognize sections.
Html
biorxiv and medrxiv need converting to XHTML (I may already have done this).
Dictionaries
TO DO create relevant dictionaries for face masks. @axiomsofchoice to create some wordlists
Indexing
The text was updated successfully, but these errors were encountered: