# PublicDatasets

## 1. Introduction

### 1.1 Info

**Current Pipeline:**

* First pick targets
* Find appropriate input
* Run Downloader
* Read PDFs
* Query for Relations
* Save Results

#### 1.1.1 Target Venues

* MIDL 2021
  * Proceedings to use: https://proceedings.mlr.press/v143/

In [1]:
venues = [
    #'MIDL 2021',
    'CHIL 2021',
    #'ML4H 2020',
    #'ML4H 2021' 
    ]

In [2]:
special_separator = '___'

#### 1.1.2 Target Datasets

* [Data Science Bowl 2017](https://www.kaggle.com/c/data-science-bowl-2017)
  * See here for literature: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22data+science+bowl%22+%2B+lung&btnG=

### 1.2 Imports

In [3]:
import os,sys
import scrapy
import pdfminer
import pandas as pd
import numpy as np

In [4]:
import warnings
warnings.filterwarnings('ignore')

import logging
logging.getLogger("scrapy").setLevel(logging.CRITICAL)
logging.getLogger("pdfminer").setLevel(logging.CRITICAL)

### 1.3 Misc. Variables

In [5]:
pdfs=[]
texts=[]

#dois=[]
titles=[]
paper_venues=[]

## 2. Input

### 2.1 PDF Sources

In [6]:
pdf_urls = [
            # 'https://proceedings.mlr.press/v143/', #MIDL 2021
            'https://proceedings.mlr.press/v174/', #CHIL 2021
            #'https://proceedings.mlr.press/v136/', #ML4H 2020
            # 'https://proceedings.mlr.press/v158/', #ML4H 2021
        ]

venue_labels = {
            # 'https://proceedings.mlr.press/v143/':'MIDL 2021',
            'https://proceedings.mlr.press/v174/':'CHIL 2021',
            #'https://proceedings.mlr.press/v136/':'ML4H 2020',
            # 'https://proceedings.mlr.press/v158/':'ML4H 2021'
}


# To Do: move these under `ArticleScraper/ArticleScraper/spiders`

## 3. Crawler/Downloader

Make directory for pdfs if it does not exist

In [7]:
!mkdir data/pdfs

In [8]:
# scrapy code goes here
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

class MLRPRessCrawler(scrapy.Spider):
    """
    Spider for crawling PDFs from `proceedings.mlr.press`
    """
    name = "MLR_Press"

    def start_requests(self):
        urls = pdf_urls
        for url in urls:
            yield scrapy.Request(
                url=url, 
                meta={
                    "venue" : venue_labels[url]
                },
                callback=self.parse
            )

    def parse(self, response):
        for article in response.xpath('/html/body/main/div/div[*]'):
            try:
                yield Request(
                    url=article.xpath('p[3]/a[2]/@href').get(),
                    meta={
                        "title": article.xpath('p[1]/text()').get(),
                        "venue" : response.request.meta['venue']
                        },
                    callback=self.save_pdf
                )
            except Exception as e:
                print(e)

    def save_pdf(self, response):
        try:
            title = response.meta['title']+".PDF"
            venue = response.meta['venue']
            self.logger.info('Saving PDF %s', title)
            pdf_file = os.path.join('data','pdfs',venue+special_separator+title)
            with open(pdf_file, 'wb') as f:
                f.write(response.body)
            pdfs.append(pdf_file)
            titles.append(response.meta['title'])
            paper_venues.append(response.meta['venue'])
        except Exception as e:
            print(e)
            
crawler = CrawlerProcess({})
crawler.crawl(MLRPRessCrawler)
crawler.start()
# DOIs are unavailable for now
# pdf-files

2022-12-11 19:50:48 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: scrapybot)
2022-12-11 19:50:48 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform macOS-13.0.1-x86_64-i386-64bit
2022-12-11 19:50:48 [scrapy.crawler] INFO: Overridden settings:
{}
2022-12-11 19:50:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-12-11 19:50:48 [scrapy.extensions.telnet] INFO: Telnet Password: 3fe1828c4c413139
2022-12-11 19:50:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-12-11 19:50:48 [scrapy.middleware] INFO: Enabled downloader middlewares:

Request url must be str, got NoneType


2022-12-11 19:50:49 [MLR_Press] INFO: Saving PDF Conference on Health, Inference, and Learning (CHIL) 2022.PDF
2022-12-11 19:50:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v174/raghu22a/raghu22a.pdf> (referer: https://proceedings.mlr.press/v174/)
2022-12-11 19:50:49 [MLR_Press] INFO: Saving PDF Data Augmentation for Electrocardiograms.PDF
2022-12-11 19:50:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v174/rahimian22a/rahimian22a.pdf> (referer: https://proceedings.mlr.press/v174/)
2022-12-11 19:50:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v174/roy22a/roy22a.pdf> (referer: https://proceedings.mlr.press/v174/)
2022-12-11 19:50:49 [MLR_Press] INFO: Saving PDF Practical Challenges in Differentially-Private Federated Survival Analysis of Medical Data.PDF
2022-12-11 19:50:49 [MLR_Press] INFO: Saving PDF Disability prediction in multiple sclerosis using performance outcome measures and d

## 4. PDF-Reader

Make directory for texts if it does not exist

In [9]:
!mkdir data/texts

In [10]:
# !ls data/pdfs

In [11]:
from pdfminer.high_level import extract_text

# i=0
for directory, subdirlist, filelist in os.walk('data/pdfs/'):
    # print(directory)
    for pdf in filelist:
        title = pdf[:-4]
        text_file = os.path.join('data','texts',title+'.txt')
        try:
            with open(text_file, 'w') as f:
                text_contents = extract_text(os.path.join(directory,pdf))
                f.write(text_contents)
                texts.append(text_file)
        except Exception as e:
            print(e)
            # print((i,text_file))
            # i+=1
# SEE ALSO these functions more layout-gnostic processing
# from pdfminer.high_level import extract_text_to_fp
# from pdfminer.layout import LAParams

## 5. Saving Results

Combine venues (constant), titles , mention_matches and keyword_matches into final output.

In [12]:
merged_dict = {}

In [13]:

#UNUSED
#merged_dict['doi']=n/a
merged_dict['Venue']=paper_venues
merged_dict['Title']=titles

# TODO EXTRACT VENUE FROM TITLES using special_separator

# [titles.split(special_separator) for title in titles]


In [14]:
match_dataset = pd.DataFrame(
    merged_dict
)

In [15]:
# match_dataset= match_dataset.assign(Venue=venues[0])
# match_dataset= match_dataset.assign(DOI='n/a')

In [16]:
L = len(match_dataset)
match_dataset =  match_dataset.assign(included=np.ones(L).astype(bool))
#Remove conference summary
# match_dataset = match_dataset.drop(0,axis=0)
match_dataset.loc[0, ['included']] = False

In [17]:
match_dataset

Unnamed: 0,Venue,Title,included
0,CHIL 2021,"Conference on Health, Inference, and Learning ...",False
1,CHIL 2021,Data Augmentation for Electrocardiograms,True
2,CHIL 2021,Practical Challenges in Differentially-Private...,True
3,CHIL 2021,Disability prediction in multiple sclerosis us...,True
4,CHIL 2021,MedMCQA: A Large-scale Multi-Subject Multi-Cho...,True
5,CHIL 2021,Improving the Fairness of Chest X-ray Classifiers,True
6,CHIL 2021,Lead-agnostic Self-supervised Learning for Loc...,True
7,CHIL 2021,Learning Unsupervised Representations for ICU ...,True
8,CHIL 2021,Context-Sensitive Spelling Correction of Clini...,True
9,CHIL 2021,Unifying Heterogeneous Electronic Health Recor...,True


In [18]:
match_dataset.to_csv('data/ResearchPapers.csv')

In [19]:
#WARNING: Preface.pdf