# PublicDatasets

## 1. Introduction

### 1.1 Info

**Current Pipeline:**

* First pick targets
* Find appropriate input
* Run Downloader
* Read PDFs
* Query for Relations
* Save Results

#### 1.1.1 Target Venues

* MIDL 2021
  * Proceedings to use: https://proceedings.mlr.press/v143/

#### 1.1.2 Target Datasets

* [Data Science Bowl 2017](https://www.kaggle.com/c/data-science-bowl-2017)
  * See here for literature: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22data+science+bowl%22+%2B+lung&btnG=

### 1.2 Imports

In [1]:
import os,sys
import scrapy

## 2. Input

### 2.1 PDF Sources

In [2]:
pdf_urls = [
            'https://proceedings.mlr.press/v143/',
        ]

# To Do: move these under `ArticleScraper/ArticleScraper/spiders`

### 2.2 Explicit Mentions 

In [3]:
mentions = [
    "2017 Data Science Bowl",
    "Kaggle Data Science Bowl 2017",
    "Data Science Bowl 2017",
    "KDSB17",
    "DSB",
]

### 2.3 Related Keywords

In [4]:
keywords = [
    "lung cancer",
    "nodule",
    "competition",
    "kaggle dataset",
    "deep learning"
]

## 3. Downloader

Makedir for output if it does not exist

In [5]:
!mkdir data/pdfs

mkdir: data/pdfs: File exists


In [6]:
# scrapy code goes here
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

class MIDL21Spider(scrapy.Spider):
    name = "MIDL21_Proceedings"

    def start_requests(self):
        urls = pdf_urls
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for article in response.xpath('/html/body/main/div/div[*]'):
            try:
                yield Request(
                    url=article.xpath('p[3]/a[2]/@href').get(),
                    meta={
                        "title": article.xpath('p[1]/text()').get()
                        },
                    callback=self.save_pdf
                )
            except Exception as e:
                print(e)

    def save_pdf(self, response):
        try:
            title = response.meta['title']+".PDF"
            self.logger.info('Saving PDF %s', title)
            with open(os.path.join('data','pdfs',title), 'wb') as f:
                f.write(response.body)
        except Exception as e:
            print(e)
            
crawler = CrawlerProcess({})
crawler.crawl(MIDL21Spider)
crawler.start()
# DOIs are unavailable for now
# pdf-files

2022-10-05 13:52:40 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: scrapybot)
2022-10-05 13:52:40 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform macOS-12.6-arm64-arm-64bit
2022-10-05 13:52:40 [scrapy.crawler] INFO: Overridden settings:
{}
2022-10-05 13:52:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-10-05 13:52:40 [scrapy.extensions.telnet] INFO: Telnet Password: da370bd18ee6c04d
2022-10-05 13:52:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-10-05 13:52:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['sc

Request url must be str, got NoneType


2022-10-05 13:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/olivier21a/olivier21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 13:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/mouches21a/mouches21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 13:52:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/muhamedrahimov21a/muhamedrahimov21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 13:52:40 [MIDL21_Proceedings] INFO: Saving PDF Balanced sampling for an object detection problem - application to fetal anatomies detection.PDF
2022-10-05 13:52:40 [MIDL21_Proceedings] INFO: Saving PDF Unifying Brain Age Prediction and Age-Conditioned Template Generation with a Deterministic Autoencoder.PDF
2022-10-05 13:52:40 [MIDL21_Proceedings] INFO: Saving PDF Learning Interclass Relations for Intravenous Contrast Phase Classification in C

## 4. PDF-Reader

## 5. Relation Querier

## 6. Saving Results