# PublicDatasets

## 1. Introduction

### 1.1 Info

**Current Pipeline:**

* First pick targets
* Find appropriate input
* Run Downloader
* Read PDFs
* Query for Relations
* Save Results

#### 1.1.1 Target Venues

* MIDL 2021
  * Proceedings to use: https://proceedings.mlr.press/v143/

In [None]:
venues = ['MIDL 2021']

#### 1.1.2 Target Datasets

* [Data Science Bowl 2017](https://www.kaggle.com/c/data-science-bowl-2017)
  * See here for literature: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22data+science+bowl%22+%2B+lung&btnG=

### 1.2 Imports

In [2]:
import os,sys
import scrapy
import pdfminer
import pandas as pd

In [3]:
import warnings
warnings.filterwarnings('ignore')

import logging
logging.getLogger("scrapy").setLevel(logging.CRITICAL)
logging.getLogger("pdfminer").setLevel(logging.CRITICAL)

### 1.3 Misc. Variables

In [4]:
pdfs=[]
texts=[]

dois=[]
titles=[]

## 2. Input

### 2.1 PDF Sources

In [5]:
pdf_urls = [
            'https://proceedings.mlr.press/v143/',
        ]

# To Do: move these under `ArticleScraper/ArticleScraper/spiders`

### 2.2 Explicit Mentions 

In [6]:
mentions = [
    "2017 Data Science Bowl",
    "Kaggle Data Science Bowl 2017",
    "Data Science Bowl 2017",
    "KDSB17",
    "DSB",
]

### 2.3 Related Keywords

In [7]:
keywords = [
    "lung cancer",
    "nodule",
    "competition",
    "kaggle dataset",
    "deep learning"
]

## 3. Crawler/Downloader

Make directory for pdfs if it does not exist

In [8]:
!mkdir data/pdfs

mkdir: data/pdfs: File exists


In [9]:
# scrapy code goes here
from scrapy.crawler import CrawlerProcess
from scrapy.http import Request

class MIDL21Spider(scrapy.Spider):
    name = "MIDL21_Proceedings"

    def start_requests(self):
        urls = pdf_urls
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for article in response.xpath('/html/body/main/div/div[*]'):
            try:
                yield Request(
                    url=article.xpath('p[3]/a[2]/@href').get(),
                    meta={
                        "title": article.xpath('p[1]/text()').get()
                        },
                    callback=self.save_pdf
                )
            except Exception as e:
                print(e)

    def save_pdf(self, response):
        try:
            title = response.meta['title']+".PDF"
            self.logger.info('Saving PDF %s', title)
            pdf_file = os.path.join('data','pdfs',title)
            with open(pdf_file, 'wb') as f:
                f.write(response.body)
            pdfs.append(pdf_file)
            titles.append(response.meta['title'])
        except Exception as e:
            print(e)
            
crawler = CrawlerProcess({})
crawler.crawl(MIDL21Spider)
crawler.start()
# DOIs are unavailable for now
# pdf-files

2022-10-05 15:46:47 [scrapy.utils.log] INFO: Scrapy 2.6.3 started (bot: scrapybot)
2022-10-05 15:46:47 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform macOS-12.6-arm64-arm-64bit
2022-10-05 15:46:47 [scrapy.crawler] INFO: Overridden settings:
{}
2022-10-05 15:46:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-10-05 15:46:47 [scrapy.extensions.telnet] INFO: Telnet Password: 8032a98e31ee1384
2022-10-05 15:46:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-10-05 15:46:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['sc

Request url must be str, got NoneType


2022-10-05 15:46:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/maheshwari21a/maheshwari21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 15:46:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/mouches21a/mouches21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 15:46:49 [MIDL21_Proceedings] INFO: Saving PDF Distill DSM: Computationally efficient method for segmentation of medical imaging volumes.PDF
2022-10-05 15:46:49 [MIDL21_Proceedings] INFO: Saving PDF Unifying Brain Age Prediction and Age-Conditioned Template Generation with a Deterministic Autoencoder.PDF
2022-10-05 15:46:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://proceedings.mlr.press/v143/neimark21a/neimark21a.pdf> (referer: https://proceedings.mlr.press/v143/)
2022-10-05 15:46:50 [MIDL21_Proceedings] INFO: Saving PDF “Train one, Classify one, Teach one” - Cross-surgery transfer learning for surgical step re

## 4. PDF-Reader

Make directory for texts if it does not exist

In [10]:
!mkdir data/texts

mkdir: data/texts: File exists


In [11]:
# !ls data/pdfs

In [12]:
from pdfminer.high_level import extract_text

# i=0
for directory, subdirlist, filelist in os.walk('data/pdfs/'):
    # print(directory)
    for pdf in filelist:
        title = pdf[:-4]
        text_file = os.path.join('data','texts',title+'.txt')
        with open(text_file, 'w') as f:
            text_contents = extract_text(os.path.join(directory,pdf))
            f.write(text_contents)
            texts.append(text_file)
            # print((i,text_file))
            # i+=1
# SEE ALSO these functions more layout-gnostic processing
# from pdfminer.high_level import extract_text_to_fp
# from pdfminer.layout import LAParams

## 5. Relation Querier

In [None]:
preview_offset = 10

In [13]:
mention_matches = {name:[] for name in mentions}

In [14]:
for name in mentions:
    for text_file in texts:
        with open(text_file, 'r') as f:
            contents = f.read()
            #Only check for 1-for-1 correspondence
            #AND DON'T FORGET TO LOWER CASE WHEN COMPARING!
            contents = contents.lower()
            low_name=name.lower()
            if contents.find(low_name) != -1:
                mention_matches[name].append(1)
                idx=contents.find(low_name)
                print(("Found", name))
                print(contents[idx-preview_offset:idx+preview_offset])
            else:
                mention_matches[name].append(0)

In [15]:
keyword_matches = {keyword:[] for keyword in keywords}

In [16]:
for keyword in keywords:
    for text_file in texts:
        with open(text_file, 'r') as f:
            contents = f.read()
            #Only check for 1-for-1 correspondence
            #AND DON'T FORGET TO LOWER CASE WHEN COMPARING!
            contents = contents.lower()
            low_key=keyword.lower()
            if contents.find(low_key) != -1:
                keyword_matches[keyword].append(1)
            else:
                keyword_matches[keyword].append(0)

In [17]:
# mention_matches

## 6. Saving Results

Combine venues (constant), titles, dois (blank), mention_matches and keyword_matches into final output.

In [18]:
merged_dict = {}

merged_dict['Title']=titles

#UNUSED
#merged_dict['doi']=n/a
#merged_dict['venue']=n/a

for name in mention_matches:
    merged_dict[name]=mention_matches[name]
for keyword in keyword_matches:
    merged_dict[keyword]=keyword_matches[keyword]

In [21]:
match_dataset = pd.DataFrame(
    merged_dict
)

**Post-Processing for DOI and Venue**

In [22]:
match_dataset= match_dataset.assign(Venue=venues[0])
match_dataset= match_dataset.assign(DOI='n/a')

In [23]:
match_dataset

Unnamed: 0,Title,2017 Data Science Bowl,Kaggle Data Science Bowl 2017,Data Science Bowl 2017,KDSB17,DSB,lung cancer,nodule,competition,kaggle dataset,deep learning,Venue,DOI
0,Distill DSM: Computationally efficient method ...,0,0,0,0,0,0,0,1,0,1,MIDL 2021,
1,Unifying Brain Age Prediction and Age-Conditio...,0,0,0,0,0,0,0,0,0,1,MIDL 2021,
2,"“Train one, Classify one, Teach one” - Cross-s...",0,0,0,0,0,0,0,0,0,1,MIDL 2021,
3,Predicting COVID-19 Lung Infiltrate Progressio...,0,0,0,0,0,0,0,0,0,1,MIDL 2021,
4,Learning Interclass Relations for Intravenous ...,0,0,0,0,0,0,0,0,0,1,MIDL 2021,
5,Embedding-based Instance Segmentation in Micro...,0,0,0,0,0,1,0,0,0,1,MIDL 2021,
6,Benefits of Linear Conditioning for Segmentati...,0,0,0,0,0,1,0,1,0,1,MIDL 2021,
7,Feature-based image registration in structured...,0,0,0,0,0,0,1,0,0,1,MIDL 2021,
8,Partial transfusion: on the expressive influen...,0,0,0,0,0,0,0,0,0,1,MIDL 2021,
9,Feedback Graph Attention Convolutional Network...,0,0,0,0,0,0,0,0,0,1,MIDL 2021,


In [None]:
match_dataset.to_csv('ResearchPapers.csv')

In [None]:
#WARNING: Preview.PDF