# Dataset Organization Procedure
This notebook shows how we downloaded and processed all data from cc images
to further create the artificial scientific forgery dataset

Each section of this noteboook presents a different source that we used to create the dataset

---
Author: João Phillipe Cardenuto

email: phillipe.cardenuto@ic.unicamp.br\
August, 2022
---

# 1- Organizing data from BBBC038 


link: https://bbbc.broadinstitute.org/BBBC038

License CC0 - public domain

Since we are intersted in data with cell mask, we use just the  [stage1_train.zip](https://data.broadinstitute.org/bbbc/BBBC038/stage1_train.zip) data from this dataset

In [14]:
# download stage1_train.zip file
! wget https://data.broadinstitute.org/bbbc/BBBC038/stage1_train.zip

--2020-10-14 20:06:11--  https://data.broadinstitute.org/bbbc/BBBC038/stage1_train.zip
Resolving data.broadinstitute.org (data.broadinstitute.org)... 69.173.92.29
Connecting to data.broadinstitute.org (data.broadinstitute.org)|69.173.92.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82923446 (79M) [application/zip]
Saving to: ‘stage1_train.zip.1’


2020-10-14 20:06:57 (1.76 MB/s) - ‘stage1_train.zip.1’ saved [82923446/82923446]



In [None]:
# Insert the content on BBBC038 folder and extract the .zip content
! mkdir -p BBBC038 && mv stage1_train.zip BBBC038 && cd BBBC038 

After downloading the data, we noticed that the object mask of each image are divided into multiple files,

So we are using the following cells to joined the into just one image

In [8]:
# import cell
from glob import glob
import shutil
import cv2
import pandas as pd
import os

In [9]:
# getting all mask
masks = glob("BBBC038/*/masks/")

In [10]:
masks[0]

'BBBC038/58406ed8ef944831c413c3424dc2b07e59aef13eb1ff16acbb3402b38b5de0bd/masks/'

In [12]:
# create joined mask for each image
for m in masks:
    joining_masks = glob(f"{m}/*")
    result_mask = cv2.imread(joining_masks[0],cv2.IMREAD_GRAYSCALE)
    for jmask in joining_masks[1:]:
        aux = cv2.imread(jmask,cv2.IMREAD_GRAYSCALE)
        result_mask = cv2.bitwise_or(result_mask, aux)
    
    # creating save path
    mask_id = os.path.basename(os.path.dirname(m))
    save_path = os.path.dirname(m) + "/joined_masks"
    os.makedirs(save_path,exist_ok=True)
    
    # save result mask
    save_name  = f"{save_path}/{mask_id}.png"
    cv2.imwrite(save_name,result_mask)



In [11]:
# create a data_frame with information relative of each figure
bbbc038_data = pd.DataFrame(columns=['ID','DatasetRef','RealDataPath','License', 'Class','ObjectMaskPath','Link'])

In [47]:
# Getting all files from folders
bbbc038_data = pd.DataFrame()
bbbc038_data_files = glob("BBBC038/*")
bbbc038_data_files = [b for b in bbbc038_data_files if os.path.isdir(b)]

for index, file in enumerate(bbbc038_data_files):
    id_name = os.path.basename(file)
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='BBBC038'
    data['RealDataPath'] = f"{file}/images/{id_name}.png"
    data['License'] = 'CC0'
    data['Class'] = 'Biological'
    data['ObjectMaskPath'] = f"{file}/masks/joined_masks/masks.png"
    data['Link'] = 'https://data.broadinstitute.org/bbbc/BBBC038/stage1_train.zip'
    
    bbbc038_data = bbbc038_data.append(data,ignore_index=True)

In [48]:
bbbc038_data

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath
0,Biological,BBBC038,000001,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...
1,Biological,BBBC038,000002,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...
2,Biological,BBBC038,000003,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...
3,Biological,BBBC038,000004,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...
4,Biological,BBBC038,000005,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...
...,...,...,...,...,...,...,...
665,Biological,BBBC038,000666,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...
666,Biological,BBBC038,000667,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...
667,Biological,BBBC038,000668,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...
668,Biological,BBBC038,000669,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...


Now we are going to manualy remove some cases that the object map aren't precise to do a realistic forgery


In [41]:
os.makedirs("manually_organizing_bbbc038",exist_ok=True)
for index, row in bbbc038_data.iterrows():
    src = row['srcDataPath']
    dest = f"temp/{row['ID']}.png"
    shutil.copy(src,dest)

In [49]:
remove_files = glob('manually_organizing_bbbc038/remove/*')
gray_images =  glob('manually_organizing_bbbc038/gray/*')
color_images =  glob('manually_organizing_bbbc038/color/*')

In [50]:
bbbc038_data_org = bbbc038_data.copy()

In [51]:
# remove images that are not adequate
remove_rows = [ os.path.basename(remove)[:-4] for remove in remove_files ]
bbbc038_data_after_remove = bbbc038_data_org[~bbbc038_data_org.ID.isin(remove_rows)]

In [52]:
# insert specific subset tag (usefull later during forgeries proceduring)
remaining_data = pd.DataFrame(columns=['ID','subset_tag'])
remaining_images = gray_images + color_images


for file in remaining_images:
    data = {}
    data['ID'] = os.path.basename(file)[:-4]
    if '/gray/' in file:
        data['subset_tag'] ='gray'
    else:
        data['subset_tag'] ='color'
    remaining_data = remaining_data.append(data,ignore_index=True)


In [53]:
bbbc038_data_after_remove

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath
0,Biological,BBBC038,000001,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...
1,Biological,BBBC038,000002,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...
2,Biological,BBBC038,000003,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...
3,Biological,BBBC038,000004,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...
4,Biological,BBBC038,000005,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...
...,...,...,...,...,...,...,...
665,Biological,BBBC038,000666,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...
666,Biological,BBBC038,000667,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...
667,Biological,BBBC038,000668,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...
668,Biological,BBBC038,000669,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...


In [54]:
final_bbbc038 = bbbc038_data_after_remove.merge(remaining_data)

In [55]:
final_bbbc038

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,Biological,BBBC038,000001,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,gray
1,Biological,BBBC038,000002,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,gray
2,Biological,BBBC038,000003,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,gray
3,Biological,BBBC038,000004,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,gray
4,Biological,BBBC038,000005,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,gray
...,...,...,...,...,...,...,...,...
547,Biological,BBBC038,000666,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...,BBBC038/40946065f7e4b6038599fbfd419f2a67e7635b...,color
548,Biological,BBBC038,000667,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...,BBBC038/ac8169a0debed11560f3f0e246c05ea82d03c6...,gray
549,Biological,BBBC038,000668,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...,BBBC038/a7f767ca9770b160f234780e172aeb35a50830...,color
550,Biological,BBBC038,000669,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...,BBBC038/a31deaf0ac279d5f34fb2eca80cc2abce6ef30...,color


In [56]:
final_bbbc038.to_csv("organized_bbbc038.csv")

# 2- Images from PMC repository

This data was entire downloaded from PMC based on their information of creative commons license

In [57]:
pmc_data = pd.read_csv('pmc_ccby_images.csv').drop(columns=['Unnamed: 0'])
pmc_data

Unnamed: 0,figname,Accession ID,File,Article Citation,Last Updated (YYYY-MM-DD HH:MM:SS),PMID,License
0,PMC5030415_CRIGM2016-3058407.001.jpg,PMC5030415,oa_package/42/6d/PMC5030415.tar.gz,Case Rep Gastrointest Med. 2016 Sep 7; 2016:30...,2017-05-18 11:28:52,27668102.0,CC BY
1,PMC5031835_CRIOT2016-4709753.001.jpg,PMC5031835,oa_package/92/8f/PMC5031835.tar.gz,Case Rep Otolaryngol. 2016 Sep 8; 2016:4709753,2017-05-22 12:40:56,27672465.0,CC BY
2,PMC5031835_CRIOT2016-4709753.002.jpg,PMC5031835,oa_package/92/8f/PMC5031835.tar.gz,Case Rep Otolaryngol. 2016 Sep 8; 2016:4709753,2017-05-22 12:40:56,27672465.0,CC BY
3,PMC5031870_CRIPA2016-4182026.005.jpg,PMC5031870,oa_package/78/f2/PMC5031870.tar.gz,Case Rep Pathol. 2016 Sep 8; 2016:4182026,2017-05-22 12:41:09,27672467.0,CC BY
4,PMC5031924_cureus-0008-000000000696-i04.jpg,PMC5031924,oa_package/c0/76/PMC5031924.tar.gz,Cureus.; 8(7):e696,2016-09-28 18:28:34,27672528.0,CC BY
...,...,...,...,...,...,...,...
377,PMC4836196_40248_2016_50_Fig5_HTML.jpg,PMC4836196,oa_package/60/42/PMC4836196.tar.gz,Multidiscip Respir Med. 2016 Apr 19; 11:16,2016-04-21 18:48:08,27096087.0,CC BY
378,PMC4836689_pone.0152554.g001.jpg,PMC4836689,oa_package/1f/88/PMC4836689.tar.gz,PLoS One. 2016 Apr 19; 11(4):e0152554,2019-02-18 13:56:10,27092557.0,CC BY
379,PMC4940527_CRICC2016-7379829.003.jpg,PMC4940527,oa_package/ad/b1/PMC4940527.tar.gz,Case Rep Crit Care. 2016 Jun 28; 2016:7379829,2017-05-15 12:30:04,27433359.0,CC BY
380,PMC5037059_cureus-0008-000000000751-i01.jpg,PMC5037059,oa_package/f9/e8/PMC5037059.tar.gz,Cureus.; 8(8):e751,2016-12-20 19:12:44,27688988.0,CC BY


**Downloading figures into PMC folder**

In [132]:
# install wget -- pip install wget
import wget

In [139]:
def download_figure(figname):
    uid = figname[:figname.find("_")]
    figname = figname[figname.find("_")+1:]
    
    url = f'https://www.ncbi.nlm.nih.gov/pmc/articles/{uid}/bin/{figname}'
    out_path = f'PMC/{uid}'
    os.makedirs(out_path,exist_ok=True)

    wget.download(url,f"{out_path}/{figname}")

In [140]:
for index, row in pmc_data.iterrows():
    figname = row['figname']
    download_figure(figname)

** Creating organized csv file**

In [58]:
pmc_organized = pd.DataFrame()

for index, row in pmc_data.iterrows():
    
    figname = row['figname']
    uid = figname[:figname.find("_")]
    figname = figname[figname.find("_")+1:]
    url = f'https://www.ncbi.nlm.nih.gov/pmc/articles/{uid}/bin/{figname}'
    
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='PMC'
    data['RealDataPath'] = f'PMC/{uid}/{figname}'
    data['License'] = 'CC-BY'
    data['Class'] = 'Biological'
    data['ObjectMaskPath'] = None
    data['Link'] = url
    data['subset_tag'] = 'pmc'
    
    pmc_organized = pmc_organized.append(data, ignore_index=True)

In [59]:
pmc_organized

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,Biological,PMC,000001,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5030415/CRIGM2016-3058407.001.jpg,pmc
1,Biological,PMC,000002,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5031835/CRIOT2016-4709753.001.jpg,pmc
2,Biological,PMC,000003,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5031835/CRIOT2016-4709753.002.jpg,pmc
3,Biological,PMC,000004,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5031870/CRIPA2016-4182026.005.jpg,pmc
4,Biological,PMC,000005,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5031924/cureus-0008-000000000696-i04.jpg,pmc
...,...,...,...,...,...,...,...,...
377,Biological,PMC,000378,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,,PMC/PMC4836196/40248_2016_50_Fig5_HTML.jpg,pmc
378,Biological,PMC,000379,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,,PMC/PMC4836689/pone.0152554.g001.jpg,pmc
379,Biological,PMC,000380,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,,PMC/PMC4940527/CRICC2016-7379829.003.jpg,pmc
380,Biological,PMC,000381,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,PMC/PMC5037059/cureus-0008-000000000751-i01.jpg,pmc


In [60]:
pmc_organized.to_csv("organized_pmc.csv")

# 3 - Western Blots Images from PMC repository

This is a collection of CC-BY PMC data from a western blot ( or other gel eletrophoresis analysis)
For this dataset we downloaded all figures, then we use a in-house algorithm to extract all figures panels, and finally we collect only the panels that were western blots

In [61]:
wblot = pd.read_csv('wblots/western_bloats_ccby.csv').drop(columns=['Unnamed: 0'])
wblot

Unnamed: 0,figname,Accession ID,File,Article Citation,Last Updated (YYYY-MM-DD HH:MM:SS),PMID,License
0,001_PMC2592807_ar2523-4.jpg,PMC2592807,oa_package/7a/c7/PMC2592807.tar.gz,Arthritis Res Ther. 2008 Sep 30; 10(5):R119,2011-01-11 22:46:40,18826638.0,CC BY
1,001_PMC503385_1471-2180-4-30-1.jpg,PMC503385,oa_package/a6/cb/PMC503385.tar.gz,BMC Microbiol. 2004 Jul 26; 4:30,2013-03-20 10:24:43,15274747.0,CC BY
2,001_PMC515284_1471-2091-5-13-6.jpg,PMC515284,oa_package/75/9e/PMC515284.tar.gz,BMC Biochem. 2004 Aug 13; 5:13,2013-03-20 10:26:08,15310391.0,CC BY
3,001_PMC515370_pbio.0020304.g004.jpg,PMC515370,oa_package/53/8c/PMC515370.tar.gz,PLoS Biol. 2004 Oct 7; 2(10):e304,2016-12-14 10:06:23,15361932.0,CC BY
4,001_PMC2927423_pone.0012315.s003.tif,PMC2927423,oa_package/d3/fb/PMC2927423.tar.gz,PLoS One. 2010 Aug 24; 5(8):e12315,2018-03-13 02:15:35,20808762.0,CC BY
...,...,...,...,...,...,...,...
1004,002_PMC3053380_pone.0017717.g004.jpg,PMC3053380,oa_package/a4/34/PMC3053380.tar.gz,PLoS One. 2011 Mar 10; 6(3):e17717,2018-03-28 22:16:27,21423701.0,CC BY
1005,002_PMC509306_pbio.0020246.g006.jpg,PMC509306,oa_package/fc/b8/PMC509306.tar.gz,PLoS Biol. 2004 Aug 17; 2(8):e246,2016-12-06 16:44:05,15314660.0,CC BY
1006,002_PMC516775_1471-2229-4-15-3.jpg,PMC516775,oa_package/e8/3a/PMC516775.tar.gz,BMC Plant Biol. 2004 Aug 26; 4:15,2013-03-20 10:26:45,15331019.0,CC BY
1007,077_PMC3431335_pgen.1002918.s011.tif,PMC3431335,oa_package/e3/5c/PMC3431335.tar.gz,PLoS Genet. 2012 Aug 30; 8(8):e1002918,2016-11-04 11:29:08,22952452.0,CC BY


In [62]:
all_wblots = glob("wblots/*/*/*")

In [85]:
wblots_organized = pd.DataFrame()

for index, image in enumerate(all_wblots):
    
    figname = os.path.basename(image)
    figname = "_".join(figname.split("_")[1:])
    uid = figname.split("_")[0]
    figname = figname.split("_")[1]
    url = f'https://www.ncbi.nlm.nih.gov/pmc/articles/{uid}/bin/{figname}'
    
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='PMC'
    data['RealDataPath'] = image
    data['License'] = 'CC-BY'
    data['Class'] = 'WesternBlot'
    data['ObjectMaskPath'] = None
    data['Link'] = url
    data['subset_tag'] = 'wblot'
    
    wblots_organized = wblots_organized.append(data, ignore_index=True)

In [86]:
wblots_organized

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,WesternBlot,PMC,000001,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/vertical/v3/002_PMC3930619_ppat.1003951...,wblot
1,WesternBlot,PMC,000002,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/vertical/v3/003_PMC3741172_pone.0072297...,wblot
2,WesternBlot,PMC,000003,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,,wblots/vertical/v3/018_PMC2691480_pone.0005932...,wblot
3,WesternBlot,PMC,000004,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,,wblots/vertical/v3/002_PMC515284_1471-2091-5-1...,wblot
4,WesternBlot,PMC,000005,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,,wblots/vertical/v3/001_PMC2246075_MBD2007-6737...,wblot
...,...,...,...,...,...,...,...,...
1004,WesternBlot,PMC,001005,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,,wblots/horizontal/h3/012_PMC2734988_pone.00069...,wblot
1005,WesternBlot,PMC,001006,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/horizontal/h3/007_PMC3741305_pone.00704...,wblot
1006,WesternBlot,PMC,001007,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/horizontal/h3/029_PMC3642079_ppat.10033...,wblot
1007,WesternBlot,PMC,001008,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,,wblots/horizontal/h3/005_PMC4117596_pone.01039...,wblot


In [87]:
wblots_organized.to_csv("organized_wblots.csv")

# 4 - BBBC019v2 Cell migration

This dataset is used during forgeries that are cropped copies of a image that has some sort of overlap during them

**Download data**

In [None]:
%%bash
mkdir -p BBBC019
cd BBBC019/
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/TScratch.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/Melanoma.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/Init.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/SN15.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/Scatter.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/Microfluidic.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/HEK293.zip"
wget -N -q "https://data.broadinstitute.org/bbbc/BBBC019/MDCK.zip"
unzip -q TScratch.zip
unzip -q Melanoma.zip
unzip -q Init.zip
unzip -q SN15.zip
unzip -q Scatter.zip
unzip -q Microfluidic.zip
unzip -q HEK293.zip
unzip -q MDCK.zip

In [65]:
bbbc019_images = glob('BBBC019/*/images/*[!.ini]')

In [66]:
bbbc019_data = pd.DataFrame()
for index, image in enumerate(bbbc019_images):
    
    dataset_name = image.split('/')[1]
    url = f'https://data.broadinstitute.org/bbbc/BBBC019/{dataset_name}'
    
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='BBBC019'
    data['RealDataPath'] = image
    data['License'] = 'CC-BY'
    data['Class'] = 'Biological'
    data['ObjectMaskPath'] = None
    data['Link'] = url
    data['subset_tag'] = 'overlap'
    
    bbbc019_data = bbbc019_data.append(data, ignore_index=True)

In [67]:
bbbc019_data

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,Biological,BBBC019,000001,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/M...,,BBBC019/Microfluidic/images/SN90_C_2_075_dic.tif,overlap
1,Biological,BBBC019,000002,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/M...,,BBBC019/Microfluidic/images/SN90_C_7_000_dic.tif,overlap
2,Biological,BBBC019,000003,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/M...,,BBBC019/Microfluidic/images/SN90_C_7_075_dic.tif,overlap
3,Biological,BBBC019,000004,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/M...,,BBBC019/Microfluidic/images/SN90_C_5_075_dic.tif,overlap
4,Biological,BBBC019,000005,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/M...,,BBBC019/Microfluidic/images/SN90_C_6_000_dic.tif,overlap
...,...,...,...,...,...,...,...,...
160,Biological,BBBC019,000161,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/T...,,BBBC019/TScratch/images/PC2br0.jpg,overlap
161,Biological,BBBC019,000162,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/T...,,BBBC019/TScratch/images/starvebr24.jpg,overlap
162,Biological,BBBC019,000163,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/T...,,BBBC019/TScratch/images/starvecl0.jpg,overlap
163,Biological,BBBC019,000164,CC-BY,https://data.broadinstitute.org/bbbc/BBBC019/T...,,BBBC019/TScratch/images/PC2ar24.jpg,overlap


In [68]:
bbbc019_data.to_csv("organized_bbbc019.csv")

# 5 - BBBC039

This dataset has an object map for each image that is usefull for a more realistic forgery

Also its license is CC0

** Download Data **

In [200]:
%%bash
mkdir -p BBBC039 && cd BBBC039
wget -N -q https://data.broadinstitute.org/bbbc/BBBC039/images.zip
wget -N -q https://data.broadinstitute.org/bbbc/BBBC039/masks.zip
unzip -q images.zip
unzip -q masks.zip


Since this data is in 16bits we are changing it to 8 bits.

Also, we are cropping this data in 4 different (non-overlapping) areas,
to increase the number of examples that we have

In [69]:
# we are using pillow, and numpy to this -- 
#pip install pillow
# pip install numpy
from PIL import Image
import numpy as np


In [70]:
def convert16bits_to_8bits_and_crop(image):
    # Converting to 8 bits
    result = np.array(Image.open(image))
    result = (result - result.min()) / (result.max() - result.min())
    result = 255*result
    result = result.astype(np.uint8)
    
    # Cropping
    h,w = result.shape[:2]
    result_1 = result[0:h//2,0:w//2]
    result_2 = result[h//2:,0:w//2]
    result_3 = result[h//2:,w//2:]
    result_4 = result[0:h//2,w//2:]
    
    cvt_path = os.path.dirname(image.replace("images",'cvt_images'))
    os.makedirs(cvt_path, exist_ok=True)
    
    # saving cropped images
    dest = f'{cvt_path}/{os.path.basename(image)[:-4]}_part1.png'
    Image.fromarray(result_1).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(image)[:-4]}_part2.png'
    Image.fromarray(result_2).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(image)[:-4]}_part3.png'
    Image.fromarray(result_3).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(image)[:-4]}_part4.png'
    Image.fromarray(result_4).save(dest)
    
    # Cropping Masks
    mask_path = image.replace("images",'masks').replace(".tif",'.png')
    mask = np.array(Image.open(mask_path))
    mask = mask[:,:,0]
    
    # Cropping
    h,w = mask.shape[:2]
    mask_1 = mask[0:h//2,0:w//2]
    mask_2 = mask[h//2:,0:w//2]
    mask_3 = mask[h//2:,w//2:]
    mask_4 = mask[0:h//2,w//2:]
    
    cvt_path = os.path.dirname(mask_path.replace("masks",'cvt_masks'))
    os.makedirs(cvt_path, exist_ok=True)
    
    # saving cropped masks
    dest = f'{cvt_path}/{os.path.basename(mask_path)[:-4]}_part1.png'
    Image.fromarray(mask_1).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(mask_path)[:-4]}_part2.png'
    Image.fromarray(mask_2).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(mask_path)[:-4]}_part3.png'
    Image.fromarray(mask_3).save(dest)
    
    dest = f'{cvt_path}/{os.path.basename(mask_path)[:-4]}_part4.png'
    Image.fromarray(mask_4).save(dest)

In [415]:
all_bbc039_images = glob('BBBC039/images/*')

In [557]:
for image in all_bbc039_images:
    convert16bits_to_8bits_and_crop(image)

**Organizing images in csv**

In [71]:
all_croped_bbc039_images = glob('BBBC039/cvt_images/*')

In [72]:
bbbc039_data = pd.DataFrame()
for index, image in enumerate(all_croped_bbc039_images):
    
    dataset_name = image.split('/')[1]
    url = 'https://data.broadinstitute.org/bbbc/BBBC039/images.zip'
    
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='BBBC039'
    data['RealDataPath'] = image
    data['License'] = 'CC0'
    data['Class'] = 'Biological'
    data['ObjectMaskPath'] = image.replace("cvt_images",'cvt_masks')
    data['Link'] = url
    data['subset_tag'] = 'obj_map'
    
    bbbc039_data = bbbc039_data.append(data, ignore_index=True)

In [73]:
bbbc039_data

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,Biological,BBBC039,000001,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_P09_s4_w11E382363-8C...,BBBC039/cvt_images/IXMtest_P09_s4_w11E382363-8...,obj_map
1,Biological,BBBC039,000002,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_P19_s6_w10EBAD15B-28...,BBBC039/cvt_images/IXMtest_P19_s6_w10EBAD15B-2...,obj_map
2,Biological,BBBC039,000003,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_B24_s9_w18C4FE0DD-12...,BBBC039/cvt_images/IXMtest_B24_s9_w18C4FE0DD-1...,obj_map
3,Biological,BBBC039,000004,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_E19_s2_w1752F162C-2C...,BBBC039/cvt_images/IXMtest_E19_s2_w1752F162C-2...,obj_map
4,Biological,BBBC039,000005,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_H16_s4_w16207B133-B3...,BBBC039/cvt_images/IXMtest_H16_s4_w16207B133-B...,obj_map
...,...,...,...,...,...,...,...,...
795,Biological,BBBC039,000796,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_F08_s1_w144C3056F-C4...,BBBC039/cvt_images/IXMtest_F08_s1_w144C3056F-C...,obj_map
796,Biological,BBBC039,000797,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_P01_s3_w1A7DC2612-9C...,BBBC039/cvt_images/IXMtest_P01_s3_w1A7DC2612-9...,obj_map
797,Biological,BBBC039,000798,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_B23_s1_w152C21D3B-75...,BBBC039/cvt_images/IXMtest_B23_s1_w152C21D3B-7...,obj_map
798,Biological,BBBC039,000799,CC0,https://data.broadinstitute.org/bbbc/BBBC039/i...,BBBC039/cvt_masks/IXMtest_A20_s4_w153DE191F-B1...,BBBC039/cvt_images/IXMtest_A20_s4_w153DE191F-B...,obj_map


In [74]:
bbbc039_data.to_csv("organized_bbbc039.csv")

# 6 - TNBC

This dataset have images with object map, that will be used for copy_move objects

This dataset is licensed with CC-BY 4.0


**Downloading the data**

In [581]:
%%bash
mkdir -p TNBC && cd TNBC
wget -N -q https://zenodo.org/record/1175282/files/TNBC_NucleiSegmentation.zip
unzip -q TNBC_NucleiSegmentation.zip

In [75]:
tnbc_images = glob('TNBC/TNBC_NucleiSegmentation/Slide_*/*')

In [76]:
tnbc_data = pd.DataFrame()
for index, image in enumerate(tnbc_images):
    
    dataset_name = image.split('/')[1]
    url = "https://zenodo.org/record/1175282/files/TNBC_NucleiSegmentation.zip"
    
    data = {}
    data['ID'] = "%06d" %(index+1)
    data['DatasetRef'] ='TNBC'
    data['RealDataPath'] = image
    data['License'] = 'CC-BY'
    data['Class'] = 'Biological'
    data['ObjectMaskPath'] = image.replace('Slide','GT')
    data['Link'] = url
    data['subset_tag'] = 'obj_map'
    
    tnbc_data = tnbc_data.append(data, ignore_index=True)

In [77]:
tnbc_data

Unnamed: 0,Class,DatasetRef,ID,License,Link,ObjectMaskPath,RealDataPath,subset_tag
0,Biological,TNBC,1,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_4.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_4.png,obj_map
1,Biological,TNBC,2,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_6.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_6.png,obj_map
2,Biological,TNBC,3,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_5.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_5.png,obj_map
3,Biological,TNBC,4,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_2.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_2.png,obj_map
4,Biological,TNBC,5,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_1.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_1.png,obj_map
5,Biological,TNBC,6,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_3.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_3.png,obj_map
6,Biological,TNBC,7,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_7.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_7.png,obj_map
7,Biological,TNBC,8,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_04/04_8.png,TNBC/TNBC_NucleiSegmentation/Slide_04/04_8.png,obj_map
8,Biological,TNBC,9,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_02/02_3.png,TNBC/TNBC_NucleiSegmentation/Slide_02/02_3.png,obj_map
9,Biological,TNBC,10,CC-BY,https://zenodo.org/record/1175282/files/TNBC_N...,TNBC/TNBC_NucleiSegmentation/GT_02/02_1.png,TNBC/TNBC_NucleiSegmentation/Slide_02/02_1.png,obj_map


In [78]:
tnbc_data.to_csv("organized_tnbc.csv")

-----

# Join all data into one dataset

we are using shuitil for copy all data to one place

In [2]:
import pandas as pd

In [88]:
bbbc038 = pd.read_csv('organized_bbbc038.csv').drop(columns=['Unnamed: 0','ID'])
bbbc039 = pd.read_csv('organized_bbbc039.csv').drop(columns=['Unnamed: 0','ID'])
bbbc019 = pd.read_csv('organized_bbbc019.csv').drop(columns=['Unnamed: 0','ID'])
pmc = pd.read_csv('organized_pmc.csv').drop(columns=['Unnamed: 0','ID'])
tnbc = pd.read_csv('organized_tnbc.csv').drop(columns=['Unnamed: 0','ID'])
wblots =  pd.read_csv('organized_wblots.csv').drop(columns=['Unnamed: 0','ID'])

In [89]:
dataset = pd.concat([bbbc038,bbbc039,bbbc019,pmc,tnbc,wblots])

In [90]:
dataset = dataset.reset_index().drop(columns='index')
dataset['dataGTPath'] = ''
dataset['dataPath'] = ''

In [91]:
dataset

Unnamed: 0,Class,DatasetRef,License,Link,ObjectMaskPath,RealDataPath,subset_tag,dataGTPath,dataPath
0,Biological,BBBC038,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,BBBC038/58406ed8ef944831c413c3424dc2b07e59aef1...,gray,,
1,Biological,BBBC038,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,BBBC038/b82548ab19466b461614e6055aaf49fbc24c03...,gray,,
2,Biological,BBBC038,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,BBBC038/dabfee30b46d23569c63fa7253ef10b2407fbe...,gray,,
3,Biological,BBBC038,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,BBBC038/f113626a04125d97b27f21b45a0ce9a686d73d...,gray,,
4,Biological,BBBC038,CC0,https://data.broadinstitute.org/bbbc/BBBC038/s...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,BBBC038/ad9d305cbf193d4250743ead466bdaefe91083...,gray,,
...,...,...,...,...,...,...,...,...,...
2953,WesternBlot,PMC,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...,,wblots/horizontal/h3/012_PMC2734988_pone.00069...,wblot,,
2954,WesternBlot,PMC,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/horizontal/h3/007_PMC3741305_pone.00704...,wblot,,
2955,WesternBlot,PMC,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,,wblots/horizontal/h3/029_PMC3642079_ppat.10033...,wblot,,
2956,WesternBlot,PMC,CC-BY,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,,wblots/horizontal/h3/005_PMC4117596_pone.01039...,wblot,,


All data into a single folder

In [92]:
os.makedirs("srcDataset")

In [93]:
for index, row in dataset.iterrows():
    src_name =  row['RealDataPath']
    real_name = os.path.basename(src_name)
    ext = real_name[real_name.rfind("."):]
    row['dataPath'] = f"srcDataset/{(index+1):05d}{ext}" 
    
    shutil.copy(src_name, row['dataPath'])
    
    if type(row['ObjectMaskPath']) is str:
        gt_name = os.path.basename((row['ObjectMaskPath']))
        ext = gt_name[gt_name.rfind("."):]
        row['dataGTPath'] = f"srcDataset/{(index+1):05d}_gt.png" 
        
        shutil.copy(row['ObjectMaskPath'], row['dataGTPath'])
    else:
        row['dataGTPath'] = None

In [94]:
dataset.to_csv('datasetSrc.csv')