# MOTHER-DB project

Image segmentation of ovarian follicles, part of https://mother-db.org/

**Program 2, Split Folders_V5_2.ipynb, Version 5_2,  March 12, 2025:**<BR>
Partition sub-images into Train, Test and Validate folders.

## About this project
This project developed and uses AI/ML techniques to segment histology images from the ovaries of nonhuman primates. Specifically, this suite of programs attempts to identify the following six follicle types: 
 1. Primordial
 1. Transitional Primordial
 1. Primary
 1. Transitional Primary
 1. Secondary
 1. Multilayer
 
The follicle type definitions are based on the recommendations of the NICHD-Sponsored Ovarian Nomenclature 
Workshop committee for primates.
 1. Yano Maher JC, Zelinski MB, Oktay KH, Duncan FE, Segars JH, Lujan ME, Lou H, Yun B, Hanfling SN, Schwartz LE, 
Laronda MM, Halvorson LM, O'Neill KE, Gomez-Lobo V. Fertil Steril. 2024 Nov 14:S0015-0282(24)02394-X. 
doi: 10.1016/j.fertnstert.2024.11.016. Epub ahead of print. PMID: 39549739. https://pubmed.ncbi.nlm.nih.gov/39549739/  

## About this program module

This program takes as input a set of sub-images of follicles, organized in folders by follicle types, and creates 
Train, Test and Validaton sets. Each set is in its own folder and within those folders are subfolders for each of
the follicle types.

### Key notes:
 * Images in an augmented set (rotations, shifts, etc.) all go to the same train/test/validate folder. 
 * The size of an augmented set is defined by the parmater `groups`.
 * Note that this does not use the parameters.py file
    
### Changes
* Dec 15, 2023: Now renames the output folders for each follicle type (in the Test, Train and Validate folders) with leading numbers. E.g., "2_Primordial" folder becomes "Primordial", "Negative" becomes "1_Negative".
* Feb 2, 2024: Now renames the output folders for each follicle type starting at "0" so "0_Negative"

### Problems
May run into Windows 11 file name+path limit of ~260 characters with filenames like<br>
`"C:\Users\jsluka\OneDrive - Indiana University\Desktop\Work\Watanabe ovary 2021\MOTHER\FADS May 2023\Program 1\Train Images\Transitional Primordial\14736_UN_050a.ome_Transitional Primordial_x1273_y4355_w150offset42_horizontal_angle0.png"`

### To do:
 * Should be writing info (fraction in train, test, ...) to the project log file

## About the input data

The input data consists of the sub-image sets created by Train Image Generation.ipynb (Program 1). 
The sub-images are in a folder, e.g., ...\Program 1\Train Images_2025-03-12_18-33-13\ (`SOURCE_PATH`) and within 
that folder are expected individual folders for each follicle type being processed, see the follicle type list above.

## About the ouput data

Ths program creates a new folder (`DEST_PATH`), and within that folder will be Train, Test and validate folders.
Within each of those folders will be subfolders for each follicle type. The other major output is the .html version of
this Jupyter notebook, which will contain the paramters sued and the final counts.

## About the MOTHER project and MOTHER-DB

The Multispecies Ovary Tissue Histology Electronic Repository (MOTHER) provides public access to digitized microscopic images of ovary tissues along with information that ensures image integrity and quality. Currently, there is no electronic repository of ovary histology slides that preserves these valuable research collections for future generations. MOTHER is a web-accessible, open resource for scientists, educators, and the public to stimulate collaboration and scientific research. Educators may use the slide images in a range of courses from reproductive biology to teaching computerized image analysis.

Biology is increasingly dependent upon quantitative data analysis, and MOTHER should inspire computational thinking in biology broadly, while developing specific skills in microscopy, computer programming, and data and image analysis.

## License For Use

This work is licensed under CC BY-NC-SA 4.0. To view a copy of this license, 
visit https://creativecommons.org/licenses/by-nc-sa/4.0/

## Funding

MOTHER-DB, and this project was funded by 
 * Grant “CIBR Multispecies Ovary Tissue Histology Electronic Repository (MOTHER)” from the National Science Foundation (NSF DBI-2054061, 2021 – 2024). 
 * Indiana University, Faculty Assistance in Data Science (FADS) Project
 * Arizona State University

## Contributors
Many people have contributed to this project:

 * Code development
   * James Sluka, Indiana University
   * Karen Watanabe, Arizona State University
   * Riley Israels, Arizona State University
   * Parth Ravindra Rao, Indiana University 
   * Param Nagda, Indiana University
   * Colette Lund, Arizona State University

 * Training data creation
   * Mary Zelinski, Oregon National Primate Research Center
   * Karen Watanabe, Arizona State University
   * Numerous Arizona State undergraduate and graduate students

 ## Imports, Settings and Parameters

**SOURCE_PATH** is where the input sub-images folders are.<br>
**DEST_PATH** is where to write the files to. This code will create a new top level (DEST_PATH) and subdirectories<br><br>
**groups** is how many files are in an augmented set (the orignal subimage plus the rotations, shifts, ...), usually 12 (was 16).<br><br>
**train_ratio** is the fraction of files that go into the train data set.<br>
**validate_ratio** is the fraction of files that go into the validation data set (often just zero).<br>
**test_ratio** is the fraction of files that go into the test data set.<br>

**outputDirs** maps the input follicle types dir names to the output dir names

In [1]:
import os
import numpy as np
import shutil
from numpy.random import choice
import time
import datetime
from ipylab import JupyterFrontEnd

#SOURCE_PATH = os.path.join('../../Data/Intermediate/Primordial 24-10-29__15_30_38/Train Images/') # top level directory of folders for each follicle type from Program 1
#DEST_PATH   = os.path.join('../../Data/Training/Primordial/v10')   #  write to this folder

SOURCE_PATH = os.path.join('../Program 1/Train Images_2025-03-12_18-33-13/') # top level directory of folders for each follicle type from Program 1
SOURCE_PATH = os.path.join('../Program 1/Train Images_2025-04-09_13-08-20/')

outfolder = os.path.join('./Training_data_'+datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
print('\nWill write the otuput to the folder:',outfolder)
DEST_PATH   = os.path.join(outfolder)   #  write to this folder

groups = 16   # groups is how many files are in an augmented set (the original subimage plus the rotations, shifts, 
              # reflections, ...). Usually 16

train_ratio    = 0.75  # the fraction of files that go into the train data set
validate_ratio = 0.05  # the fraction of files that go into the validation data set
test_ratio     = 0.20  # the fraction of files that go into the testing data set


Will write the otuput to the folder: ./Training_data_2025-04-10_15-32-08


### Subroutines and functions

In [2]:
# Deciding train, test or validate for each individual group
def split_my_data(SOURCE_PATH, DEST_PATH, groups, train_ratio, validate_ratio, test_ratio):
    
    classes = os.listdir(SOURCE_PATH)
    if not os.path.exists(DEST_PATH):
        os.makedirs(DEST_PATH)
        print('Created output directory:',DEST_PATH)
    
    # collect some statistics
    trainCount = 0
    testCount  = 0
    validateCount = 0
    
    # Check if folders are empty
    print('Removing old output folders...')
#   for filename in os.listdir(DEST_PATH):
    for filename in ('test','train','validate'):
        file_path = os.path.join(DEST_PATH, filename)
        print('\tdelete:',file_path)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print('Failed to delete %s. Reason: %s' % (file_path, e))

    print('\nStarting main (this may take a while) ...')
    for i in classes:
        try:
            os.makedirs(DEST_PATH +'/train/' + i)
            os.makedirs(DEST_PATH +'/validate/' + i)
            os.makedirs(DEST_PATH +'/test/' + i)

        except:
            pass
        source = os.path.join(SOURCE_PATH,i)
        allFileNames = sorted(os.listdir(source))
        #print('\nDoing: %-25s  source dir: %s' % (i,source))
        print('\nDoing:',i)
        print('\tsource dir:',source)
        print('\tfile index (max = ',len(allFileNames),') = ',sep='',end='')


        for row in range(0, len(allFileNames) - groups, groups):
            if row % 5000 < groups:
                print(row,end=' ')
                
            # decide between train and test with probability
            loc = choice(['train', 'validate', 'test'], 1, p=[train_ratio, validate_ratio, test_ratio])

            tempFileNames = allFileNames[row:row+groups]

            if loc == 'train':
                train_FileNames = [source+'/'+ name for name in tempFileNames]
                for name in train_FileNames:
                    shutil.copy(name, DEST_PATH +'/train/' + i)
                    trainCount += 1
            elif loc == 'validate':
                validate_FileNames = [source+'/'+ name for name in tempFileNames]
                for name in validate_FileNames:
                    shutil.copy(name, DEST_PATH +'/validate/' + i)
                    validateCount += 1
            elif loc == 'test':
                test_FileNames = [source+'/' + name for name in tempFileNames]
                for name in test_FileNames:
                    shutil.copy(name, DEST_PATH +'/test/' + i)
                    testCount += 1
            else:
                print("Oops")
                
    totalCount = testCount + trainCount + validateCount
    print('\n\n\nSummary:')
    print('Requested: train=%5.1f%%, test=%5.1f%%, validate=%5.1f%%' % \
            (100*train_ratio,100*test_ratio,100*validate_ratio))
    print('Total= %i, train= %i (%4.1f%%), test= %i (%4.1f%%), validate= %i (%4.1f%%)' % \
            (totalCount,trainCount,trainCount/totalCount*100.,testCount,testCount/totalCount*100.,validateCount,validateCount/totalCount*100.))
    print("\nsplit_my_data, Done!\n") 


### Rename the output directories, e.g., "Primordial" --> "2_Primordial"

In [3]:
outputDirs = {               "Negative":"0_Negative",
                           "Primordial":"1_Primordial",
              "Transitional Primordial":"2_Transitional Primordial",
                              "Primary":"3_Primary",
                 "Transitional Primary":"4_Transitional Primary",
                            "Secondary":"5_Secondary",
                           "Multilayer":"6_Multilayer",
                               "Antral":"7_Antral",
                       "Atretic Antral":"8_Atretic Antral" }
print(outputDirs)

def renameDirs(DEST_PATH):
    for folder in ("train","test","validate"):
        folderPath = os.path.join(DEST_PATH,folder)
        print('\nRenaming folders in:',folder,'\t',folderPath)
        subfolders = os.listdir(folderPath)
        print('\tsubfolders list:',subfolders)
        for subsubfolder in subfolders:
            print('\t\tdoing subsubfolder:',subsubfolder)
            if os.path.isdir(os.path.join(folderPath,subsubfolder)):
                dirRoot = os.path.basename(subsubfolder)
                if dirRoot in outputDirs:
                    os.rename(os.path.join(folderPath,subsubfolder),os.path.join(folderPath,outputDirs[dirRoot]))
                else:
                    print('\t\t\tnot found in "outputDirs" (not a follicle type?):',dirRoot)
            else:
                print('\t\t\tnot a folder')
    return()

{'Negative': '0_Negative', 'Primordial': '1_Primordial', 'Transitional Primordial': '2_Transitional Primordial', 'Primary': '3_Primary', 'Transitional Primary': '4_Transitional Primary', 'Secondary': '5_Secondary', 'Multilayer': '6_Multilayer', 'Antral': '7_Antral', 'Atretic Antral': '8_Atretic Antral'}


### Main Code 

In [4]:
split_my_data(SOURCE_PATH, DEST_PATH, groups, train_ratio, validate_ratio, test_ratio)
renameDirs(DEST_PATH)

Created output directory: ./Training_data_2025-04-10_15-32-08
Removing old output folders...
	delete: ./Training_data_2025-04-10_15-32-08\test
	delete: ./Training_data_2025-04-10_15-32-08\train
	delete: ./Training_data_2025-04-10_15-32-08\validate

Starting main (this may take a while) ...

Doing: Multilayer
	source dir: ../Program 1/Train Images_2025-04-09_13-08-20/Multilayer
	file index (max = 2752) = 0 
Doing: Negative
	source dir: ../Program 1/Train Images_2025-04-09_13-08-20/Negative
	file index (max = 497536) = 0 5008 10000 15008 20000 25008 30000 35008 40000 45008 50000 55008 60000 65008 70000 75008 80000 85008 90000 95008 100000 105008 110000 115008 120000 125008 130000 135008 140000 145008 150000 155008 160000 165008 170000 175008 180000 185008 190000 195008 200000 205008 210000 215008 220000 225008 230000 235008 240000 245008 250000 255008 260000 265008 270000 275008 280000 285008 290000 295008 300000 305008 310000 315008 320000 325008 330000 335008 340000 345008 350000 35500

()

## Convert Notebook to HTML and save

Programmatically save the notebook, convert it to html, rename the .html file with a timestamp.

Make sure the __NOTEBOOK_NAME__ and __NOTEBOOK_HTML_NAME__ are properly defined.

In [5]:
#Needed for the next command to work, for some reason
time.sleep(1)

#Programmatically save the notebook

APP = JupyterFrontEnd() #Needed to save the notebook programmatically later, do not change.
APP.commands.execute("docmanager:save")

#Convert the notebook to html
NOTEBOOK_NAME = "Split Folders_V5_2.ipynb" #The exact name of this notebook, including the file extension.
print('NOTEBOOK_NAME:',NOTEBOOK_NAME)
!jupyter nbconvert --to html "$NOTEBOOK_NAME"

#Rename the .html file with a timestamp
NOTEBOOK_HTML_NAME = "Split Folders_V5_2_" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + ".html"
print('NOTEBOOK_HTML_NAME:',NOTEBOOK_HTML_NAME)
shutil.move(NOTEBOOK_NAME[:-6] + ".html",NOTEBOOK_HTML_NAME)


NOTEBOOK_NAME: Split Folders_V5_2.ipynb
NOTEBOOK_HTML_NAME: Split Folders_V5_2_2025-04-10_18-57-24.html


[NbConvertApp] Converting notebook Split Folders_V5_2.ipynb to html
[NbConvertApp] Writing 619704 bytes to Split Folders_V5_2.html


'Split Folders_V5_2_2025-04-10_18-57-24.html'