# File Sorting

This notebook aims to to sort files into a directories based on their "base names". The base name is the portion of the file name before the "-" delimiter. The script will group files that share the same base name and ignore specific compound extensions such as `.change.history.html`, `.json.html`, `.ttl.html`, and `.xml.html`.

In [1]:
import os
import io, threading, time, re, json
from collections import defaultdict
import shutil

### Group Files by Base Name

In [2]:
def group_files_by_base_name(directory_path, delimiter='-'):
    """
    Group files in the directory by their base name (portion before a delimiter),
    ignoring specific compound extensions.
    
    Args:
        directory_path (str): Path to the directory containing files.
        delimiter (str): The delimiter to split the file name on (default is '-').
    
    Returns:
        dict: A dictionary where keys are base names and values are lists of files 
             that share the same base name.
    
    Example:
        For files like:
        - doc-v1.html (will be included)
        - doc-v2.json (will be included)
        - doc-v3.change.history.html (will be ignored)
        - doc-v4.json.html (will be ignored)
    """
    grouped_files = defaultdict(list)
    
    # List of compound extensions to ignore
    ignored_extensions = [
        '.change.history.html',
        '.json.html',
        '.ttl.html',
        '.xml.html'
    ]
    
    # Iterate through the files in the directory
    for filename in os.listdir(directory_path):
        # Check if file should be ignored based on compound extensions
        should_ignore = any(filename.endswith(ext) for ext in ignored_extensions)
        
        if not should_ignore:
            # Process only .json or .html files that aren't in the ignore list
            if filename.endswith('.json') or filename.endswith('.html'):
                if delimiter in filename:
                    # Get the base name (before the first delimiter)
                    base_name = filename.split(delimiter)[0]
                    # Append the file to the group corresponding to its base name
                    grouped_files[base_name].append(filename)
    
    return grouped_files

Define PlanNet full IG directory

In [3]:
directory_path = '/Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site'

In [4]:
grouped_files = group_files_by_base_name(directory_path)

In [5]:
for base_name, files in grouped_files.items():
    print(f"Base name: {base_name} (Total files: {len(files)})")

Base name: Location (Total files: 18)
Base name: StructureDefinition (Total files: 114)
Base name: SearchParameter (Total files: 102)
Base name: OrganizationAffiliation (Total files: 14)
Base name: ValueSet (Total files: 48)
Base name: CodeSystem (Total files: 28)
Base name: HealthcareService (Total files: 20)
Base name: Endpoint (Total files: 2)
Base name: Organization (Total files: 22)
Base name: PractitionerRole (Total files: 12)
Base name: usage (Total files: 1)
Base name: InsurancePlan (Total files: 4)
Base name: CapabilityStatement (Total files: 2)
Base name: Practitioner (Total files: 6)
Base name: ImplementationGuide (Total files: 1)
Base name: plan (Total files: 1)


Base names identfied and folders are created for base names using the `copy_files_to_folders` function if the category has more than one file with a specific base name.

In [6]:
def copy_files_to_folders(directory_path, grouped_files):
    """
    Copy files to folders if the base name group has more than 1 file, and remove them from the original directory.
    
    Args:
    directory_path (str): Path to the directory containing files.
    grouped_files (dict): Dictionary of grouped files by base name.
    """
    for base_name, files in grouped_files.items():
        if len(files) >= 2:  # Only process groups with more than 1 file
            # Create a folder for the base name in the same directory
            base_folder = os.path.join(directory_path, base_name)
            if not os.path.exists(base_folder):
                os.makedirs(base_folder)  # Create the folder if it doesn't exist
            print(f"Created folder: {base_folder}")
            
            # Copy each file in the group to the new folder
            for file in files:
                source_file = os.path.join(directory_path, file)
                destination_file = os.path.join(base_folder, file)
                shutil.copy(source_file, destination_file)  # Copy the file
                # print(f"Copied {file} to {base_folder}")
                
                # Remove the file from the original directory
                # os.remove(source_file)
                # print(f"Removed {file} from original directory")

In [7]:
copy_files_to_folders(directory_path, grouped_files)

Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/Location
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/StructureDefinition
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/SearchParameter
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/OrganizationAffiliation
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/ValueSet
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/CodeSystem
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/HealthcareService
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/Endpoint
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/Organization
Created folder: /Users/amathur/Documents/ONCLAIVE/onclaive-aanchalwip/PlanNet/site/PractitionerRole
Created folder: /Users/a

14 folders created