*Note that this notebook requires the use of Python functions, and therefore should be run using either a Standard Compute cluster or preferably with Serverless notebook compute.*

# Copying Files from a Volume to Another Volume
***

## Notebook Setup
***

Set the schema to use to dynamically create the `target_volume` path.  

In [0]:
schema_use = spark.sql("SELECT REPLACE(SPLIT(current_user(), '@')[0], '.', '_')").collect()[0][0]
schema_use

In [0]:
source_volume = "/Volumes/fhir_workshop/synthea/synthetic_files_raw/output/fhir/"
target_volume = f"/Volumes/fhir_workshop/{schema_use}/landing/"
target_volume

## Using Shell Commands to Interact with Volumes 
***

One of the best parts about using Volumes to reference your cloud storage is that it provides a posix-style way of interacting with the storage account, abstracting away the conplexities of each cloud providers various APIs.  Additionally it allows the use of standard shells commands such as `ls`, `cp`, `mv`, `mkdir`, etc. as though it were local storage.  This makes interacting with files incredibly easy in Databricks.  

When connected to Serverless Notebook compute or a Standard or Dedicated cluster (i.e. not a SQL warehouse), we're able to use the `%sh` magic command to directly execute shell commands.  

In [0]:
%sh

ls -alt /Volumes/fhir_workshop/synthea/synthetic_files_raw/output/fhir/ | head -n 10;

We can even use cat to inspect the contents of one of the files directly in the notebook.  

In [0]:
%sh 

cat /Volumes/fhir_workshop/synthea/synthetic_files_raw/output/fhir/DBPriorAuthExample.json

Shell context is preserved inside the code chunck, but not between code chunks.  This means we can seperate multiple shell commnds in the same code chunk with semi-colons, including the use of `cd` for the volume path!  

Let's compare this FHIR bundle to another bundle but do it in a more programatic way.  

In [0]:
%sh 

cd /Volumes/fhir_workshop/synthea/synthetic_files_raw/output/fhir/;
ls A*.json | head -n 1;
ls A*.json | head -n 1 | awk '{print $1}' | xargs cat;

The motivation for this course is really based on the complexity of this FHIR bundles.  No two FHIR JSONs will ever have the exact same strcuture, and therefore the exact same schema.  While great for transmitting data between organizations, they are really not great for much anything else.  You need an easy way to parese these, and that's what the rest of this course is all about.  

## Define a function to Copy Files with a File Pattern
***

During the first iteration of this course, the source volume contained nearly 100K FHIR JSON bundles, which would have taken too long to copy over for everyone in the course.  Therefore an array of file patterns was used to copy over approximately 1,000 bundles from the source volume to the target volume.  The below function also makes use of standard Python libraries that interact with local file storage.  You may be used to using these already on your laptop or Linux VMs.  With Volumes you may use these same functions without the need to learn each cloud's storage APIs.  

In [0]:
import shutil
import glob
import os

def copy_files(source_volume, target_volume, file_pattern):
  # Check if the source and target volumes end with a slash
  if not source_volume.endswith('/'):
    source_volume += '/'

  if not target_volume.endswith('/'):
    target_volume += '/'

  # Use glob to locate files based on the file pattern
  if file_pattern is None:
    file_pattern = '*'

  files = glob.glob(os.path.join(source_volume, file_pattern))

  # Copy each file to the destination directory
  for file in files:
    target_file = os.path.join(target_volume, os.path.basename(file))
    if os.path.exists(target_file):
        os.remove(target_file)
    shutil.copy2(file, target_volume)

  return f"Copied {str(len(files))} files."

Uncomment the first line of the code chunk below to use file patterns based on the start of the FHIR bundle names and comment out the last line.  If the number of files in the source volume is less than 2K, then its fine to move everything with a file pattern of "*". 

In [0]:
 # file_patterns = ["Aa*.json", "Ab*.json", "Ad*.json", "Af*.json", "Ag*.json", "Ah*.json", "Ai*.json", "Aj*.json", "Ak*.json"]
file_patterns = ["*"]

In [0]:
for file_pattern in file_patterns:
    print(copy_files(source_volume, target_volume, file_pattern))

In [0]:
# files_to_remove = dbutils.fs.ls(target_volume)
# for file in files_to_remove:
#     if file.name.startswith("Al") and file.name.endswith(".json"):
#         dbutils.fs.rm(file.path)