# Introduction

This Notebook demonstrates how to work with files on a virtual machine (VM) and Google buckets for  common  tasks, including how to explore/copy/move files on a VM, from Google buckets, or between a VM and Google buckets.

Although we can do these jobs in the terminal, for this tutorial, we will demonstrate how to perform these jobs in a Notebook by using four widely used signs, including:  
- exclamation mark (!), 
- dollar sign ($), 

and two built-in magic signs: 
- percent sign (%), followed by one line of code 
- double percent sign (%%), followed by several lines of code



**List all magic commands**

For the purpose of this tutorial, we will only demonstrate a few such commands:

In [None]:
%lsmagic

# How to get main Workspace environment (env) variables

**Method 1:** Using OS module

In [None]:
import os

In [None]:
google_project=os.environ["GOOGLE_PROJECT"]
google_project

In [None]:
my_bucket = os.getenv('WORKSPACE_BUCKET')
my_bucket

In [None]:
DATASET=os.environ["WORKSPACE_CDR"]
DATASET

**Method 2:** Using %env

In [None]:
%env GOOGLE_PROJECT

In [None]:
%env WORKSPACE_BUCKET

In [None]:
%env WORKSPACE_CDR

**We can directly use these variables** by using $

In [None]:
!echo $WORKSPACE_BUCKET

# How to run commands using ! and %%bash

**Examples**
When we open a new terminal, the default directory is our home directory (/home/jupyter) when using a VM, which is different from the current working directory (/home/jupyter/workspaces/your_workspace_name/) when using a Notebook. We can check the type of terminal by using ! or %%bash

In [None]:
!echo $HOME

In [None]:
%%bash
echo $HOME

**List contents under the home directory**

In [None]:
!ls $HOME

In [None]:
%%bash 
ls $HOME

## How to work with folders/files on a VM

**Let's select a small file from the person_table and save to the current working directory on VM**

In [None]:
import pandas as pd
query = f"""
SELECT DISTINCT *
FROM `{DATASET}.person`
LIMIT 5
"""
df = pd.read_gbq(query,dialect = "standard")
df.shape

**Save this data frame as a csv file to the current working directory**

In [None]:
df.to_csv('test1.csv')

**What is current working directory?**

If we click 'Open' in the 'File' menu, then we will see the current working directory is

/workspaces/your_workspace_name, which is the same as $HOME/workspaces/your_workspace_name

Since ! and %%bash are interchangable, we will only use one of them for the following examples. 

**We can check the files in the current Workspace or current working directory by running the cell below**

In [None]:
!ls $HOME/workspaces

In [None]:
!ls $HOME/workspaces/bestpracticeforaoudatascience/**

**Check test1.csv on the current working directoy**

In [None]:
!ls test1.csv

**How to create a folder on a VM**

Let's create a folder called 'test' under the current working directory and copy test1.csv from the current working directory to this folder

In [None]:
!mkdir test

In [None]:
!cp test1.csv test/test1.csv

Check test1.csv under /test/ folder

In [None]:
!ls test/**

# How to work with files in Google buckets

To access files stored in Google buckets, we have to use command 'gsutil', which is a Python application that allows you to access cloud storage from the command line. You can use gsutil to do a wide variety of bucket and object management tasks, including:

- Listing buckets and objects (ls), 
- Moving (mv) objects, 
- Copying(cp) objects, 
- Renaming(rm) objects

Please note that not all commands may be supported in the Workbench.

For more details, please read: https://cloud.google.com/storage/docs/gsutil

**Check what is available in the current Google bucket**

In [None]:
!gsutil ls {my_bucket}

Or we can check using this line below

In [None]:
!gsutil ls $WORKSPACE_BUCKET

**All Notebooks are saved in the Google bucket/Notebooks**

In [None]:
!gsutil ls {my_bucket}/notebooks

**How to create a folder named 'data' in the Google bucket and copy test1.csv from the VM to the bucket**

Please note that we don't have to initially create a folder in this case because there is no such command as mkdir that is supported by the Workbench

In [None]:
!gsutil cp test1.csv {my_bucket}/data/test1.csv    

**Check test1.csv in the bucket/data**

In [None]:
!gsutil ls {my_bucket}/data/test1.csv

**Check all files under bucket/data**

In [None]:
!gsutil ls {my_bucket}/data/**

**How to copy a folder with files from the VM to the Google bucket**

In [None]:
!gsutil cp -r test {my_bucket}/data/test

In [None]:
!gsutil ls {my_bucket}/data/test

## How to create a path environment variable

There may be some instances where you would want to use a path variable for a file 

**Example 1**

For this example, let's copy a file in a shared genomic analysis Workspace to the current working directory on the VM. $GOOGLE_PROJECT is needed to access such Workspaces, but it is not necessary when we are working with our own Google buckets.

In [None]:
%env file_path=gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv
!gsutil -u $GOOGLE_PROJECT cp $file_path .

**Check files on the VM**

In [None]:
!ls ancestry_preds.tsv

**Or we can copy this file directly from the shared genomic Workspace to the Google bucket/data**

In [None]:
!gsutil -u $GOOGLE_PROJECT cp $file_path $WORKSPACE_BUCKET/data/ancestry_preds2.tsv

Check this file in {my_bucket}

In [None]:
!gsutil ls {my_bucket}/data/ancestry_preds2.tsv

**Copy this file from the VM to {my_bucket}/data**

In [None]:
!gsutil cp ancestry_preds.tsv {my_bucket}/data/ancestry_preds.tsv

**Example 1.2**

Or we can try this method as well to copy file to the current VM/PD.

In [None]:
file_path='gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/aux/ancestry/ancestry_preds.tsv'
!gsutil -u $$GOOGLE_PROJECT cp {file_path} .

In [None]:
!ls *.tsv

**Example 2**

For this example, let's copy files from the VM to the Google bucket using path variable

In [None]:
!ls /home/jupyter/workspaces

In [None]:
%env file_path2=/home/jupyter/workspaces/bestpracticeforaoudatascience/test/test1.csv

In [None]:
!echo $file_path2

In [None]:
!gsutil cp $file_path2 $WORKSPACE_BUCKET/data/test2.csv

**Be careful, the following line will not work** - Feel free to test

In [None]:
!gsutil cp $file_path2 {my_bucket}/data/test3.csv

In [None]:
!gsutil ls {my_bucket}/data/test3.csv

**Check the files in the Google bucket**

In [None]:
!gsutil ls {my_bucket}/data/**

**Display a list of files stored in your Google bucket, including subfolders**

In [None]:
!gsutil ls -r {my_bucket}

# How to show a file head

In [None]:
!head test1.csv

# How to rename a file

**Rename a file on the VM**

In [None]:
!mv test1.csv test11.csv

In [None]:
!ls test11.csv

**Rename a file in the Google bucket**

In [None]:
!gsutil ls {my_bucket}/data

In [None]:
!gsutil mv {my_bucket}/data/test1.csv {my_bucket}/data/test11.csv

# How to delete a file

**Delete a file on the VM**

In [None]:
!rm test11.csv 

**Delete a file in the Google bucket**

In [None]:
!gsutil rm {my_bucket}/data/test11.csv 