## Storage introduction

Cloud storage is a globally unified, scalable, and highly durable object storage system. We could store our data or files of any objects into the bucket. 

Key features:
* Manage storage costs and performance with OLM
* Globle location with low latency
* Any workload

  * Standard - Optimized for performance and high frequency access.
  * Nearline - Fast, highly durable, for data accessed less than once a month.
  * Coldline - Fast, highly durable, for data accessed less than once a quarter.
  * Archive - Most cost-effective, for data accessed less than once a year.
  * etc

There is a sample use case of storage for machine learning:
![Machine learning with storage](https://cloudx-bricks-prod-bucket.storage.googleapis.com/159d98095e3d589068d6267b0861564b7a0bf2aca5c81208989c64662811b517.svg)


 Let start with data storage with command line also with python code to show use case of cloud storage.

In [0]:
# let's first auth user
from google.colab import auth
auth.authenticate_user()

In [16]:
# let's set our project
! gcloud config set project cloudtutorial-279003

Updated property [core/project].


In [17]:
# let's list the region that we could use with bucket
! gcloud compute regions list

NAME                     CPUS  DISKS_GB  ADDRESSES  RESERVED_ADDRESSES  STATUS  TURNDOWN_DATE
asia-east1               0/8   0/2048    0/4        0/8                 UP
asia-east2               0/8   0/2048    0/4        0/8                 UP
asia-northeast1          0/8   0/2048    0/4        0/8                 UP
asia-northeast2          0/8   0/2048    0/4        0/8                 UP
asia-northeast3          0/8   0/2048    0/4        0/8                 UP
asia-south1              0/8   0/2048    0/4        0/8                 UP
asia-southeast1          0/8   0/2048    0/4        0/8                 UP
australia-southeast1     0/8   0/2048    0/4        0/8                 UP
europe-north1            0/8   0/2048    0/4        0/8                 UP
europe-west1             0/8   0/2048    0/4        0/8                 UP
europe-west2             0/8   0/2048    0/4        0/8                 UP
europe-west3             0/8   0/2048    0/4        0/8                 UP
europe

In [18]:
# let's list our bucket
# with command, we could just use `gsutil`
! gsutil ls

gs://dataflow_tutorial_bucket/
gs://new_bucket_for_test/


In [23]:
# let's create a bucket
! gsutil mb gs://new_bucket_for_test

Creating gs://new_bucket_for_test/...


In [24]:
# check
! gsutil ls

gs://dataflow_tutorial_bucket/
gs://new_bucket_for_test/


In [25]:
# let's copy files into the new bucket
# -r will recursively copy all files and directories from one bucket to another.
# -p will ensure the same permissions for new bucket
! gsutil cp -r -p gs://dataflow_tutorial_bucket/* gs://new_bucket_for_test/

Copying gs://dataflow_tutorial_bucket/data_flow_inputs/<Bucket: dataflow_tutorial_bucket> [Content-Type=data_flow_inputs/<Bucket: dataflow_tutorial_bucket>]...
Copying gs://dataflow_tutorial_bucket/data_flow_inputs/sample.txt [Content-Type=data_flow_inputs/sample.txt]...
Copying gs://dataflow_tutorial_bucket/data_flow_output/outputs-00000-of-00003 [Content-Type=text/plain]...
Copying gs://dataflow_tutorial_bucket/data_flow_output/outputs-00001-of-00003 [Content-Type=text/plain]...
\ [4 files][  1.9 KiB/  1.9 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://dataflow_tutorial_bucket/data_flow_output/outputs-00002-of-00003 [Content-Type=text/plain]...
Copying gs://dataflow_tutorial_bucket/spark_code/training_spark.py [Conten

In [29]:
# let's check the lifecycle of bucket
! gsutil lifecycle get gs://new_bucket_for_test

gs://new_bucket_for_test/ has no lifecycle configuration.


In [36]:
#  as we haven't set any lifecycle of the bucket, let's set with a json file, 
# we could set the bucket file to be 7 days.
%%writefile filecycle.json

{
"rule":
[
{
"action": {"type": "Delete"},
"condition": {"age": 7}
}
]
}

Overwriting filecycle.json


In [34]:
! gsutil lifecycle set filecycle.json gs://new_bucket_for_test

Setting lifecycle configuration on gs://new_bucket_for_test/...


In [35]:
# let's check again
! gsutil lifecycle get gs://new_bucket_for_test

{"rule": [{"action": {"type": "Delete"}, "condition": {"age": 7}}]}


In [37]:
# let's get the detail info about the bucket
# -L means detail information
# -b means with bucket name specified.
! gsutil list -L -b gs://new_bucket_for_test

gs://new_bucket_for_test/ :
	Storage class:			STANDARD
	Location type:			multi-region
	Location constraint:		US
	Versioning enabled:		None
	Logging configuration:		None
	Website configuration:		None
	CORS configuration: 		None
	Lifecycle configuration:	Present
	Requester Pays enabled:		None
	Labels:				None
	Default KMS key:		None
	Time created:			Thu, 04 Jun 2020 04:43:08 GMT
	Time updated:			Thu, 04 Jun 2020 05:01:05 GMT
	Metageneration:			2
	Bucket Policy Only enabled:	False
	ACL:				
	  [
	    {
	      "entity": "project-owners-227224402169",
	      "projectTeam": {
	        "projectNumber": "227224402169",
	        "team": "owners"
	      },
	      "role": "OWNER"
	    },
	    {
	      "entity": "project-editors-227224402169",
	      "projectTeam": {
	        "projectNumber": "227224402169",
	        "team": "editors"
	      },
	      "role": "OWNER"
	    },
	    {
	      "entity": "project-viewers-227224402169",
	      "projectTeam": {
	        "projectNumber": "227224402169",
	 

In [39]:
# let's upload a file into the bucket
import os

with open("test.txt", 'w') as f:
  f.write("try with gsutil upload file logic")

# when we upload file, we could even don't need to create folder first, 
# will create folder automatically.
! gsutil cp test.txt gs://new_bucket_for_test/upload_file/test.txt

Copying file://test.txt [Content-Type=text/plain]...
/ [1 files][   33.0 B/   33.0 B]                                                
Operation completed over 1 objects/33.0 B.                                       


In [40]:
# let's check with the new created folder, we do upload our file into bucket
! gsutil ls  gs://new_bucket_for_test/upload_file/ 

gs://new_bucket_for_test/upload_file/test.txt


In [41]:
# let's move file
! gsutil move gs://new_bucket_for_test/upload_file/test.txt gs://dataflow_tutorial_bucket/upload_file/

# so we do move the file from one to the other.
! gsutil ls gs://dataflow_tutorial_bucket/upload_file/

Copying gs://new_bucket_for_test/upload_file/test.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/   33.0 B]                                                / [1 files][   33.0 B/   33.0 B]                                                Removing gs://new_bucket_for_test/upload_file/test.txt...

Operation completed over 1 objects/33.0 B.                                       
gs://dataflow_tutorial_bucket/upload_file/test.txt


In [44]:
# let's remove the bucket
! gsutil rm -r  gs://new_bucket_for_test

Removing gs://new_bucket_for_test/data_flow_inputs/<Bucket: dataflow_tutorial_bucket>#1591245805764858...
Removing gs://new_bucket_for_test/data_flow_inputs/sample.txt#1591245806105696...
Removing gs://new_bucket_for_test/data_flow_output/outputs-00000-of-00003#1591245806503658...
Removing gs://new_bucket_for_test/data_flow_output/outputs-00001-of-00003#1591245807002360...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://new_bucket_for_test/data_flow_output/outputs-00002-of-00003#1591245807286814...
Removing gs://new_bucket_for_test/spark_code/training_spark.py#1591245807726618...
/ [6 objects]                                                                   
Operation completed over 6 object

## Cloud storage with Python

As we have already used command line to do common actions of buckets, let's try to use Python client to manage bucket, as when we interact with bucket, we will use client more frequently. 

Let's start.

In [0]:
# first we have to install the client
! pip install google-cloud-storage --quiet

In [0]:
# init client
from google.cloud import storage

project_id = "cloudtutorial-279003"
# config with project_id
client = storage.Client(project_id)

In [0]:
# let's create a bucket
from google.cloud.storage import Bucket

bucket = client.create_bucket('bucket_with_python')
assert isinstance(bucket, Bucket)

In [52]:
# here are many functions that we could use with bucket
print(dir(bucket))

['COLDLINE_STORAGE_CLASS', 'DUAL_REGION_LOCATION_TYPE', 'DURABLE_REDUCED_AVAILABILITY_LEGACY_STORAGE_CLASS', 'MULTI_REGIONAL_LEGACY_STORAGE_CLASS', 'MULTI_REGION_LOCATION_TYPE', 'NEARLINE_STORAGE_CLASS', 'REGIONAL_LEGACY_STORAGE_CLASS', 'REGION_LOCATION_TYPE', 'STANDARD_STORAGE_CLASS', '_LOCATION_TYPES', '_MAX_OBJECTS_FOR_ITERATION', '_STORAGE_CLASSES', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_acl', '_changes', '_client', '_default_object_acl', '_encryption_headers', '_label_removals', '_location', '_patch_property', '_properties', '_query_params', '_require_client', '_set_properties', '_user_project', 'acl', 'add_lifecycle_delete_rule', 'add_lifecycle_set_storage_class_rule', 'blob', '

In [54]:
# let's upload file into bucket
blob = bucket.blob('test.txt')
try:
  blob.upload_from_filename('test.txt')
  print('file uploaded')
except Exception as e:
  print("When upload file with error:", e)

file uploaded


In [55]:
# let's download the file into local 
try:
  blob.download_to_filename('new_test.txt')
  print("download file list: ", os.listdir('.'))
except:
  pass

download file list:  ['.config', 'adc.json', 'new_test.txt', 'test.txt', 'filecycle.json', 'sample_data']


In [57]:
# we could also download the file as a string for later use
try:
  file_content = blob.download_as_string()
  # return is a byte object. One more thing, if we need to upload the file into bucket, we have to 
  # serialize the data with bytes, so storage is platform independent.
  print("Get file content:", file_content)
except:
  pass

Get file content: b'try with gsutil upload file logic'


In [59]:
# after we have created the bucket, we could get the bucket object
bucket = client.get_bucket('bucket_with_python')

# list bucket files
# return is an iterator object
blobs = bucket.list_blobs()
print("file list:", list(blobs))

file list: [<Blob: bucket_with_python, test.txt, 1591249755726832>]


In [60]:
# let's list the bucket we have
print("bucket list: ", list(client.list_buckets()))

bucket list:  [<Bucket: bucket_with_python>, <Bucket: dataflow_tutorial_bucket>]


In [63]:
# let's copy one file from one bucket into another
source_bucket = client.bucket('dataflow_tutorial_bucket')
des_bucket = client.bucket('bucket_with_python')

blob_file = source_bucket.blob('upload_file/test.txt')

# we could rename the destination file name with `new_name`
new_blob = source_bucket.copy_blob(blob_file, des_bucket, new_name='upload_file/new_rename.txt')

print(list(des_bucket.list_blobs()))

[<Blob: bucket_with_python, test.txt, 1591249755726832>, <Blob: bucket_with_python, upload_file/new_rename.txt, 1591250710306730>, <Blob: bucket_with_python, upload_file/test.txt, 1591250647254400>]


In [64]:
# let's delete the file 
des_bucket.delete_blob('upload_file/new_rename.txt')

# check current bucket file, so that we do remove the file in that bucket.
print(list(des_bucket.list_blobs()))

[<Blob: bucket_with_python, test.txt, 1591249755726832>, <Blob: bucket_with_python, upload_file/test.txt, 1591250647254400>]


In [67]:
# we could check the file exist or not in bucket,
# if the file doesn't exist will return None
print(des_bucket.get_blob('test.txt'))

<Blob: bucket_with_python, test.txt, 1591249755726832>


In [70]:
# let's delete the bucket
try:
  # when there are files, we have to force the delete command try to delete whole files, then deletel bucket
  # otherwise will face with error: not empty.
  des_bucket.delete(force=True)
  print("bucket has been deleted")
  print("current bucket list:", list(client.list_buckets()))
except Exception as e:
  raise Exception("When delete bucket with error:", e)

bucket has been deleted
current bucket list: [<Bucket: dataflow_tutorial_bucket>]


## Final words

I have to say that there are many useful functions that we could use for cloud storage, I have just mentioned some common use case, if you are curious about the whole functions, you could find the [API here](https://googleapis.dev/python/storage/latest/buckets.html).