# Cloud Storage - Reading and Writing Files

You can easily read and write files (eg. csv or json files) from Google Cloud Storage, to work with data stored as files in storage, and like with BigQuery, use the resulting data with python data analysis libraries.

This notebook reads from a shared read-only dump of request logs, as well as working with cloud storage buckets and files within your project.

Related Links:

* [Cloud Storage](https://cloud.google.com/storage/)
* Python [Pandas](http://pandas.pydata.org/) for data analysis

----

NOTE:

* If you're new to notebooks, or want to check out additional samples, check out the full [list](..) of notebooks.

# Python APIs for working with Storage

Using the python APIs in the `PyGCP` library you can directly work with all the datasets that are contained or shared with your cloud project. The following `gcp.storage` import makes the Storage functionality available to this notebook.

In [1]:
import gcp
import gcp.storage as storage

## Buckets

Buckets are the top-level containers in cloud storage, and they contain items, each associated with a key.

In [2]:
shared_bucket = storage.Bucket('cloud-datalab')

And then a bucket object can be used to enumerate items.

In [3]:
items = shared_bucket.items()
keys = map(lambda item: item.key, items)
keys

[u'/',
 u'assets/',
 u'assets/DataLab128px.png',
 u'assets/DataLab512px.png',
 u'assets/Logo.ico',
 u'assets/LogoIcon.png',
 u'assets/LogoLarge.png',
 u'assets/LogoOAuth.png',
 u'content/',
 u'content/Hello World.ipynb',
 u'content/datalab/',
 u'content/datalab/Readme.ipynb',
 u'content/datalab/docs/',
 u'content/datalab/docs/An Introduction to Notebooks.ipynb',
 u'content/datalab/docs/BigQuery - API Reference.ipynb',
 u'content/datalab/docs/BigQuery - Basics.ipynb',
 u'content/datalab/docs/BigQuery - Composing Queries.ipynb',
 u'content/datalab/docs/BigQuery - Data Transforms with SQL.ipynb',
 u'content/datalab/docs/BigQuery - Extracting Data.ipynb',
 u'content/datalab/docs/BigQuery - Inserting Data.ipynb',
 u'content/datalab/docs/BigQuery - JavaScript UDFs.ipynb',
 u'content/datalab/docs/BigQuery - Parameterized Queries.ipynb',
 u'content/datalab/docs/BigQuery - Sample - Genomics.ipynb',
 u'content/datalab/docs/BigQuery - Sample - GitHub Timeline.ipynb',
 u'content/datalab/docs/BigQu

Items within a bucket can also be filtered to enumerate matching ones.

In [4]:
items = shared_bucket.items(prefix = 'ipython/intro/', delimiter = '/')
keys = map(lambda item: item.key, items)
keys

[]

## Items

Items are individual objects in a bucket. Items have associated metadata, and can be read or written to.

In [5]:
sample_logs = shared_bucket.item('sampledata/requestlogs/logs_sample.csv')
'The item with key "%s" is %d bytes' % (sample_logs.key, sample_logs.metadata().size)

'The item with key "sampledata/requestlogs/logs_sample.csv" is 3949 bytes'

### Reading Item Contents

In [6]:
log_content = sample_logs.read_from()

In [7]:
print log_content[:198] + '...'

1402815600.003772,122,200,GET,Interact3
1402815600.428897,144,200,GET,Interact3
1402815600.536486,48,200,GET,Interact3
1402815600.652760,28,405,GET,Interact2
1402815600.670100,103,200,GET,Interact3
...


### Writing Item Contents

The following will simply copy the item that was just read from the shared read-only bucket, and write the content to an item within a bucket in your own project. Specifically this will create an item within the notebooks bucket where all of your notebooks are read and written to as well.

In [8]:
print 'Creating an item and writing %d bytes into it...' % len(log_content)

Creating an item and writing 3949 bytes into it...


In [9]:
# The notebooks bucket is named using the project id followed by the "-datalab" suffix.
notebooks_bucket_name = gcp.Context.default().project_id + '-datalab'
notebooks_bucket = storage.Bucket(notebooks_bucket_name)
if not notebooks_bucket.exists():
    storage.Buckets().create(notebooks_bucket_name)

new_item = notebooks_bucket.item('sample_logs.csv')
new_item.write_to(log_content, 'text/plain')

In [10]:
map(lambda item: item.metadata().size,
    filter(lambda item: item.key.endswith('.csv'),
           notebooks_bucket.items(delimiter = '/')))

[3949]

### Deleting Items

Lets delete that item we just created when writing the sample log content.

In [11]:
new_item.delete()

False

# Integrating with Pandas

Once you're read in some data from cloud storage you can easily load a Python Pandas dataframe to further query and/or reshape the data.

In [12]:
import pandas as pd
import numpy as np
import StringIO

In [13]:
buffer = StringIO.StringIO(log_content)
df = pd.read_csv(buffer, header=None, names=['timestamp','latency','status','method','endpoint'],
                 parse_dates=[0], index_col=0)
df[df.status >= 400]

Unnamed: 0_level_0,latency,status,method,endpoint
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1402816000.0,28,405,GET,Interact2
1402816000.0,121,405,GET,Interact2
1402816000.0,124,405,GET,Interact2
1402816000.0,27,405,GET,Interact2
1402816000.0,112,404,GET,Other
1402816000.0,123,405,GET,Interact2
1402816000.0,192,404,GET,Other
1402816000.0,29,405,GET,Interact2
1402816000.0,112,400,GET,Other
1402816000.0,125,401,POST,Create
