# Caserta: GCP Test Part 1

### Author: Jeff Norton

In order to show the steps for the first several parts of the Google Cloud Platform Test, I have created a Python 3 notebook on my Linux VM of my personal computer.

## Steps of the Test

### Overview:
1.	Pull current data for all cryptocurrencies using the [CoinMarketCap API](https://coinmarketcap.com/api/)
2.	Save this data as a CSV file
3.	Upload the CSV to a Google Cloud Storage Bucket
4.	Move cryptocurrency data from GCS bucket to BigQuery
5.	Create a Google Datalab notebook instance
6.	Execute these five queries inside the notebook:
7.	How many coins have a USD price greater than 8000 USD?
8.	What is the total market cap of the top 100 cryptocurrencies (in USD)?
9.	Which coins have an available supply less than 5M USD?
10.	Which 5 coins have seen the greatest percentage growth in the last week?
11.	How many ticker symbols contain the letter "X" ?
12.	Download the notebook as a .ipynb file by choosing correct option under "Notebook" on the top left
13.	Email deliverables back to recruiter

### Deliverables:
1.	Google DataLab notebook (.ipynb) containing data ingestion script and SQL queries including answers to above questions
2.	Public URL of Google Cloud Storage bucket containing CoinMarketCap data

*Note - The deliverables will include this notebook, the Google Datalab notebook and the public URL.*

### Extra Credits
1.	Explain thinking at each step
2.	Include Markdown text
3.	Log steps and errors
4.	Push to Github Repository
5.	Publicly accessible link to Google Storage bucket
6.	Use Google Cloud APIs instead of web console whenever possible

## In this Notebook

In this notebook, we execute steps 1 through 3.  3 is done as a standalone script.

# Step 1: Pull Data for Cryptocurrencies.

Note - there are restrictions on using this API.  There are to be no more than 30 queries per minute and access is temporarily denied if the access rules are broken.

Pull data 100 entries at a time paging through all entries.  For each of the entries, we will flatten the dictionary (for the row) and append the data to the rows list.

Set up logging.  For a more robust and professional application, we are using logging and exception handling.

In [1]:
import logging
import sys

FORMAT = '%(asctime)-15s %(levelname)-6s %(message)s'
DATE_FORMAT = '%b %d %H:%M:%S'
logger = logging.getLogger(__name__)
formatter = logging.Formatter(fmt=FORMAT, datefmt=DATE_FORMAT)
# Clear the handlers out
for h in reversed(logger.handlers):
    logger.removeHandler(h)
# Add the one handler that is desired.
handler = logging.StreamHandler()
handler.setFormatter(formatter)
handler.stream = sys.stdout
logger.addHandler(handler)
logger.setLevel(logging.INFO)

Define the flatten routine for flattening a nested dictionary to a flat dictionary.

In [2]:
import collections

# This flattens any nested dictionary
def flatten(d):
    items = []
    for k, v in d.items():
        new_key = k # In our case, there is no key collision so we just use the original keys
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v).items())
        else:
            items.append((new_key, v))
    return dict(items)

In [3]:
# Use the requests library
import requests, json, time
from pandas import DataFrame

# Start time
starttime = time.time()
logger.info('Starting download of cryptocurrences at {}'.format(starttime))

# Get all the information paged (noting that we have to minimize calls!).  The interface
# supports maximum extraction of 100 rows per page.
base_url = ('https://api.coinmarketcap.com/v2/ticker/?start=', '&limit=100&sort=id')
row_data = []

calls = 0
last = False
start = 1
while not last:
    logger.info('Accessing items starting at {}'.format(start))
    url = '{}{}{}'.format(base_url[0], start, base_url[1])
    curtime = time.time()
    
    # Thresholding - keep it less than 30 calls in less than 1 minute
    if calls >= 29 and curtime - starttime > 60 - 0.1:
        logger.info('Thresholding enacted on API calls for {} seconds'.format(60))
        time.sleep(60)
        logger.info('Thresholding finished')
        calls = 0
        starttime = curtime
    else:
        calls = calls + 1
    
    # Make the call for a page
    response = requests.get(url)
    
    # Check the status code
    if response.status_code != 200:
        logger.error(response.status_code)
        break # That I know, the only failure is when the request is banned due to hitting the endpoint too many times!
    else:
        # Convert the response to usable json
        data_block = json.loads(response.content.decode('utf-8'))
        # are we on the last page?
        if len(data_block['data']) < 100:
            last = True
        # For each entry, flatten it and add it to the rows.
        for k in sorted(data_block['data'].keys()):
            row_data.append(flatten(data_block['data'][k]))
        
    start += 100
      
# Create the dataframe of all ticker data.
ticker_df = DataFrame(row_data)

Jul 08 16:34:05 INFO   Starting download of cryptocurrences at 1531082045.0777428
Jul 08 16:34:05 INFO   Accessing items starting at 1
Jul 08 16:34:05 INFO   Accessing items starting at 101
Jul 08 16:34:05 INFO   Accessing items starting at 201
Jul 08 16:34:05 INFO   Accessing items starting at 301
Jul 08 16:34:05 INFO   Accessing items starting at 401
Jul 08 16:34:05 INFO   Accessing items starting at 501
Jul 08 16:34:06 INFO   Accessing items starting at 601
Jul 08 16:34:06 INFO   Accessing items starting at 701
Jul 08 16:34:06 INFO   Accessing items starting at 801
Jul 08 16:34:06 INFO   Accessing items starting at 901
Jul 08 16:34:06 INFO   Accessing items starting at 1001
Jul 08 16:34:07 INFO   Accessing items starting at 1101
Jul 08 16:34:07 INFO   Accessing items starting at 1201
Jul 08 16:34:07 INFO   Accessing items starting at 1301
Jul 08 16:34:07 INFO   Accessing items starting at 1401
Jul 08 16:34:07 INFO   Accessing items starting at 1501
Jul 08 16:34:07 INFO   Accessing i

Print the number of rows in the dataframe, then display the first five rows of data as inspection...

In [4]:
from IPython.display import display

print('Number of rows (cryptocurrencies): {}'.format(ticker_df.shape[0]))

display(ticker_df.head(5))

Number of rows (cryptocurrencies): 1619


Unnamed: 0,circulating_supply,id,last_updated,market_cap,max_supply,name,percent_change_1h,percent_change_24h,percent_change_7d,price,rank,symbol,total_supply,volume_24h,website_slug
0,17138580.0,1,1531082000.0,116677500000.0,21000000.0,Bitcoin,0.21,2.95,7.11,6807.89,1,BTC,17138580.0,3584920000.0,bitcoin
1,31112960.0,10,1531082000.0,178664.0,,Freicoin,0.12,,23.23,0.005742,1196,FRC,100000000.0,159.831,freicoin
2,,101,1531082000.0,,,KlondikeCoin,0.12,21.02,3.57,0.01108,1562,KDC,,26.9692,klondikecoin
3,74107900.0,103,1531082000.0,85111.0,,RedCoin,0.12,-7.57,-4.44,0.001148,1264,RED,74107900.0,97.6741,redcoin
4,10497540000.0,109,1531082000.0,414120700.0,21000000000.0,DigiByte,0.71,17.84,56.29,0.039449,35,DGB,10497540000.0,17366000.0,digibyte


# Save data as CSV file

Save the dataframe as a CSV file locally.

In [5]:
file_name = './cryptocurrency_prices.csv'
ticker_df.to_csv(file_name, sep=',', encoding='utf-8', index=False)

# Upload the CSV to a GCP bucket

Use the upload_blob.py utility to upload the file to the bucket.  It is necessary to first login using gcloud to validate the app.  So these two things are done first (after making sure to install gcloud):

```> pip install --upgrade google-api-python-client```

Then, enable api authentication to get application default credentials.

```> gcloud beta auth application-default login```

In theory, in a Jupyter notebook, you can run scripts using magics:

```%run -i utils/upload_blob.py jrnorton_caserta_test_cryptocurrencies cryptocurrency_prices.csv  cryptocurrency_prices.csv```

but then the script is dependent on this environment where there is a problem with gcloud libraries.

So I just ran this from the linux command line...

```python3 utils/upload_blob.py jrnorton_caserta_test_cryptocurrencies cryptocurrency_prices.csv cryptocurrency_prices.csv```

which worked.

Note however, that I cannot find where to set ACLs - I may have seen a comment that this cannot be done programmatically.  I just did it in the online interface for the items in the bucket - mark it as public.  This also gives the public URL.  The public URL is [
https://storage.googleapis.com/jrnorton_caserta_test_cryptocurrencies/cryptocurrency_prices.csv](
https://storage.googleapis.com/jrnorton_caserta_test_cryptocurrencies/cryptocurrency_prices.csv)


**Reference**
https://stackoverflow.com/questions/40683702/upload-csv-file-to-google-cloud-storage-using-python

# From here...

The data is now in GCP.  So from here, it makes sense to execute all calls in Datalab (in the cloud).  See the second notebook.