# Stream data to GCS
___

This notebook shows how to use satin tooling and `requests` to stream large files from the Rio Tinto SFT server into google buckets.

In [1]:
# import packages
import os
import requests
import json
from tqdm import tqdm

import sys
sys.path.append('/home/jovyan/rose/satin/')

from satin.utils.gcs_io.streamupload import requests_to_gs

### Set up access to use requests 
___ 
This uses the username and password for the Rio Tinto server site to get an access token for requests

In [2]:
payload = {
    'username': 'YOUR-USERNAME', 
    'password': 'YOUR-PASSWORD', 
    'grant_type': 'password'
}

# Is this a self signed cert enable this to ignore SSL errors
requests.packages.urllib3.disable_warnings()

# Authentication
r = requests.post('https://sft.riotinto.com/api/v1/token', data=payload, verify=False)
access_token = r.json()['access_token']
header = {'Authorization': f'Bearer {access_token}'}

### Browse files within the SFT Rio Tinto server 
___

In order to download urls with `requests`, you need to know where to find the files. Below, we show how to browse folders and files within the Rio Tinto server. When you find the folders / files needed, you need to manually note the IDs to create the proper download urls. 

First, show all the available folders within the server site

In [3]:
r = requests.get('https://sft.riotinto.com/api/v1/folders', headers=header, allow_redirects=True)
dict_ = json.loads(r.text)

for item in dict_['items']:
    print(item)
    print('')

{'id': '614788840', 'parentId': '0', 'name': '', 'lastContentChangeTime': '2021-05-11T07:30:55', 'folderType': 'Root', 'path': '/', 'isShared': False, 'permission': {'canListSubfolders': True, 'canAddSubfolders': False, 'canChangeSettings': False, 'canDelete': False, 'canListFiles': False, 'canReadFiles': False, 'canWriteFiles': False, 'canDeleteFiles': False, 'canShare': False}, 'subfolderCount': 1, 'totalFileCount': 0, 'sharedWithUsersCount': 0, 'sharedWithGroupsCount': 0}

{'id': '720707876', 'parentId': '614788840', 'name': 'RTX AMR Data Transfers', 'lastContentChangeTime': '2021-04-13T09:54:57', 'folderType': 'Normal', 'path': '/RTX AMR Data Transfers', 'isShared': True, 'permission': {'canListSubfolders': True, 'canAddSubfolders': False, 'canChangeSettings': False, 'canDelete': False, 'canListFiles': False, 'canReadFiles': False, 'canWriteFiles': False, 'canDeleteFiles': False, 'canShare': False}, 'subfolderCount': 1, 'totalFileCount': 0, 'sharedWithUsersCount': 0, 'sharedWithGro

By logging into the sft site itself, we can see where the files are that we need. As an example, let's find the url for the `/RTX AMR Data Transfers/DescartesLab/Rosemont_LWIR_Emissivity_V2.zip` file. From the server site, we know that it's in the `/RTX AMR Data Transfers/DescartesLab` folder. By manually inspecting the items above, we see that the `id` associated with the folder with this name is `'722722105'`. Now, we can change the request link to see which files exist in this folder

In [4]:
r = requests.get('https://sft.riotinto.com/api/v1/folders/722722105/files', headers=header, allow_redirects=True)
dict_ = json.loads(r.text)

for item in dict_['items']:
    print(item)
    print('')

{'path': '/RTX AMR Data Transfers/DescartesLab/Christmas_001-053_EMISS_Mosaic_V2.zip', 'uploadStamp': '2021-05-03T17:04:26.187', 'isNew': False, 'name': 'Christmas_001-053_EMISS_Mosaic_V2.zip', 'size': 1776261280, 'id': '764579580'}

{'path': '/RTX AMR Data Transfers/DescartesLab/Christmas_001-053_EMISS_Mosaic_V2_01.zip', 'uploadStamp': '2021-05-04T18:25:44.567', 'isNew': False, 'name': 'Christmas_001-053_EMISS_Mosaic_V2_01.zip', 'size': 1888011872, 'id': '764715030'}

{'path': '/RTX AMR Data Transfers/DescartesLab/Rosemont_LWIR_Emissivity_V2.zip', 'uploadStamp': '2021-04-16T14:59:18.897', 'isNew': False, 'name': 'Rosemont_LWIR_Emissivity_V2.zip', 'size': 294403204, 'id': '759976791'}



Now we see the file of interest above, with path `/RTX AMR Data Transfers/DescartesLab/Rosemont_LWIR_Emissivity_V2.zip`. By again inspecting the different items for the proper ID, we see that this file has id `'759976791'`. With this information, we will construct the list of urls that we want to download. In this case, we're just downloading this one file.

In [5]:
urls = [
    'https://sft.riotinto.com/api/v1/folders/722722105/files/759976791/download'
]

To stream the data into a google bucket, we need to mount the folder, then stream the above url chunk by chunk into it's new location. You can mount the bucket with `gcsfuse` on a GCS VM. Create a folder in the same directory as this file called `mounted_bucket`, then mount the base bucket you want to stream to from google cloud storage:

Or, if you'd just like to stream the files into workbench, create a folder to stream to locally. In this case, it's not necessary to mount a bucket.

In [6]:
base_folder = 'local_folder'
os.makedirs(base_folder, exist_ok=True)

file_path = f'{base_folder}/Rosemont_LWIR_Emissivity_V1.zip'

Finally, stream the files from the urls into the desired folder, either to the mounted bucket or to the local folder!

In [8]:
for url in urls:
    # Authentication
    r = requests.post('https://sft.riotinto.com/api/v1/token', data=payload, verify=True)
    access_token = r.json()['access_token']
    header = {'Authorization': f'Bearer {access_token}'}
    
    # Stream file to folder
    print('url: ', url)
    r = requests.get(url, headers=header, allow_redirects=True, stream=True)
    
    of = open(file_path, 'wb')
    for chunk in tqdm(r.iter_content(chunk_size=8*1024*1024)):
        of.write(chunk)

    print('Complete.')

url:  https://sft.riotinto.com/api/v1/folders/722722105/files/759976791/download


36it [00:17,  2.10it/s]

Complete.



