# NoSQL - Not only SQL

Snowflake supports the following kinds of data:

- Structured data (AKA "SQL data") — such as rows and columns in a table — follows a strict tabular schema.

- Semi-structured data — such as a JSON file or an XML file — has a flexible schema.

- Unstructured data — such as a document, image, or audio file — has no inherent schema.


## Unstructured data

Supports the following actions:

- Securely access data files located in cloud storage.
- Share file access URLs with collaborators and partners.
- Load file access URLs and other file metadata into Snowflake tables.
- Process unstructured data.
- Load unstructured data with Document AI


### URL Types

#### Scoped URL

- Permits temporary access to a staged file without granting privileges to the stage.
- The URL expires when the persisted query result period ends.
- Give scoped access to data files to specific roles in the same account.
- Provide access to the files with a view that retrieves scoped URLs. 
- Only roles that have privileges on the view can access the files. 
- Snowflake records information in the query history about who uses a scoped URL to access a file, and when.
- Ideal for use in custom applications, for providing unstructured data to other accounts through a share, or for downloading and analysis of unstructured data in Snowsight.

#### File URL - Stage File URL 

- URL that identifies the database, schema, stage, and file path to a set of files. 
- Does not expire.
- A role that has sufficient privileges on the stage can access the files.
- Permanent URL to a file on a stage. 
- To download or access a file, users send the file URL in a GET request to the REST API endpoint along with the authorization token. 
- Ideal for custom applications that require access to unstructured data files.


#### Pre-signed URL

- Simple HTTPS URL used to access a file via a web browser.
- A file is temporarily accessible to users via this URL using a pre-signed access token.
- The expiration time for the access token is configurable.
- Used to download or access files without authenticating into Snowflake or passing an authorization token.
- Pre-signed URLs are open; any user or application can directly access or download the files. 
- Ideal for business intelligence applications or reporting tools that need to display the unstructured file contents.


In [None]:
%%sql -r dataframe_1

CREATE OR REPLACE STAGE sf_cert_prep.public.my_images_stage
DIRECTORY = (ENABLE = TRUE)
ENCRYPTION  = (TYPE = 'SNOWFLAKE_SSE')
  ;

CREATE OR REPLACE STAGE sf_cert_prep.public.my_images_stage2
DIRECTORY = (ENABLE = TRUE)
ENCRYPTION  = (TYPE = 'SNOWFLAKE_SSE')
  ;

-- please upload some images in your stage, I tried to automate, but trial accounts dont allow external integration.


In [None]:
%%sql -r dataframe_13
alter stage sf_cert_prep.public.my_images_stage refresh;

In [None]:
%%sql -r dataframe_3
Alter stage sf_cert_prep.public.my_images_stage 
set
DIRECTORY = (ENABLE = TRUE, AUTO_REFRESH=true)

In [None]:
%%sql -r dataframe_7
select * from 
DIRECTORY (@sf_cert_prep.public.my_images_stage );

## Server-side encryption for unstructured data access

### Types of encryption for internal stages

    [ ENCRYPTION = (   TYPE = 'SNOWFLAKE_FULL' | TYPE = 'SNOWFLAKE_SSE' ) ]

#### SNOWFLAKE_FULL

- Client-side ***and*** server-side encryption. 
- The files are encrypted by a client when it uploads them to the internal stage using PUT.
- Snowflake uses a 128-bit encryption key by default.
- All files are also automatically encrypted using AES-256 strong encryption on the server side.

#### SNOWFLAKE_SSE

- Server-side encryption only. 
- The files are encrypted when they arrive on the stage by the cloud service where your Snowflake account is hosted.


So at the end, to enable unstructured data access on an internal stage for third party tools (open a downloaded image from stage in your PC for instance) you need to consider using server-side encryption only (SNOWFLAKE_SSE) when creating the stage.
Otherwise, staged files will be client-side encrypted by default. 
The encryption keys are owned by Snowflake, and client-side encrypted files are unreadable by users and external tools using pre-signed, file, or scoped URLs.

### Types of encryption for external stage

#### AWS S3

    [ ENCRYPTION = ( [ TYPE = 'AWS_CSE' ] MASTER_KEY = '<string>'
                       | TYPE = 'AWS_SSE_S3'
                       | TYPE = 'AWS_SSE_KMS' [ KMS_KEY_ID = '<string>' ]
                       | TYPE = 'NONE' ) ]


#### Google Cloud Storage


    [ ENCRYPTION = (   TYPE = 'GCS_SSE_KMS' [ KMS_KEY_ID = '<string>' ]
                       | TYPE = 'NONE' ) ]

#### Azure Blob


    [ ENCRYPTION = (   TYPE = 'AZURE_CSE' MASTER_KEY = '<string>'
                       | TYPE = 'NONE' ) ]

In [None]:
%%sql -r dataframe_5
list @sf_cert_prep.public.my_images_stage;

In [None]:
%%sql -r dataframe_6

SELECT *
FROM DIRECTORY(@sf_cert_prep.public.my_images_stage)  -- exposes RELATIVE_PATH, SIZE, LAST_MODIFIED, FILE_URL, etc.
ORDER BY RELATIVE_PATH;


## Directory tables

is an implicit object layered on a stage (not a separate database object) 

- Both external and internal stages support directory tables

- Query a list of all the unstructured files on a stage -> size, last modified timestamp, and its Snowflake ***File URL***.

- You can join a directory table with a Snowflake table that contains additional data and metadata about unstructured files.

- Construct a file processing pipeline. You can use a directory table with the Snowpark API or external functions to create a file processing pipeline.


### Automated Refresh

#### Internal Stage: Preview feature only for AWS accounts

Conditions that triggers the refresh:

- New files in the path are added to the table metadata.

- Changes to files in the path are updated in the table metadata.

- Files no longer in the path are removed from the table metadata.

    CREATE STAGE my_int_stage
      DIRECTORY = (
        ENABLE = TRUE
        AUTO_REFRESH = TRUE
      );

#### External Stage:

For the external stages, you need to make sure the cloud storage are also setup.

In general for all providers, you should do:

1. Create Storage Integration:

    CREATE OR REPLACE STORAGE INTEGRATION s3_int
      TYPE = EXTERNAL_STAGE
      STORAGE_PROVIDER = S3
      ENABLED = TRUE
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<acct>:role/snowflake-role'
      STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/data/');

    
2. Create external stage wtih Directory table:
    
    CREATE OR REPLACE STAGE my_ext_stage
      URL = 's3://my-bucket/data/'
      STORAGE_INTEGRATION = s3_int
      DIRECTORY = (
           ENABLE = TRUE
           AUTO_REFRESH = TRUE
      );

3. Setup Event notification in the cloud provider
- S3: SQS Event notifcation
- GCS: Cloud Pub/Sub
- Azure: Event Grid


##### S3:

![alt](https://docs.snowflake.com/en/_images/storage-integration-s3.png)

https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto-s3

##### GCS:

![alt](https://docs.snowflake.com/en/_images/storage-integration-gcs.png)

https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto-gcs
##### Azure Blob:


https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto-azure



#### Workaround for "Automated Refresh"

As workaround you can also setup a task that runs periodically to refresh the stage.


    CREATE OR REPLACE TASK daily_refresh
      WAREHOUSE = compute_wh
      SCHEDULE = 'USING CRON 30 12 * * * UTC' --> once a day at 12:30 UTC
    AS
      ALTER STAGE ext_orders REFRESH;




### BUILD_STAGE_FILE_URL - SQL Function

Generates a Snowflake file URL to a staged file using the stage name and relative file path as inputs.

#### Arguments:
- stage_name: 
    - it must be enclosed in single quotes if stage name has spaces.

- relative_file_path:
    - Path and filename of the file relative to its location in the stage


#### Returns     

    https://<account_identifier>/api/files/<db_name>/<schema_name>/<stage_name>/<relative_path>



In [None]:
%%sql -r dataframe_9
-- Build a stable file URL (must know relative path)
SELECT BUILD_STAGE_FILE_URL('@sf_cert_prep.public.my_images_stage', 'my_images_stage/1.jpg') AS file_url;

### BUILD_SCOPED_FILE_URL  - SQL Function

Generates a encoded URL and permits access to a specified file for a limited period of time.

#### Arguments:
- stage_name: 
    - it must be enclosed in single quotes if stage name has spaces.

- relative_file_path:
    - Path and filename of the file relative to its location in the stage

- use_privatelink_host_for_business_critical:
    - Default: True
    - Specifies whether to add privatelink to the URL for Business Critical accounts


#### Returns     

    https://<account_identifier>/api/files/<query_id>/<encoded_file_path>


In [None]:
%%sql -r dataframe_10
SELECT BUILD_SCOPED_FILE_URL('@sf_cert_prep.public.my_images_stage', '1.jpg') AS scoped_url;

### GET_PRESIGNED_URL  - SQL Function

Generates a pre-signed URL to a file on a stage using the stage name and relative file path as inputs.

#### Arguments:
- stage_name: 
    - it must be enclosed in single quotes if stage name has spaces.

- relative_file_path:
    - Path and filename of the file relative to its location in the stage

- expiration_time:
    - Default: 3600 - 60 minutes
    - Length of time (in seconds) after which the short term access token expires


#### Returns     

    https://<account_identifier>/api/files/<query_id>/<encoded_file_path>


In [None]:
-- 10-minute presigned link
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
USE "SF_CERT_PREP"."PUBLIC";
SELECT GET_PRESIGNED_URL('@sf_cert_prep.public.my_images_stage', '1.jpg', 600) AS presigned_url;

### Case Open

GET_PRESIGNED_URL not working propertly for notebooks running under containers, works normally in Warehouse enviroment.

Case with SF: 01259013

In [None]:
--Retrieves the URL for an external or internal named stage using the stage name as the input

SELECT GET_STAGE_LOCATION(@"SF_CERT_PREP"."PUBLIC".MY_IMAGES_STAGE2);