## Cabinet Sandbox - Youtube thumbnails Example
#### Description
Cabinet sandbox exists to provide an example of the Cabinet system at work. In this example (youtube.py), a dataset of thumbnail images and their assoicated metadata are uploaded to Cabinet. The other functions in the cabinet_sdk library are then demonstrated using the uploaded data
#### The Data
A file system containing youtube video thumnbnails and a csv with associated metadata.

Link to dataset: https://www.kaggle.com/datasets/praneshmukhopadhyay/youtube-thumbnail-dataset 

In [94]:
%load_ext autoreload 
%autoreload 2 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [95]:
from typing import List
from PIL import Image 
import io
import os
# from IPython import display
from base64 import b64decode, b64encode
import pdb
import cabinet_sdk as c

In [96]:
# set environment 
os.environ['ENV'] = 'testing'
ENV = os.getenv("ENV")

Test cabinet_sdk library is installed and working 


In [97]:
c.check_health()

SDK live
 {'status': 200}


##### PROCESS DATA INTO FORMAT THAT CAN BE UPLOADED TO CABINET 
note: this is specific to this data set (youtube thumbnails) 



1. Turn csv metadata into List[dicts] 


In [98]:
desired_num_entries = 2

def create_metadata_list() -> list:
    with open('mini_metadata.csv',newline='') as csvfile:
        count = 0
        metadata = [] 
        while True and count < (desired_num_entries+1):
            entry_line= csvfile.readline()
            if not entry_line:
                break
            entry_list= entry_line.strip().split(',')
            # if title split because it contianed commas, recreate single title
            if len(entry_list) > 4:
                title_split = entry_list[3:]
                title = ', '.join(title_split) 
                entry_list[3] = title 
            entry_metadata = {'blob_type':'youtube','photo_id': entry_list[0], 'channel':entry_list[1],'category':entry_list[2],'title':entry_list[3]}
            metadata.append(entry_metadata)
            count += 1
    return metadata[1:]

metadatas = create_metadata_list()
print(metadatas)


[{'blob_type': 'youtube', 'photo_id': 'OkmNXy7er84', 'channel': '3Blue1Brown', 'category': 'Science', 'title': 'The hardest problem on the hardest test'}, {'blob_type': 'youtube', 'photo_id': 'r6sGWTCMz2k', 'channel': '3Blue1Brown', 'category': 'Science', 'title': 'But what is a Fourier series? From heat flow to drawing with circles | DE4'}]


2. Use metadata info to generate list of file_paths to corresponding thumbnail

In [99]:
def create_paths_list(md) -> list:
    paths = []
    for i in md:
        channel = i['channel']
        id = i['photo_id']
        path = 'images/'+f'{channel}/'+f'{id}.jpg'
        paths.append(path)
    return paths
    
paths = create_paths_list(metadatas) 
print(paths)

['images/3Blue1Brown/OkmNXy7er84.jpg', 'images/3Blue1Brown/r6sGWTCMz2k.jpg']


3. Create a list of tuples with t[0]=metadata:dict, t[1]=img_file_path:str

In [100]:
def create_upload_tuples(metadatas:list, paths:list) -> list: 
    upload_tups = []
    for i in range(len(paths)):
        upload_tups.append((metadatas[i],paths[i]))
    return upload_tups

blob_info:List[tuple] = create_upload_tuples(metadatas,paths)
print(blob_info)

[({'blob_type': 'youtube', 'photo_id': 'OkmNXy7er84', 'channel': '3Blue1Brown', 'category': 'Science', 'title': 'The hardest problem on the hardest test'}, 'images/3Blue1Brown/OkmNXy7er84.jpg'), ({'blob_type': 'youtube', 'photo_id': 'r6sGWTCMz2k', 'channel': '3Blue1Brown', 'category': 'Science', 'title': 'But what is a Fourier series? From heat flow to drawing with circles | DE4'}, 'images/3Blue1Brown/r6sGWTCMz2k.jpg')]


### Cabinet at Work

BLOB_TYPES - provieds name and metadata fields for all blob_types in your Cabinet

In [101]:
print(ENV)
print(c.list_blob_types())

testing
{'fruit': ['entry_id', 'blob_type', 'fruit_name', 'fruit_color', 'blob_hash'], 'chess': ['entry_id', 'blob_type'], 'youtube': ['entry_id', 'blob_type', 'blob_hash', 'photo_id', 'channel', 'category', 'title']}


FIELDS - lists metadata fields of specified blob_type

In [102]:
print(c.list_schema('youtube'))

['entry_id', 'blob_type', 'blob_hash', 'photo_id', 'channel', 'category', 'title']


#### Upload Blobs
Iterate through list of tuples and add each blob+metadata to Cabinet using the cabinet UPLOAD function 

In [103]:
def upload_bulk_data(blob_info:List[tuple]):
    for tup in blob_info:
        try:  
            print(c.upload(tup[0], tup[1], ['testing']))
        except Exception as e: 
            print(e.args[0])
            continue

upload_bulk_data(blob_info) 

BlobDuplication: blob already saved in requested location
BlobDuplication: blob already saved in requested location


SEARCH - search for entries that match specified metadata values. You can use any number of valid metadata fields to search. 

In [104]:
preview = 2

matching_entries = c.search('youtube',{'category':'Science'})
print('Number of matching entries: ', len(matching_entries))
print(f'Details for {preview} entries')
for key in matching_entries:
    print(key, matching_entries[key][:preview])

Number of matching entries:  7
Details for 2 entries
channel ['3Blue1Brown', '3Blue1Brown']
blob_type ['youtube', 'youtube']
blob_hash ['07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', 'ced97f146e9aa3db62131c537fedab78bfa9355698e53b746da1158a70fde9cc']
category ['Science', 'Science']
entry_id [144, 145]
photo_id ['OkmNXy7er84', 'r6sGWTCMz2k']
title ['The hardest problem on the hardest test', 'But what is a Fourier series? From heat flow to drawing with circles | DE4']


If no metadata values are provided, all entries for specified blob_type are returned 

In [105]:
matching_entries = c.search('youtube')
num_entries = len(matching_entries['entry_id'])
print(f'There are {num_entries} youtube entries')


There are 4 youtube entries


UPDATE - creates a soft update 

In [106]:
# Pre-update: search all entries in youtube table matching specified title
resp = c.search('youtube',{'title':'The hardest problem on the hardest test'})
print(resp)
entry_id_ex = resp['entry_id'][0]

{'channel': ['3Blue1Brown', '3Blue1Brown', '3Blue1Brown'], 'blob_type': ['youtube', 'youtube', 'youtube'], 'blob_hash': ['07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', '07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', '07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda'], 'category': ['Science', 'MATH', 'MATH'], 'entry_id': [144, 146, 147], 'photo_id': ['OkmNXy7er84', 'OkmNXy7er84', 'OkmNXy7er84'], 'title': ['The hardest problem on the hardest test', 'The hardest problem on the hardest test', 'The hardest problem on the hardest test']}


In [107]:
# Update: a new metadata entry containing the specified changes is added to the youtube table 
id_of_update = c.update('youtube',entry_id_ex,{'category':'MATH'})
print(id_of_update)

{'entry_id': 148}


In [108]:
# Post-update: same search shows new updated entry has been added 
print(c.search('youtube',{'title':'The hardest problem on the hardest test'}))

{'channel': ['3Blue1Brown', '3Blue1Brown', '3Blue1Brown', '3Blue1Brown'], 'blob_type': ['youtube', 'youtube', 'youtube', 'youtube'], 'blob_hash': ['07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', '07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', '07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda', '07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda'], 'category': ['Science', 'MATH', 'MATH', 'MATH'], 'entry_id': [144, 146, 147, 148], 'photo_id': ['OkmNXy7er84', 'OkmNXy7er84', 'OkmNXy7er84', 'OkmNXy7er84'], 'title': ['The hardest problem on the hardest test', 'The hardest problem on the hardest test', 'The hardest problem on the hardest test', 'The hardest problem on the hardest test']}


RETRIEVE - returns desired blob in bytes

In [109]:
blob_urls = c.retrieve('youtube', entry_id_ex)
print('Paths to locations where blob is saved', blob_urls)
path1 = blob_urls[0]

Paths to locations where blob is saved ['blobs/youtube/07e34467fe1418d8d80c9fa6efbf7868a98148b8ed3918a3be606ca070b51dda']


### Post Processing
blob must be further processed by user to regain original format

In [110]:
# opens image in new window
Image.open(path1).show()

Open file in notebook

In [None]:
with open(path1, 'rb') as f:
    im_binary = f.read() 
blob_base64 = b64encode(im_binary)
display.Image(b64decode(blob_base64))