This notebook is dedicated to establishing a reliable method of transferring batches of data to and from Google Drive and storing metadata in an accessible format

John Marangola
11/2/2021

We begin by sketching out how data should be clearly and efficiently stored as follows:

In order to standardize on a simple and very useful convention, we define an enum for the pieces on the chess board. 

In [1]:
import pandas as pd
import numpy as np
from enum import Enum

class ChessPiece(Enum):
    PAWN = 1
    ROOK = 2
    KNIGHT = 3
    KING = 4
    QUEEN = 5
    BISHOP = 6
    EMPTY = 7
     
piece = ChessPiece.PAWN
if piece is ChessPiece.PAWN:
    print("This is a pawn!")
if piece != ChessPiece.KNIGHT:
    print("Not a knight!")




This is a pawn!
Not a knight!


In order to avoid remembering ambiguous conventions such as T/F for colors of square and color of piece (Which takes time to remember and makes de bugging hard), we use a similiar standard enum for colors of things (pieces and squares).

In [9]:
class Color(Enum):
    ORANGE = 1
    BLUE = 2
    BLACK = 3
    WHITE = 4

piece_1_color = Color.ORANGE
piece_2_color = Color.BLUE
print("pieces are opponents") if piece_1_color != piece_2_color else "pieces are allies"

pieces are opponents


Now we find a clear convention for labelling positions on the board. If you are unfamiliar with chess take a look at this image that visually explains so-called "algebraic" notation:


In [8]:
import urllib.request
from PIL import Image

urllib.request.urlretrieve(
  "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/SCD_algebraic_notation.svg/1200px-SCD_algebraic_notation.svg.png", "SCD_algebraic_notation.svg")
  
img = Image.open("SCD_algebraic_notation.svg")
img.show()

For the sake of simplicity, we will define positions as "LN" where L is the letter associated with the position and N is the number associated with the position ie:

In [7]:
position1 = "e2"
position1_alt = "E2"
position1_alt = position1_alt.lower()
print(f"automatic case convesion works: {position1 == position1_alt}")

position2 = "g1"
print(f"position2 equals position1: {position2 == position1}")

automatic case convesion works: True
position2 equals position1: False


This appears to be robust. Since the convention in chess is always <letter><number> it is illogical to even worry about things such as 2e and e2 not being equivalent. Now lets move on to the storing all the metadata for a single piece. We decided that the metadata fields that should be recorded for each image are:
    1. Piece type
    2. Piece color (or lack of)
    3. Position
    4. Color of tile

We can therefore define a function that recieves these fields as parameters: 

In [10]:

# (Skip type validation for now)
def print_metadata(piece_type, piece_color, position, tile_color):
    print(piece_type.name)
    print(f"piece color: {piece_color.name}")
    print(f"position: {position.lower()}")
    print(f"tile color: {tile_color.name}")

piece_color = Color.ORANGE
piece_type = ChessPiece.ROOK
position = "E5"
tile_color = Color.BLACK

print_metadata(piece_type, piece_color, position, tile_color)


ROOK
piece color: ORANGE
position: e5
tile color: BLACK


Clearly, we can never have any pieces other than {ROOK, KING, QUEEN, KNIGHT, ..., BISHOP} or the allowed colors. Everything is always in the correct format when saved and we will save space by only writing integers to the csv instead of numerous strings for instance:

In [11]:
demo_color = Color.BLACK
# "write" operation:
print(demo_color.value)
# Get the demo color back from # it is written as:
print(Color(3))


3
Color.BLACK


Since we need to store the metadata for many images, lets use pandas to organize this it in a way that is efficient! Heres a simple dataframe with random metadata:

In [12]:
import pandas as pd
import numpy as np
import random

# Can rescale metadata table for any general number of images
number_images = 10

cols = ["Piece Type", "Color", "Position", "Tile Color"]
rows = list(range(5))
data = np.random.randn(5, number_images)
df_temp = pd.DataFrame(
    {
        "Piece Type": [ChessPiece(random.randint(1, 7)).name for i in range(number_images)],
        "Piece Color": [Color(random.choice([1, 2])).name for i in range(number_images)],
        "Position" : [random.choice(list("abcdefgh")) + str(random.randint(1, 8)) for i in range(number_images)],
        "Tile Color" : [Color(random.randint(3, 4)).name for i in range(number_images)]
    }
)
df_temp

Unnamed: 0,Piece Type,Piece Color,Position,Tile Color
0,KNIGHT,BLUE,a5,WHITE
1,QUEEN,ORANGE,a6,BLACK
2,PAWN,ORANGE,c4,BLACK
3,BISHOP,ORANGE,a4,WHITE
4,KNIGHT,BLUE,c6,BLACK
5,ROOK,ORANGE,h1,WHITE
6,PAWN,BLUE,h1,BLACK
7,ROOK,ORANGE,d1,WHITE
8,KNIGHT,BLUE,a2,WHITE
9,KNIGHT,ORANGE,c8,BLACK


Yes, it is possible to have an empty square with a piece color... the above data is totally randomly generated and just quick and dirty to get an idea of how this would work. 

Moving on we should probably add the camera pose as well:

In [13]:
# For now, lets keep it basic
class Camera(Enum):
    BIRDSEYE = 1
    ANGLED = 2
    
df_temp = pd.DataFrame(
    {
        "Piece Type": [ChessPiece(random.randint(1, 7)).name for i in range(number_images)],
        "Piece Color": [Color(random.choice([1, 2])).name for i in range(number_images)],
        "Position" : [random.choice(list("abcdefgh")) + str(random.randint(1, 8)) for i in range(number_images)],
        "Tile Color" : [Color(random.randint(3, 4)).name for i in range(number_images)],
        "Camera": [Camera(random.randint(1, 2)).name for i in range(number_images)]
    }
)
df_temp

Unnamed: 0,Piece Type,Piece Color,Position,Tile Color,Camera
0,EMPTY,ORANGE,f6,BLACK,BIRDSEYE
1,KING,ORANGE,c1,BLACK,BIRDSEYE
2,KNIGHT,ORANGE,d8,BLACK,ANGLED
3,BISHOP,BLUE,h3,WHITE,ANGLED
4,QUEEN,ORANGE,g4,WHITE,ANGLED
5,ROOK,BLUE,h8,WHITE,BIRDSEYE
6,EMPTY,BLUE,d5,WHITE,BIRDSEYE
7,EMPTY,ORANGE,f2,BLACK,ANGLED
8,KING,BLUE,c2,WHITE,BIRDSEYE
9,QUEEN,ORANGE,e4,BLACK,ANGLED


Looking good, now lets move on to uploading the metadata to Google Drive.

In [14]:
pip install pydrive

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


Perform first time manual developer authentication using Oauth and pydrive, make sure Localhost:8080/ abnd Localhost:8090/ enabled

In [16]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

gauth = GoogleAuth()
gauth.LocalWebserverAuth() # Creates local webserver and auto handles authentication.
drive = GoogleDrive(gauth)

file1 = drive.CreateFile({'title': 'Hello6.txt'})  # Create GoogleDriveFile instance with title 'Hello.txt'.
file1.SetContentString('this is a test!') # Set content of the file from given string.
file1.Upload()

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=1022961328214-bl4hn8614idt5sdup9996pk8rirkjf33.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


It worked!

Now lets create a function

In [17]:
def create_file(filename, content=None):
    file1 = drive.CreateFile({'title': filename})  # Create GoogleDriveFile instance with title 'Hello.txt'.
    if content is not None:
        file1.SetContentString(content) 
    file1.Upload()
    return file1

# Test function by making an empty test file
create_file("tester.txt")
    


GoogleDriveFile({'title': 'tester.txt', 'kind': 'drive#file', 'id': '1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6', 'etag': '"MTYzNjA2NDY0NzE1NQ"', 'selfLink': 'https://www.googleapis.com/drive/v2/files/1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6', 'webContentLink': 'https://drive.google.com/uc?id=1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6&export=download', 'alternateLink': 'https://drive.google.com/file/d/1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6/view?usp=drivesdk', 'embedLink': 'https://drive.google.com/file/d/1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6/preview?usp=drivesdk', 'iconLink': 'https://drive-thirdparty.googleusercontent.com/16/type/text/plain', 'mimeType': 'text/plain', 'labels': {'starred': False, 'hidden': False, 'trashed': False, 'restricted': False, 'viewed': True}, 'copyRequiresWriterPermission': False, 'createdDate': '2021-11-04T22:24:07.155Z', 'modifiedDate': '2021-11-04T22:24:07.155Z', 'modifiedByMeDate': '2021-11-04T22:24:07.155Z', 'lastViewedByMeDate': '2021-11-04T22:24:07.155Z', 'markedViewedByMeDate': '1970-

Get a list of files and their respective ids

In [18]:
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s' % (file1['title'], file1['id']))

title: tester.txt, id: 1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6


In [19]:
""" 
Get the first occurence of the id of a file in root directory of drive from its name

Returns: id iff single file found with name, else None
"""
def get_id(name):
    file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
    ids = []
    for file1 in file_list: 
        if file1["title"] == name: 
            ids.append(file1["id"])
    if len(ids) == 1: return ids[0]
    return None

print(get_id("tester.txt"))
print(get_id("Hello2.txt"))

1ZASyVcRP7kxjvDDSfC3tqFzvCgZ-C-K6
None


Now given some filename for the metadata we can download it

In [20]:
""" 
Download a filename from google drive if it exists, else do nothing 

Return: True -> Sucessful, False -> file not found
"""
def download(filename):
    _id = get_id(filename)
    if _id is None: 
        return False
    temp = drive.CreateFile({'id':_id})
    temp.GetContentFile(filename)
    return True
    
download("tester.txt")

True

This clearly works, now lets try for our csv...

First, lets upload the csv to a folder called dataset like this

--test_dataset
    |__ BIRDSEYE
        |__ metadata.csv

In [21]:
# demo: add nested folders
base = drive.CreateFile({'title':"base", 'mimeType':"application/vnd.google-apps.folder"})
base.Upload()
print(base['id'])
file1 = drive.CreateFile({'title':"sub","parents":[{'id':base.attr["metadata"]["id"]}],'mimeType':"application/vnd.google-apps.folder"})
file1.Upload()
    

1HB9Ns5e2XLOpnHESNZQrsN9q0e3JQDxu


In [22]:
"""
Create a root folder named <name>

Return: id of root folder
"""
def create_root_folder(name):
    for file in drive.ListFile({'q': f"'root' in parents and trashed=false"}).GetList():
        if file['title'] == name:
            return False
    root_folder = drive.CreateFile({'title':name, 'mimeType':"application/vnd.google-apps.folder"})
    root_folder.Upload()
    return root_folder['id']
create_root_folder("root")



'1qCcT3j8J_k2JicRx80GrYi-1TN5afMaG'

In [242]:
"""
Adds sub_folder inside base directory if base directory is a single folder in drive

Returns: id of sub folder for directory chaining
"""
def upload_subfolder(root_dir, sub_folder):
    id_temp = get_id(root_dir)
    if id_temp is None: 
        return False # ie. duplicate folders, folder not found
    # check to make sure sub-directory does not exist yet:
    for file in drive.ListFile({'q': f"'{id_temp}' in parents and trashed=false"}).GetList():
        if file['title'] == sub_folder:
            return False
    sub_dir = drive.CreateFile({'title':sub_folder,"parents":[{'id':id_temp}],'mimeType':"application/vnd.google-apps.folder"})
    sub_dir.Upload()
    return sub_dir['id']


'1G4U5DhFkKiJM8dpRROc_5QfjXmN5uuYE'

Putting it all together, we can create a function that takes the file id of any directory and generates a sub directory named sub_dir inside of it:

In [24]:
"""
Create a sub folder <sub_dir> inside parent folder with parent_id

Return: id of sub folder
"""
def add_sub_directory(parent_id, sub_dir):
    # check to make sure sub-directory does not exist yet:
    for file in drive.ListFile({'q': f"'{parent_id}' in parents and trashed=false"}).GetList():
        if file['title'] == sub_dir:
            return False
    sub_dir = drive.CreateFile({'title':sub_dir,"parents":[{'id':parent_id}],'mimeType':"application/vnd.google-apps.folder"})
    sub_dir.Upload()
    return sub_dir['id']


Lets try to create the directory above now:

HttpError: <HttpError 404 when requesting https://www.googleapis.com/drive/v2/files?q=%27False%27+in+parents+and+trashed%3Dfalse&maxResults=1000&alt=json returned "File not found:". Details: "File not found: ">