# Tech Talent Dashboard Data Upload

This notebook can be run after the data preprocessing notebook has completed successfully.  It will load the preprocessed files found in the `db` folder to the correct location on Google Sheets.  

Tableau Public can refresh automatically from Google Sheets data sources by default.  As the final loading step, we'll connect to the Google Sheets API, find and open the correct file, delete its contents, then insert our new data into the empty file.

*Note: Tableau Public data connections to Google Sheets uses the file's unique id rather than the filename.  Uploading a new copy of the file will also generate a new id for that file, which in turn will break the connection between the dashboard and the data.*

### This step is not reversible.  It is advised to download and retain a backup of the existing files in the `mcdc-tech-talent` folder on Google Drive as a zip file before proceeding with the steps in this notebook.

# Google API

We'll use the previously created Google Drive functions to find the correct files to overwrite, and also import a package called `gspread` to perform the Google Sheets operations (this package just makes our lives easier).

These initial steps will create a connection object to the [Google Drive API](https://developers.google.com/drive/api/v3/about-sdk) called `service`.  A valid authentication key from Google must be saved as key.json in the project's root directory.  This project uses [service accounts](https://cloud.google.com/iam/docs/understanding-service-accounts) for authentication to both the Google [Drive](https://developers.google.com/drive/api/v3/about-sdk) and [Sheets](https://developers.google.com/sheets/api/reference/rest) APIs.

In [1]:
import os
from pathlib import Path

import gspread
import pandas as pd
from google.oauth2 import service_account
from googleapiclient.discovery import build


In [2]:
CLIENT_SECRET_FILE = 'key.json'
SCOPES = ['https://www.googleapis.com/auth/drive', 
          'https://www.googleapis.com/auth/spreadsheets']

creds = service_account.Credentials.from_service_account_file(CLIENT_SECRET_FILE, scopes=SCOPES)

DATA_DIR = 'data'
DB_DIR = 'db'

DB_PATH = Path() / 'db'
DATA_PATH = Path() / 'data'

service = build('drive', 'v3', credentials=creds)

In [3]:
def create_service(api_key, api_name, api_version, scope):

    creds = service_account.Credentials.from_service_account_file(api_key, scopes=scope)

    try:
        service = build(api_name, api_version, credentials=creds)
        print(api_name, 'service created successfully')
        return service
    except Exception as e:
        print('Unable to connect.')
        print(e)
        return None


In [4]:
def get_file_id(service, file_name, mime_type=None, parent_id=None):
    """Return the ID of a Google Drive file

    :param service: A Google Drive API service object
    :param file_name: A string, the name of the file
    :param mime_type: A string, optional MIME type of file to search for
    :param parent_id: A string, optional id of a parent folder to search in

    :return file_id: A string, file ID of the first found result
    """

    file_id = None

    query = """name='{}'
               and trashed=False
               """.format(file_name)

    if parent_id:
        query += "and parents in '{}'".format(parent_id)

    if mime_type:
        query += "and mimeType in '{}'".format(mime_type)

    try:
        results = service.files().list(
            q=query,
            fields='files(name, id)').execute()

        if len(results['files']) > 1:
            print('Multiple files found, retrieving first from list')

        file_id = results['files'][0]['id']

    except Exception as e:
        print('An error occurred: {}'.format(e))

    return file_id


# Load Google Sheets with final data 

In [5]:
def upload_db_data(secret, db_path):
    """ Loads processed data files to remote Google Sheets """

    scope = ['https://www.googleapis.com/auth/drive']
    drive_service = create_service(secret, 'drive', 'v3', scope)
    sheet_service = gspread.service_account('key.json')

    db_folder_id = get_file_id(drive_service,
                               file_name=DB_DIR,
                               mime_type='application/vnd.google-apps.folder')

    files = [f.name for f in os.scandir(DB_PATH) if f.name.endswith('.xlsx')]

    for file in files:
        print("Loading {} to Google Sheets".format(file))
        try:
            df = pd.read_excel(db_path / file)

            file_id = get_file_id(drive_service,
                                  file[:-5],
                                  parent_id=db_folder_id)

            sh = sheet_service.open_by_key(file_id)
            sh.values_clear("'Sheet1'!A:AAA")
            sh.sheet1.update([df.columns.values.tolist()] + df.fillna('').values.tolist())
        except Exception as e:
            print('Upload failed.')
            print(e)
            return None

*Occasionally, you may receive an error 400 or similar during this step.  If so, just rerun the function in the last cell.*

In [6]:
upload_db_data(CLIENT_SECRET_FILE, DB_PATH)

drive service created successfully
Loading bls_msa.detailed.xlsx to Google Sheets
Loading bls_msa.major.xlsx to Google Sheets
Loading bls_msa.total.xlsx to Google Sheets
Loading bls_national.detailed.xlsx to Google Sheets
Loading bls_national.major.xlsx to Google Sheets
Loading bls_national.total.xlsx to Google Sheets
Loading ipeds.awards.xlsx to Google Sheets
Loading ipeds.tech_soc_awards.xlsx to Google Sheets
