[View in Colaboratory](https://colab.research.google.com/github/hlecuanda/jupyter-notebooks-of-all-kinds/blob/master/DriveAPI.ipynb)

# Data extraction from Chase Bank Statements in Google Drive

## Import auxiliary Libraries and software dependencies 

This section checks whether the required libraries have been installed in the current notebook. If this raises an exception, (For example, if the libraries are notavailable), the exception block will install the libraries.

The important auxiliary library is [gspread](http://gspread.readthedocs.io/en/latest/) plus the oauth2 client library and the GoogleAPI python client discovery based libraries.


In [0]:
try:
    from oauth2client.service_account import ServiceAccountCredentials
    from google.oauth2 import service_account
    import gspread
except:
    !pip install google-auth google-auth-httplib2 google-api-python-client gspread
    from oauth2client.service_account import ServiceAccountCredentials
    import gspread
finally:
    import re
    import httplib2
    from apiclient import discovery
    from google.colab import files


## Authorization procedure

This is a modified Oauth2 procedure, used when [Domain Wide Delegation](https://developers.google.com/identity/protocols/OAuth2ServiceAccount?hl=en_US#delegatingauthority) has been applied. [The procedure has been taken from this StackOverflow](https://stackoverflow.com/questions/49374112/google-service-account-cant-impersonate-gsuite-user) reference. 

The authorization flow has been simplified using a service account instead of the usual auth flow. This requires:

*   A service account from the project's [API console's Credentials Section](https://console.cloud.google.com/apis/credentials?project=mezaops&organizationId=1063501829322). This will furnish a key file that you need to upload on this notebook
*   The client defined for the service account above, has to be authorized for the required scopes via the[ Domain's Admin Console](https://admin.google.com/lecuanda.com/AdminHome?chromeless=1#OGX:ManageOauthClients). 


The `try ... except` block will check whether the authorization file containing the encryption certificate has been loaded on the notebook. If it has not been loaded, it willopen a dialog for the user to upload the authorization file.

If you don't have the authorization file, please request it from your domain Admin. 

Finallym the `credentials.create_delegated()` method from the oauth2 clent will allow you to impersonate the user that has been used as parameter to this method.




In [0]:
SERVICE_ACCOUNT_FILE = 'MezaOps-9483d786f5ef.json'

SCOPES = ['https://www.googleapis.com/auth/drive',
          'https://www.googleapis.com/auth/drive.file',
          'https://www.googleapis.com/auth/drive.metadata.readonly']

try:
    with open(SERVICE_ACCOUNT_FILE, "r") as f:
        print()
except:
    uploaded = files.upload()

    for fn in uploaded.keys():
        print('User uploaded file "{name}" with length {length} bytes'.format(
                name=fn, length=len(uploaded[fn])))


credentials = ServiceAccountCredentials.from_json_keyfile_name(
        SERVICE_ACCOUNT_FILE, scopes=SCOPES)


delegated_credentials = credentials.create_delegated('hector@lecuanda.com')  
delegated_http = delegated_credentials.authorize(httplib2.Http())
drive  = discovery.build('drive', 'v3', http=delegated_http)


## Report Parameters

Fill in this form with the Google drive URL Key of the file that contains the PDF statements to extract data from. It will update the `docs` variable to contain all documents to beprocessed

In [0]:
#@title Directory id { run: "auto", vertical-output: true, output-height: 100, form-width: "50%", display-mode: "form" }
dir_id = "1kghdjPxiHRvQY2sQa1koli2csSGqni8_" #@param {type:"string"}
drivequery="'{}' in parents and mimeType = \
'application/vnd.google-apps.document'".format(dir_id)
docs = drive.files().list(q=drivequery).execute()

print("{} documents selected".format(len(docs['files'])))


## Aux Routines

Here, we define two routines that do the actual data extraction using the drive API

In [0]:

def get_yymm(instring):
    ''' Gets the year and month for the statement being processed from the
        file name. We need this because the Chase Statements format omit 
        the year on the transaction detials'''
    yy=instring[0:4]
    mm=instring[4:6]
    return "{}-{}".format(yy,mm)

def get_text_from_statements(doclist):
    ''' Iterates through the drive files, exportsthe text, and filters it 
        through a compiled regular expresion, appending all data items to
        a list that will get returned. Also, correctthe date on each item by
        adding the year extracted fromthe filename'''
    deposits=[]
    regex=re.compile('^\S{5}\sDeposit\s\S+$')
    
    for doc in doclist:
        yymm=get_yymm(doc['name'])
        curr_statement=drive.files().export(fileId=doc['id'],mimeType='text/plain').execute()
        for line in curr_statement.decode().split('\r\n'):
            if regex.match(line):
                mo=line[0:2]
                dy=line[3:5]
                rest=line.split(' ')[2]
                deposits.append("{}-{},{}".format(yymm,dy,rest))
                        
    return deposits

This call extracts the selected data, dumping it to the `data`  global var

In [0]:
data=get_text_from_statements(docs['files'])    

## Dump data on a spreadhseet

This code block creates a Google Spreadsheet, selects a range and then dumps each value from the `data` variable into the newly created spreadsheet

### Todo:


1.   Verify if a spreadsheet already exists
2.   Report formatting and formula addition



In [0]:
gc = gspread.authorize(credentials.get_application_default())

sh = gc.create('Deposits')
#shid = gc

# Open our new sheet and add some data.
worksheet = gc.open('Deposits').sheet1

if worksheet.row_count < len(data):
    extra_rows=len(data)-worksheet.row_count
    worksheet.add_rows(extra_rows+20)

cell_list = worksheet.range("A1:A{}".format(len(data)))

for cell in cell_list:
  cell.value = data[cell.row]

worksheet.update_cells(cell_list)
# Go to https://sheets.google.com to see your new spreadsheet.