## Processing Data

Before loading the data into the table, I need to take care of processing the data. Here are the columns used as part of the sheet based up on the questions from the form.

|Column Name|
|-----------|
|Timestamp|
|Email Address|
|First Name|
|Last Name|
|Why you want to learn Python?|
|Current Status|
|If experienced, what is your current role?|

I would like to have the target data in this form so that I can load into databases like Dynamodb, Mongodb etc.

|Column Name|Data Type|
|-----------|---------|
|email_id|string|
|first_name|string|
|last_name|string|
|forms|dict|

**forms** will contain dict with sheet id, title as well as last updated time by the user.

|Column Name|Data Type|
|-----------|---------|
|id|string|
|title|string|
|submitted_ts|string or timestamp|


In [None]:
import pickle
import os.path
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

In [None]:
def get_credentials():
    SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']
    creds = None
    # The file token.pickle stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)

    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)
            
    return creds

In [None]:
def get_sheet_name_and_id(service, spreadsheetId):
    sheet = service.spreadsheets()
    sheet_metadata = sheet.get(spreadsheetId=spreadsheetId).execute()
    return {
        'id': spreadsheetId,
        'title': sheet_metadata['properties']['title']
    }

In [None]:
def get_sheet_data(service, spreadsheet_id, spreadsheet_range):
    sheet = service.spreadsheets()
    sheet_values = sheet.values()
    sheet_details = sheet_values.get(spreadsheetId=spreadsheet_id,
                            range=spreadsheet_range).execute()
    return sheet_details.get('values')[0], sheet_details.get('values')[1:]

In [None]:
SPREADSHEET_ID = '1lgyVuw6nVyRnmKtCPbXF4kYcop5HMJ8H3eeNsArAlVk'

In [None]:
RANGE_NAME = 'Form Responses 1!A1:G'

In [None]:
creds = get_credentials()

In [None]:
service = build('sheets', 'v4', credentials=creds)

In [None]:
sheet_metadata = get_sheet_name_and_id(service, SPREADSHEET_ID)

In [None]:
sheet_metadata

In [None]:
sheet_columns, sheet_rows = get_sheet_data(service, SPREADSHEET_ID, RANGE_NAME)

In [None]:
for column in sheet_columns: print(column)

In [None]:
for row in sheet_rows[:3]: print(row)

In [None]:
import pandas as pd

sheet_df = pd.DataFrame(sheet_rows, columns=sheet_columns)

In [None]:
sheet_df.columns[4:]

In [None]:
sheet_df = sheet_df.drop(sheet_df.columns[4:], axis=1)

In [None]:
import json

In [None]:
sheet_df['forms'] = sheet_df. \
    apply(
        lambda rec: {'id': sheet_metadata['id'], 'title': sheet_metadata['title'], 'submitted_ts': rec['Timestamp']},
        axis=1
    )

In [None]:
sheet_df['forms']

In [None]:
sheet_df = sheet_df.drop('Timestamp', axis=1)

In [None]:
sheet_df.columns = ['email_id', 'first_name', 'last_name', 'forms']

In [None]:
sheet_df

In [None]:
emails_list = sheet_df.to_dict('records')

In [None]:
sheet_df[sheet_df.email_id.str.startswith('anil')]

In [None]:
emails_list[:3]