<a href="https://colab.research.google.com/github/duncandewhurst/kingfisher_notebooks/blob/main/setup_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Install Kingfisher Colab and required packages:

In [None]:
%%shell

pip install --upgrade 'ocdskingfishercolab<0.4' pandas psycopg2-binary > pip.log

Import functions:

In [None]:
from ocdskingfishercolab import (
    list_source_ids,
    list_collections,
    set_spreadsheet_name,
    save_dataframe_to_sheet,
    set_search_path)

Load [ipython-sql](https://pypi.org/project/ipython-sql/) and [data_table](https://colab.research.google.com/notebooks/data_table.ipynb) extensions. Set config.

In [None]:
%load_ext sql
%load_ext google.colab.data_table
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False  # Don't show connection string after execute
%config SqlMagic.feedback = False  # Don't print number of rows affected by DML

Enter credentials and connect to database:

> **Helpdesk analysts:** See [CRM-6335](https://crm.open-contracting.org/issues/6335).

In [None]:
import getpass

print('Enter your Kingfisher credentials')
user = input('Username:')
password = getpass.getpass('Password:')

connection_string = 'postgresql://' + user + ':' + password + '@postgres-readonly.kingfisher.open-contracting.org/ocdskingfisherprocess?sslmode=require'

%sql $connection_string

Generate a list of schemas and their selected collections:

In [None]:
%%capture collections

import pandas as pd

# Get a list of schemas that contain the `selected_collections` table

list_schemas = """

SELECT
	schemaname
FROM
	pg_tables
WHERE
	tablename = 'selected_collections'

"""

schemas = %sql {list_schemas}

# Get the selected collections from each schema and store the results in a DataFrame

template = """

SELECT
  '{schema}' as schema_name,
  array_agg(id) as collections
FROM
  {schema}.selected_collections

  """

collections_list = pd.DataFrame()

for schema in schemas['schemaname'].to_list():

  statement = template.format(schema = schema)

  collections = %sql {statement}
  collections_list = collections_list.append(collections)


Log errors:

In [None]:
# Some schemas listed in `pg_tables` (and `information_schema.views`) are not accessible, log those errors and warn the user

if len(collections.stdout) > 0:
  print('`selected_collections` is not accessible for some schemas. See collections.log for details')
  %store collections.stdout > collections.log

## Choose source, collection and schema

Run this cell and use the 'Filter' button to find the `source_id`:

In [None]:
list_source_ids()

Update the `source_id`, run this cell, and use the 'Filter' button to find the `collection_id`(s):

In [None]:
source_id = 'uk_contracts_finder'

list_collections(source_id)

Update the `collection_id`(s), run this cell and use the 'Filter' button to find the schema:

In [None]:
collection_ids = [1883, 1164]  # list of collection_ids 
collection_ids = tuple(collection_ids)  # convert list to tuple for use in sql queries

collections_list = collections_list.astype({'collections': str})
                                           
collections_list[collections_list['collections'].str.contains('|'.join(str(id) for id in collection_ids))]

Update the `schema` and run the cell:

In [None]:
schema_name = 'view_data_collection_1883_1885'

set_search_path(schema_name)