Goal 1: Make sure tracking sheet and qc sheet are in sync with each other
* Both have same datasets

Goal 2: Make sure tracking sheet and qc sheet are in sync with backoffice
* Flag datasets on backoffice that don't correspond to anything in tracking/qc

Goal 3: Make sure we have layers for all datasets we have metadata for

In [215]:
import pandas as pd
import requests as req
import os

pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

Pull down layer info from back office

In [301]:
url = "https://api.resourcewatch.org/v1/layer"

# page[size] tells the API the maximum number of results to send back
# There are currently between 200 and 300 datasets on the RW API
payload = { "application":"rw", "page[size]": 1000}

# Request all datasets, and extract the data from the response
res = req.get(url, params=payload)
configs = {}
count = 0
for ds in res.json()["data"]:
    try:
        if ds['attributes']['dataset'] in configs:
            configs[ds['attributes']['dataset']].append(ds['id'])
        else:
            configs[ds['attributes']['dataset']] = [ds['id']]
        count += 1
    except:
        pass
layer_ids_by_ds = pd.DataFrame.from_dict(configs, orient='index')
print('Total number of layers in api:', count)
print('Total number of unique datasets in layers in api:', len(configs))
print()

print(layer_ids_by_ds.shape[0], 'datasets w/ layers on the api')
print(layer_ids_by_ds.head(1))
print()

configs = {}
for ds in res.json()["data"]:
    try:
        configs[ds['attributes']['dataset']] = ds['attributes']['layerConfig']['body']['layers'][0]['options']['sql']
    except:
        pass
    
api_sql = pd.DataFrame.from_dict(configs, orient='index')
api_sql.columns = ['SQL']
print(api_sql.shape[0],'layers with SQL on the api')
print(api_sql.head(1))
print()

Total number of layers in api: 449
Total number of unique datasets in layers in api: 260

260 datasets w/ layers on the api
                                                                        0   \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  255cf15f-39c1-46d2-ba82-59540133e4d1   

                                        1     2     3     4     5     6   \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  None  None  None   

                                        7     8     9     10    11    12  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  None  None  None   

                                        13    14    15  
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  

194 layers with SQL on the api
                                                                 SQL
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  SELECT * FROM global_mangroves



Pull down dataset ids from backoffice

In [248]:
url = "https://api.resourcewatch.org/v1/dataset"

# page[size] tells the API the maximum number of results to send back
# There are currently between 200 and 300 datasets on the RW API
payload = { "application":"rw", "page[size]": 1000}

# Request all datasets, and extract the data from the response
res = req.get(url, params=payload)
api_datasets = pd.DataFrame(pd.DataFrame(res.json()["data"])['id']).set_index('id')
print(api_datasets.shape[0], 'datasets on the api')
print(api_datasets.head(1))
print()

291 datasets on the api
Empty DataFrame
Columns: []
Index: [098b33df-6871-4e53-a5ff-b56a7d989f9a]



Fetching data from Google Spreadsheets

In [302]:
# Tracking Sheet
!curl "https://docs.google.com/spreadsheets/d/1viPOGYIk6RGu7YMoM3BHNVbkWaCZ0JFBOMSNncWvHYk/export?format=tsv" > tracking_sheet.tsv
tracking_sheet = pd.read_csv("tracking_sheet.tsv", sep="\t", index_col=[0])
os.remove("tracking_sheet.tsv")

# Continue with the metadata that matches elements in the tracking sheet
ids_on_tracking = pd.notnull(tracking_sheet["API_ID"])
tracking = tracking_sheet.loc[ids_on_tracking]
tracking = tracking.reset_index().set_index("API_ID")

# QC Ready Metadata
!curl "https://docs.google.com/spreadsheets/d/1UkABgMlBIinJjITa6WepFAL-8VBkulS0LCbKojRXjVY/export?format=tsv" > metadata_sheet.tsv
metadata_sheet = pd.read_csv("metadata_sheet.tsv", sep="\t", index_col=[0])
os.remove("metadata_sheet.tsv")

# Continue with the metadata that matches elements in the tracking sheet
ids_on_metadata = pd.notnull(metadata_sheet["final_ids"])
metadata = metadata_sheet.loc[ids_on_metadata]

# Should have used this:
metadata = metadata.reset_index().set_index("final_ids")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   98k    0   98k    0     0   100k      0 --:--:-- --:--:-- --:--:--  100k      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  764k    0  764k    0     0   723k      0 --:--:--  0:00:01 --:--:--  723k


In [251]:
tracking.head(1)

Unnamed: 0_level_0,WRI_ID,Perfect Dataset? C = extra QC needed,Published on RW,Public Title,Technical Title,Subtitle,Theme_1,Theme_2,Theme_3,Format,Time Series,Subscription (Filter options),End of Year Priority,signals check,Fancy Datasets & Visualization Needs,Shared API - Do Not Touch These!,From WRI Platform (we need viz to get/replicate them),Problem Solving,Metadata Completed,Download from Source,table_name on server,"Server Location (and account wri-rw, insights, or wri-01)",Uploaded to S3,Download Data (S3),Dataset Processed for Upload,Distribution Restriction,Data Upload Responsibility,Alias defined on Backoffice,Layer Definition/Description/Name,Editable Widget (Chart/Map),Tags,Category
API_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1
5b5a21ac-0835-43fb-86b9-64b93d472e10,bio.001,X,X,Endangered Species Sites,Alliance for Zero Extinction Sites (AZE),AZE,Biodiversity,,,Vector,,,,,,,,,X,http://www.biodiversitya-z.org/content/allianc...,bio_001_aze_endangered_species_sites,Carto - wri-rw,X,https://wri-public-data.s3.amazonaws.com/resou...,omitted spreadsheet forward from AZE along wit...,,Peter,,X,Map,,BIO


Comparing amongst backoffice and google spreadsheet data

In [93]:
metadata.head(1)

Unnamed: 0_level_0,Unique ID,Learn More Link,Download from Source,Download Data (S3),Distribution Restriction,Shared API - Do Not Touch These!,Public Title,Technical Title,Subtitle,Source Organizations,...,Original Data Name 1,Original Data Link 1,Original Data Name 2,Original Data Link 2,Original Data Name 3,Original Data Link 3,Original Data Name 4,Original Data Link 4,Unnamed: 37,API_ID
final_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5b5a21ac-0835-43fb-86b9-64b93d472e10,bio.001,http://www.biodiversitya-z.org/content/allianc...,http://www.biodiversitya-z.org/content/allianc...,,"X emailed that it is okay, but affiliated with...",,Endangered Species Sites,Alliance for Zero Extinction Sites (AZE),AZE,Alliance for Zero Extinction (AZE),...,,,,,,,,,,5b5a21ac-0835-43fb-86b9-64b93d472e10


Ensure metadata and tracking sheets agree

In [303]:
m = pd.DataFrame(metadata['Unique ID'])
t = pd.DataFrame(tracking['WRI_ID'])
print('m')
print(m.head(1))
print(m.shape)
print()
print('t')
print(t.head(1))
print(t.shape)
print()

print('These should both be empty if the metadata and tracking sheets all share the same final_ids')
t_not_in_m = [id for id in t['WRI_ID'] if id not in m['Unique ID'].values]
m_not_in_t = [id for id in m['Unique ID'] if id not in t['WRI_ID'].values]
print('t not in m:', t_not_in_m)
print()
print('m not in t:', m_not_in_t)
print()

mgmt = pd.merge(m, t, left_index=True, right_index=True)
print('merged t to m')
print(mgmt.head(1))
print(mgmt.shape)
print()
mgtm = pd.merge(t, m, left_index=True, right_index=True)
print('merged m to t')
print(mgtm.head(1))
print(mgtm.shape)
print()

jtm = t.join(m, how='left')
print('joined m to t')
print(jtm.head(1))
print(jtm.shape)
print()
jmt = m.join(t, how='left')
print('joined t to m')
print(jmt.head(1))
print(jmt.shape)
print()

print('api response')
print(ds_in_api_layers.head(1))
print(ds_in_api_layers.shape)
print()

m
                                     Unique ID
final_ids                                     
5b5a21ac-0835-43fb-86b9-64b93d472e10   bio.001
(220, 1)

t
                                       WRI_ID
API_ID                                       
5b5a21ac-0835-43fb-86b9-64b93d472e10  bio.001
(220, 1)

These should both be empty if the metadata and tracking sheets all share the same final_ids
t not in m: []

m not in t: []

merged t to m
                                     Unique ID   WRI_ID
5b5a21ac-0835-43fb-86b9-64b93d472e10   bio.001  bio.001
(220, 2)

merged m to t
                                       WRI_ID Unique ID
5b5a21ac-0835-43fb-86b9-64b93d472e10  bio.001   bio.001
(220, 2)

joined m to t
                                       WRI_ID Unique ID
API_ID                                                 
5b5a21ac-0835-43fb-86b9-64b93d472e10  bio.001   bio.001
(220, 2)

joined t to m
                                     Unique ID   WRI_ID
final_ids                              

In [307]:
print('These data sets have a final id in the tracking sheet, but not the metadata sheet')
jtm[pd.isnull(jtm['Unique ID'])]['WRI_ID']

These data sets have a final id in the tracking sheet, but not the metadata sheet


Series([], Name: WRI_ID, dtype: object)

Compare between metadata/tracking and backoffice

In [304]:
datasets_missing_layers = [ix for ix in api_datasets.index if ix not in layer_ids_by_ds.index]
layers_missing_datasets = [ix for ix in layer_ids_by_ds.index if ix not in api_datasets.index]
print('Number datasets missing layers:', len(datasets_missing_layers))
print('Number layers missing datasets:', len(layers_missing_datasets))
print()

dml_ids = metadata.loc[datasets_missing_layers]['Unique ID']
dml_with_mdata_ids = dml_ids[pd.notnull(dml_ids.values)]
lmd_ids = metadata.loc[layers_missing_datasets]['Unique ID']

print('datasets on api without layers:', dml_ids.shape[0])
for ds in dml_ids.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))
    
print('datasets on api without layers that have metadata:', dml_with_mdata_ids.shape[0])
for ds in dml_with_mdata_ids.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))

print('layers on api without datasets:', lmd_ids.shape[0])
for ly in lmd_ids.index:
    print('https://api.resourcewatch.org/v1/layer/{}'.format(layer_ids_by_ds.loc[ly,0]))

Number datasets missing layers: 67
Number layers missing datasets: 36

datasets on api without layers: 67
https://api.resourcewatch.org/v1/dataset/2b569ae2-9452-44f8-9c81-3c3afb6c3c25
https://api.resourcewatch.org/v1/dataset/bb80312e-b514-48ad-9252-336408603591
https://api.resourcewatch.org/v1/dataset/5edefab9-c707-447e-96f9-6115149e3a87
https://api.resourcewatch.org/v1/dataset/86777822-d995-49cd-b9c3-d4ea4f82c0a3
https://api.resourcewatch.org/v1/dataset/ac750026-e66d-4403-b607-efcf01189d21
https://api.resourcewatch.org/v1/dataset/ec8359f3-4b35-4212-9482-3e34c55a3ef5
https://api.resourcewatch.org/v1/dataset/b9ae853a-8e80-4c99-8bc8-30973b029710
https://api.resourcewatch.org/v1/dataset/bd3ad33a-886b-4456-a032-119b9ac064de
https://api.resourcewatch.org/v1/dataset/90a76816-cb42-49e0-a736-9c3d82cb0d59
https://api.resourcewatch.org/v1/dataset/b4dbb3a5-654f-4f36-aa32-c28f7406d6f4
https://api.resourcewatch.org/v1/dataset/3b7c3e8f-f7c0-466f-8f7c-01d73afc0988
https://api.resourcewatch.org/v1/dat

In [305]:
#https://stackoverflow.com/questions/22676081/pandas-the-difference-between-join-and-merge
# Join attaches data from left onto indices of right

dft = pd.DataFrame(tracking['WRI_ID'])
dfm = pd.DataFrame(metadata['Unique ID'])

table_joined_t = ds_in_api_layers.join(dft, how='left')
table_joined_m = ds_in_api_layers.join(dfm, how='left')

#table_merged = ds_in_api_layers.merge(df, left_index=True, right_index=True)
#pd_table_merged = pd.merge(layer_ids_by_ds, df, left_index=True, right_index=True)

print(table_joined_t.head(1))
print(table_joined_m.head(1))

#print(table_merged.head(1))
#print(pd_table_merged.head(1))

layers_without_wriid = ds_in_api_layers[pd.isnull(table_joined_t['WRI_ID'])]
layers_without_uniqueid = ds_in_api_layers[pd.isnull(table_joined_m['Unique ID'])]

print()
print('These data sets appear in layers, but are not in the tracking sheet:', layers_without_wriid.shape[0])
print('They may be old data sets who were deleted and the layers are hanging out, or they are test data sets')
print()
for ds in layers_without_wriid.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))
    
print()
print('These data sets appear in layers, but are not in the metadata sheet:', layers_without_uniqueid.shape[0])
print('They may be old data sets who were deleted and the layers are hanging out, or they are test data sets')
print()
for ds in layers_without_uniqueid.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))

                                                                         0  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  255cf15f-39c1-46d2-ba82-59540133e4d1   

                                         1     2     3     4     5     6  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  None  None  None   

                                         7     8     9    10    11    12  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  None  None  None   

                                        13    14    15   WRI_ID  
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  for.005  
                                                                         0  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  255cf15f-39c1-46d2-ba82-59540133e4d1   

                                         1     2     3     4     5     6  \
8ab2606d-1b1c-4b71-ab29-c6e0a687e9fd  None  None  None  None  None  None   

                                         7     8     9    10    11    12  \
8ab2606d-1b1c-4b71-

In [306]:
dft = pd.DataFrame(tracking['WRI_ID'])
dfm = pd.DataFrame(metadata['Unique ID'])

table_joined_t = api_datasets.join(dft, how='left')
table_joined_m = api_datasets.join(dfm, how='left')

print(table_joined_t.head(1))
print(table_joined_m.head(1))

datasets_without_wriid = api_datasets[pd.isnull(table_joined_t['WRI_ID'])]
datasets_without_uniqueid = api_datasets[pd.isnull(table_joined_m['Unique ID'])]

print()
print('These data sets appear in backoffice, but are not in the tracking sheet:', datasets_without_wriid.shape[0])
print()
for ds in datasets_without_wriid.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))
    
print()
print('These data sets appear in backoffice, but are not in the metadata sheet:', datasets_without_uniqueid.shape[0])
print()
for ds in datasets_without_uniqueid.index:
    print('https://api.resourcewatch.org/v1/dataset/{}'.format(ds))

                                       WRI_ID
id                                           
098b33df-6871-4e53-a5ff-b56a7d989f9a  soc.064
                                     Unique ID
id                                            
098b33df-6871-4e53-a5ff-b56a7d989f9a   soc.064

These data sets appear in backoffice, but are not in the tracking sheet: 73

https://api.resourcewatch.org/v1/dataset/0a59f415-ee0b-4d19-96f7-c7304c152e1b
https://api.resourcewatch.org/v1/dataset/12510410-1eb3-4af0-844f-8a05be50b1c1
https://api.resourcewatch.org/v1/dataset/42de3f98-ba1c-4572-a227-2e18d45239a5
https://api.resourcewatch.org/v1/dataset/d8a45b34-4cc0-42f4-957d-e13b37e9182e
https://api.resourcewatch.org/v1/dataset/2b569ae2-9452-44f8-9c81-3c3afb6c3c25
https://api.resourcewatch.org/v1/dataset/9e9a5c50-b825-4f12-838f-1650943c2be1
https://api.resourcewatch.org/v1/dataset/bb80312e-b514-48ad-9252-336408603591
https://api.resourcewatch.org/v1/dataset/5edefab9-c707-447e-96f9-6115149e3a87
https://api.resourc