# Feature creation pipeline to operationalize data quality

# Aspirational TODO list
- How many different versions of the standard are we using
- Build a flag for invalid reporting-org indications, add to features dict
- Are there any other description contents to calculate ("Wow, there's a lot of txn info here!") - Do this on server side

### Initial Setup

In [1]:
from __future__ import division,print_function

from multiprocessing import Pool, cpu_count
import numpy as np
import os
import sys
import json
import re
import time
import pickle
import requests
import six
from datetime import datetime
from collections import Counter
from operator import add

import pandas as pd
import pymongo

from pprint import PrettyPrinter
p=PrettyPrinter()

In [2]:
pool = Pool(cpu_count())

In [3]:
conn=pymongo.MongoClient('mongodb', 27017)

conn.database_names()

['admin', 'iati', 'local']

In [4]:
db = conn.iati

activities=db.activities
metadata=db.metadata

db.collection_names()

['cleaned_orgs_full',
 'transactions',
 'quality',
 'organizations_metadata',
 'organizations',
 'activities',
 'activities_metadata']

In [5]:
print(activities.count())

764159


In [6]:
#Load a test record into memory so we have it on hand for later
activity=activities.find_one({'iati-identifier':'XM-DAC-41114-PROJECT-00047321'})

In [7]:
activity['reporting-org']

{'@ref': 'XM-DAC-41114',
 '@type': '40',
 'narrative': 'United Nations Development Programme'}

## Define activity-level feature creation functions

We're going to establish three different types of features
- Completeness (does the activity have the data needed to be useful)
- Compliance (does the numeric data in activity appear to be naturally generated)
- Utility (is the activity likely to be practically useful to a human reader)

### Completeness features

In [8]:
%%writefile feature_creation.py

import pickle
import six
import sys

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Overwriting feature_creation.py


In [9]:
%%writefile -a feature_creation.py

#Simple check for missing data

def check_missing_fields(activity,field,features):
    """Generic function for checking whether or not a given field has data"""
    try:
        if activity[field]:
            features['missing_'+field]=0
        else:
            features['missing_'+field]=1
    except KeyError:
        features['missing_'+field]=1

Appending to feature_creation.py


In [10]:
%%writefile -a feature_creation.py

def get_val(val):
    """
    Many IATI schema complex types are simple content (no embedded tags) with attributes.
    xmltodict represents these as string values if no attributes are present, or as dicts with
    the content as the '#text' member otherwise.
    """
    if isinstance(val, dict):
        return val.get('#text')
    return val

Appending to feature_creation.py


### Compliance features
These are based on [Benford's law](https://en.wikipedia.org/wiki/Benford%27s_law), which defines certain patterns for the frequency of digits appearing in naturally generated data.  Datasets that don't comply with Benford's law are more likely to be made up (for example, by an analyst guessing at an approximate value rather than calculating it) or even fraudulent.  We aren't making any claims about the veracity of any activities, but rather simply trying to enable the user to get a sense for which numbers have a higher likelihood of a real-world meaning, since this is what's ultimately going to make a dataset useful for understanding an aid intervention

In [11]:
%%writefile -a feature_creation.py

#Benford's Law Analysis

def calculate_digit_distribution(list_of_numbers):
    """Returns a distribution of digits given a list of numbers (type agnostic)"""
    counts=[0]*10
    for num in list_of_numbers:
        try:
            num=filter(six.text_type.isdigit, six.text_type(num))
            digit=int(next(iter(num)))
            counts[digit]+=1
        except StopIteration:
            pass
    total=sum(counts)
    if total:
        return [round(i/total,4) for i in counts]
    return counts

def benfords_law(activity,field,features):
    """
    Tests whether or not a given field on an activity obeys Benford's law

    For our purposes, a distribution complies with Benford's law if the 1
    represents at least 30% of the data.  This is a simplistic definition,
    but it'll work for the prototype and we can make it more nuanced later

    """
    distribution=None
    compliance=None
    values=activity.get(field)
    if values:
        #Make sure types are consistent
        if type(values)!=list:
            values=[values]
        raw_vals=[]
        for value in values:
            contents=get_val(value.get('value'))
            if contents:
                raw_vals.append(contents)
        if raw_vals:
            distribution=calculate_digit_distribution(raw_vals)
            if distribution[1]>0.3:
                compliance=1
            else:
                compliance=0
    features['benford_distribution_'+field]=distribution
    features['benford_compliance_'+field]=compliance

Appending to feature_creation.py


### Utility Features

Here, we're trying to understand how practically useful a record is likely to be to a lay user.  Useful records are likely to be managable in size and contain a diverse mix of information.  If a record consists entirely of one type of data (usually long lists of transactions or budget items), it's liable to obscure important content about its programs in the same way that meaningful data is overwhelmed by useless packets sent during a DDOS attack.

We also examine the similarity of the title and description of each activity, since these are typically among the easiest fields for end-users to understand and they're among the most complete.  When these fields are too similar, it may indicate lazy data entry (copy/pasting from one field to the other).  When they're too dissimilar, it may indicate a lack of programmatic alignment.

In [12]:
%%writefile -a feature_creation.py

#Extract and compare similarity of title and description

def find_text_in_dict(field):
    """Structure of certain fields varies, so we need to unwrap the nested dicts when this happens"""
    if type(field)==dict:
        try:
            field=field['narrative']
            if type(field)==dict:
                field=field['text']
        except KeyError:
            field=field['#text']
            if type(field)==dict:
                field=field['text']
    return field

def find_text(field):
    """
    If there are more than one dict in a list, extract and concat all text content together
    """
    if type(field)==list:
        output=''
        for i in field:
            try:
                output+=find_text_in_dict(i)
            except TypeError:
                pass
    else:
        output=find_text_in_dict(field)
    return output

def field_similarity(field1,field2):
    """Calculate cosine similarity between two TFIDF vectorized strings"""
    tfidf_vectorizer=TfidfVectorizer(min_df=0)
    tfidf_matrix=tfidf_vectorizer.fit_transform([field1,field2])
    cs=cosine_similarity(tfidf_matrix[0],tfidf_matrix[1])
    return cs[0][0]

def compare_title_description(activity,features):
    """
    Cosine similarity between title and description

    Very high values (close to 1) indicate that title and description are redundant
    Very low value (close to 0) indicate that title and description have nothing to do with each other
    """
    try:
        title=find_text(activity['title'])
        description=find_text(activity['description'])
        cs=field_similarity(title,description)
        features['title_description_similarity']=cs
    #KeyError - fields don't exist
    #ValueError - fields only contain stopwords
    #AttributeError - don't remember
    except (KeyError,AttributeError,ValueError):
        features['title_description_similarity']=None

Appending to feature_creation.py


In [13]:
%%writefile -a feature_creation.py

def get_relative_sizes(activity,exist_fields,features):
    """
    Use the serialized storage space for different fields as a proxy for their relative size within a given activity
    """
    sizes={ field:0 for field in exist_fields}
    #Included in here is the features themselves, but those will be small relative to everything else
    total=sys.getsizeof(pickle.dumps(activity))
    for field in sizes:
        try:
            field_size=sys.getsizeof(pickle.dumps(activity[field]))
            sizes[field]=field_size/total
        except KeyError:
            pass
    features[u'total_size']=total
    features[u'relative_sizes']=sizes


Appending to feature_creation.py


### Other features

There are certain features, like the IATI organization ID, that we'll want to have easily accessible later on, so we'll pull them out and add them to our feature vector.  Other incomplete feature implementations are included below as well

In [14]:
%%writefile -a feature_creation.py

def find_reporting_org(field):
    """
    Extract IATI ID of reporting org

    If there are more than one dict in a list, extract and concat all text content together

    This methodology is NOT perfect, but it gets about 210 distinct orgs in the dataset
    We'll need to make it more responsive to the varying IATI versions, but it's close enough
    for now.
    """
    output = None
    if type(field)==list:
        if '@ref' in field[1]:
            output=field[1]['@ref']
    elif type(field)==dict:
        if '@ref' in field:
            output=field['@ref']
    else:
        output=None
    return output

Appending to feature_creation.py


In [15]:
%%writefile -a feature_creation.py

#Testing whether URLS are valid - not currently in pipeline

def test_links(list_of_links):
    """
    Find percentage of links in a list that seem to work

    The process of checking these can be tedious, and requires a lot of bandwidth,
    so it's probably best to implement this in the cloud if we want to scale across all activities
    """
    valid_links=0
    for link in activity[list_of_links]:
        url=link['@url']
        response=requests.get(url)
        if response.status_code==200:
            valid_links+=1
    ratio=valid_links/len(list_of_links)
    return ratio


Appending to feature_creation.py


In [16]:
# Initial implementation of activity date validation - INCOMPLETE
from collections import OrderedDict
date_dict=OrderedDict({'start-planned':None,'start-actual':None,'end-actual':None,'end-actual':None})
dates=activity['activity-date']
for date in dates:
    try:
        date_dict[date['@type']]=date['@iso-date']
    except KeyError:
        pass

### Generate Features

Having defined the functions needed to generate each feature, we can combine them to create a master pipeline function that we can run iteratively on the whole dataset. 

In [17]:
%%writefile -a feature_creation.py

#Define master feature pipeline function

#lists of fields to check for existence
exist_fields=['description','budget','title','result','transaction','document-link','participating-org','activity-date','@default-currency','@xml:lang']
benford_fields=['budget','transaction']

def generate_features(activity):
    """Generate all features"""

    #Clear old features/make space if it doesn't already exist
    features={
        '_id': activity['_id']
    }

    #Find sizes of stuff
    get_relative_sizes(activity,exist_fields, features) #do this early on so it's less affected by the creation of other data

    #Look for missing data
    for field in exist_fields:
        check_missing_fields(activity,field, features)

    #Data quality validations
    compare_title_description(activity, features)

    #Benford's Law
    for field in benford_fields:
        benfords_law(activity,field, features)

    #Extract certain pieces of text to make it easier to parse later
    features['organization']=find_reporting_org(activity['reporting-org'])
    return features

Appending to feature_creation.py


In [18]:
from feature_creation import generate_features

In [19]:
#Test feature generation on one record
features=generate_features(activity)
p.pprint(features)

{'_id': ObjectId('5a475d066f8487020fe4c488'),
 'benford_compliance_budget': None,
 'benford_compliance_transaction': None,
 'benford_distribution_budget': None,
 'benford_distribution_transaction': None,
 'missing_@default-currency': 0,
 'missing_@xml:lang': 0,
 'missing_activity-date': 0,
 'missing_budget': 1,
 'missing_description': 0,
 'missing_document-link': 0,
 'missing_participating-org': 0,
 'missing_result': 0,
 'missing_title': 0,
 'missing_transaction': 1,
 'organization': 'XM-DAC-41114',
 'relative_sizes': {'@default-currency': 0.0025742906709944596,
                    '@xml:lang': 0.002518327830320667,
                    'activity-date': 0.01203201074486541,
                    'budget': 0,
                    'description': 0.011360456656779898,
                    'document-link': 0.7812972186468186,
                    'participating-org': 0.038222620180200347,
                    'result': 0.05848116850411327,
                    'title': 0.006435726677486149,
      

In [20]:
#GENERATE ALL FEATURES for all activities!
db.drop_collection('quality')
activities_count=activities.count()
print(datetime.now(), 'Started processing')

qual_recs=[]
num_finished = 0

for qual_rec in pool.imap_unordered(generate_features, activities.find()):
    num_finished += 1
    qual_recs.append(qual_rec)

    if num_finished % 25000 == 0:
        print(datetime.now(), 'Processed', num_finished, 'of', activities_count)
        db.quality.insert_many(qual_recs)
        qual_recs=[]

if len(qual_recs) > 0:
    db.quality.insert_many(qual_recs)
    qual_recs = []

print(datetime.now(), 'Finished processing')

2017-12-31 20:57:18.307344 Started processing
2017-12-31 20:57:30.159424 Processed 25000 of 764159
2017-12-31 20:57:42.124127 Processed 50000 of 764159
2017-12-31 20:57:55.413257 Processed 75000 of 764159
2017-12-31 20:58:05.719675 Processed 100000 of 764159
2017-12-31 20:58:18.005187 Processed 125000 of 764159
2017-12-31 20:58:29.841263 Processed 150000 of 764159
2017-12-31 20:58:40.415187 Processed 175000 of 764159
2017-12-31 20:58:51.978640 Processed 200000 of 764159
2017-12-31 20:59:05.232527 Processed 225000 of 764159
2017-12-31 20:59:16.283863 Processed 250000 of 764159
2017-12-31 20:59:26.932167 Processed 275000 of 764159
2017-12-31 20:59:40.795594 Processed 300000 of 764159
2017-12-31 20:59:51.690017 Processed 325000 of 764159
2017-12-31 21:00:02.105478 Processed 350000 of 764159
2017-12-31 21:00:14.123345 Processed 375000 of 764159
2017-12-31 21:00:25.766887 Processed 400000 of 764159
2017-12-31 21:00:38.807642 Processed 425000 of 764159
2017-12-31 21:00:50.504149 Processed 45

In [21]:
#Sanity check - how many results have non-null results for the title-description similarity
db.quality.find({ 'title_description_similarity' : {'$ne' : None} }).count()

571058

In [22]:
db.quality.create_index('organization')

'organization_1'

### Exploration of activity-level results

In [23]:
from feature_creation import exist_fields, benford_fields

In [24]:
#EDA to figure out how much data is missing
grp={'_id': None}
for field in exist_fields:
    grp['missing_'+field]={'$sum': '$missing_'+field}
    grp['average_'+field]={'$avg': '$relative_sizes.'+field}
for field in benford_fields:
    grp['benford_'+field]={'$sum': '$benford_compliance_'+field}
grp['total']={'$sum': 1}

# Run the agg pipeline
results=next(db.quality.aggregate([{'$group': grp}]))

# Print the results
print('MISSING DATA')
p.pprint({k:v for k,v in results.items() if k.startswith('missing_')})
print("")
print("RELATIVE SIZES OF KEY FIELDS")
p.pprint({k:v for k,v in results.items() if k.startswith('average_')})
print("")
print("BENFORD COMPLIANCE (AMONG ACTIVITIES WITH DATA)")
p.pprint({k:v for k,v in results.items() if k.startswith('benford_')})
print("")
print("TOTAL ACTIVITIES")
print(results['total'])

MISSING DATA
{'missing_@default-currency': 58679,
 'missing_@xml:lang': 59805,
 'missing_activity-date': 8661,
 'missing_budget': 453185,
 'missing_description': 43550,
 'missing_document-link': 604302,
 'missing_participating-org': 1627,
 'missing_result': 711899,
 'missing_title': 2771,
 'missing_transaction': 34192}

RELATIVE SIZES OF KEY FIELDS
{'average_@default-currency': 0.008136710128163815,
 'average_@xml:lang': 0.008078909761280085,
 'average_activity-date': 0.042858221831122534,
 'average_budget': 0.0308277609843497,
 'average_description': 0.06799742091908322,
 'average_document-link': 0.02518787885247538,
 'average_participating-org': 0.0800301649129149,
 'average_result': 0.01177915102503437,
 'average_title': 0.024315709276819773,
 'average_transaction': 0.30208155163699385}

BENFORD COMPLIANCE (AMONG ACTIVITIES WITH DATA)
{'benford_budget': 74449, 'benford_transaction': 277481}

TOTAL ACTIVITIES
764159


In [25]:
#Number of records by language
pipeline=[{"$group": {"_id": "$@xml:lang", "count": {"$sum": 1}}}]
list(activities.aggregate(pipeline))

[{'_id': '', 'count': 54},
 {'_id': 'es', 'count': 1912},
 {'_id': 'fr', 'count': 1862},
 {'_id': 'FR', 'count': 14702},
 {'_id': None, 'count': 59751},
 {'_id': 'en', 'count': 671725},
 {'_id': 'EN', 'count': 943},
 {'_id': 'nl', 'count': 7103},
 {'_id': 'NL', 'count': 1},
 {'_id': 'pt', 'count': 174},
 {'_id': 'de', 'count': 5932}]

In [26]:
#Number of unique organizations
len(list(db.quality.aggregate([{"$group": {"_id": "$organization", "count": {"$sum": 1}}}])))

403

In [27]:
db.quality.distinct('organization')

[None,
 '',
 '21-PK-WWF',
 '21020',
 '21032',
 '21033',
 '21033-1.0744',
 '21033-1.0792',
 '21033-1.0931-2',
 '21033-1.0931-3',
 '21033-1.0941',
 '21033-1.1053',
 '21033-1.2001',
 '21033-1.2008',
 '41111',
 '41119',
 '41120',
 '41122',
 '41AAA',
 '44000',
 '46002',
 '46004',
 '47045',
 '47122',
 '47134',
 '47135',
 'AF-MOE-1212',
 'AU-5',
 'BD-NAB-1301',
 'BE-BCE_KBO-0264814354',
 'BE-GTCF-630789842',
 'BJ-IFU-32000700033415',
 'CA-3',
 'CA-CRA-89980-1815-RR0001',
 'CA-CRA_ARC-119304848',
 'CH-4',
 'CH-FDJP-106064950',
 'DAC-1601',
 'DE-1',
 'DE-AG-VR7795',
 'DK-1',
 'DK-CVR-88136411',
 'ES-DIR3-E04585801',
 'FI-3',
 'FI-PRO-1498487-2',
 'GB-10',
 'GB-3',
 'GB-6',
 'GB-7',
 'GB-9',
 'GB-CC-1098893',
 'GB-CHC-000391',
 'GB-CHC-1000717',
 'GB-CHC-1001349',
 'GB-CHC-1001698',
 'GB-CHC-1017255',
 'GB-CHC-1029161',
 'GB-CHC-1038785',
 'GB-CHC-1038860',
 'GB-CHC-1043843',
 'GB-CHC-1045348',
 'GB-CHC-1046001',
 'GB-CHC-1047501',
 'GB-CHC-1050327',
 'GB-CHC-1053389',
 'GB-CHC-1055436',
 'GB-CH

In [28]:
#Use this to lookup how a particular org shows up
test_name='XM-DAC-41114'
test_q=db.quality.find_one({'organization':test_name})

if test_q is not None:
    test=activities.find_one({'_id': test_q['_id']})
    print(test['reporting-org'])
else:
    print('Failed to find', test_name)

{'@ref': 'XM-DAC-41114', '@type': '40', 'narrative': 'United Nations Development Programme'}


## Aggregating activity features to the organization level

Now that we've generated features for each activity, we need to aggregate them together to display them to the end user.  We'll focus on aggregating results at the organization level, but leave the option open to recalculate the aggregation for any of the other fields (country, theme, etc.)

In [29]:
#Load pickles into memory for quickly lookups later
with open("lookup_by_ref.pickle",'rb') as f:
    lookup=pickle.load(f)

with open("lookup_by_name.pickle",'rb') as f:
    lookup_name=pickle.load(f)

In [30]:
def get_benford_distribution(orgid, field):
    """Calculate Benford distribution for an organization across all its activity records"""
    qdata=db.quality.find({'organization':orgid})
    count=0
    running_distribution=[0,0,0,0,0,0,0,0,0,0]
    for qrec in qdata:
        dist_field_name = 'benford_distribution_'+field
       
        if dist_field_name not in qrec:
            continue

        dist_field = qrec[dist_field_name]
       
        if type(dist_field) != list:
            continue

        try:
            running_distribution=np.add(running_distribution, dist_field)
            count+=1
        except TypeError:
            #print i['features']['benford_distribution_'+field]
            continue
    if count>0:
        final_distribution=[100*i/count for i in running_distribution]
    else:
        final_distribution=running_distribution
    final_dict={str(i):final_distribution[i] for i in range(0,10)}
    return final_dict

In [31]:
#Extract aggregated results for a given grouping (we'll start at the org level for now)

exist_fields=['description','budget','title','result','transaction','document-link','participating-org','activity-date','@default-currency','@xml:lang']
size_exclude_fields=['activity-date','@default-currency','@xml:lang']
benford_fields=['budget','transaction']

def query_orgs(group_field, group_val=None):
    """
    Collect aggregated results across our feature space for a given field.
    Note that most of these aggregations are simple averages across all activities
    for this prototype.

    This implementation isn't particularly fast because it makes repeated calls to the mongo collection.
    There are certainly opportunities for optimization by gathering a subset of results
    into memory and aggregating from there.

    """
    pipeline=[]
    # Do we have a single org filter?
    if group_val:
        pipeline.append({"$match": {group_field: group_val}}) #Filter for a given subset of records
    # get the fields, grouping by org ID (or whatever is passed in)
    pipeline.append({'$group': {'_id': '$'+group_field,
                                    #CONTACT INFO
                                    'contact-info': {'$first': '$contact-info'},
                                    #TOTAL
                                    'total': {'$sum': 1},
                                    #SIMILARITY
                                    'title_description_similarity': {'$avg': '$title_description_similarity'},
                                    #DOC SIZE
                                    'avg': {'$avg': '$total_size'},
                                    'min': {'$min': '$total_size'},
                                    'max': {'$max': '$total_size'},
                                    #Doesn't exist in Mongo 3.0.4 (which is what I'm using)
                                    #'sd': {'$stdDevPop': '$total_size'},
                                   }})
    #MISSING DATA
    for field in exist_fields:
        pipeline[-1]['$group']['missing_'+field]={'$avg': '$missing_'+field}

        #FIELD SIZE DATA
        if field not in size_exclude_fields:
            pipeline[-1]['$group']['size_'+field]={'$avg': '$relative_sizes.'+field}

    #BENFORD COMPLIANCE (AMONG ACTIVITIES WITH DATA)
    for field in benford_fields:
        pipeline[-1]['$group']['benford_'+field]={'$avg': '$benford_compliance_'+field}

    # Run the pipeline and yield results as they are available
    for results in db.quality.aggregate(pipeline):
        orgid=results['_id']
        if not orgid: # discard blank and null
            continue
        # figure out the name
        orgnames=lookup[orgid]
        if orgnames and isinstance(orgnames, set):
            orgname=next(iter(orgnames))
        elif orgnames and isinstance(orgnames, six.string_types):
            orgname=orgnames
        else:
            continue
        # Collect Missing Data
        exist_results={}
        size_results={}
        for field in exist_fields:
            exist_results[field]=results['missing_'+field]
            #FIELD SIZE DATA
            if field not in size_exclude_fields:
                size_results[field]=results['size_'+field]
        # Collect Benford Fields
        benford_results={}
        for field in benford_fields:
            benford_results[field]=results['benford_'+field]
            benford_results[field+'_distribution']=get_benford_distribution(orgid,field)

        #Output result as JSON for easy parsing/storing in Mongo
        yield {'organization_id': orgid,
                'organization_name': orgname,
                'contact_info': results['contact-info'] or {},
                'doc_size': {
                    'avg': results['avg'],
                    'min': results['min'],
                    'max': results['max']
                },
                'records': results['total'],
                'missing_data':exist_results,
                'title_description_similarity':results['title_description_similarity'],
                'relative_size':size_results,
                'benford_compliance':benford_results
               }

In [32]:
len(list(db.quality.aggregate([{'$group': {'_id': '$organization'}}])))

403

In [33]:
#Test aggregation function
output=query_orgs('organization', 'GB-COH-04154075')
p.pprint(next(iter(output)))

{'benford_compliance': {'budget': 0.2,
                        'budget_distribution': {'0': 0.0,
                                                '1': 16.666,
                                                '2': 10.0,
                                                '3': 10.0,
                                                '4': 10.0,
                                                '5': 10.0,
                                                '6': 20.0,
                                                '7': 10.0,
                                                '8': 13.334,
                                                '9': 0.0},
                        'transaction': 0.3333333333333333,
                        'transaction_distribution': {'0': 0.0,
                                                     '1': 25.505000000000006,
                                                     '2': 31.183333333333334,
                                                     '3': 7.2516666666666678,
            

In [34]:
db.drop_collection('cleaned_orgs_full')
orgs=db.cleaned_orgs_full

batch=[]
for num, results in enumerate(query_orgs('organization')):
    batch.append(results)
    if(len(batch)==100):
        print(datetime.now(), "- inserting {} orgs".format(len(batch)))
        orgs.insert_many(batch)
        print('successful insert')
        batch=[]
if batch:
    print(datetime.now(), "- inserting {} orgs".format(len(batch)))
    orgs.insert_many(batch)

2017-12-31 21:03:30.347651 - inserting 100 orgs
successful insert
2017-12-31 21:03:37.836134 - inserting 100 orgs
successful insert
2017-12-31 21:03:42.607544 - inserting 100 orgs
successful insert
2017-12-31 21:03:48.040927 - inserting 70 orgs


In [35]:
orgs.create_index('organization_id')

'organization_id_1'

In [36]:
#Verify that this matches the manual test from above
p.pprint(orgs.find_one({'organization_id':'GB-COH-04154075'}))
orgs.count()

{'_id': ObjectId('5a4950ae6f8487050240e1e1'),
 'benford_compliance': {'budget': 0.2,
                        'budget_distribution': {'0': 0.0,
                                                '1': 16.666,
                                                '2': 10.0,
                                                '3': 10.0,
                                                '4': 10.0,
                                                '5': 10.0,
                                                '6': 20.0,
                                                '7': 10.0,
                                                '8': 13.334,
                                                '9': 0.0},
                        'transaction': 0.3333333333333333,
                        'transaction_distribution': {'0': 0.0,
                                                     '1': 25.505000000000006,
                                                     '2': 31.183333333333334,
                                            

370