# Analysis of PHLASK data in early 2024

This notebook contains the analysis of the PHLASK data that exists in early 2024. It uses the 
PHLASK Firebase DB to analyze the schema and presence of the data.

**The goal of this analysis is to begin designing an official schema for the data, which we can then
use to go through the data and normalize or remove invalid resources.**

You can find the Firebase databases and dashboard [here.](https://console.firebase.google.com/u/1/project/phlask-web-map/overview)

## Getting Started Hints

- If you are new to Firebase, check out [this link](https://www.geeksforgeeks.org/firebase-introduction/).
- I found [this tutorial](https://www.freecodecamp.org/news/how-to-get-started-with-firebase-using-python/) to be useful for learning how to connect to Firebase using Python

## Table of Contents:

- Getting started - Taking a look at the data
- Analyzing specific aspects of PHLASK data
  - Analysis of "Hours" Data in PHLASK
  - Analysis of address fields
- Coming up with a schema for the data

# Getting started - Taking a look at the data

First, we need to install the required dependencies and get some initial configuration setup to access the Firebase DB.

**DO NOT skip this step! Without it, none of the code will run!**

First, we will install some Python dependencies from pip.


In [1]:
!pip install firebase-admin # used for accessing Firebase DB
!pip install cerberus # used for validating the schema of the data



Next, we will need to configure our environment to properly access the Firebase DB. This will use a Firebase Cert that
you must request from a PHLASK admin. Ask in the #phlask_data channel for a `firebase_cert.json` file, and then place it
in the same directory as this notebook. Then, run the following code to configure your environment.

In [1]:
import os
cert_path = os.path.abspath("firebase_cert.json")
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = cert_path
print("Set the cert path to " + cert_path)

Set the cert path to /Users/aaronvontell/Projects/phlask-data-handlers/data_analysis/firebase_cert.json


Now, we will startup the client to access the DB, and take a look at the data of a specific database. This constant,
`DB_URL`, is used in the next sections as well as the main DB to look at. You can find the whole list of databases [here](https://console.firebase.google.com/u/1/project/phlask-web-map/database/phlask-web-map/data).

In [2]:
DB_URL = 'https://phlask-web-map-beta-water-live.firebaseio.com/'

from firebase_admin import initialize_app, db
default_app = initialize_app()
ref = db.reference(url=DB_URL)
all_entries = [e for e in ref.get() if e is not None]
print(f"Loaded PHLASK DB reference with {len(all_entries)} resources")

# If you get an error about the Firebase app already existing, restart your notebook kernel.

Loaded PHLASK DB reference with 274 resources


Now let's look at some of the data from these resources.

In [3]:
all_entries[5]

{'access': 'Public',
 'address': '2501 Walnut St.',
 'city': 'Philadelphia',
 'description': 'Drinking fountain along Schuykill Banks path near Locust St. entrance',
 'filtration': 'No',
 'gp_id': 'ChIJ1QhLoknGxokRY1BxIaMBmEY',
 'handicap': 'Unsure',
 'hours': [{'close': {'day': 1, 'time': '0000'},
   'open': {'day': 0, 'time': '0600'}},
  {'close': {'day': 2, 'time': '0000'}, 'open': {'day': 1, 'time': '0600'}},
  {'close': {'day': 3, 'time': '0000'}, 'open': {'day': 2, 'time': '0600'}},
  {'close': {'day': 4, 'time': '0000'}, 'open': {'day': 3, 'time': '0600'}},
  {'close': {'day': 5, 'time': '0000'}, 'open': {'day': 4, 'time': '0600'}},
  {'close': {'day': 6, 'time': '0000'}, 'open': {'day': 5, 'time': '0600'}},
  {'close': {'day': 0, 'time': '0000'}, 'open': {'day': 6, 'time': '0600'}}],
 'images': ['https://i.imgur.com/TwWKydJ.jpg'],
 'lat': 39.952195,
 'lon': -75.180653,
 'norms_rules': '',
 'organization': 'Schuylkill Banks',
 'permanently_closed': False,
 'phone': '(215) 309-55

In [4]:
all_entries[150]

{'access': 'Public',
 'address': '',
 'city': 'Philadelphia',
 'description': 'Dawn - Dusk',
 'filtration': '',
 'gp_id': 'ChIJBb3JfPK4xokRajvM6KUr75c',
 'handicap': '',
 'lat': 40.0290751,
 'lon': -75.211313,
 'norms_rules': '',
 'organization': 'Kendrick Playground & Recreation Center',
 'permanently_closed': False,
 'quality': '5-7 Missing - Needs Work',
 'service': 'Self-serve',
 'statement': '',
 'tap_type': 'Drinking Fountain',
 'tapnum': 151,
 'vessel': '',
 'zip_code': 19128}

# Analyzing specific aspects of PHLASK data

## Analysis of "Hours" Data in PHLASK

Let's start with an analysis of the hour format in these databases. We want to answer the following questions:
- How many entries have hours includes?
- What is the format of the hours?

In [5]:
print(f"There are {len(all_entries)} resources in this PHLASK db.")

There are 274 resources in this PHLASK db.


In [6]:
# Now, let's see how many entries have hours
entries_with_hours = [entry for entry in all_entries if entry.get('hours') is not None]
print(f"There are {len(entries_with_hours)} entries with hours ({len(entries_with_hours)*100/len(all_entries):.2f}%).")

There are 187 entries with hours (68.25%).


Now, let's take a look at a few of these hour entries.

In [7]:
entries_with_hours[0].get('hours')

[{'close': {'day': 0, 'time': '2100'}, 'open': {'day': 0, 'time': '0700'}},
 {'close': {'day': 1, 'time': '2200'}, 'open': {'day': 1, 'time': '0700'}},
 {'close': {'day': 2, 'time': '2200'}, 'open': {'day': 2, 'time': '0700'}},
 {'close': {'day': 3, 'time': '2200'}, 'open': {'day': 3, 'time': '0700'}},
 {'close': {'day': 4, 'time': '2200'}, 'open': {'day': 4, 'time': '0700'}},
 {'close': {'day': 5, 'time': '2200'}, 'open': {'day': 5, 'time': '0700'}},
 {'close': {'day': 6, 'time': '2100'}, 'open': {'day': 6, 'time': '0700'}}]

In [8]:
entries_with_hours[35].get('hours')

[{'close': {'day': 1, 'time': '2130'}, 'open': {'day': 1, 'time': '1300'}},
 {'close': {'day': 2, 'time': '2130'}, 'open': {'day': 2, 'time': '1300'}},
 {'close': {'day': 3, 'time': '2130'}, 'open': {'day': 3, 'time': '1300'}},
 {'close': {'day': 4, 'time': '2130'}, 'open': {'day': 4, 'time': '1300'}},
 {'close': {'day': 5, 'time': '2130'}, 'open': {'day': 5, 'time': '1300'}}]

In [9]:
entries_with_hours[150].get('hours')

[{'close': {'day': 0, 'time': '1930'}, 'open': {'day': 0, 'time': '0600'}},
 {'close': {'day': 1, 'time': '1930'}, 'open': {'day': 1, 'time': '0530'}},
 {'close': {'day': 2, 'time': '1930'}, 'open': {'day': 2, 'time': '0530'}},
 {'close': {'day': 3, 'time': '1930'}, 'open': {'day': 3, 'time': '0530'}},
 {'close': {'day': 4, 'time': '1930'}, 'open': {'day': 4, 'time': '0530'}},
 {'close': {'day': 5, 'time': '1930'}, 'open': {'day': 5, 'time': '0530'}},
 {'close': {'day': 6, 'time': '1930'}, 'open': {'day': 6, 'time': '0600'}}]

We can see that these hours entries are can be different, but they have some commonality. Let's do an analysis to see how many resources have this format.

In [10]:
from collections import Counter

def validate_hours_format(resources: list):
    """
    This function does the following:
    - Counts the number of hour entries each resource has
    - Counts the number of resources that follow the format of 'close' and 'open' keys with a day and time
    - Prints the results
    :param resources: The list of resources. Must all have hours (so make sure to pre-filter!)
    :return: None
    """
    
    # Count the distribution of hours
    hour_counts = map(lambda r: len(r.get('hours')), resources)
    hour_distribution = Counter(hour_counts)
    print("Distribution of number of days in a resource: " + str(hour_distribution))
    
    bad_entries = []
    for resource in resources:
        days = resource.get('hours')
        for day in days:
            close_time = day.get('close')
            open_time = day.get('open')
            if close_time is None or open_time is None:
                bad_entries.append(resource)
                break
            validate_time = lambda t: 0 <= t.get('day') <= 6 and 0 <= int(t.get('time')) <= 2400
        
            try:
                if not validate_time(close_time) or not validate_time(open_time):
                    bad_entries.append(resource)
                    break
            except:
                bad_entries.append(resource)
                break
            
    print(f"There are {len(bad_entries)} bad resources ({len(bad_entries)*100/len(resources):.2f}%) out of {len(resources)} resources.")
    
validate_hours_format(entries_with_hours)

Distribution of number of days in a resource: Counter({7: 112, 6: 34, 5: 29, 1: 8, 2: 2, 3: 2})
There are 6 bad resources (3.21%) out of 187 resources.


### Some observations from this analysis

- There are some resources that have no hours attached at all
- If they do have hours, they are organized as lists of days, but this ranges from 1 to 7 day entries
- Some of the entries are missing close times, and only have an open time
- All close/open entries have a day (0-6) which represent the day of the week (assuming Sunday = 0?)

## Analysis of address fields

Within the data, we can see that there are a few fields related to addresses

- `address` - A street address for the resource
- `city` - The city of the resource
- `zip_code` - The zip code of the resource
- `lat` - The latitude of the resource
- `lon` - The longitude of the resource
- `organization` - The organization that owns the resource
- `gp_id` - The Google Places ID of the resource (you can do a reverse lookup [here](https://developers.google.com/maps/documentation/javascript/examples/geocoding-place-id))

Let's start by seeing how many of these resources have these fields.

In [11]:
ref = db.reference(url='https://phlask-web-map-beta-water-live.firebaseio.com/')
all_entries = [e for e in ref.get() if e is not None]
print(f"There are {len(all_entries)} resources in this PHLASK db.")

entries_with_address = [entry for entry in all_entries if entry.get('address') is not None and entry.get('address').strip() != ""]
entries_with_city = [entry for entry in all_entries if entry.get('city') is not None and entry.get('city').strip() != ""]
entries_with_zip_code = [entry for entry in all_entries if entry.get('zip_code') is not None and str(entry.get('zip_code')) != ""]
entries_with_lat = [entry for entry in all_entries if entry.get('lat') is not None and entry.get('lat') != 0]
entries_with_lon = [entry for entry in all_entries if entry.get('lon') is not None and entry.get('lon') != 0]
entries_with_organization = [entry for entry in all_entries if entry.get('organization') is not None and entry.get('organization').strip() != ""]
entries_with_gp_id = [entry for entry in all_entries if entry.get('gp_id') is not None and entry.get('gp_id').strip() != ""]

print(f"There are {len(entries_with_address)} entries with addresses ({len(entries_with_address)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_city)} entries with cities ({len(entries_with_city)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_zip_code)} entries with zip codes ({len(entries_with_zip_code)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_lat)} entries with latitudes ({len(entries_with_lat)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_lon)} entries with longitudes ({len(entries_with_lon)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_organization)} entries with organizations ({len(entries_with_organization)*100/len(all_entries):.2f}%).")
print(f"There are {len(entries_with_gp_id)} entries with Google Places IDs ({len(entries_with_gp_id)*100/len(all_entries):.2f}%).")

There are 274 resources in this PHLASK db.
There are 189 entries with addresses (68.98%).
There are 248 entries with cities (90.51%).
There are 232 entries with zip codes (84.67%).
There are 274 entries with latitudes (100.00%).
There are 274 entries with longitudes (100.00%).
There are 246 entries with organizations (89.78%).
There are 257 entries with Google Places IDs (93.80%).


We can also take a quick look and see what these different entries look like

In [12]:
[e.get('address') for e in entries_with_address]

['1020 Lombard St.',
 'Market St. between 10th & 12th Sts.',
 '1500 Chestnut St.',
 '16th St. & JFK Blvd.',
 '1901 Vine St.',
 '2501 Walnut St.',
 '23 S Christopher Columbus Blvd',
 '2955 Market St.',
 '3601 Walnut St.',
 '1500 Spring Garden St',
 '1500 Spring Garden St',
 '1500 Spring Garden St',
 '1700 S. Broad Street, Unit 201',
 '555 S. 43rd Street',
 '1900 N. 20th Street',
 '321 W. Girard Avenue',
 '131 E. Chelten Avenue',
 '2840 W. Dauphin Street',
 '705 S. 5th St.',
 '1710 E Passyunk Ave',
 '1602 Spruce St.',
 '131 Old Lancaster Road',
 '321 University Avenue',
 '4400 Haverford Avenue',
 '500 S. Broad Street',
 '23 S Christopher Columbus Blvd',
 '2106 S Christopher Columbus Blvd',
 '2110 S Christopher Columbus Blvd',
 '2206 S Christopher Columbus Blvd',
 '23 S Christopher Columbus Blvd',
 '2300 S Christopher Columbus Blvd',
 '29 Snyder Ave',
 '1 Mifflin St',
 '2000 S Swanson St',
 '51 N 12th St',
 '1500 N 50th St',
 '6666 Ridge Ave',
 '3900 Lancaster Ave',
 '4221-29 Market St',


In [13]:
[e.get('city') for e in entries_with_city]

['Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Bala Cynwyd',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Ardmore',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'Philadelphia',
 'P

In [14]:
[e.get('zip_code') for e in entries_with_zip_code]

[19145,
 19104,
 19120,
 19123,
 19144,
 19132,
 19147,
 19148,
 19103,
 19004,
 10194,
 '19104',
 19146,
 '19107',
 '19131',
 19143,
 19106,
 '19003',
 19135,
 19135,
 19122,
 19124,
 19120,
 19120,
 19128,
 19121,
 19121,
 19121,
 19121,
 19122,
 19121,
 19121,
 19123,
 19146,
 19131,
 19131,
 19131,
 19104,
 19104,
 19143,
 19143,
 19104,
 19134,
 19135,
 19135,
 19124,
 19134,
 19134,
 19134,
 19134,
 19140,
 19140,
 19120,
 19138,
 19126,
 19145,
 19143,
 19143,
 19143,
 19151,
 19151,
 19135,
 19138,
 19138,
 19124,
 19124,
 '19144',
 19142,
 19149,
 19125,
 19002,
 19140,
 19143,
 19138,
 19111,
 19143,
 19111,
 19148,
 19147,
 19124,
 19131,
 19131,
 19132,
 19154,
 19104,
 19130,
 19143,
 19140,
 19145,
 19150,
 19125,
 19103,
 19111,
 19130,
 19145,
 19106,
 19145,
 19115,
 19147,
 19114,
 19136,
 19140,
 19132,
 19136,
 19111,
 19153,
 19154,
 19131,
 19128,
 19115,
 19123,
 19103,
 19107,
 19146,
 19149,
 19152,
 19129,
 19104,
 19114,
 19134,
 19118,
 19151,
 19133,
 19123

In [15]:
[e.get('lat') for e in entries_with_lat]

[39.943787,
 39.9525,
 39.950861,
 39.954101,
 39.959269,
 39.952195,
 39.947466,
 39.95578,
 39.953382,
 39.9622221,
 39.962293,
 39.9623835,
 39.929181,
 39.948857,
 39.983632,
 39.970143,
 40.037962,
 39.990727,
 39.94005,
 39.928485,
 39.947317,
 40.001739,
 39.947333,
 39.96225,
 39.944297,
 39.902329176510136,
 39.91981266450526,
 39.91885839345153,
 39.917115817628755,
 39.91563704822793,
 39.91596279208233,
 39.923320372467465,
 39.92411526481998,
 39.92254648240829,
 39.95362880647623,
 39.97776846261271,
 40.03993018,
 39.96164004,
 39.95820598,
 39.95813417,
 39.97380358,
 39.96042156,
 39.93146051,
 39.95414737,
 39.93124389,
 40.01449636,
 39.986008408122665,
 39.97676301,
 39.98726035,
 40.04955653,
 40.05905883,
 40.08059935,
 39.953902,
 40.007543,
 39.992697,
 39.967240055434665,
 39.9804902975695,
 39.97919520469946,
 39.971695729140244,
 39.96960035510774,
 40.027686,
 40.02779,
 39.973224,
 40.023873,
 40.021384,
 40.021378,
 40.034027,
 39.987278,
 39.987266,
 39.9

In [16]:
[e.get('lon') for e in entries_with_lon]

[-75.159048,
 -75.158056,
 -75.165866,
 -75.166818,
 -75.170716,
 -75.180653,
 -75.183711,
 -75.181968,
 -75.195194,
 -75.1643174,
 -75.1641745,
 -75.1635925,
 -75.169272,
 -75.208927,
 -75.167033,
 -75.14186,
 -75.173098,
 -75.179025,
 -75.150955,
 -75.165337,
 -75.168257,
 -75.239045,
 -75.198113,
 -75.21102,
 -75.165706,
 -75.1485013961792,
 -75.143228305418,
 -75.14283707270079,
 -75.14127848082069,
 -75.13942835053376,
 -75.13871427120705,
 -75.14519768090526,
 -75.14611676864189,
 -75.14690828504962,
 -75.15868786689634,
 -75.22195019461691,
 -75.22466851,
 -75.20043271,
 -75.20749114,
 -75.24818842,
 -75.25908428,
 -75.143069,
 -75.1623555,
 -75.23283566,
 -75.18872768,
 -75.07590834,
 -75.20183763523539,
 -75.21274045,
 -75.17470372,
 -75.16003691,
 -74.99839391,
 -74.98926787,
 -75.142927,
 -75.290701,
 -75.194197,
 -75.18336724332909,
 -75.19882121259286,
 -75.19633446198884,
 -75.19019653264229,
 -75.18733461599174,
 -75.049327,
 -75.049119,
 -75.145755,
 -75.080084,
 -75.12

In [17]:
[e.get('organization') for e in entries_with_organization]

['Seger Recreation Center',
 'Jefferson Station',
 'Staples',
 'Suburban Station',
 'Parkway Central Library',
 'Schuylkill Banks',
 'Southeast Pollution Control Plant',
 '30th Street Station',
 'University of Pennsylvania Book Store',
 'Harrisburg University Philadelphia Location',
 'Health Center #2',
 'Health Center #3',
 'Health Center #5',
 'Health Center #6',
 'Health Center #9',
 'Health Center #12 (Strawberry Mansion)',
 'Plenty Cafe - Queen Village',
 'Plenty Cafe - East Passyunk',
 'Plenty Cafe - Rittenhouse',
 'Bala Cynwyd Library',
 "Medical Examiner's Office",
 'Health Center #4',
 'Health Administration Building',
 'Southeast Pollution Control Plant',
 "Lowe's",
 'Raymour & Flannigan',
 'Ikea',
 'Southeast Pollution Control Plant',
 'Best Buy',
 'Acme',
 'Target',
 "Marshall's",
 'Reading Terminal Market',
 "Lowe's - West Philly",
 'Police 5th District & Fire Ladder 30',
 'Police 16th District',
 'Fire Engine 5 / Ladder 6 & L&I West District Office & Fuel Site 225',
 'Cob

In [18]:
[e.get('gp_id') for e in entries_with_gp_id]

['ChIJocPgsybGxokR3QyYHHmQ118',
 'ChIJSVdgvSnGxokRWJqLCbC4xzQ',
 'ChIJCSet3i_GxokRecQeyoGrAl0',
 'ChIJ9Sdt-zHGxokRad5acsk-ifo',
 'ChIJ_x-wODPGxokREDE-Lq4X9dE',
 'ChIJ1QhLoknGxokRY1BxIaMBmEY',
 'ChIJK-2pcVrPxokRsUB6CwKx3lk',
 'ChIJKxHG307GxokRP23dRyxLy64',
 'ChIJP1z1qVDGxokRvUMnTTzBxdM',
 'ChIJcZR_YNLHxokRDNZxCexFgYk',
 'ChIJ87U4TdLHxokRoghtMn8iA2o',
 'ChIJ87U4TdLHxokRoghtMn8iA2o',
 'ChIJvw4u5v7JxokRVlUweG4Qhx0',
 'ChIJubVbRAXGxokR6Cu3BPxfIw8',
 'ChIJVSBvHTvGxokRz0C0_VaUxyM',
 'ChIJB0P3_1DHxokRepHAZTGyjfY',
 'ChIJmSU991jGxokRGMSyRlN5gLA',
 'ChIJVaMsViPGxokRoYdepfdi_5U',
 'ChIJK-2pcVrPxokRsUB6CwKx3lk',
 'ChIJO95SFrLIxokRv38358hdaaI',
 'ChIJ3x7cPLLIxokRERcNvsd1j9I',
 'ChIJXatc2LLIxokRh-hZFBX8EWM',
 'ChIJK-2pcVrPxokRsUB6CwKx3lk',
 'ChIJY74fK7PIxokRfcHUUcHx5UE',
 'ChIJY_vkQq7IxokRYLyz37r5W8A',
 'ChIJiVZ4I6nIxokR2uGzPMAI5uA',
 'ChIJuWWRaDfGxokR0mYy349nqGU',
 'ChIJCQH7WCnGxokRAWsd3AfQj80',
 'ChIJ3dwlnhfHxokRGAtTKSmcm5Q',
 'ChIJ07J1fue4xokRQbv_61o6Yhk',
 'ChIJh7WdcVXGxokRaAJQjqmiF7A',
 'ChIJb0

### Some observations from this analysis

- All of the resources seem to have valid latitude and longitude entries (at least from a data perspective)
- Zip codes do have some invalid entries, such as weird numbers or empty strings, and a mix of numbers and strings (data inconsistency)
- Most of the resources have Google Places IDs, which could be helpful with data sanitization
- Sometimes a resource will have an address, and sometimes it will have an organization
- Some addresses are not real addresses, but approximate locations "located between X and Y streets"

# Coming up with a schema

This section is a WIP. After analyzing the data, we can come up with an official schema, and validate the resources which follow this schema.

In [19]:
from cerberus import Validator

schema = {
    'access': {'type': 'string', 'allowed': ['Restricted', 'Public']},
    'address': {'type': 'string'},
    'city': {'type': 'string', 'allowed': ['Philadelphia']},
    'zip_code': {'type': 'string'}, # Needs more validation
    'description': {'type': 'string'},
    'filtration': {'type': 'string', 'allowed': ['Unsure', 'No']},
    'gp_id': {'type': 'string'},
    'handicap': {'type': 'string', 'allowed': ['Unsure']},
    'images': {'type': 'list', 'schema': {'type': 'string'}},
    'lat': {'type': 'float'},
    'lon': {'type': 'float'},
    'norms_rules': {'type': 'string'},
    'organization': {'type': 'string'},
    'permanently_closed': {'type': 'boolean'},
    'phone': {'type': 'string'}, # Add stricter validation here
    'quality': {'type': 'string'}, # Add stricter validation here
    'service': {'type': 'string', 'allowed': ['Self-serve']},
    'statement': {'type': 'string'},
    'status': {'type': 'string', 'allowed': ['OPERATIONAL']},
    'tap_type': {'type': 'string', 'allowed': ['Drinking fountain and water dispenser', 'Drinking Fountain']},
    'tapnum': {'type': 'number'},
    'vessel': {'type': 'string', 'allowed': ['No']},
}

v = Validator(schema, require_all=True)
v.validate(all_entries[5])
v.errors

{'hours': ['unknown field']}