Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special groups aggregate metrics #569

Closed
dirkdk opened this issue Aug 27, 2020 · 23 comments · Fixed by #587 or #607
Closed

Special groups aggregate metrics #569

dirkdk opened this issue Aug 27, 2020 · 23 comments · Fixed by #587 or #607
Labels
Metrics Related to the Metrics API and related topics privacy Implications around privacy for the attention of the OMF Privacy Committee
Milestone

Comments

@dirkdk
Copy link
Contributor

dirkdk commented Aug 27, 2020

Is your feature request related to a problem? Please describe.

Cities often request data on how many lower income users we reach with our vehicle service, and how many trips such users take. MDS does not currently support user segmentation. We would oppose attaching any user data to the Trips endpoint as it would involve information close to Personally Identifiable Information (PII) and would make it fairly easy to identify individuals by trip route and user segment. Instead, we propose the solution of providing aggregate trip data by user segment.

Describe the solution you'd like

We propose that Providers provide aggregated data on trips by special user segments, using the new Metrics API. By using aggregation, it should be impossible to trace back this data to individuals. This does mean we need to set meaningful minimals for certain metrics so that the aggregated data has k-anonymity.

The new Metrics API specifies parameters for name, since, interval, dimensions that we will assume these metrics support.

Proposed metrics for special groups:

Metric Description
special_groups array of names of special groups as served by Provider. This list will be fairly static. Example values are low_income or students
active_users[special_group] total number of users for given group with at least 1 trip in interval
trips[special_group] count of trips by users of given special group during interval
trips.average_duration[special_group] average duration in seconds of trips by users of given special group during interval
trips.median_duration[special_group] median duration in seconds of trips by users of given special group during interval
trips.std_duration[special_group] standard deviation of duration  in seconds of trips by users of given special group during interval
trips.average_distance[special_group] average distance in meters of trips by users of given special group during interval
trips.median_distance[special_group] median distance in meters of trips by users of given special group during interval
trips.std_distance[special_group] standard deviation of distance in meters of trips by users of given special group during interval

Overall aggregate statistics

For overall usage we can do what is listed below. Please note that this data should be derivable from MDS trip data, minus active users.

Metric Description
active_users total number of all users with at least 1 trip in interval
trips count of trips by all users during interval
trips.average_duration average duration in seconds of trips by all users  group during interval
trips.median_duration median duration in seconds of trips by all users  during interval
trips.std_duration standard deviation of duration in seconds of trips by all users  during interval
trips.average_distance average distance in meters of trips by all users  during interval
trips.median_distance median distance in meters of trips by all users  during interval
trips.std_distance standard deviation of distance in meters of trips by all users  during interval

Is this a breaking change

A breaking change would require consumers or implementors of the API to modify their code for it to continue to function (ex: renaming of a required field or the change in data type of an existing field). A non-breaking change would allow existing code to continue to function (ex: addition of an optional field or the creation of a new optional endpoint).

  • No, not breaking

Impacted Spec

For which spec is this feature being requested?

  • metrics but only served by Providers. Only Providers will have the raw data

Describe alternatives you've considered

Alternatives would be to add user segments to individual trips in the Trips API. We oppose this method as it would make user identification extremely easy. We currently send this data to cities via manually compiled Excel sheets, and it would be better to have an official API.

@schnuerle schnuerle added this to the 1.1.0 milestone Aug 27, 2020
@schnuerle
Copy link
Member

schnuerle commented Aug 28, 2020

Some notes from WG meeting on Aug 27:

  • This would allow reporting on data beyond what MDS provides, such as interest on usage of low-income/student plans. This would be data that only the Provider would be able to calculate but avoids privacy issues by not sharing the underlying data to calculate these metrics through MDS.
  • These special groups could be defined in the spec, with the option of other special groups definable between a city and a provider.
  • Timeframe and geography would need to be further defined. These should not be so granular that individuals could be identified. Could use Geography API to define areas and start, end, traversal. Time could be day, week, month.
  • k-anonymity values would be key (eg, 5, 10), where a minimum value would have to be returned or else no data is shown.
  • There is likely a common list of these "special group" metrics across cities, but there would need to be room for flexibility to support additional metrics. Having a standard would provide that flexibility.

@schnuerle schnuerle added the Metrics Related to the Metrics API and related topics label Aug 28, 2020
@johnclary
Copy link
Contributor

johnclary commented Aug 28, 2020

@schnuerle can we have a Privacy label on this one, please?

R.e:

We would oppose attaching any user data to the Trips endpoint as it would involve information close to Personally Identifiable Information (PII) and would make it fairly easy to identify individuals by trip route and user segment. Instead, we propose the solution of providing aggregate trip data by user segment.

Timeframe and geography would need to be further defined. These should not be so granular that individuals could be identified. Could use Geography API to define areas and start, end, traversal. Time could be day, week, month.

Yes—there are significant privacy concerns with adding any additional PII to MDS. In my opinion, this is something that feels beyond the scope of what MDS should ever support. Can we not leave it up to agencies to use their own data—presumably collected responsibly (e.g. through surveys or the US Census)—to conduct this kind of analysis?

If providers are already holding this kind of information: why? can you stop?

@schnuerle schnuerle added the privacy Implications around privacy for the attention of the OMF Privacy Committee label Aug 28, 2020
@joshuaandrewjohnson1
Copy link

joshuaandrewjohnson1 commented Aug 28, 2020

@johnclary the original intention with this idea was to ensure that no personally identifiable information of specific user groups would flow through MDS, nor would their individual trip data, but rather aggregated information (such as overall number of low-income program users or overall number of trips), which would be calculated by the provider before sharing with a city.

This is data that almost every city already asks for, and is mostly facilitated via custom reports, so there would be benefit to both providers and cities in standardizing.

@schnuerle
Copy link
Member

Maybe a clarification from the call that could be added to the main description here: the 'low income' group from providers means any riders using an equity plan, eg Bird Access, Spin Access, Bolt Forward, etc. So this is known by the providers per rider as part of operations and billing for discounts.

Note the k-anonymity part which says if a query returns too few results based on filters/time/geography, no data is returned for that slice.

@johnclary
Copy link
Contributor

johnclary commented Aug 28, 2020

If I'm reading these materials correctly, it is being proposed that this API make it possible to report on
the average duration and distance of trips taken in an arbitrary geography by riders who are unbanked.

What's missing is a specific explanation of why an agency would want to collect this data and what meaningful action they would expect to take as a result of it.

Assuming we all agree that the "special groups" being discussed here are people who have have a disproportionately high risk of harm if their PII were to be leaked, who generally have fewer mobility choices, and who would almost definitely prefer that service providers did not hold this kind of data about them—this warrants an extremely well-defined use-case.

Regardless of whether or not providers are currently collecting this data, do we as agencies really want to reinforce that practice based on the case that there might be some useful information for an as-yet unproven technocratic purpose?

@johnclary
Copy link
Contributor

What really bothers me about this proposal is that the development of MDS continues to default to collecting as much data as possible on the premise that it will be theoretically useful at some point. There remain very few stories to tell of cities doing right by there publics as a result of this casting of an extremely wide collection net.

@marie-x
Copy link
Collaborator

marie-x commented Aug 28, 2020

What really bothers me about this proposal is that the development of MDS continues to default to collecting as much data as possible on the premise that it will be theoretically useful at some point. There remain very few stories to tell of cities doing right by there publics as a result of this casting of an extremely wide collection net.

That's now how I read Dirk's proposal. It is from an employee of a mobility provider. Spin's assertion is that many cities have specific purposes for this data, they are already making these requests of mobility providers, and that this would be an (optional!) way to standardize the requests that are already being made.

Also I am not sure I agree with the characterization that MDS "defaults to collecting as much data as possible". The Metrics API is clearly less data, explicitly de-identified. The OMF privacy committee is also publishing standards for privacy protection including data deletion principles.

I agree that some city representatives should join Dirk/Spin to describe the use cases and public-policy aims for this information.

@schnuerle
Copy link
Member

schnuerle commented Aug 31, 2020

For the sake of documenting what cities are currently asking for in terms of monthly reporting, I went through every city permit/policy application I could find from our Cities Using MDS list and summarized very broadly the relevant, documented asks.

These are mostly in the terms of monthly reports, and can include MDS derived data (most cities are asking for aggregated MDS data in addition to the API feeds).

I'm sure there are more examples, but I could not find them online. If anyone has more cities please add them in the comments. The mobility service providers like Spin and Bird may have a more comprehensive list of the things cities are requiring, and could provide an aggregated (not broken down by city) list.

List Notes

I put special user groups in bold, and equity/zone area info in italics.

Note some cities are asking for aggregated demographics.

Almost all cities also ask for complaints, safety, collision, injury, and vandalism numbers, and these seem like something missing in the proposals so far.

Outside of the scope of this are system alerts, pricing plans, and hours of operation, which gets back to a brief conversation we had in the last Provider WG call around a kind of 'Policy for Providers' API idea.

City List

Austin

  1. Complaints: number, nature, time to remedy
  2. Collision: number, severity, location, time

Denver

  1. App: downloads, active users, repeat users
  2. Trips: total, day of week, time, distance
  3. Origin/Destination
  4. Opportunity areas: in and out
  5. Parking: transit/bus stop compliance
  6. Theft/Vandalism
  7. Maintenance Reports
  8. Complaints
  9. Special groups: low income, students
  10. Collision
  11. Payment Method

Detroit

  1. Utilization rates
  2. Special groups: Membership volumes
  3. Trips: volumes, day of week, time of day, origins, destinations, distance
  4. Parking: compliance rates
  5. Theft/vandalism
  6. Maintenance reports
  7. Complaints
  8. Collision

Kelowna, Canada

  1. Trip
  2. Parking
  3. Incidents
  4. Maintenance
  5. Collision
  6. Survey: 3-20 questions about riders

Long Beach

  1. Riders: total unique number
  2. Parking zone: counts, percent, incentives
  3. Trip: distance
  4. Devices: serviced, lost, stolen, missing, impounded, make/model/count
  5. Revenue: membership, penalty, per minute, per ride
  6. Injuries
  7. Legal actions
  8. Complaints: resolutions
  9. Community Outreach
  10. Special Events
  11. Maintenance
  12. Geographic distribution

Louisville

  1. Rides
  2. Vehicles
  3. Parking: performance, preferred, designated
  4. Operator Staff Levels
  5. Complaints
  6. Vandalism
  7. Collision: injury, fatality
  8. Gender
  9. Age: 8 buckets
  10. Equity Distribution

San Francisco

  1. System alerts
  2. Pricing plans
  3. Hours of operations
  4. Complaints: response times, wait times, nature
  5. Demographics
  6. Injury
  7. Low income participation (communities of concert low income, cash options)
  8. Distribution stats
  9. Parking and rebalancing
  10. Energy sources
  11. Outreach activities
  12. Counts, trips, length, revenue
  13. Training

San Jose

  1. Complaints
  2. Parking
  3. Collisions: location, # involved, severity, response
  4. Maintenance

Santa Monica

  1. Collisions
  2. Injuries
  3. Operations
  4. Complaints: responses
  5. Maintenance
  6. Education/outreach
  7. Deployment zones: percent targets

Washington DC

  1. Trips: total, per vehicle, average, distance
  2. Equity areas: start/end
  3. Violations: response time
  4. Equity programs
  5. Safety
  6. Education
  7. Parking: locations, violations, incentives
  8. Idle Time: per vehicle
  9. Vehicles: days in service, distance (mean, median, StdDev), decommissioned
  10. Repairs: lights, seats, brakes, gears, locks, frame, other
  11. Complaints: time, vehicle, location, severity, collision
  12. Special users: total, low income (signups, rides, count, minutes, miles)

Cities with low income participation count requirements:
Denver, Detroit, DC, San Francisco. Also Portland, Chicago but I don't see it explicitly in their policy.

Cities with equity distribution count requirements:
DC, Santa Monica, Louisville, Long Beach, Denver. Many more, just not called out in monthly reports area of their policy

@johnclary
Copy link
Contributor

to be clear, I support the metrics API and really appreciate @dirkdk's work on it. but are there member agencies pushing for these "special groups" metrics as they're currently being proposed? would like to hear from them.

@dirkdk
Copy link
Contributor Author

dirkdk commented Sep 11, 2020

Cities generally have equity as a key goal and basic mandate, working to ensure that all its residents have the opportunity to be on equal footing. We as an operator have this same goal, which is the reason we established our Access program, providing low-income pricing and a means to use the service without a smartphone or credit card. We track these metrics internally to gauge the effectiveness of our Access program, and inform decisions on any needed adjustments to the program. Cities essentially do the same thing, working to ensure any pilots or programs they deliver are working to achieve equitable outcomes, and evaluating operators to that end.

Pretty much all cities where we operate require the metrics included in this proposal as part of our data sharing terms, although it is generally shared in aggregated form via a custom report via spreadsheet or powerpoint. As stated in the original issue and on the working group call when this was discussed, Spin feels strongly about the need to protect the privacy of users, and that this proposal strikes a good balance of sharing data with cities while maintaining privacy.

Use cases should be driving changes to MDS, and we do see a clear use case here for sharing aggregated data on equity programs. Maybe a solution could be that this issue be solely focused on that equity program data, and the other groups or categories of data mentioned in the replies to this issue should be tabled until a clear and meaningful use case is presented. It would also be good to hear another operator's perspective such as @bhandzo from Bird.

@dirkdk and @joshuaandrewjohnson1, Spin

@alexdemisch
Copy link
Collaborator

alexdemisch commented Sep 23, 2020

Updating the list of metrics from San Francisco's scooter program that are reported monthly via an excel file and calling out the special groups:

  • Unique users
  • Collisions
  • Safety Training Log
  • Complaints
  • Outreach activities
  • Non-revenue device VMT by type (i.e., how much VMT due to trips generated by redistributing, recharging, and other operational activities)
    Special Groups
    • Low Income Plan
      • Active memberships per month
      • Unique users
      • Trips on low-income plan
    • Adaptive Device Pilot
      • Unique users
      • Total trips

There is a more exhaustive description (although this list may not exactly be what is reported every month).

It's great that our permittees offer low-income plans, but the goal of knowing how many members and trips are taken on that plan is to get a sense of how well those low-income plans are actually being used. Similarly, we have a requirement for an adaptive scooter pilot, and we'd like to understand how those devices are being utilized relative to the rest of the program. These aggregate metrics are not the only methods of evaluating their respective programs, but they are certainly helpful.

@sharades
Copy link

sharades commented Sep 24, 2020

@johnclary this is information that cities require to regulate. DC had been receiving this in quarterly reports for fleet increases and then with a newer regulation in weekly reports as operators requested fleet increases. Cities are asking for this in a non-standard way, so adding it to MDS would in theory make things easier for the operators, too, as @dirkdk noted in the initial PR.

Our equity plans are both geography- and user group- based, same as several other cities. For geography-based equity, we require deployment in specific areas and this can already be monitored through MDS. For the user group-based equity, we require that companies have low-income customer plans that give free unlimited trips for those at 200% or less of the federal poverty level. Without some level of reporting on usage, we only know that the plans are offered, but not whether there is any uptake (which can speak to the success of the providers’ marketing of and support for said plans). There are distinctions between user-based programs and geographic programs - you use both to enable usage by groups that might generally have lower access. However, we don't necessarily expect that the low-income plan usage will be concentrated in the geographic equity areas.

Per @alexdemisch knowing the level of program usage would allow for better regulation. We might want to require that a certain percent of trips come from low-income plans as a condition of operating (e.g. operators must demonstrate that at least 1% of all trips were from low-income plan users). Being able to see overall usage levels would be critical for tracking that.

The information about the origins and destinations and waypoint movements of low-income plan users is not something that DC would like to have. At a high level, we’d like to know a little bit more about the characteristics of the geography where the trip is occurring. Being able to query an API, with a minimum K-anonymity value, would be a good scenario.

@PlannerOnTheGo

@nicklucius
Copy link

The City of Chicago has requirements that mobility providers offer low-income and equity programs to ensure that new mobility options are available for all of our residents. We have a program for the City's own system (Divvy for Everyone) as well as program requirements as a condition of permitting for other public mobility providers. We also engage in evidence-based policy evaluation using the mobility data we collect (e.g. ridehail congestion study and scooter pilot evaluation).

A core government function is to enroll individuals in programs and evaluate the policies that govern those programs to check and improve their effectiveness. Collecting data related to low-income and equity mobility programs is necessary in order to measure trends and determine whether the policies and requirements are having their intended effect, and whether changes need to be made. Our experience with the study and evaluation linked above is that pre-aggregated data limits the insights that can be generated in a study, and therefore limits the effectiveness of policy actions.

The proposal in this issue would help MDS provide a better value when it comes to measuring the effectiveness of low income/equity policies, but would not allow for the level of analysis presented in the linked documents. Trip-level data would be the best way to measure and study the reach and equity of mobility services as well as the effect of policies.

@schnuerle
Copy link
Member

We reviewed this issue as part of the second OMF Working Group Steering Committee release Checkpoint. Both WGSCs had some feedback and I'm documenting it here for discussion.

How can cities trust aggregated (non MDS derived/special groups) data? It might not be possible, but wanted to ask for ideas since it is a concern.

@schnuerle schnuerle pinned this issue Oct 2, 2020
@johnclary
Copy link
Contributor

johnclary commented Oct 2, 2020

Appreciate the cities chiming in. @nicklucius r.e.

Our experience with the study and evaluation linked above is that pre-aggregated data limits the insights that can be generated in a study, and therefore limits the effectiveness of policy actions...Trip-level data would be the best way to measure and study the reach and equity of mobility services as well as the effect of policies.

Can you expand on why a pre-calculated geographic aggregate is not sufficient for your purposes? Do you see some trade off in terms of the value of whatever level of analysis you're trying to conduct versus the risks involved with collecting and storing personal data?

I don't know how to measure this kind of trade off, but it seems to me that the most sound approach to mobility data collection for planning purposes is to start with the bare minimum (e.g. aggregates at some geo resolution) and see how far that gets you. If you're not certain about the insights you'll derive from raw data, even less certain about the policy decisions that follow, and even less certain about the impacts of those policies, it becomes increasingly harder to justify the privacy risks involved with collecting raw data in case its useful.

@nicklucius
Copy link

nicklucius commented Oct 2, 2020

@johnclary Sure, I'm happy to. These reports are good examples of what I'm talking about: our ridehail congestion study and scooter pilot evaluation. We have aggregated, privacy-protected datasets for the underlying data made available to the public on our data portal here and here. If you tried to replicate the analysis in the reports using the pre-aggregated data, you would not be able to calculate many of the metrics or recreate many of the maps. That is because the pre-aggregated data removes the granularity necessary to run the queries and analytics that produced the findings and allowed us to fully evaluate the programs, recommend policies, and share it all with the public.

We only want to collect the data that is necessary for our purposes and will produce important insights. We are always monitoring our mobility data collection standards and we do refrain from collecting what we do not need, and have stopped collecting data we previously collected once we realize that it is not producing an needed benefit. For example, see how our scooter data collection rules changed from 2019 to 2020.

@schnuerle
Copy link
Member

Note that for 1.1.0 we have merged with #582 the new Geography API to the 'dev' branch. Please update this pull request with the latest code, resolve any conflicts, and make references to the Geography API where appropriate, e.g. with UUIDs.

We will be discussing Special Groups at this week's Working Group meeting, so if available please come prepared to talk about your latest updates and ideas.

This was referenced Oct 12, 2020
@schnuerle
Copy link
Member

Would love to see a PR for this before our Thursday Working Group call this week, so we can discuss it on the call @dirkdk @joshuaandrewjohnson1. I can help if needed.

I've made a feature branch called 'feature-metrics' to start pulling all the Metrics related work together and do PRs against.

Also see comment on #486.

@dirkdk
Copy link
Contributor Author

dirkdk commented Oct 13, 2020

ok I will work on that

@schnuerle schnuerle linked a pull request Oct 16, 2020 that will close this issue
@schnuerle
Copy link
Member

The new 'feature-metrics' is ready and has #486 and #487 incorporated into it. Per the WG call, I will incorporate the proposed ideas here into that branch, then report back.

Here are the relevant meeting notes from the call last Thursday.

Special Groups

  • See previous doc about this
  • From Spin: providers are already sending this data and more to cities in spreadsheets, in different formats and methods and report formats
  • This data will not be added to the raw trips endpoint
  • Will be incorporated as part of the core metrics
  • We should call out in the spec that application of this data should be tied to specific use cases.
  • K-anonymity values should be across all of Metrics to limit re-identification risks.
  • OMF will establish a minimum in the spec. Maybe different values with different metrics? Spacial/temporal difference of a city can affect values
  • Spin, E&A, and OMF can work on proposed values.
  • No field will be in the spec to pass in a k-value, but maybe the k-value can be returned in the payload so consumers know what it is for that request
  • Low income only for this release. Adaptive vehicles, students, unbanked, subscribers, access methods, etc are other possible future groups, but may have so few trips as to not appear in most requests due to k-value limits.
  • Spin + San Francisco can come up with an example use cases for adaptive vehicles in the future, showing how and why that's measured and how the data could be used. Could be part of webinar or blog post with more details.

@schnuerle
Copy link
Member

I believe I've captured all of the metrics mentioned here in the Metrics branch, so please review.

There is a new dimension and filter for special_group_type which should meet all of your requirements too. Please review.

There is also the start of a data redaction section which talks about k-values (which I've set as 10 across the board for now) and needs some thought behind it (what should the value be, how should it be calculated, how can it differ across Metrics?). The k-value also comes back in the query response.

@schnuerle
Copy link
Member

We will be aligning this back to the original proposal intent and pull it out of Metrics. Look for a PR soon.

@schnuerle schnuerle linked a pull request Dec 10, 2020 that will close this issue
@schnuerle schnuerle linked a pull request Dec 11, 2020 that will close this issue
@schnuerle
Copy link
Member

This has a solution now for 1.1.0 as a beta feature serving up a relatively simple static CSV file with PR #607. I think in the next release we should gather feedback and ideas on how to expand this, either in a more dynamic API way and/or with more fields/options to align more to original issue description and meet more existing provider/agency monthly report use cases.

@schnuerle schnuerle unpinned this issue Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to the Metrics API and related topics privacy Implications around privacy for the attention of the OMF Privacy Committee
Projects
None yet
8 participants