Step 0: Problem Understanding – What is the problem? Who does it impact and how much? How is it being solved today and what are some of the gaps?
Step 1: Goals – What are the goals of the project? How will we know if our project is successful?
Step 2: Actions – What actions or interventions will this work inform?
Step 3: Data – What data do you have access to internally? What data do you need?
What can you augment from external and/or public sources?
Step 4: Analysis – What analysis needs to be done? Does it involve description, detection, prediction, or behavior change? How will the analysis be validated?
b

#### Step 0: Understand the problem

- What is the scope?
- What is the problem?
- Who is impacted by the problem?
- How many are impacted by the problem?
- How much are they impacted by the problem?
- Why is the problem a priority now?
- How has the org tried tackling the problem up until now?
- Who are the internal and external stakeholders?
    

#### Step 1: Define the goal(s)

- Concrete, not abstract. (Increase HS graduation rates, not 'help students')
    - Possibly conflicting goals:
        - Efficiency (e.g. help the most number of people in need with limited resources) 
        - Effectiveness (e.g. maximize the total improvement in outcomes for the people you help)
        - Equity (e.g. allocate resources across groups to achieve equitable outcomes). We should not only define these explicitly during the scoping process but also attempt to prioritize them at this stage.

- What are the constraints?
    - Budget?
    - People?
    - Time?
    - Other resources?


#### Step 2: What actions/interventions are you informing?
- It’s generally a good strategy to first focus on informing existing actions instead of starting with completely new actions that the organization isn’t familiar with implementing. 
- Breaking down actions
    - Create a list of actions an organization is taking or may take to achieve its goal.
    - What channels an action can be taken through?
        - Email, text, phone call?
    - Ethical implications?

#### Step 3: What data do you have and what data do you need?

- What data we have access to inside and outside the organization that will be relevant to solve the problem, as well as what data sources we may need to get access to?

- Data sources that are available inside the organization?
    - Granularity or detail: Some data may be collected about individual students, some may be collected about schools, and some may be collected about neighborhoods or school districts.
    - Frequency: How often the data is updated.
    - History: How far back the data goes.
    - Identifiers: Unique identifiers that allow you to link to other data sources, such as Social Security Numbers, insurance numbers, student ID numbers, or addresses. 
    - Owner: The organization, department, group, or persons who control access to the data.
    - Storage: How the data is stored (databases, spreadsheets, pdfs, data stores, or hardcopy).
    - Ethics: Ethical issues that may be associated with using the data sources (Consent, security protocols, etc.).

##### Matching the Data to the Actions

- It’s important to match the granularity, frequency, and time horizon of the actions to the granularity, frequency, and time horizon of the data you have.
- Public vs private data and feasability.

Step 4: What analysis needs to be done?

Once we have the goals, actions, and the available data identified, the final step in the scoping process is to determine the analyses we will do to inform the identified actions, using the data we have, to achieve our goals. It is worth reiterating that analysis is not the goal of a data science project, and it’s advisable to leave this piece aside until the goals, actions, and available data are clearly defined. 

Analyses can use methods and tools from different areas: computer science, machine learning, data science, statistics, and social sciences. One way to think about the performed analysis is to break it down into five types:

Description: Primarily focused on understanding events and behaviors that have happened in the past. Typically all projects contain a descriptive analysis component for gaining an understanding of the data. However, it is rare that a descriptive analysis is the sole analysis component of a project.

E.g.:

    Exploring data around students that haven’t graduated on time in the past, potentially revealing “types” of students that suffer from the adverse outcome when compared to others with respect to highly correlated factors.

Detection: Less focused on the past and more focused on ongoing events. Detection tasks often involve detecting events and anomalies that are currently happening.

E.g. 

    Using text data from legislative bills to identify their topic areas 
    Detecting fraudulent credit card transactions 

Prediction: Focused on the future and predicting future behaviors and events.

E.g. 

    Predicting the likelihood of a student graduating high school on time at the time they enter 9th grade, 
    Predicting the likelihood of a legislative bill passing into law at the time it’s introduced to the legislature. 

Optimization: Focused on taking the outputs of other types of analysis and using them to allocate resources or make decisions.

E.g. 

    Identifying where to locate ambulances such that their coverage is maximized, 
    Given a list of students ranked by their probability to not graduate on time, identify the subset of students for intervention that would have the most impact

Behavior Change: Focused on causing a change in behaviors of people, organizations, neighborhoods. Typically uses methods from causal inference and behavioral economics.

E.g. 

    Given a student is at risk of not graduating on time, and a set of possible interventions, identifying the intervention that would maximize their likelihood of graduating on time

There are of course many more types of analyses but we’ll keep the focus on these five here. 

For each analysis that we do in the project, we should look to answer the following

    What type of analysis needs to be done and what’s the purpose? 

Is this a descriptive analysis, a predictive model, or a detection or behavior change task? Often, the project involves several of the types of analysis we described above, each designed to inform specific actions and achieve specific goals.

    How will the analysis inform the actions?

Each analysis should inform one or more of the identified actions. Some analyses could inform multiple actions and sometimes one action could be informed by multiple analyses.

    How will the analysis be validated? What validation can be done using existing, historical data? 

It’s important to think about how we can validate each analysis using the available, historical data. We should choose the validation set up to reflect the deployment scenario (how a user would use this model for example) as closely as possible. Typically, for prediction tasks, we would create multiple train validation sets and choose a metric that accurately reflects how we use the model, and observe how our models perform over time. One critical thing we would need to think about is the baseline(s) to which we compare our models. If the organization is performing this task today in some way, a good baseline is the performance of the “existing system”. Additional good baselines include simple heuristics (based on expert knowledge or prior research). We want our new analysis to be “better enough” than existing or simpler approaches to make it worth deploying the more expensive-to-build-and-maintain data science system.

    What are the ethical issues associated with the analysis? 

Are there any equity implications that we are concerned about?  How would errors in your analysis impact individuals and society? 

We’ll demonstrate this with a few examples. 

Reducing Lead Poisoning Rates

Goal: Reduce the number of children who will get lead poisoning in the future due to lead hazards in their current residence 

Actions: Proactive home inspections constrained by limited inspection resources.

Analysis: We need to answer the key question: “Which homes should we prioritize for proactive inspections?” In other words, we want to identify the homes of kids who are at risk of lead poisoning in the future. 

Because we want to intervene before a child is exposed to lead, we would predict the likelihood of every kid under the age of 2 being exposed to lead in their homes. We could use our predictions to rank homes based on the risk that a child will be exposed to lead in them and prioritize them for inspection. If we consider the five types of analyses we mentioned above, this would be a prediction task.

Validation:  Let’s say, the number of homes the agency can inspect in a month is 100. Given this, we can optimize the analysis to predict the 100 homes where a child is most likely to be exposed to lead each month, a metric we call Precision at K (or P@K). 

In this example, the causal link between the action (lead inspection and removal) and the goal (protecting children from lead exposure) is well established. That might not be the case in every project. Let’s take the example of timely high school graduation:

Improving graduation rates

Goal: Increase the percentage of high school students who graduate on time, while reducing the disparity in graduation rates across racial groups 

Actions: Provide additional support to students who are identified as at-risk of not graduating on time

Analyses:  

    As with the lead hazards project, we have to perform a predictive analysis; predicting which students are least likely to graduate on time. We can use these predictions to prioritize a subset of students for additional outreach and support. 
    Once the risk is predicted for each student, selecting the subset of students for intervention might not be as straightforward as prioritizing students with the highest predicted risk. Given the nature of the interventions and the metric you are interested in improving, the best students for intervention could change. For instance, if the goal is to improve the average graduation rate with minimum effort, prioritizing students with a more middling risk score could prove to be more effective than prioritizing those who are very likely not to graduate on time. An optimization analysis could help identify the optimal set of students to be prioritized for intervention. 
    Even if we select the “optimal” set of students for intervention, there could be multiple additional support types available (e.g., after-school tutoring, counseling, help with transportation, financial help). Identifying the most effective intervention(s) for individual students could require a causal inference analysis. 

Ethical Issues:

As a reminder, consideration of ethical issues should be embedded in every phase of the scoping and execution of a project, rather than thought of as a discrete “step” in that process. Often the initial conversation around ethics in the scoping phase will focus on the ethical and societal values we want to embed in the system. Note that this conversation is not specific to applications of data science or AI, but rather about the values that broadly need to be embedded in the decision-making processes that affect people’s lives. As such, these decisions cannot be made unilaterally by the data scientists or technologists tasked with building the system. While they have an important role in the discussion, it is important for it to be inclusive of all the relevant stakeholders: policymakers, action-takers, system developers, data owners, and the community being affected by the system will all have important perspectives on these issues.

As you work through the scoping process (and the project itself), being explicit about how your work reflects those values can help act as a guiding principle in the course of the many decisions that need to be made to see a project through. To highlight their central nature, we’ve consolidated many of them here, posing some motivating questions to help you explore these issues. This list is far from exhaustive, but rather provides a starting point for that conversation: 

Privacy, Confidentiality, and Security

Concerns about privacy and data security are common in applications of data science to socially impactful problems, particularly where high-stakes decisions are involved such as healthcare, criminal justice, and education. In some cases, there may be governing legal requirements about how data can be used and steps that must be in place to protect it, but ethical considerations may extend beyond what is legally required. Particularly important here is how people might feel about the data about them that is being used, as well as their expectations about how publicly available that data is. Questions to ask here include:

    What are the privacy considerations (legal as well as ethical)?
    How is the privacy of the individuals in the data being protected?
    What about confidentiality?
    What are the security considerations and protections? Who has access to which parts of the data? For what purposes? What is the security audit process?

Transparency

Transparency considerations for the building and deployment of data science systems can involve many different stakeholders: internal actors and decision-makers, individuals whose data is being used, individuals who will be affected by the decisions the system informs, and the public at large. As you consider the following questions, keep the perspectives of each of these stakeholders in mind:

    Which stakeholders should know about the project?
    Do the people whose data you’re using know that you’re using it?
    How will the actions you’re taking based on this analysis affect people? What are the costs or benefits of this action for the people affected?
    Do the people you’re prioritizing know if and why they’re being prioritized?
    What recourse do people have to challenge or change a decision informed by your analysis that has impacted them?

Discrimination, Equity, and Fairness

Understanding and improving the fairness of machine learning systems has been the subject of much recent writing and research, both in scholarly works and the popular press. During the scoping process, it is important to understand the kinds of disparities that could result from your project and how they might impact people, accounting for the perspectives of people impacted by your analysis and any historical and ongoing discrimination that might affect them.

For instance, in the context of screening child welfare hotline calls, disparities in false negatives (not following up a report with an investigation when you actually should have) would mean more harm to children in one community relative to another. By contrast, disparities in false positives (making an unnecessary investigation) could result in over-policing some communities resulting in broader societal harms. Two tools from our work that may be helpful here are the Fairness Tree, which we use to help facilitate these conversations, and Aequitas, which helps audit model outputs for disparities – both can be found here. Some of the key questions to ask with regards to discrimination and fairness are:

    Which specific groups could be impacted unequally by your analysis, and how should you account for this?
    From the perspective of each of these groups, how do they define fairness in this context? How well aligned are these definitions?
    Over what time horizon are you trying to improve equity in this work?
    How will you detect biases or inequities in your system?
    What disparities exist in the current decision-making process?
    If inequities do exist, how will you approach reducing them in your system or mitigating their impact in downstream decisions and interventions?
    Are there broader sources of inequities, either historical or ongoing, that affect the outcomes you are trying to improve? How should your system take these into account?

Accountability

Accountability should be considered with respect to the entire process, including both the technical decisions made while working with the data and the decisions made about how the system will be used in practice and described to the public. Ensuring there is clarity upfront about who is accountable and responsible for each aspect of the system, particularly with regards to any potential ethical aspects and considerations, can help reduce both the risks of issues arising (by setting boundaries and making oversight somebody’s explicit responsibility) and potential harms if these issues do arise (by having contingency plans in place). For example, our group has worked on several projects seeking to reduce the risk of jail incarceration by providing assistive interventions like mental health supports or social services. However, the same model predicting risk of future arrest could also be put to use in more concerning, punitive ways that our partner agreements need to carefully define and guard against. Some questions to ask around accountability include:

    If sensitive data is used for the project, who is responsible for keeping it safe? What contingency plans are in place in case there is a leak and who will be accountable?
    What limitations are in place on how your analysis will be used? Who bears responsibility if it’s used for harmful or unintended purposes?
    How will you monitor for unintended consequences (for instance, if students predicted at risk for low performance internalize this prediction)? What processes will be in place to reduce any potential harm from them?
    How will you determine if the system is increasing disparities over time? Who is responsible for monitoring this and making decisions about how to respond?
    Who bears responsibility for determining what to disclose about the system to the public and how to describe it? If people object to it, who is accountable and what should be done?
    What recourse do people impacted by the new system have if it makes mistakes about them or is making recommendations based on inaccurate data? Who is responsible for responding to these issues and making any necessary improvements to the underlying model or analysis?

Social License

The concept of social license is related to ideas of transparency and accountability, but with a different framing that can provide a helpful thought exercise. Regardless of your actual plans around transparency described above or how many people might learn about the details of the project in practice, here you want to think about how people might respond to your project if it did receive widespread coverage. Would it be uniformly cast in a positive light or a negative one, or to what extent would different people react differently to it? A few questions to help you consider social license for your project include:

    If the entire population of the country finds out about your project, will they be ok with it? Why?
    If it was on the front page of the newspaper, what would the headline be? Would it be positive or negative?
    Even if you believe the population might support your project overall, are there any groups who might object? Who are they and what concerns would they have?

Other Ethical Considerations

Finally, it’s important to keep in mind that the list here is only a starting point and there may be other ethical considerations relevant to your specific context. Are there additional legal, policy, or organizational requirements not covered above? Do you need to gather informed consent for individuals potentially affected by this work? Are there specific oversight, reporting, or ethical review procedures? As is the case with the other considerations above, it can be helpful here to think about the project through the perspectives of different stakeholders and explore whether any considerations or concerns might arise that don’t fit well into the other categories here.

Additional Considerations

Evaluation

Once your analysis is complete and you have validated it against available data, it is important to evaluate it in the field to make sure that it helps achieve your goals. Field trials are a complicated topic and can be fairly specific to the project and its goals, so we will not go into them in detail here. However, your evaluation should be an application of the analysis to the actions they were proposed to inform and should be measured against your pre-defined goals. 

Such an evaluation often requires the buy-in of people inside and outside the organization and a significant commitment of the organization’s time and resources. Even before embarking on a new project, the organization should be certain that they can commit the necessary time and resources and should engage anyone who will be involved in the evaluation and application of the analysis. These conversations and commitments should happen at the beginning of the scoping process to ensure the project will be successful.

The evaluation may show that the analysis is or is not successful at achieving the organization’s goals. The evaluation should not be seen as an opportunity to prove that the analysis is successful, but an opportunity to test its effectiveness and perhaps improve upon it iteratively by re-evaluating other aspects of the project’s scope, such as the analysis itself, the data used for the analysis, or the actions to which the analysis is applied. 

Deployment

If the analysis proves effective in the evaluation, then you will want to deploy it as a new system on the organization’s infrastructure so that it can update regularly and be incorporated into its operations. Among the most important considerations when deploying a new system is whether or not it was built on the organization’s infrastructure. If the system was built on the organization’s infrastructure, it will often (though not always) be easier to deploy, because it should already be consistent with the organization’s technical infrastructure. Data scientists working within the organization will usually have access to the organization’s internal infrastructure and should work closely with production engineers to make sure their system integrates with it. Volunteers and other external data scientists may also be able to access the organization’s infrastructure through volunteer or business agreements, making for a more seamless deployment process. 

However, often data science systems are built outside of the organization’s infrastructure, possibly because of security or privacy issues that limit access. When this happens, deployment may be more complicated. The organization’s internal data and engineering professionals who work with and maintain their technical infrastructure should be incorporated into the scoping process early to ensure that the system built by partner data scientists fits the organization’s infrastructure and can be integrated and deployed fairly easily. The project should be a priority for the data and technical professionals in the organization who are necessary for deployment, and there should be time held in their schedules to devote to this project, just like any other. If the organization’s own data and technical professionals are not involved in the project and do not have time budgeted for the project, then it will be doomed to fail. The same data and technical professionals who help deploy the new system may need to maintain it in production, and this should be figured into their schedule and workload. 

Data science systems also require computing resources for deployment. A good understanding of the organization’s computing capacity should be achieved before starting the project. In deployment, data science systems will use memory and take time when they are updated. This should be accounted for within the organization’s infrastructure, and internal data and technical professionals should make sure that regular updates of data science systems do not conflict or compete for resources with other important technical processes.

Data science systems also rely on regular and consistent updates to underlying data sources. Data and technical professionals in the organization should make sure that regular updates of the new system are consistent with updates of the underlying data. If the data is old, delayed, or otherwise inconsistent with updates to the new system, then the value of that system will be reduced and may even deteriorate over time. 

Finally, it will be important to consider who in the organization will use the output of the system and how they will access it. The key to this is thinking about how the end-user will actually use the information. Will the end-user be looking at individual cases one by one, or will they be creating a list to prioritize? If the end-user is considering cases one-by-one, then a visual user interface that provides them the output alongside other information about the case may be helpful. If the end-user is instead creating a prioritized list (of places to inspect, for example), then it may be better to deliver them a list of the top cases, perhaps in a visual user interface or even as an email or printable document! These decisions should take into account the end-user’s current operational processes, their access to technology in their work, and their literacy (digital and otherwise), among other considerations. Training will also be a key part of the deployment, both for the technical maintenance of the system and the end-users who will use it to make decisions. 

Monitoring

Once the data science system is deployed and integrated into the organization’s operations, it will need to be continuously monitored to ensure that it continues to perform well and provide information that improves decision-making. The performance of data science systems can deteriorate over time for a variety of reasons, including changes to the underlying data or data quality, changes in the organization’s policies or operations, or even changes in real-world trends. To ensure the information that the system provides remains useful, organizations should have a plan and commit resources to monitor the system’s performance over time as a regular part of their operations. 

First, you should create a plan for how the system will be monitored, including what data and metrics should be used to assess its performance over time. This data will likely be future updates to the same data that was used to develop the original analysis and may use some of the same metrics and diagnostics that were used to validate it. When monitoring performance, you should be careful to ensure that data used to assess the system does not leak into the data used to update it. This could lead to overestimating the system’s performance and masking any decay. 

The organization will also need to commit technical staff to monitor the system’s performance and managing it if it breaks or if performance deteriorates. This requires understanding not only the system but the data that is used to update and evaluate it. Technical staff may need to be trained to understand, evaluate, and update the system, and some redundancy is recommended so that the organization has the skill depth to maintain the system, even if a staff member leaves. 

The organization should also plan to run additional field evaluations (like the one described above) at regular intervals to ensure that the system is still providing information that is actionable and improves operations. This requires the commitment of not only data professionals but also end-users to the evaluation of the system. The frequency of these evaluations may vary with the organization’s capacity, how often the data and the system are updated, and the frequency of changes in the organization’s operations and the overarching policy environment. However, it is important that the organization plans to devote resources to the continued monitoring and evaluation of the system. After all, a system that provides poor information could be worse than no system at all!

Finally, you should plan what you will do if the system does show signs of decay. This plan should consider what the sources of decay may be, including changes to input or evaluation data, changes to the organization’s policies or operations, or changes to real-world trends. Responses will differ depending on the sources of performance decay. 

You should also consider what the organization will do with the system if they change their infrastructure, data, policies, or operations. The system should be evaluated under the new changes and it should be determined whether it still provides useful and actionable information, or if it will need to be updated or retired. This is important because changes to the input data or the infrastructure that support the system could impact its performance, while changes to policy or operations may mean the system no longer provides information relevant to those new processes.

http://www.datasciencepublicpolicy.org/wp-content/uploads/2021/09/Project-Scoping-Worksheet-Blank-10-2021.pdf

http://www.dssgfellowship.org/2016/04/28/introducing-the-data-maturity-framework/