author | title | semester | footer | license |
---|---|---|---|---|
Christian Kaestner and Claire Le Goues |
MLiP: Data Quality |
Spring 2024 |
Machine Learning in Production/AI Engineering • Christian Kaestner & Claire Le Goues, Carnegie Mellon University • Spring 2024 |
Creative Commons Attribution 4.0 International (CC BY 4.0) |
One week from today, here
Questions based on shared scenario, apply concepts
Past midterms online, similar style
All lectures and readings in scope, focus on concepts with opportunity to practice (e.g., recitations, homeworks, in-class exercises)
Closed book, but 6 sheets of notes (sorry, no ChatGPT)
Required reading:
- Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems (pp. 1-15).
Recommended reading:
- Schelter, S., et al. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.
- Distinguish precision and accuracy; understanding the better models vs more data tradeoffs
- Use schema languages to enforce data schemas
- Design and implement automated quality assurance steps that check data schema conformance and distributions
- Devise infrastructure for detecting data drift and schema violations
- Consider data quality as part of a system; design an organization that values data quality
(often delayed, hard-to-fix consequences)
Image source: https://monkeylearn.com/blog/data-cleaning-python
Poor data quality leads to poor models
Often not detectable in offline evaluation - Q. why not?
Causes problems in production - now difficult to correct
Detection almost always delayed! Expensive rework. Difficult to detect in offline evaluation.
Sambasivan, N., et al. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. CHI (pp. 1-15).
Data cleaning and repairing account for about 60% of the work of data scientists.
Own experience?
Quote: Gil Press. “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says.” Forbes Magazine, 2016.
Goal: Train an ML model to predict future sales; make decisions about what to (re)stock/when/how many...
Manually entered
Generated through actions in IT systems
Logging information, traces of user interactions
Sensor data
Crowdsourced
<style>#mermaid1 {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid1 .error-icon{fill:#552222;}#mermaid1 .error-text{fill:#552222;stroke:#552222;}#mermaid1 .edge-thickness-normal{stroke-width:2px;}#mermaid1 .edge-thickness-thick{stroke-width:3.5px;}#mermaid1 .edge-pattern-solid{stroke-dasharray:0;}#mermaid1 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid1 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid1 .marker{fill:#333333;stroke:#333333;}#mermaid1 .marker.cross{stroke:#333333;}#mermaid1 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid1 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid1 .cluster-label text{fill:#333;}#mermaid1 .cluster-label span{color:#333;}#mermaid1 .label text,#mermaid1 span{fill:#333;color:#333;}#mermaid1 .node rect,#mermaid1 .node circle,#mermaid1 .node ellipse,#mermaid1 .node polygon,#mermaid1 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid1 .node .label{text-align:center;}#mermaid1 .node.clickable{cursor:pointer;}#mermaid1 .arrowheadPath{fill:#333333;}#mermaid1 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid1 .flowchart-link{stroke:#333333;fill:none;}#mermaid1 .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid1 .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid1 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid1 .cluster text{fill:#333;}#mermaid1 .cluster span{color:#333;}#mermaid1 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid1 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}</style>
sources of different reliability and quality
Product Database:
ID | Name | Weight | Description | Size | Vendor |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
Stock:
ProductID | Location | Quantity |
---|---|---|
... | ... | ... |
Sales history:
UserID | ProductId | DateTime | Quantity | Price |
---|---|---|---|---|
... | ... | ... | ... | ... |
Recommended Reading: Gitelman, Lisa, Virginia Jackson, Daniel Rosenberg, Travis D. Williams, Kevin R. Brine, Mary Poovey, Matthew Stanley et al. "Data bite man: The work of sustaining a long-term study." In "Raw Data" Is an Oxymoron, (2013), MIT Press: 147-166.
Accuracy: The data was recorded correctly.
Completeness: All relevant data was recorded.
Uniqueness: The entries are recorded once.
Consistency: The data agrees with itself.
Timeliness: The data is kept up to date.
Unreliable sensors or data entry
Wrong results and computations, crashes
Duplicate data, near-duplicate data
Out of order data
Data format invalid
Examples in inventory system?
System objective changes over time
Software components are upgraded or replaced
Prediction models change
Quality of supplied data changes
User behavior changes
Assumptions about the environment no longer hold
Examples in inventory system?
Users react to model output; causes data shift (more later)
Users try to game/deceive the model
Examples in inventory system?
Accuracy: Reported values (on average) represent real value
Precision: Repeated measurements yield the same result
Accurate, but imprecise: Q. How to deal with this issue?
Inaccurate, but precise: ?
(CC-BY-4.0 by Arbeck)
More data -> better models (up to a point, diminishing effects)
Noisy data (imprecise) -> less confident models, more data needed
- some ML techniques are more or less robust to noise (more on robustness in a later lecture)
Inaccurate data -> misleading models, biased models
-> Need the "right" data
-> Invest in data quality, not just quantity
Ensuring basic consistency about shape and types
Problems with this data?
Define the expected format of data
- expected fields and their types
- expected ranges for values
- constraints among values (within and across sources)
Data can be automatically checked against schema
Protects against change; explicit interface between components
- Illegal attribute values:
bdate=30.13.70
- Violated attribute dependencies:
age=22, bdate=12.02.70
- Uniqueness violation:
(name=”John Smith”, SSN=”123456”), (name=”Peter Miller”, SSN=”123456”)
- Referential integrity violation:
emp=(name=”John Smith”, deptno=127)
if department 127 not defined
Further readings: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23.4 (2000): 3-13.
CREATE TABLE employees (
emp_no INT NOT NULL,
birth_date DATE NOT NULL,
name VARCHAR(30) NOT NULL,
PRIMARY KEY (emp_no));
CREATE TABLE departments (
dept_no CHAR(4) NOT NULL,
dept_name VARCHAR(40) NOT NULL,
PRIMARY KEY (dept_no), UNIQUE KEY (dept_name));
CREATE TABLE dept_manager (
dept_no CHAR(4) NOT NULL,
emp_no INT NOT NULL,
FOREIGN KEY (emp_no) REFERENCES employees (emp_no),
FOREIGN KEY (dept_no) REFERENCES departments (dept_no),
PRIMARY KEY (emp_no,dept_no));
Image source: https://www.kdnuggets.com/2021/05/nosql-know-it-all-compendium.html
- CSV files
- Key-value stores (JSon, XML, Nosql databases)
- Message brokers
- REST API calls
- R/Pandas Dataframes
2022-10-06T01:31:18,230550,GET /rate/narc+2002=4
2022-10-06T01:31:19,332644,GET /rate/i+am+love+2009=4
{"user_id":5,"age":26,"occupation":"scientist","gender":"M"}
Q. Benefits? Drawbacks?
{ "type": "record",
"namespace": "com.example",
"name": "Customer",
"fields": [{
"name": "first_name",
"type": "string",
"doc": "First Name of Customer"
},
{
"name": "age",
"type": "int",
"doc": "Age at the time of registration"
}
]
}
Schema specification in JSON format
Serialization and deserialization with automated checking
Native support in Kafka
Benefits
- Serialization in space efficient format
- APIs for most languages (ORM-like)
- Versioning constraints on schemas
Drawbacks
- Reading/writing overhead
- Binary data format, extra tools needed for reading
- Requires external schema and maintenance
- Learning overhead
Notes: Further readings eg https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321, https://www.confluent.io/blog/avro-kafka-data/, https://avro.apache.org/docs/current/
Examples
- Avro
- XML Schema
- Protobuf
- Thrift
- Parquet
- ORC
Product Database:
ID | Name | Weight | Description | Size | Vendor |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
Stock:
ProductID | Location | Quantity |
---|---|---|
... | ... | ... |
Sales history:
UserID | ProductId | DateTime | Quantity | Price |
---|---|---|---|---|
... | ... | ... | ... | ... |
Basic structure and type definition of data
Well supported in databases and many tools
Very low bar of data quality
Application- and domain-specific data issues
Problems with the data beyond schema problems?
- Missing values:
phone=9999-999999
- Misspellings:
city=Pittsburg
- Misfielded values:
city=USA
- Duplicate records:
name=John Smith, name=J. Smith
- Wrong reference:
emp=(name=”John Smith”, deptno=127)
if department 127 defined but wrong
Q. How can we detect and fix these problems?
Further readings: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23.4 (2000): 3-13.
Data analysis / Error detection
- Usually focused on specific kind of problems, e.g., duplication, typos, missing values, distribution shift
- Detection in input data vs detection in later stages (more context)
Error repair
- Repair data vs repair rules, one at a time or holistic
- Data transformation or mapping
- Automated vs human guided
Illegal values: min, max, variance, deviations, cardinality
Misspelling: sorting + manual inspection, dictionary lookup
Missing values: null values, default values
Duplication: sorting, edit distance, normalization
Can we (automatically) detect instance-level problems? Which problems are domain-specific?
expect_column_values_to_be_between(
column="passenger_count",
min_value=1,
max_value=6
)
Supports schema validation and custom instance-level checks.
Invariants on data that must hold
Typically about relationships of multiple attributes or data sources, eg.
- ZIP code and city name should correspond
- User ID should refer to existing user
- SSN should be unique
- For two people in the same state, the person with the lower income should not have the higher tax rate
Classic integrity constraints in databases or conditional constraints
Rules can be used to reject data or repair it
Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017.
- User provides rules as integrity constraints (e.g., "two entries with the same name can't have different city")
- Detect violations of the rules in the data; also detect statistical outliers
- Automatically generate repair candidates (with probabilities)
Image source: Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017.
Rules directly taken from external databases
- e.g. zip code directory
Given clean data,
- several algorithms that find functional relationships (
$X\Rightarrow Y$ ) among columns - algorithms that find conditional relationships (if
$Z$ then$X\Rightarrow Y$ ) - algorithms that find denial constraints (
$X$ and$Y$ cannot co-occur in a row)
Given mostly clean data (probabilistic view),
- algorithms to find likely rules (e.g., association rule mining)
- outlier and anomaly detection
Given labeled dirty data or user feedback,
- supervised and active learning to learn and revise rules
- supervised learning to learn repairs (e.g., spell checking)
Further reading: Ilyas, Ihab F., and Xu Chu. Data cleaning. Morgan & Claypool, 2019.
- Sale 1: Bread, Milk
- Sale 2: Bread, Diaper, Beer, Eggs
- Sale 3: Milk, Diaper, Beer, Coke
- Sale 4: Bread, Milk, Diaper, Beer
- Sale 5: Bread, Milk, Diaper, Coke
Rules
- {Diaper, Beer} -> Milk (40% support, 66% confidence)
- Milk -> {Diaper, Beer} (40% support, 50% confidence)
- {Diaper, Beer} -> Bread (40% support, 66% confidence)
(also useful tool for exploratory data analysis)
Further readings: Standard algorithms and many variations, see Wikipedia
Why does my model begin to perform poorly over time?
Gama et al., A survey on concept drift adaptation. ACM Computing Surveys Vol. 46, Issue 4 (2014)
Concept drift (or concept shift)
- properties to predict change over time (e.g., what is credit card fraud)
- model has not learned the relevant concepts
- over time: different expected outputs for same inputs
Data drift (or covariate shift, virtual drift, distribution shift, or population drift)
- characteristics of input data changes (e.g., customers with face masks)
- input data differs from training data
- over time: predictions less confident, further from training data
Upstream data changes
- external changes in data pipeline (e.g., format changes in weather service)
- model interprets input data incorrectly
- over time: abrupt changes due to faulty inputs
How do we fix these drifts?
Concept and data drift are separate concepts
In practice and literature not always clearly distinguished
Colloquially encompasses all forms of model degradations and environment changes
Define term for target audience
From the first lecture and syllabus:
Within groups, we expect that you are honest about your contribution to the group's work. [...] This also applies to in-class discussions, where indicating working with others who did not participate in the discussion is considered an academic honesty violation.
What kind of drift might be expected?
As a group, tagging members, write plausible examples in #lecture
:
- Concept Drift:
- Data Drift:
- Upstream data changes:
How to detect concept drift in production?
Model degradations observed with telemetry
Telemetry indicates different outputs over time for similar inputs
Differences in influential features and feature importance over time
Relabeling training data changes labels
Interpretable ML models indicate rules that no longer fit
(many papers on this topic, typically on statistical detection)
How to detect data drift in production?
Model degradations observed with telemetry
Distance between input distribution and training distribution increases
Average confidence of model predictions declines
Relabeling of training data retains stable labels
- Compare distributions over time (e.g., t-test)
- Detect both sudden jumps and gradual changes
- Distributions can be manually specified or learned (see invariant detection)
Plot distributions of features (histograms, density plots, kernel density estimation)
- Identify which features drift
Define distance function between inputs and identify distance to closest training data (e.g., energy distance, see also kNN)
Anomaly detection and "out of distribution" detection
Compare distribution of output labels
https://rpubs.com/ablythe/520912
Image source and further readings: Detect data drift (preview) on models deployed to Azure Kubernetes Service (AKS)
Regularly retrain model on recent data
- Use evaluation in production to detect decaying model performance
Involve humans when increasing inconsistencies detected
- Monitoring thresholds, automation
Monitoring, monitoring, monitoring!
What kind of monitoring for previously listed drift in Inventory scenario?
"Everyone wants to do the model work, not the data work"
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15).
Data flows across components, e.g., from user interface into database to crowd-sourced labeling team into ML pipeline
Documentation at the interfaces is important
Humans interacting with the system
- Entering data, labeling data
- Observed with sensors/telemetry
- Incentives, power structures, recognition
Organizational practices
- Value, attention, and resources given to data quality
Teams rarely document expectations of data quantity or quality
Data quality tests are rare, but some teams adopt defensive monitoring
- Local tests about assumed structure and distribution of data
- Identify drift early and reach out to producing teams
Several ideas for documenting distributions, including Datasheets and Dataset Nutrition Label
- Mostly focused on static datasets, describing origin, consideration, labeling procedure, and distributions; Example
🗎 Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64, no. 12 (2021).
🗎 Nahar, Nadia, et al. “Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process.” In Pro. ICSE, 2022.
Notes: Source: https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdf
Physical world brittleness
- Idealized data, ignoring realities and change of real-world data
- Static data, one time learning mindset, no planning for evolution
Inadequate domain expertise
- Not understand. data and its context
- Involving experts only late for trouble shooting
Conflicting reward systems
- Missing incentives for data quality
- Not recognizing data quality importance, discard as technicality
- Missing data literacy with partners
Poor (cross-org.) documentation
- Conflicts at team/organization boundary
- Undetected drift
Sambasivan, N., et al. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. Conference on Human Factors in Computing Systems.
- Interacting with physical world brittleness
- Inadequate domain expertise
- Conflicting reward systems
- Poor (cross-organizational) documentation
Raw data is an oxymoron
- Data from many sources, often inaccurate, imprecise, inconsistent, incomplete, ... -- many different forms of data quality problems
- Many mechanisms for enforcing consistency and cleaning
- Data schema ensures format consistency
- Data quality rules ensure invariants across data points
- Concept and data drift are key challenges -- monitor
- Data quality is a system-level concern
- Data quality at the interface between components
- Documentation and monitoring often poor
- Involves organizational structures, incentives, ethics, ...
- Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12), pp.1781-1794.
- Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. "Data validation for machine learning." Proceedings of Machine Learning and Systems 1 (2019): 334-347.
- Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “Data Management Challenges in Production Machine Learning.” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM.
- Theo Rekatsinas, Ihab Ilyas, and Chris Ré, “HoloClean - Weakly Supervised Data Repairing.” Blog, 2017.
- Ilyas, Ihab F., and Xu Chu. Data cleaning. Morgan & Claypool, 2019.
- Moreno-Torres, Jose G., Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V. Chawla, and Francisco Herrera. "A unifying view on dataset shift in classification." Pattern recognition 45, no. 1 (2012): 521-530.
- Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019.
- Humbatova, Nargiz, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. "Taxonomy of real faults in deep learning systems." In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110-1121. 2020.