### Good Data Criteria
- Covers important cases (good coverage of various $x$)
- Is defined consistently (definitions of labels $y$ is unambiguous and there is no noise in $p(y|x)$)
- Has timely feedback from production data (deployment covers data drift and concept drift)
- Is sized appropriately

### Labeling Ambiguities

Labeling ambiguities may be a source of noise in the financial market data. Think about the problem of predicting price/yield spread between bonds with the same issuer which are otherwise identical but with different coupons. That spread can be different simply because different issuers' bonds are priced by different market participants, even if all other attributes, such as maturity, call dates, or even perceived issuer credit worthiness. That should give us pause by asserting machine learning can be used in these cases, without actually training one such model. 

One can argue that sentiment analysis is ambiguous, since labeling an article being positive, negative or neutral is subjective.

Eliminating labeling ambiguities sometimes is more effective in improving data quality and algo effectiveness, than obtaining more data.

### Major types of data problems

The following screen shot is from Week 3 of [Introduction to Machine Learning in Production](https://www.coursera.org/learn/introduction-to-machine-learning-in-production?specialization=machine-learning-engineering-for-production-mlops). It is said that experience, intuition and algos/models that work on problems in one quadrant is more easily transferred to problems of the same quadrant.

![image.png](attachment:image.png)

### Obtaining Data

From Week 3 of [Introduction to Machine Learning in Production](https://www.coursera.org/learn/introduction-to-machine-learning-in-production?specialization=machine-learning-engineering-for-production-mlops)

![image.png](attachment:image.png)

### Data Pipeline

Although it can be messy in the PoC phase, you still need to take intensive notes to make sure you replicate the same data pipelines in prod. Retaining meta-data might help with this process; keeping track of data provenance (where data is from) and lineage (the sequence of steps of data processing) also helps.

### Things to watch out in preparation of data

- **Leakage**: Information about labels sneaks into features
- **Sample bias**: Test inputs and deployment inputs have different distributions
- **Nonstationary** or **Domain Shift**: When the thing you are modeling changes over time (note that these terms are used loosely)
    - **Covaraiate Shift** or **Data shift**: input distribution $P(X)$ changes over time
    - **Concept Shift**: correct output $P(y|X)$ for given input changes over time
    - **Label Shift**: label distribution $P(y)$ changes over time.
  
  These can be the norm rather than the exception in financial market. Solution can be to recalibrate the model in an online fashion and/or put more weight on most recent data whereby weights are controlled by a half-life parameter.

## References
- [Introduction to Machine Learning in Production](https://www.coursera.org/learn/introduction-to-machine-learning-in-production?specialization=machine-learning-engineering-for-production-mlops)