<a href="https://colab.research.google.com/github/m4rCs1l/Data-Wrangling-and-Data-Quality/blob/main/Chapter3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Fitness** (the appropriateness of data for use in a particular context, or to answer a particular question.)

the documentation
in your data diary is actually irreplaceable. The conclusion that you
reach about a dataset being representative or valid will, in most cases,
be informed by everything from your own reading and research to
conversations with experts to additional datasets you’ve located. But
without good documentation of who said what or how you came across
the information, any attempt to repeat or confirm your previous work
will almost certainly fail.

## Assessing Data Fit

Perhaps one of the most common misconceptions about data wrangling is
that it is a predominantly quantitative process, that is, that data wrangling is
mostly about working with numbers, formulas, and code. In fact,
irrespective of the type of data you’re dealing with—it could be anything
3
from temperature readings to social media posts—the core work of data
wrangling involves making judgment calls: from whether your data
accurately represents the phenomenon you’re investigating, to what to do
about missing data points and whether you have enough data to generate
any real insight at all. That first concept—the extent to which a given
dataset accurately represents the phenomenon you’re investigating—is
broadly what I mean by its fit, and assessing your dataset’s fitness for
purpose is much more about applying informed judgment than it is about
applying mathematical formulas

when you begin the
process of trying to answer a question with data, it’s not enough to know
just what is in the dataset; you need to know about the processes and
mechanisms used to collect it.

Because of this, over time the
scientific community has developed three key metrics for determining the
appropriateness or fit of a dataset for answering a given question: **validity**,
**reliability**, and **representativeness**.

## Validity

validity describes the extent to which something measures
what it is supposed to

In our room temperature example, this would mean
ensuring that the type of thermometer you’ve chosen will actually measure
the air temperature rather than something else. For example, while
traditional liquid-in-glass thermometers will probably capture air
temperature well, infrared thermometers will tend to capture the
temperature of whatever surface they’re pointed at

Unsurprisingly, things only get more involved when we’re not collecting
data about common physical phenomena. **Construct validity** describes the
extent to which your data measurements effectively capture the (usually
abstract) construct, or idea, you’re trying to understand

In data analysis, this process of selecting measures is known as
**operationalizing a construct**, and it inevitably requires choosing among—
and balancing—proxies for the idea or concept you are trying to
understand. These proxies—like graduation rates, test scores,
extracurricular activities, and so on—are things about which you can collect
data that you are choosing to use to represent an **abstract concept** (“best”
school) that cannot be measured directly. Good-quality data, to say the
least, must have good **construct validity** with respect to your question,
otherwise your data wrangling results will be meaningless.

So you need to define an abstract concept, operationalize it (choose proxies, with respect to construct validity), select measures which can describe proxies (with respect to content validity)


**Content validity**.  This type of validity has to do with ***how complete*** your data is for a given
proxy measurement

***Construct validity : FALSE***

Suppose a researcher develops a questionnaire to measure the stress levels of office workers. The researcher decides to **use questions that assess only the amount of time spent at work as the main indicator of stress**. BUT! Stress levels are a multifaceted construct that includes not only the amount of work time but also sleep quality, physical and mental health, personal relationships, support from colleagues, and more. So if we try to validate the questionnaire by comparing its results with other established methods of measuring stress (e.g., physiological measures like cortisol levels) or expert assessments, we'll see many differencies

***Content validity : FALSE***

Suppose the same case as in construct invalidity. Researcher wants to measure "physical health". But only include индекс массы тела as data. But there are way more factors affecting "physical health".

## Reliability

Within a dataset, the reliability of a given measure describes its accuracy
and stability. Together, these help us assess whether the same measure taken
twice in the same circumstances will give us the same—or at least very
similar—results

With abstract concepts and real-world data, determining the reliability of a
data measure is especially tricky, because it is never really possible to
collect the data more than once—whether because the cost is prohibitive,
the circumstances can’t be replicated, or both. In those cases, we typically
estimate reliability by comparing one similar group to another, using either
previously or newly collected data.

Example: Unreliable Data

A manufacturing plant uses sensors to monitor the temperature of machinery to ensure optimal operating conditions and prevent overheating. High frequency of errors due to environmental interference (sensor is placed near door, so the temperature of machinery is higher than measured)

## Representativeness

The key value proposition for data-driven systems is that they allow us to
generate insights—or even predictions—about people and phenomena that
are too massive or too complex for humans to reason about effectively. Whether those insights are an accurate portrait of
a particular population or situation, however, depends directly on the
representativeness of the data being used

Whether a dataset is sufficiently representative depends on a few things, the
most significant of which goes back to the “for whom?” question

Anytime you’re working with a subset or sample in this way, it’s crucial to
make sure that it is representative of the broader population to which you
plan to apply your findings. While proper sampling methodology is beyond
the scope of this book, the basic idea is that in order for your insights to
accurately generalize to a particular community of people, the data sample
you use must proportionally reflect that community’s makeup. That means
that you need to invest the time and resources to understand a number of
things about that community as a whole before you can even know if your
sample is representative.

As you can see, ensuring representativeness demands that we carefully
consider which characteristics of a population are relevant to our data
wrangling question and that we seek out enough additional information to
ensure that our dataset proportionally represents those characteristics

For example, data about search engine trends, social
media activity, public transit usage, or smartphone ownership, for example,
are all extremely unlikely to be representative of the broader population,
since they are inevitably influenced by things like internet access and
income level ***(so it will be very delusional if you try to generate some insights about "percentage of different languages in the World" from "languages in USA")***. This means that communities are overrepresented in these
datasets while others are (sometimes severely) underrepresented. The result
is systems that don’t generalize—like facial recognition systems that cannot
“see” Black faces.

If you are faced with nonrepresentative data, what do you do? At the very
least, you will need to revise (and clearly communicate) your “for whom?”
assessment to reflect whatever population it does represent; this is the only
community for whom your data wrangling insights will be valid

# **Data Integrity**

Data integrity is
about whether the data you have can support the analyses you’ll need to
perform in order to answer that question

In general, a high-integrity dataset will, to one degree or another, be:


**Necessary, but not sufficient**



*   Of known provenance
*   Well-annotated




**Important**




*   Timely
*   Complete
*   High Volume
*   Multivariate
*   Atomic


Achievable



*   Consistent
*   Clear
*   Dimensionally structured














## Necessary but not sufficient

### Of known provenance

This means that using a dataset
collected by others requires putting a significant amount of trust in them

This is why knowing the provenance of a dataset is so important: if you
don’t know who compiled the data, the methods that they used, and/or the
purpose for which they collected it, you will have a very hard time judging
whether it is fit for your data wrangling purpose, or how to correctly
interpret it.

you should try to
find out enough about data collectors: professional backgrounds, motivations for
collecting the data (is it legally mandated, for example?), and the methods
they employed so that you have some sense of which measures you’ll want
to corroborate versus those that might be okay to take at face value

### Well-annotated

A well-annotated dataset has enough surrounding information, or metadata,
to make interpretation possible. This will include everything from highlevel
explanations of the data collection methodology to the “data
dictionaries” that describe each data measure right down to its units.

For
example, imagine trying to interpret a budget without knowing if the figures
provided refer to dollars, thousands of dollars, or millions of dollars—it’s
clearly impossible.

## Important

### Timely

How up to date is the data that you’re using? Unless you’re studying a
historical period, ensuring that your data is recent enough to meaningfully
describe the current state of the world is important—though how old is “too
old” will depend on both the phenomenon you’re exploring and the
frequency with which data about it is collected and released. Unless it’s about a field you’re already familiar with, assessing whether
your data is timely will likely require some research with experts as well as
the data publisher.

### Complete

Does the dataset contain all of the data values it should? Can we still
generate useful data insights when parts of the data are so incomplete? Addressing this question means first answering two others. First, why is the
data missing? Second, do you need that data measure in order to perform a
specific analysis needed for your data wrangling process? Whatever the reason,
discovering why some part of the data is missing is essential in order to
know how to proceed. Even truncated data may not be a problem if what we have
available covers a sufficient time period for our purposes, but it is still
useful to learn the true number of records and date range of the data for
context

In other words, while having complete data is always preferable, once you
know why that data is missing, you may be able to proceed with your data
wrangling process, regardless. But you should always find out—and be sure
to document what you learn!

### High volume

How many data points are “enough”? At minimum, a dataset will need to
have sufficient records to support the type of analysis needed to answer
your particular question:

if what you need is a count - you need all records,

if your question is about general or generalizable patterns or trends - what counts as “enough” is a little less clear (the correct answer is
largely about specifying the question correctly—and that requires being, in
most cases, very specific.).

One of the trickiest parts of assessing data “completeness,” however, is that
accounting for factors that may influence the trend or pattern you’re
investigating is difficult without knowing the subject area pretty well
already.

The answer—as it is so often—is (human) experts (+ Google Scholar)

### Multivariate