## Overview
1. Assess Data Quality
2. Missing Values
3. Transform Features
4. Reshape (structure, granularity)

Quality checking (and fixing) and transformation are often cyclical: the quality checks point us toward transformations we need to make, and when we check the transformed columns to confirm that our data are ready for analysis, we may discover they need further cleaning.

### different expectations for quality

Data from a scientific experiment or study are typically clean, well-documented, and have a simple structure. These data are organized to be broadly shared so that others can build on or reproduce the findings. They are typically ready for analysis after little to no wrangling.

Data from government surveys often come with very detailed codebooks and meta data describing how the data are collected and formatted, and these datasets are also typically ready for exploration and analysis right out of the “box”.

Administrative data can be clean, but without inside knowledge of the source we may need to extensively check their quality. Also, since we often use these data for a purpose other than why they were collected in the first place, we may need to transform features or combine data tables.

Informally collected data, such as data scraped from the Web, can be quite messy and tends to come with little documentation. For example, texts, tweets, blogs, Wikipedia tables, etc. usually require formatting and cleaning to transform them into information ready for analysis.



## 9.1 Quality Check
Quality checks can reveal issues in the data that need to be addressed before proceeding with analysis.

### Four Vantage Points

#### 1. scope
Do the data match your understanding of the population?

#### 2. measurement


Considering what might be a _**reasonable value**_ for a feature.
- based on common knowledge of ranges
- documentation to tells us the expected values for a feature

Ensure that the **_data type_** matches our expectations
- a price to be a number
- weight values recorded in pounds, not kilograms

#### 3. Related Features
At times two features have builtin conditions on their values that we can cross-check for internal consistency. 

A cross-tabulation of the feature1 and feature2 can confirm constraint is met:

            `display_df(pd.crosstab(df[field1], df[field2]), rows=12)`

#### 4. Analysis
Even when data pass the previous quality checks, problems can arise with its usefulness:
- all but a handful of values for a feature are identical
- too many missing values, especially if there is a discernible pattern in the missing values
- a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range.

### Fixing or Not
what to do next when uncovering problems in the previous check

What ever approach you take, you will want to study the possible impact of the changes that you make on your analysis. For example, try to determine whether the records with corrupted values are similar to each other, and different from the rest of the data.

##### 1. Leave it as is
- Have discovered a characteristic of your data that will inform you about how to do your analysis and otherwise does not need correcting.

- The problem is relatively minor and most likely will not impact your analysis

- Replace corrupted values with `NaN`

##### 2. Modify Indiviudal Values
create a new feature with the modified value and preserve the original feature

##### 3. Remove a Column
If many values in a feature have problems, then consider eliminating that feature entirely. Rather than exclude a feature, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded.

##### 4. Drop Records
DON'T:
- do not drop a large number of observations from a dataset without good reason
- do not simply correspond dropping records with corrupted values

DO:
- try to scale back your investigation to a particular **subgroup** of the data that is clearly defined by some criteria
- Even if an unusual value is in fact correct, you still might decide to exclude the record from your analysis because it’s so different from the rest of your data and you do not want it to overly influence your analysis.

## 9.2 Missing Values