# Data Cleaning + Screening Steps

---

## What is Data Cleaning?

* Enables you to make sure you've met all your assumptions, outliers, and error problems
* Each type of analysis will have different types of data screening

## 🚨 The Big Important Rule 🚨

> For __hypothesis testing__ traditionally we use a `p < 0.05` (less than) criterion because you're looking for statistically significant relationship

> But for __data screening__ we use a much more stringent criterion of `p > 0.001` (greater than) because we want to make sure the data is astronomically wild before we remove it

## Data __Cleaning__ Steps for each Hypothesis

> ⚠️ It is paramount that these steps are performed _in the order outlined below_:

1. Strip data set for __Only Relevant Variables__
2. Check and fix data for __Accuracy__
3. Check and fix data for __Missing__ values
4. Check and fix data for __Outliers__

---

### 1. Select Relevant Variables Only

There's no need to waste time cleaning or scrutinizing data that isn't relevant to the hypothesis you're testing.

In addition, for some of the following steps you will need to compare a value against the entire range of values for a single variable, or against the entire set of values across a single observation.

In these circumstances only focusing on the variables and/or observations that are relevant to the hypothesis is paramount. If you were to include values from variables or observations that are not relevant to the hypothesis it would change the results and therefore the efficacy of your findings against the hypothesis.

#### Example:

Take a look at the following data set and coding:


| Coding Legend        |  |
|---------------|-------------------------------------------------------------------------------|
| **id**        | Unique identifier for each individual observation                              |
| __gender__    | `1` = Male, `2` = Female                                                      |
| __education__ | `1` = High School, `2` = Undergraduate Degree, `3` = Graduate Degree          |
| __race__      | `1` = White, `2` = Black, `3` = Hispanic, `4` = Pacific Islander, `8` = Other |
| __income__    | `1` = \$0-18k, `2` = \$18k-36k, `3` = \$36k-75k, `4` = \$75k-100k, `5` = $100k+ |

| id | gender | education | race | income |
|----|--------|-----------|------|--------|
| 1  | 1      | 1         | 1    | 1      |
| 2  | 2      | 2         | 2    | 1      |
| 3  | 1      | 1         | 4    | 2      |
| 4  | 2      | 3         | 8    | 3      |
| 5  | 1      | 3         | 1    | 3      |

If you were to hypothesize a positive correlation between `education` and `income`, you would automatically remove `id`, `gender`, and `race` from your data set. And you would perform the remaining Data Cleaning steps with a data set that looked like this:

| education | income |
|-----------|--------|
| 1         | 1      |
| 2         | 1      |
| 1         | 2      |
| 3         | 3      |
| 3         | 3      |

> #### ⚠️  It is suggested that you duplicate your data first, then strip irrelevant columns.
> #### This allows you to preserve the original data set.


---

### 2. Identify & Fix Inaccurate Data

#### Why is it important?

Most of the time the data is, at some point between its origin and you, modified by a human. This makes some of the data prone to human error. Therefore, we must _always_ __assume the data has mistakes__ and __thoroughly check for accuracy__.

If we do not check our data for accuracy our results may be skewed, misinterpreted, or lacking accurate context.

A lot of times you'll find a simple format conversion is needed. Other times you will identify inexplicable values. Here are some of the more common discrepancies in accuracy you may find in your data sets:

* __Numbers:__
    * `Nineteen` vs `19`
    * Float numbers `1.000000` instead of Integers `1`
    * Price fields need to be standardized `$10k` vs `10000.00`
    * Tempurature recoreded in Celcius but needed in Fahrenheit
* __Date/Time:__
    * Hour field needs to be adjusted to GMT based on recorded time zone
    * `M/D/YY` needs to be converted into `YYYY-MM-DD`
* __Strings__
    * Email field may need to be made all lowercase, checked for valid format, and/or authorized domain(s)
    * Name field may contain prefix like "Mr." or "Dr."

#### How do you handle inaccurate data?

👉 Either __fix__ or __delete__ the `value` (do <u>not</u> delete the entire observation or entire variable).



---

### 3. Identify & Fix Missing Data
* a
* b
* c

---

### 4. Identify & Fix Outliers
* a
* b
* c

---

## Data __Screening__ Steps for each Hypothesis
### Verify Assumptions
* Additvity
* Normality
* Linearity
* Homogeneity
* Homoscedasticity

#### Additivity
* a
* b
* c

#### Normality
* a
* b
* c

#### Linearity
* a
* b
* c

#### Homogeneity
* a
* b
* c

#### Homoscedasticity
* a
* b
* c