# Notes on data cleaning

## Intro

It is the case more often than not that, in order to use quantitative data for statistical analysis and modeling, we must first preprocess the data in some way that makes our results more meaningful, reduces the noise in our model, etc. and generally brings some clarity to the data.  This process is often called "cleaning" the data, which is a very blurry concept: it isn't clear what the cleaning process is.  Sometimes it refers to standardization, sometimes it refers to replacing empty values with markers, but the general idea is that the data is in a more ready state for analysis and modeling after this process is done.  These notes aim to bring a bit of focus to the process of cleaning data.

## Quickstart cheatsheet

| Transformation | Formula  | Use When | Notes |
|--|--|--|--|
| standardized normalization | $\frac{X-\mu}{\sigma}$ | You are comparing values with different units or scales | Assumes well defined mean, so most meaningful when samples are large or presumably normally distributed |
| t-statistic for population | $\frac{X-\hat{X}}{\hat{s}}$ | You only have a sample of the population | This is what you usually use and is sometimes called a "standard score"
| min-max scaling | $\frac{X-X_{min}}{X_{max}-X_{min}}$ | Your algorithm is sensitive to the magnitude of the values | Neural networks train faster with data in 0-1 range |



## Standardization and Normalization

Many machine learning algorithms work best on data that is within a certain scale and conforms to certain properties.  Before we can use the algorithm effectively, we must first translate the data so that it obeys the presumed parameters.  This general process is often called, confusingly and sometimes interchangably, "standardization" or "normalization".  To bring some clarity to this issue, let's first think about data cleaning for the purpose of two different reasons:

* Transformations for analysis: transforming two different data sets by some means so that we can compare them in some way
* Transformations for algorithms: transforming a single data set by some means for the purpose of using a certain algorithm

### Transformations for analysis

Sometimes we find ourselves with data that we would like to compare, however comparison doesn't really make sense.  For example, let's say we were having a conversation about shoe size and height and someone in the conversation mentioned that he or she had large shoes for his or her height or vice versa.

Consider the california housing dataset from scikit-learn:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
df = pd.DataFrame( data['data'], columns=data['feature_names'] )
from IPython.core.debugger import Tracer
Tracer()() #this one triggers the debugger
print(df.describe())

  if __name__ == '__main__':


--Return--
None
> [0;32m<ipython-input-5-e8ba9085644d>[0m(9)[0;36m<module>[0;34m()[0m
[0;32m      6 [0;31m[0mdata[0m [0;34m=[0m [0mfetch_california_housing[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m      7 [0;31m[0mdf[0m [0;34m=[0m [0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m([0m [0mdata[0m[0;34m[[0m[0;34m'data'[0m[0;34m][0m[0;34m,[0m [0mcolumns[0m[0;34m=[0m[0mdata[0m[0;34m[[0m[0;34m'feature_names'[0m[0;34m][0m [0;34m)[0m[0;34m[0m[0m
[0m[0;32m      8 [0;31m[0;32mfrom[0m [0mIPython[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mdebugger[0m [0;32mimport[0m [0mTracer[0m[0;34m[0m[0m
[0m[0;32m----> 9 [0;31m[0mTracer[0m[0;34m([0m[0;34m)[0m[0;34m([0m[0;34m)[0m [0;31m#this one triggers the debugger[0m[0;34m[0m[0m
[0m[0;32m     10 [0;31m[0;32mprint[0m[0;34m([0m[0mdf[0m[0;34m.[0m[0mdescribe[0m[0;34m([0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0m
[0m
ipdb> df
       MedInc  HouseAge  AveRooms  AveBe