# Agenda

1. What is data cleaning? Why do we need to clean our data?
2. `NaN`/`NA`
    - What are they? Good and bad
    - What types of values are they, and how does this complicate things (or not)?
    - `dropna`
    - Selective dropping of `NaN`
    - `fillna` to replace values with a scalar
    - `fillna` with a method call
2. Removing bad values
3. `interpolate` -- replacing `NaN` values with reasonable fakes
4. Finding and removing outliers
5. Regularizing string data with strip/lower/replace
6. Finding and removing duplicate data

# What is data cleaning?

When we read in data from somewhere, it will almost always have flaws. That's because:

- human error
- computers going down
- sensors that are wrong (or down)
- people refusing to answer questions

If you read in a data set, it will almost always have problems *or* someone who prepared the data set did the hard work ahead of you.



# NaN / NA

If data is missing, then Pandas usually represents it as `NaN`, short for "not a number." You might have also seen it written as `nan`. They are identical, except that they both used to be defined in NumPy.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
np.NaN

nan

In [3]:
np.nan is np.NaN

True

In [4]:
type(np.nan)

float

In [5]:
np.nan + 5

nan

In [6]:
np.nan * 100

nan

In [7]:
a = np.array([97, 85, 92, 98, 89])
a.mean()  

92.2

In [8]:
a = np.array([97, 85, np.nan, 98, 89])
a.mean()  

nan

In [9]:
a.dtype

dtype('float64')

In [10]:
a

array([97., 85., nan, 98., 89.])

In [11]:
s = Series([97, 85, 92, 98, 89])
s

0    97
1    85
2    92
3    98
4    89
dtype: int64

In [12]:
s.mean()

92.2

In [13]:
s = Series([97, 85, np.nan, 98, 89])
s

0    97.0
1    85.0
2     NaN
3    98.0
4    89.0
dtype: float64

In [14]:
s.mean()  # by default, Pandas ignores NaN in a series when we perform calculations on it!

92.25

In [16]:
s.mean(skipna=False)  # if you want the NumPy behavior, you can get it like this

nan

In [17]:
np.nan == np.nan  # is NaN equal to itself?

False

In [21]:
# you cannot filter our NaN in this way!
s != np.nan

0    True
1    True
2    True
3    True
4    True
dtype: bool

# NaN vs NA

If I have a series of strings, and one of the values in that series is `NaN`, what will be the dtype of that series? The answer: `object`, but that's true for all string series, because Pandas always uses `object` for strings and anything else it doesn't know what to do with.

What if you have a bunch of integers? You'll turn them into floats.

This is annoying! We want to be able to say that a series contains ints and the occasional `NaN`. But we cannot.

If we try to turn a series of ints + `NaN` into ints, we'll get an error!

In [None]:
s.astype(int)  # get a 