# 7 - Data Cleaning and Preparation

In this chapter we discuss tools for handling missing data,
duplicate data, string manipulation, and some other analytical
data transformations. The next chapter is then focused on 
combining and rearranging datasets in various ways.

## 7.1 Handling Missing Data

Pandas has plenty of ways of handling missing data. Some of
the statistical built-in methods of pandas objects already 
exclude missing data for default, for example, but we may
be interested in handling missing data in different ways:

First, it is important we remember that we can set which 
values are considered NA when importing a dataset with 
`read_csv()` or other read functions by using the `na_values`
parameter.

Second, if the DataFrame is already loaded, we can treat it 
with functions such as `replace()` or `map()`.

### Filtering Out Missing Data

There are different ways of filtering missing data, depending
on whether we want to drop rows, columns, and the missing data
threshold we consider for being dropped.

Although we could use boolean indexing with the `notna()` method,
`dropna()` allows us to customize all these options above-mentioned.


In [1]:
import numpy as np
import pandas as pd

data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

Note that these return copies of the object by default. 
To modify the original dataset, we use the `inplace=` 
parameter.

To present some of the different ways to drop NA values:

In [2]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


`dropna()` drops rows that have any missing value by default:

In [3]:
data.dropna() 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` drops only rows **that have all values
missing**

In [4]:
data.dropna(how='all') 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


We can **drop columns** instead with the `axis` parameter:

In [6]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [7]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


To drop only past a certain **threshold** of missing values, 
we use the `thresh` parameter:

In [10]:
df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.254007,,
1,-0.249208,,
2,0.762543,,-0.648712
3,-0.321158,,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


In [11]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


In [12]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.762543,,-0.648712
3,-0.321158,,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


### Filling In Missing Data

Rather than discarding missing data, we may
want to fill it with some value, such as an
integer, the mean for that columns, of the median.
`fillna()` will do that for us:

In [13]:
df.fillna(0) 

Unnamed: 0,0,1,2
0,0.254007,0.0,0.0
1,-0.249208,0.0,0.0
2,0.762543,0.0,-0.648712
3,-0.321158,0.0,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


To **use different fill values for different
columns**, we can pass a dictionary as parameter
to the method:

In [14]:
df.fillna({1:0, 2:2})

Unnamed: 0,0,1,2
0,0.254007,0.0,2.0
1,-0.249208,0.0,2.0
2,0.762543,0.0,-0.648712
3,-0.321158,0.0,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


We can also **fill forwards** or **fill backwards**
with the `ffill()` and `bfill()` methods:

In [15]:
df = pd.DataFrame(np.random.standard_normal((6,3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,,0.077171
3,-1.907408,,-0.298665
4,-0.770874,,
5,-0.503009,,


In [17]:
df.ffill()

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.202194,0.077171
3,-1.907408,-0.202194,-0.298665
4,-0.770874,-0.202194,-0.298665
5,-0.503009,-0.202194,-0.298665


In [18]:
df.ffill(limit=2)

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.202194,0.077171
3,-1.907408,-0.202194,-0.298665
4,-0.770874,,-0.298665
5,-0.503009,,-0.298665


With `fillna()` we may also fill with the mean
or median of a column:

In [19]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.20887,0.077171
3,-1.907408,-0.20887,-0.298665
4,-0.770874,-0.20887,0.475072
5,-0.503009,-0.20887,0.475072


## 7.2 Data Transformation