# Chapter 7. Data Cleaning and Preparation

In this chapter, the author reviewed tools for missing data, duplicate data, string manipulation, and a few other other common practices used for data preparation.

## 7.1 Handling missing data

Missing data is ignored by default when computing summary and descriptive statistics on a DataFrame or Series.
For numeric data, the floating-point value `NaN` ("not a number") is used.

In [27]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [28]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [29]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

### Filtering out missing data

The `dropna()` method removes `NaN` values from a Series.

In [30]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [31]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

On a DataFrame, `dropna()` removes rows with an `NaN` values.

In [32]:
data = pd.DataFrame([[1., 6.5, 3.],
                     [1., NA, NA],
                     [NA, NA, NA],
                     [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [33]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


However, passing `how='all'` will remove only rows with *all* `NaN`.

In [34]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Or only columns made up of all `NaN` can be dropped by declaring the axis.

In [35]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [36]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


You can also specify the maximum number of missing data values allowed for an individual row.

In [37]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,1.764052,,
1,2.240893,,
2,0.950088,,-0.103219
3,0.410599,,1.454274
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


In [38]:
df.dropna()

Unnamed: 0,0,1,2
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


In [39]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.950088,,-0.103219
3,0.410599,,1.454274
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


### Filling in missing data

`fillna()` can be used in most cases to fill in missing data.

In [40]:
df.fillna(0)

Unnamed: 0,0,1,2
0,1.764052,0.0,0.0
1,2.240893,0.0,0.0
2,0.950088,0.0,-0.103219
3,0.410599,0.0,1.454274
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


A dictionary can be used to fill specific columns with specific values.

In [41]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,1.764052,0.5,0.0
1,2.240893,0.5,0.0
2,0.950088,0.5,-0.103219
3,0.410599,0.5,1.454274
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


In [42]:
df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,1.764052,0.0,0.0
1,2.240893,0.0,0.0
2,0.950088,0.0,-0.103219
3,0.410599,0.0,1.454274
4,0.761038,0.121675,0.443863
5,0.333674,1.494079,-0.205158
6,0.313068,-0.854096,-2.55299


The same interpolation methods available for deciding which rows or columns to drop can be used for filling in data.

In [43]:
df = pd.DataFrame(np.random.rand(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.264556,0.774234,0.45615
1,0.568434,0.01879,0.617635
2,0.612096,,0.943748
3,0.68182,,0.437032
4,0.697631,,
5,0.670638,,


In [44]:
df.fillna(method='ffill')  # 'ffill' = 'forward fill'

Unnamed: 0,0,1,2
0,0.264556,0.774234,0.45615
1,0.568434,0.01879,0.617635
2,0.612096,0.01879,0.943748
3,0.68182,0.01879,0.437032
4,0.697631,0.01879,0.437032
5,0.670638,0.01879,0.437032


In [45]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.264556,0.774234,0.45615
1,0.568434,0.01879,0.617635
2,0.612096,0.01879,0.943748
3,0.68182,0.01879,0.437032
4,0.697631,,0.437032
5,0.670638,,0.437032


## 7.2 Data transformation