# 7 - Data Cleaning and Preparation

In this chapter we discuss tools for handling missing data,
duplicate data, string manipulation, and some other analytical
data transformations. The next chapter is then focused on 
combining and rearranging datasets in various ways.

## 7.1 Handling Missing Data

Pandas has plenty of ways of handling missing data. Some of
the statistical built-in methods of pandas objects already 
exclude missing data for default, for example, but we may
be interested in handling missing data in different ways:

First, it is important we remember that we can set which 
values are considered NA when importing a dataset with 
`read_csv()` or other read functions by using the `na_values`
parameter.

Second, if the DataFrame is already loaded, we can treat it 
with functions such as `replace()` or `map()`.

### Filtering Out Missing Data

There are different ways of filtering missing data, depending
on whether we want to drop rows, columns, and the missing data
threshold we consider for being dropped.

Although we could use boolean indexing with the `notna()` method,
`dropna()` allows us to customize all these options above-mentioned.


In [3]:
import numpy as np
import pandas as pd

data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

Note that these return copies of the object by default. 
To modify the original dataset, we use the `inplace=` 
parameter.

To present some of the different ways to drop NA values:

In [2]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


`dropna()` drops rows that have any missing value by default:

In [3]:
data.dropna() 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing `how='all'` drops only rows **that have all values
missing**

In [4]:
data.dropna(how='all') 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


We can **drop columns** instead with the `axis` parameter:

In [6]:
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [7]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


To drop only past a certain **threshold** of missing values, 
we use the `thresh` parameter:

In [10]:
df = pd.DataFrame(np.random.standard_normal((7,3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.254007,,
1,-0.249208,,
2,0.762543,,-0.648712
3,-0.321158,,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


In [11]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


In [12]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.762543,,-0.648712
3,-0.321158,,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


### Filling In Missing Data

Rather than discarding missing data, we may
want to fill it with some value, such as an
integer, the mean for that columns, of the median.
`fillna()` will do that for us:

In [13]:
df.fillna(0) 

Unnamed: 0,0,1,2
0,0.254007,0.0,0.0
1,-0.249208,0.0,0.0
2,0.762543,0.0,-0.648712
3,-0.321158,0.0,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


To **use different fill values for different
columns**, we can pass a dictionary as parameter
to the method:

In [14]:
df.fillna({1:0, 2:2})

Unnamed: 0,0,1,2
0,0.254007,0.0,2.0
1,-0.249208,0.0,2.0
2,0.762543,0.0,-0.648712
3,-0.321158,0.0,1.165636
4,-0.749398,0.915997,-0.410406
5,-3.215473,0.348234,0.527618
6,0.609746,1.506953,-0.541533


We can also **fill forwards** or **fill backwards**
with the `ffill()` and `bfill()` methods:

In [15]:
df = pd.DataFrame(np.random.standard_normal((6,3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,,0.077171
3,-1.907408,,-0.298665
4,-0.770874,,
5,-0.503009,,


In [17]:
df.ffill()

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.202194,0.077171
3,-1.907408,-0.202194,-0.298665
4,-0.770874,-0.202194,-0.298665
5,-0.503009,-0.202194,-0.298665


In [18]:
df.ffill(limit=2)

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.202194,0.077171
3,-1.907408,-0.202194,-0.298665
4,-0.770874,,-0.298665
5,-0.503009,,-0.298665


With `fillna()` we may also fill with the mean
or median of a column:

In [19]:
df.fillna(df.mean())

Unnamed: 0,0,1,2
0,0.918578,-0.215545,1.107566
1,0.588271,-0.202194,1.014218
2,0.696783,-0.20887,0.077171
3,-1.907408,-0.20887,-0.298665
4,-0.770874,-0.20887,0.475072
5,-0.503009,-0.20887,0.475072


## 7.2 Data Transformation

Asides from dealing with missing data, filtering, cleaning and
transforming are also essential parts of the data wrangling job:

### Removing duplicates

Consider the following example of a DataFrame that contains duplicates:

In [4]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4,]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The `duplicated()` method returns a boolean array indicating 
if any row is a duplicate of a previously iterated row:

In [5]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

The `drop_duplicates()` method returns a DataFrame with only
the rows indicated as `False` by `duplicated()`:

In [6]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


These methods by default consider all columns, but 
suppose we want to restrict the duplicate checking 
and dropping to only a subset of columns. We do that
with the `subset` parameter:

In [7]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [8]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

In [9]:
data.duplicated(subset=['k1'])

0    False
1    False
2     True
3     True
4     True
5     True
6     True
dtype: bool

By default, `drop_duplicates()` keeps the first values
it encounters. Passing `keep='last'` will keep the last
ones instead.

In [10]:
data.drop_duplicates(subset=['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function Or Mapping

Frequently we'll want to do some transformation depending
on the values present in the current array. Consider this
hypothetical data collected about kinds of meat:

In [11]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon", "pastrami", "corned beef", "bacon", "pastrami", "honey ham", "nova lox"], "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose we want to add a new column indicating the 
type of animal the meat came from. We can create a 
dict to map each meat to an animal:

In [12]:
meat_to_animal = {
  'bacon' : 'pig',
  'pulled pork': 'pig',
  'pastrami' : 'cow',
  'corned beef' : 'cow',
  'honey ham' : 'pig',
  'nova lox' : 'salmon'
}

The `map` method accepts a function or a dictionary-like
object to perform a function or mapping into a series of
values:

In [13]:
data['animal'] = data['food'].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Passing a function that returns the value from the dict
would also have done the trick. Lets try it with a lambda
function:

In [15]:
data['food'].map(lambda x: meat_to_animal[x])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### Replacing Values

While `map()` can be seen as a way to replace values,
`replace()` offers a simpler and more flexible way to 
do so. Consider the Series:

In [16]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

The `-999` values might be a sentinel for missing data.
To replace them with a value that pandas understands as 
NA, we can use `replace()`:

In [17]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

We can also **replace multiple values at once** with 
by passing a list of values to the first argument:

In [18]:
values_to_replace = [-999, -1000]
data.replace(values_to_replace, np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

**Replacing different values with different
replacements** can be done by passing a list
to the second argument (of equal length to the
first list) or by passing a dict as argument:

In [19]:
data.replace({-999: 0, -1000: np.nan})

0    1.0
1    0.0
2    2.0
3    0.0
4    NaN
5    3.0
dtype: float64

### Renaming Axis Indexes

We can both create new objects with different labels
than the first one, or modify labels in place.
To modify it in place, we can use the `index.map()` 
method assigning it to the index of the DataFrame:

In [21]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [23]:
data.index = data.index.map(lambda x: x[:4].upper())
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


To return a new object without modifying the original, 
we use the `rename()` method:

In [25]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Rename can also be used with a dictionary-like object
to rename specific values to specific replacements:

In [28]:
data.rename(index={'OHIO': 'Indiana'}, columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
Indiana,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11
