# CHAPTER 7. Data Cleaning and Preparation

In [2]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 10
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

---

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up $80\%$ or more of an analyst’s time. In this section, we focus on 
* missing data
* duplicate data
* string manipulation
* some other analytical data transformations

---

## 7.1 Handling Missing Data

For numeric data, pandas uses the floating-point value ```NaN``` **(Not a Number)** to represent missing data.

In [None]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

In [None]:
string_data.isnull()

The built-in Python ```None``` value is also treated as NA in object arrays.

In [None]:
string_data[0] = None
string_data.isnull()

### 7.1.1 Filtering Out Missing Data

In [3]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [None]:
data.dropna()

In [None]:
data[data.notnull()]

For DataFrame objects, ```dropna``` by default drops any row containing a missing value:

In [None]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

In [None]:
cleaned = data.dropna()
cleaned

Passing ```how='all'``` will only drop rows that are **all NA**

In [None]:
data.dropna(how='all') 

To drop columns in the same way, pass ```axis=1```:

In [None]:
data[4] = NA
data

In [None]:
data.dropna(axis=1, how='all')

A related way to filter out DataFrame rows tends to concern time series data. 

Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the ```thresh``` argument:

In [None]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

In [None]:
df.dropna()

In [None]:
df.dropna(thresh=2)   # only drop the rows with >= 2 NAs

### 7.1.2 Filling In Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to **fill in the “holes”**. The ```fillna``` method is the workhorse function to use.

In [None]:
df

In [None]:
df.fillna(0)

Calling fillna with a ```dict```, you can use a different fill value *for each column*:

In [None]:
df.fillna({1: 0.5, 2: 0})

In [None]:
_ = df.fillna(0, inplace=True)   # modify the existing object in-place
df

**Interpolation methods: ```'ffill'```**

In [None]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:5, 1] = NA
df.iloc[4:, 2] = NA
df

In [None]:
df.fillna(method='ffill')

In [None]:
df.fillna(method='ffill', limit=2)

In [None]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())   # pass the mean or median value of a Series

**Summary of ```fillna``` function arguments**

* ```value``` Scalar value or dict-like object to use to fill missing values
* ```method``` Interpolation; by default ```'ffill'``` if function called with no other arguments 
* ```axis``` Axis to fill on; default ```axis=0```
* ```inplace``` Modify the calling object without producing a copy
* ```limit``` For forward and backward filling, maximum number of consecutive periods to fill



---

## 7.2 Data Transformation

### 7.2.1 Removing Duplicates

In [4]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method ```duplicated``` returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row). 
 * Relatedly, ```drop_duplicates``` method returns a DataFrame where the duplicated array is False.

In [5]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [6]:
data.drop_duplicates()   

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [7]:
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


Suppose we had an additional column of values and wanted to filter duplicates ***only based on the 'k1' column***

In [8]:
data.drop_duplicates(['k1'])   

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


```duplicated``` and ```drop_duplicates``` by default keep the first observed value combination. Passing ```keep='last'``` will **return the last one**. 

In [9]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


### 7.2.2 Transforming Data Using a Function or Mapping

In [10]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. 

Let’s write down a mapping of each distinct meat type to the kind of animal. 

In [11]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [12]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

The ```map``` method on a Series accepts a function or dict-like object containing a mapping

In [None]:
data['animal'] = lowercased.map(meat_to_animal)
data

In [None]:
data['food'].map(lambda x: meat_to_animal[x.lower()])
# map: operates the function on each element of the sequence.

### 7.2.3 Replacing Values

In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

The **-999** values might be sentinel values for missing data. To replace these with NA values that pandas understands, we can use ```replace```, producing a new Series.

In [None]:
data.replace(-999, np.nan)

In [None]:
# replace multiple values at once
data.replace([-999, -1000], np.nan)

To use a different replacement for each value, pass a list of substitutes:

In [None]:
data.replace([-999, -1000], [np.nan, 0])

In [None]:
data.replace({-999: np.nan, -1000: 0})

### 7.2.4 Renaming Axis Indexes

In [13]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [14]:
transform = lambda x: x[:4].upper()   # only keep the first four characters
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to ```index```, modifying the DataFrame in-place:

In [15]:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the original, a useful method is ```rename```.

In [16]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


```rename``` can be used in conjunction with a dict-like object providing new values for a subset of the axis labels

In [17]:
data.rename(index={'OHIO': 'INDIANA'}, 
            columns={'three': 'peekaboo'})  

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [None]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)   # change the original data
data

### 7.2.5 Discretization and Binning

Continuous data is often discretized or otherwise separated into “bins” for analysis.

In [18]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. 

To do so, you have to use ```cut``` function in pandas. 

In [19]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special **Categorical** object. 

The output you see describes the bins computed by ```pandas.cut```. You can treat it like an array of strings indicating the bin name. 

It contains a ```categories``` array specifying the distinct category names along with a labeling for the ```ages``` data in the ```codes``` attribute:

In [20]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [21]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [None]:
pd.value_counts(cats)   # the counts of frequencies in each bin

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing ```right=False```.

In [None]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

You can also pass your own bin names by passing a list or array to the ```labels``` option:

In [None]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute **equal-length  bins** based on the minimum and maximum values in the data.

In [22]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)   # The precision=2 option limits the decimal precision to two digits.

[(0.73, 0.96], (0.25, 0.49], (0.0074, 0.25], (0.0074, 0.25], (0.49, 0.73], ..., (0.49, 0.73], (0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.49, 0.73]]
Length: 20
Categories (4, interval[float64, right]): [(0.0074, 0.25] < (0.25, 0.49] < (0.49, 0.73] < (0.73, 0.96]]

A closely related function, ```qcut```, bins the data based on sample quantiles. 

Depending on the distribution of the data, using ```cut``` will not usually result in each bin having the same number of data points. Since ```qcut``` uses sample quantiles instead, by definition you will obtain ***roughly equal-size bins***.

In [None]:
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4)  # Cut into quartiles
cats

In [None]:
pd.value_counts(cats)   # the number of observations are 250 in each group

Similar to ```cut``` you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [None]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

In [None]:
pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]))   

### 7.2.6 Detecting and Filtering Outliers

In [26]:
np.random.seed(12345)
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.067684,0.067924,0.025598,-0.002298
std,0.998035,0.992106,1.006835,0.996794
min,-3.428254,-3.548824,-3.184377,-3.745356
25%,-0.77489,-0.591841,-0.641675,-0.644144
50%,-0.116401,0.101143,0.002073,-0.013611
75%,0.616366,0.780282,0.680391,0.654328
max,3.366626,2.653656,3.260383,3.927528


Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

In [27]:
col = data[2]
col[np.abs(col) > 3]

5      3.248944
102    3.176873
324    3.260383
499   -3.056990
586   -3.184377
Name: 2, dtype: float64

In [29]:
data[(np.abs(data) > 3).any(axis = 1)]   # select all rows having a value exceeding 3 or –3

Unnamed: 0,0,1,2,3
5,-0.539741,0.476985,3.248944,-1.021228
97,-0.774363,0.552936,0.106061,3.927528
102,-0.655054,-0.565230,3.176873,0.959533
305,-2.315555,0.457246,-0.025907,-3.399312
324,0.050188,1.951312,3.260383,0.963301
...,...,...,...,...
499,-0.293333,-0.242459,-3.056990,1.918403
523,-3.428254,-0.296336,-0.439938,-0.867165
586,0.275144,1.179227,-3.184377,1.369891
808,-0.362528,-3.548824,1.553205,-2.186301


In [None]:
data[np.abs(data) > 3] = np.sign(data) * 3   # cap values outside the inter‐ val –3 to 3
data.describe()

In [None]:
np.sign(data).head()  # The statement np.sign(data) produces 1 and –1 values based on 
                      # whether the values in data are positive or negative:

### 7.2.7 Permutation and Random Sampling

**Permuting (randomly reordering)** a ```Series``` or the rows in a ```DataFrame``` can be done using the ```numpy.random.permutation``` function.

In [30]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler

array([1, 0, 2, 3, 4])

In [31]:
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


The ```sampler``` array can then be used in ```iloc```-based indexing or the equivalent ```take``` function:

In [32]:
df.take(sampler)   

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [33]:
df.iloc[sampler,:]

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


To select a random subset without replacement, you can use the ```sample``` method on ```Series``` and ```DataFrame```.

In [34]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
1,4,5,6,7
3,12,13,14,15
4,16,17,18,19


To generate a sample with replacement (to allow repeat choices), pass ```replace=True``` to sample:

In [None]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

### 7.2.8 Computing Indicator/Dummy Variables

Let's consider converting a categorical variable into a “dummy” or “indicator” matrix.

For example, if a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame with k columns containing all 1s and 0s. ```pandas``` has a ```get_dummies``` function for doing this. 

In [37]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': np.arange(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [38]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


In some cases, you may want to add a prefix to the columns in the indicator DataFrame

In [39]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,False,True,False
1,False,True,False
2,True,False,False
3,False,False,True
4,True,False,False
5,False,True,False


In [None]:
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Suppose now that a DataFrame belongs to **multiple categories**. 

In [None]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/movielens/movies.dat', sep='::', encoding='ISO-8859-1', 
                       header=None, names=mnames)
print(movies.shape)
movies[:10]

First, we extract the list of **unique genres** in the dataset:

In [None]:
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)

In [None]:
genres

* We start with a DataFrame of all zeros.
* Next, iterate through each movie and set entries in each row of dummies to 1. 

In [40]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)

NameError: name 'movies' is not defined

In [None]:
gen = movies.genres[0]
gen.split('|')

In [None]:
?pd.Index.get_indexer

In [None]:
dummies.columns.get_indexer(gen.split('|'))

The ```enumerate``` object yields pairs containing:
* a count (from start, which defaults to zero); 
* a value yielded by the iterable argument.

In [None]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

You can combine ```dummies``` with ```movies```:

In [None]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]

In [None]:
movies_windic.head()

A useful recipe for statistical applications is to combine ```get_dummies``` with a discretization function like ```cut```:

In [None]:
np.random.seed(12345)
values = np.random.rand(10)
values

In [None]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

---

# END