# Chapter 07: Data Cleaning and Preparation

# Inital Setup

In [102]:
import pandas as pd
import numpy as np

## 7.1 Handling Missing Data

- Pandas uses the floating-point value **`NaN`** (Not a Number) to represent missing data (aka *sentinel value*)

- The built-in Python **`None`** value is treated as NA value (Not available value)

- In statistic applications, missing data may be either be data that does not exist or that exists but was not observed or collected. When cleaning up data for analysis, it is often important to do **analysis on the missing data** itself to identify data collection problems or potential biases in the data caused by missing data.

> ***Example - Detect NaN and NA value in a series***

In [103]:
s = pd.Series([1, 2, np.nan, 4, 7])
s[4] = None
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [104]:
s.isnull()

0    False
1    False
2     True
3    False
4     True
dtype: bool

- NA handling methods

| Argument | Description |
|----------|-------------|
| dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate |
| fillna | Fill in missing data some value or using an interpolation method such as `ffill` or `bfill`|
| isnull | Return boolean values indicating which values are missing or NA |
| notnull | Negation of `isnull`|

### Filtering Out Missing Data

> ***Example - Using `dropna()` and another equivalent approach using `notnull()`***

In [105]:
s = pd.Series([1, np.nan, 3, 6, np.nan, 9])

In [106]:
s.dropna()

0    1.0
2    3.0
3    6.0
5    9.0
dtype: float64

In [107]:
s[s.notnull()]

0    1.0
2    3.0
3    6.0
5    9.0
dtype: float64

### **`Warning:`** Using `dropna()` function on DataFrame object by default will drop any row containing a missing value

> ***Example - Using `dropna()` on DataFrame***

In [108]:
df = pd.DataFrame([
    [1,2,3],
    [4,np.nan,5],
    [6,7,8],
    [9, 10, np.nan],
    [np.nan,np.nan,np.nan],
    [2,3,0]
])
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,6.0,7.0,8.0
3,9.0,10.0,
4,,,
5,2.0,3.0,0.0


In [109]:
df.dropna() # Drop any row that contains one or more NA value(s)

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
2,6.0,7.0,8.0
5,2.0,3.0,0.0


In [110]:
df.dropna(how='all') # Only drop row that have all NA values

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,6.0,7.0,8.0
3,9.0,10.0,
5,2.0,3.0,0.0


In [111]:
df.dropna(axis=1) # Drop any column that contains one or more NA value(s)

0
1
2
3
4
5


### Filling In Missing Data

- `fillna()` function arguments

| Argument | Description |
|----------|-------------|
| value | Scalar value or dict-like object to use to fill missing values |
| method | Interpolation; by default 'ffill' if function is called with no other arguments |
| axis | Axis to fill; default axis is 0|
| inplace | Modify the calling object without producing a copy |
| limit | For forward and backward filling, maximum number of consecutive periods to fill |

> ***Example - Using `fillna()`***

In [112]:
df = pd.DataFrame([
    [1,2,3],
    [4,np.nan,5],
    [6,7,8],
    [9, 10, np.nan],
    [np.nan,np.nan,np.nan],
    [2,3,0]
])
df

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,,5.0
2,6.0,7.0,8.0
3,9.0,10.0,
4,,,
5,2.0,3.0,0.0


In [113]:
df.fillna(0) # Fill NA values with 0

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,0.0,5.0
2,6.0,7.0,8.0
3,9.0,10.0,0.0
4,0.0,0.0,0.0
5,2.0,3.0,0.0


In [114]:
df.fillna({
    0: 'A',
    1: 'B',
    2: 'C'
}) # Fill NA values in each column with a specific value

Unnamed: 0,0,1,2
0,1,2,3
1,4,B,5
2,6,7,8
3,9,10,C
4,A,B,C
5,2,3,0


In [115]:
df.fillna(method='ffill') # Fill NA values using interpolation

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,2.0,5.0
2,6.0,7.0,8.0
3,9.0,10.0,8.0
4,9.0,10.0,8.0
5,2.0,3.0,0.0


In [116]:
df.fillna(df.mean()) # Fill NA values using interpolation

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,5.5,5.0
2,6.0,7.0,8.0
3,9.0,10.0,4.0
4,4.4,5.5,4.0
5,2.0,3.0,0.0


## 7.2 Data Transformation

### Removing Duplicates

> ***Example- Using `duplicated()` & `drop_duplicates()`***

In [117]:
df = pd.DataFrame(list('ABCDACEF'))
df

Unnamed: 0,0
0,A
1,B
2,C
3,D
4,A
5,C
6,E
7,F


In [118]:
df.duplicated() # duplicated() return a boolean Series indicating whether each row is a duplicate or not

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
dtype: bool

In [119]:
df.drop_duplicates() # Keep the first occurrence and drop other duplicates

Unnamed: 0,0
0,A
1,B
2,C
3,D
6,E
7,F


### Transforming Data Using a Function or Mapping

> ***Example - Using `map()`***

In [120]:
products = pd.DataFrame(
    [
        ['Pepsi', 1.1],
        ['Tea', 1.2],
        ['Jeans', 20],
        ['Jacket', 30],
        ['Raisin', 5]
    ],
    columns=['Product', 'Price']
)
products

Unnamed: 0,Product,Price
0,Pepsi,1.1
1,Tea,1.2
2,Jeans,20.0
3,Jacket,30.0
4,Raisin,5.0


In [121]:
productToCategory = {
    'Pepsi': 'Drink',
    'Tea': 'Drink',
    'Jeans': 'Clothing',
    'Jacket': 'Clothing',
    'Raisin': 'Food'
}

In [122]:
products['Category'] = products['Product'].map(productToCategory)
products

Unnamed: 0,Product,Price,Category
0,Pepsi,1.1,Drink
1,Tea,1.2,Drink
2,Jeans,20.0,Clothing
3,Jacket,30.0,Clothing
4,Raisin,5.0,Food


> ***Exampel - Using Lambda function***

In [123]:
discountPrice = lambda oldPrice: oldPrice * 0.8
products['Discount Price'] = products['Price'].apply(discountPrice)
products

Unnamed: 0,Product,Price,Category,Discount Price
0,Pepsi,1.1,Drink,0.88
1,Tea,1.2,Drink,0.96
2,Jeans,20.0,Clothing,16.0
3,Jacket,30.0,Clothing,24.0
4,Raisin,5.0,Food,4.0


### Replacing Values

> ***Example - Using `replace()`***

In [124]:
s = pd.Series([1, -99, 2, 3, -100, 4])
s

0      1
1    -99
2      2
3      3
4   -100
5      4
dtype: int64

In [125]:
s.replace([-99,-100], np.nan) # replace list of values in series with NaN

0    1.0
1    NaN
2    2.0
3    3.0
4    NaN
5    4.0
dtype: float64

In [126]:
s.replace({
    -99: 'Placeholder',
    -100: 'Placeholder',
}) # replace values in series using a dictionary for mapping

0              1
1    Placeholder
2              2
3              3
4    Placeholder
5              4
dtype: object

### Discretization & Binning

- Continuous data is often discretized or otherwise separated into "bins" for analysis
- The function `cut()` return a special **Categorical Object** which can be treated as an array of strings indicating the bin name.

> ***Example = Using `cut()`***

In [127]:
employeeAges = [22,23,24,40,39,26,27,21,36]
bins= [20, 25, 30, 35, 40]
pd.cut(employeeAges, bins)

[(20, 25], (20, 25], (20, 25], (35, 40], (35, 40], (25, 30], (25, 30], (20, 25], (35, 40]]
Categories (4, interval[int64]): [(20, 25] < (25, 30] < (30, 35] < (35, 40]]

In [128]:
pd.cut(employeeAges, bins).codes # each number indicate index of the segment from the original value

array([0, 0, 0, 3, 3, 1, 1, 0, 3], dtype=int8)

In [129]:
pd.cut(employeeAges, bins).categories

IntervalIndex([(20, 25], (25, 30], (30, 35], (35, 40]],
              closed='right',
              dtype='interval[int64]')

In [130]:
pd.value_counts(pd.cut(employeeAges, bins)) 

(20, 25]    4
(35, 40]    3
(25, 30]    2
(30, 35]    0
dtype: int64

In [131]:
pd.value_counts(
    pd.cut(employeeAges, 4)
)
# auto compute & split the list into 4 segments based on the min & max values in data
# DO NOT GUARANTEE THAT EACH SEGMENT WILL HAVE THE SAME AMOUNT OF ELEMENTS

(20.981, 25.75]    4
(35.25, 40.0]      3
(25.75, 30.5]      2
(30.5, 35.25]      0
dtype: int64

In [132]:
pd.value_counts(
    pd.qcut(employeeAges, 4)
)
# auto compute & split the list into 4 segments based on the total quantity
# GUARANTEE THAT EACH SEGMENT WILL HAVE AlMOST THE SAME AMOUNT OF ELEMENTS

(20.999, 23.0]    3
(36.0, 40.0]      2
(26.0, 36.0]      2
(23.0, 26.0]      2
dtype: int64

### Detecting & Filtering Outliers (skipped)

### Permutation & Random Sampling (skipped)

### Computing Indicator|Dummy Variables

- If a column in a DataFrame has **k** distinct values, you can derive a matrix or DataFrame with **k** columns containing all 1s and 0s
- Using function `pandas.get_dummies()`

> ***Example - Using `get_dummies()`***

In [133]:
df = pd.DataFrame({
    'key': list('aabbbc'),
    'data': range(6)
})
df

Unnamed: 0,key,data
0,a,0
1,a,1
2,b,2
3,b,3
4,b,4
5,c,5


In [134]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,1,0,0
1,1,0,0
2,0,1,0
3,0,1,0
4,0,1,0
5,0,0,1


> ***Example -  An use case of indicator/dummy value***

In [135]:
dataMoviesColumnNames = ['id', 'title', 'genres']
dataMovies = pd.read_table(
    r'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/movielens/movies.dat', 
    sep='::', 
    header=None,
    names=dataMoviesColumnNames,
    engine='python'
)
dataMovies

Unnamed: 0,id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [136]:
# Extract list of unique genres in column **genre**
allGenres = []

for dataGenres in dataMovies.genres:
    allGenres.extend(dataGenres.split('|'))
    
allGenres = pd.unique(all_genres)    

In [137]:
# Intial setup for the indicator
dummies = pd.DataFrame(
    (
        np.zeros(
            (len(dataMovies), len(allGenres))
        )
    ), # generate a Zero Matrix
    columns=allGenres
)
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [138]:
# Compute indicator
for i, genres in enumerate(dataMovies.genres):
    indices = dummies.columns.get_indexer(genres.split('|'))
    dummies.iloc[i, indices] = 1
dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [139]:
# Combine indicator with the original data
dataMovies.join(dummies.add_prefix('Genre_')).iloc[0]

id                                             1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

## 7.3 String Manipulation

### String Object Methods

- Python built-in string methods

| Argument | Description |
|----------|-------------|
|count | Return the number of non-overlapping occurrences of substring in the string.|
|endswith | Returns True if string ends with suffix.|
|startswith | Returns True if string starts with prefix.
|join| Use string as delimiter for concatenating a sequence of other strings.
|index| Return position of first character in substring if found in the string; raises ValueError if not found.
|find| Return position of first character of first occurrence of substring in the string; like index, but returns –1 if not found.
|rfind| Return position of first character of last occurrence of substring in the string; returns –1 if not found.|
|replace| Replace occurrences of string with another string.|
|strip,<br>rstrip,<br>lstrip | Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.|
|split| Break string into list of substrings using passed delimiter.|
|lower| Convert alphabet characters to lowercase.|
|upper| Convert alphabet characters to uppercase.|
|casefold| Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form.|
| ljust,<br>rjust| Left justify or right justify, respectively; pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

### Regular Expression

- Regular Expressiong methods

| Argument | Description |
|----------|-------------|
| findall | Return all non-overlapping matching patterns in a string as a list |
| finditer | Like `findall`, but return an iterator |
| match | Match pattern at start of string and optionally segment pattern components into group, if the pattern matches, return a match object, and otherwise None|
| search | Scan string for amtch to pattern; returning a match object if so. The match can be anywhere in the string as opposed to only at the beginning.|
| split | Break string into pieces at each occurence of pattern |
| sub,<br>subn | Replace all (sub) or first n occurences (subn) of pattern in string with replacement expression, use symbols \1, \2, ... to refer to match group elements in the replacement string|

### Vectorized String Function in **pandas**

- Partial listing of vectorized string methods

| Method | Description|
|--------|------------|
| cat| Concatenate strings element-wise with optional delimiter|
|contains| Return boolean array if each string contains pattern/regex|
|count| Count occurrences of pattern|
|extract| Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group|
|endswith| Equivalent to x.endswith(pattern) for each element|
|startswith| Equivalent to x.startswith(pattern) for each element|
|findall| Compute list of all occurrences of pattern/regex for each string
|get| Index into each element (retrieve i-th element)|
|isalnum| Equivalent to built-in str.alnum|
|isalpha| Equivalent to built-in str.isalpha|
|isdecimal| Equivalent to built-in str.isdecimal|
|isdigit| Equivalent to built-in str.isdigit|
|islower| Equivalent to built-in str.islower|
|isnumeric| Equivalent to built-in str.isnumeric|
|isupper| Equivalent to built-in str.isupper|
|join| Join strings in each element of the Series with passed separator|
|len| Compute length of each string|
|lower,<br> upper| Convert cases; equivalent to x.lower() or x.upper() for each element|
|match| Use re.match with the passed regular expression on each element, returning matched groups as list|
|pad| Add whitespace to left, right, or both sides of strings|
|center| Equivalent to pad(side='both')|
|repeat| Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)|
|replace| Replace occurrences of pattern/regex with some other string|
|slice| Slice each string in the Series|
|split| Split strings on delimiter or regular expression|
|strip| Trim whitespace from both sides, including newlines|
|rstrip| Trim whitespace on right side|
|lstrip| Trim whitespace on left side|