## 7.2. Data Cleaning and Processing with Pandas

### 7.2.1. Missing values

Real data often contains missing values.


- Checking for Missing Values

In [None]:
float_data = pd.Series([1.2, -3.5, np.nan, 0, np.nan])
float_data

In [None]:
float_data.isna()

In [None]:
float_data.isna().sum()

- Removing Missing Values


In [None]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

In [None]:
data.dropna()

In [None]:
# or equivalently
data[data.notna()]

- Rows with any missing values are excluded from the DataFrame.


In [None]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

In [None]:
data.dropna()

- What if you want to exclude only rows where all values are missing?


In [None]:
data.dropna(how = "all")

- What if you want to keep only rows with fewer than a certain number of missing values?

In [None]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

In [None]:
df.dropna()

In [None]:
df.dropna(thresh=2)

- What if you want to remove missing values by column instead of by row?


In [None]:
df.dropna(axis = "columns")

- Filling Missing Values
    - You can use the `fillna` function to fill missing values with a specified value.


In [None]:
df

In [None]:
df.fillna(0)

In [None]:
# you can fill missing values with different values for each column
df.fillna({1: 0.5, 2: 0})

In [None]:
df

In [None]:
# fill forward using the next value
df.fillna(method="bfill") # 'ffill' fills backward using the previous value

In [None]:
# you can set a limit on the number of values to fill
df.fillna(method="bfill", limit=2)

In [None]:
df.fillna(method="ffill")

In [None]:
# fill with the mean of each column
df.fillna(df.mean())

### 7.2.2. Value replacement

- Filling in missing values can be considered a value replacement operation.
- The `replace` function allows you to easily replace values.

In [None]:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df

In [None]:
df.replace(0, 5) # value to value

In [None]:
df.replace([0, 1, 2, 3], 4) # list to value

In [None]:
df.replace([0, 1, 2, 3], [4, 3, 2, 1]) # list to list

### 7.2.3. Outlier detection

- Identifying outliers or anomalies is important in data analysis.
- Let's find the values that exceed an absolute value of 3 in the example dataset below.

In [None]:
data = pd.DataFrame(np.random.standard_normal((1000, 4)))
data

In [None]:
data.describe()

In [None]:
# Find outliers in the third column
col = data[2]
col

In [None]:
col.abs() > 3

In [None]:
col[col.abs() > 3]

In [None]:
data.abs() > 3

In [None]:
(data.abs() > 3).any(axis = "columns")

In [None]:
# Find rows that contain at least one outlier
data[(data.abs() > 3).any(axis="columns")]

- Replace outliers (absolute values greater than 3) with -3 or 3

In [None]:
data[data.abs() > 3] = np.sign(data[data.abs() > 3]) * 3

In [None]:
data.describe()

## 7.2.4. Randomization



*   Randomization plays a very important role in machine learning.
    *   Cross-validation
    *   Bootstrap / Subsampling
*   Using the `numpy.random.permutation` function, you can randomly reorder rows or columns of a dataset.



In [None]:
v = np.arange(5 * 7).reshape((5,7))
df = pd.DataFrame(v)
df

*   Row permutation (Defualt)

In [None]:
sampler = np.random.permutation(5)
sampler

In [None]:
df.iloc[sampler]

In [None]:
# or equivaletly
df.take(sampler)

*   Column permutation

In [None]:
column_sampler = np.random.permutation(7)
column_sampler

df.iloc[:,column_sampler]

In [None]:
# or equivaletly
df.take(column_sampler, axis="columns")

*   Subsampling

In [None]:
df

In [None]:
df.sample(n=3)

Bootstrapping (Sampling with replacement)

In [None]:
df.sample(n = 5, replace = True)

### 7.2.5. Categorizing and Grouping

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({"score": np.random.standard_normal(20)})
df

Unnamed: 0,score
0,1.106356
1,1.185026
2,0.679801
3,-0.372005
4,0.537941
5,0.412478
6,-0.165952
7,-1.922481
8,-1.173217
9,-0.52753


*   Creatint a categorical variable

In [None]:
def categorize(x):
    if x < -0.5:
        return "low"
    elif x > 0.5:
        return "high"
    else:
        return "medium"

In [None]:
df["category"] = df["score"].apply(categorize)
df

Unnamed: 0,score,category
0,1.106356,high
1,1.185026,high
2,0.679801,high
3,-0.372005,medium
4,0.537941,high
5,0.412478,medium
6,-0.165952,medium
7,-1.922481,low
8,-1.173217,low
9,-0.52753,low


*   `groupby` groups rows by column values and applies a function to each group.

In [None]:
df.groupby("category")["score"].mean()

Unnamed: 0_level_0,score
category,Unnamed: 1_level_1
high,0.971577
low,-1.056356
medium,-0.008609


In [None]:
df.groupby("category")["score"].std()

Unnamed: 0_level_0,score
category,Unnamed: 1_level_1
high,0.313097
low,0.513824
medium,0.324529
