### Notes on Ch 7: Data Cleaning and Preparation

When working with data for analysis and modeling, a significant amount of time is spent on tasks like loading, cleaning, and transforming data. Pandas, a Python library, offers tools to make these tasks easier and more efficient.

#### Handling Missing Data

<b>NaN as Sentinel Value:</b>

For numerical data, pandas uses `NaN` (Not a Number) to represent missing values.
It's like a signal that indicates a missing value in the data.

In [2]:
import pandas as pd
import numpy as np

float_data = pd.Series([1.2,-3.5,np.nan,0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The `isna()` method helps identify missing values in a dataset.
It returns a Boolean series, marking True where values are missing.

In [2]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python `None` value is also treated as NA:

In [3]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

##### Filtering Out Missing Data

<b>Filtering Missing Values in a Series:</b>

You can filter out missing values from a Series using the dropna() method:

In [4]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

# Drop missing values
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

You can achieve the same result using boolean indexing:

In [5]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

<b>Filtering Missing Values in a DataFrame:</b>

For DataFrames, there are different ways to remove missing data. The dropna() method, by default, drops any row containing a missing value:

In [3]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                        [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

print(data)
print("")
print(data.dropna())
print("")
# Passing how="all" will drop only rows that are all NA:
print(data.dropna(how="all"))

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

     0    1    2
0  1.0  6.5  3.0

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0


To drop columns with missing values, you can use axis="columns":

In [4]:
# Adding a column with missing values
data[4] = np.nan

# Drop columns with missing values
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


<b>Handling Missing Data Based on a Threshold:</b>

You can also specify a threshold to keep only rows with a certain number of non-missing values using the thresh argument:

In [7]:
 df = pd.DataFrame(np.random.standard_normal((7, 3)))

# Introducing missing values
df.iloc[:4, 1] = np.nan

print(df)
print("")
print(df.dropna())
print("")
print(df.dropna(thresh=2)) # This will keep rows with at least 2 non-missing values.



          0         1         2
0 -1.292988       NaN  0.950515
1  0.710997       NaN -0.316265
2  0.963413       NaN -1.531206
3  0.965523       NaN  2.023745
4  0.169033 -1.603368 -0.326718
5 -0.614326  1.457363 -1.173855
6  0.845100  2.039895 -0.572499

          0         1         2
4  0.169033 -1.603368 -0.326718
5 -0.614326  1.457363 -1.173855
6  0.845100  2.039895 -0.572499

          0         1         2
0 -1.292988       NaN  0.950515
1  0.710997       NaN -0.316265
2  0.963413       NaN -1.531206
3  0.965523       NaN  2.023745
4  0.169033 -1.603368 -0.326718
5 -0.614326  1.457363 -1.173855
6  0.845100  2.039895 -0.572499


##### Filling In Missing Data

Instead of filtering out missing data, you might want to fill in the missing values. The `fillna()` method is useful for this purpose.

In [8]:
# Filling in missing data with a constant
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.292988,0.0,0.950515
1,0.710997,0.0,-0.316265
2,0.963413,0.0,-1.531206
3,0.965523,0.0,2.023745
4,0.169033,-1.603368,-0.326718
5,-0.614326,1.457363,-1.173855
6,0.8451,2.039895,-0.572499


You can also fill missing values with different values for each column using a dictionary:

In [9]:
df.fillna({1: 0.5, 2: 0}) # Fill column 1 with 0.5 and column 2 with 0

Unnamed: 0,0,1,2
0,-1.292988,0.5,0.950515
1,0.710997,0.5,-0.316265
2,0.963413,0.5,-1.531206
3,0.965523,0.5,2.023745
4,0.169033,-1.603368,-0.326718
5,-0.614326,1.457363,-1.173855
6,0.8451,2.039895,-0.572499


Interpolation methods, like forward fill (`ffill`), can also be applied with `fillna()`:

In [11]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))

# Introducing missing values
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan

print(df)
print("")
print(df.fillna(method="ffill")) # Forward fill
print("")
print(df.fillna(method="bfill")) # Backward fill

          0         1         2
0 -1.634233  0.364611 -0.148496
1 -0.205130 -1.054656  0.505615
2  1.960879       NaN  0.129398
3  0.013551       NaN -0.403024
4 -1.927891       NaN       NaN
5 -1.330447       NaN       NaN
6  0.269430       NaN       NaN

          0         1         2
0 -1.634233  0.364611 -0.148496
1 -0.205130 -1.054656  0.505615
2  1.960879 -1.054656  0.129398
3  0.013551 -1.054656 -0.403024
4 -1.927891 -1.054656 -0.403024
5 -1.330447 -1.054656 -0.403024
6  0.269430 -1.054656 -0.403024

          0         1         2
0 -1.634233  0.364611 -0.148496
1 -0.205130 -1.054656  0.505615
2  1.960879       NaN  0.129398
3  0.013551       NaN -0.403024
4 -1.927891       NaN       NaN
5 -1.330447       NaN       NaN
6  0.269430       NaN       NaN


The `fillna()` method can also be used for data imputation, such as filling missing values with the mean:

In [12]:
# Filling missing values with the mean
data.fillna(data.mean())

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,6.5,3.0,
2,1.0,6.5,3.0,
3,1.0,6.5,3.0,
