### Notes on Ch 7: Data Cleaning and Preparation

When working with data for analysis and modeling, a significant amount of time is spent on tasks like loading, cleaning, and transforming data. Pandas, a Python library, offers tools to make these tasks easier and more efficient.

#### Handling Missing Data

<b>NaN as Sentinel Value:</b>

For numerical data, pandas uses `NaN` (Not a Number) to represent missing values.
It's like a signal that indicates a missing value in the data.

In [1]:
import pandas as pd
import numpy as np

float_data = pd.Series([1.2,-3.5,np.nan,0])
float_data

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

The `isna()` method helps identify missing values in a dataset.
It returns a Boolean series, marking True where values are missing.

In [2]:
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

The built-in Python `None` value is also treated as NA:

In [3]:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

#### Filtering Out Missing Data

<b>Filtering Missing Values in a Series:</b>

You can filter out missing values from a Series using the dropna() method:

In [4]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])

# Drop missing values
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

You can achieve the same result using boolean indexing:

In [5]:
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

<b>Filtering Missing Values in a DataFrame:</b>

For DataFrames, there are different ways to remove missing data. The dropna() method, by default, drops any row containing a missing value:

In [9]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                        [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

print(data)
print("")
print(data.dropna())
print("")
# Passing how="all" will drop only rows that are all NA:
print(data.dropna(how="all"))

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

     0    1    2
0  1.0  6.5  3.0

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0
