## Handling Missing Values

1. **`df.isna()`:**
   - Purpose: This method returns a DataFrame of the same shape as the original DataFrame `df`, where each element is a Boolean value indicating whether the corresponding value is NaN (Not a Number) or missing.
   - Usage: `df.isna()`

2. **`df.dropna()`:**
   - Purpose: This method is used to remove rows or columns containing any missing values (NaN) from the DataFrame `df`.
   - Usage:
     - Drop rows with missing values: `df.dropna()`
     - Drop columns with missing values: `df.dropna(axis=1)`

3. **`df.fillna()`:**
   - Purpose: This method is used to fill missing values in the DataFrame `df` with specified values.
   - Usage: `df.fillna(value)`, where `value` is the value you want to use to fill the missing values.

4. **`df.isnull()`:**
   - Purpose: This method is similar to `df.isna()` and returns a DataFrame of the same shape as `df` with Boolean values indicating whether the corresponding values are NaN (missing).
   - Usage: `df.isnull()`

5. **`df.notnull()`:**
   - Purpose: This method is the complement of `df.isnull()` and returns a DataFrame of the same shape as `df` with Boolean values indicating whether the corresponding values are not NaN (not missing).
   - Usage: `df.notnull()`

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.6f' % x) #setting float format to 6 decimal places, this will help in better visualization of data

#read the data
df = pd.read_csv("data/nba.csv") # Load NBA dataset into a DataFrame

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


**df.isna()**

.isna() was later introduced to align Pandas with the naming conventions used in R , making the transition easier for users coming from that ecosystem.

In [2]:
df_na = df.isna() # Check for missing values in the DataFrame

df_na.head(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False


**`df.isnull()`**

.isnull() was the original method introduced in Pandas.

In [3]:
df_null = df.isnull() # Check for null values in the DataFrame, same as isna()

df_null.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False


**`df.notnull`**

In [4]:
df_not_null = df.notnull() # Check for non-null values in the DataFrame

df_not_null.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,False
3,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,False,True


**`df.dropna()`**

In [5]:
#drop column where a null value is there
df_col_drp = df.dropna(subset=["College"], axis=0, how="all")

df_col_drp.head(5) # Rows 4 and 5 have been dropped because 'College' column had NaN values

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0


**`df.fillna()`**

In [6]:
#fill with null values
df["Salary_filled"] = df["Salary"].fillna(df["Salary"].mean(), axis=0) # Fill NaN values in 'Salary' column with the mean of the column

df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary_filled
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,4842684.105381
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0,5000000.0


Next Chapter [Data Cleaning and Transformations](6.DataCleaningTransformations.ipynb)