# Identify Missing Data
- Values can either be missing from the original dataset or become missing due to data manipulation
- In pandas, missing values are typically called NaN or None
- Missing data can:
    - Hint at data colleciton errors
    - Indicate improper converion/manipulation
    - Cause an error/misinterpretation of the data

### Load the data

In [7]:
import pandas as pd

# Create a data frame (load the data from an excel file)
filename = 'Ex_Files_Python_for_Data_Vis/Exercise Files/Pandas/data/car_financing.xlsx'
df = pd.read_excel(filename)

In [8]:
# Check the first few rows of data
df.head()

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna


In [9]:
# Create and apply filters to the data frame
car_filter = df['car_type'] == 'Toyota Sienna'
interest_filter = df['interest_rate'] == 0.0702
df = df.loc[car_filter & interest_filter, :]

### Find missing values

In [12]:
# Discover null values
#   - In this case, there is one null value in the "Interest Paid" column
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 59
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Month             60 non-null     int64  
 1   Starting Balance  60 non-null     float64
 2   Repayment         60 non-null     float64
 3   Interest Paid     59 non-null     float64
 4   Principal Paid    60 non-null     float64
 5   New Balance       60 non-null     float64
 6   term              60 non-null     int64  
 7   interest_rate     60 non-null     float64
 8   car_type          60 non-null     object 
dtypes: float64(6), int64(2), object(1)
memory usage: 4.7+ KB


There are two common methods for indicating where values in the DataFrame are missing:
1. isna
2. isnull

They are functionally the same, just have different names.
- Both return a Pandas series of True/False values

In [16]:
# Check the method
df['Interest Paid'].isna().head()

0    False
1    False
2    False
3    False
4    False
Name: Interest Paid, dtype: bool

In [17]:
# Create a filter for the missing values
interest_missing = df['Interest Paid'].isna()

In [18]:
# Look at the row that contains the NaN value
df.loc[interest_missing, :]

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
35,36,15940.06,687.23,,593.99,15346.07,60,0.0702,Toyota Sienna


In [19]:
# Note: a not operator can be used to negate the filter
# In this case, it would return every row except that which contains the NaN value
df.loc[-interest_missing, :]

Unnamed: 0,Month,Starting Balance,Repayment,Interest Paid,Principal Paid,New Balance,term,interest_rate,car_type
0,1,34689.96,687.23,202.93,484.3,34205.66,60,0.0702,Toyota Sienna
1,2,34205.66,687.23,200.1,487.13,33718.53,60,0.0702,Toyota Sienna
2,3,33718.53,687.23,197.25,489.98,33228.55,60,0.0702,Toyota Sienna
3,4,33228.55,687.23,194.38,492.85,32735.7,60,0.0702,Toyota Sienna
4,5,32735.7,687.23,191.5,495.73,32239.97,60,0.0702,Toyota Sienna
5,6,32239.97,687.23,188.6,498.63,31741.34,60,0.0702,Toyota Sienna
6,7,31741.34,687.23,185.68,501.55,31239.79,60,0.0702,Toyota Sienna
7,8,31239.79,687.23,182.75,504.48,30735.31,60,0.0702,Toyota Sienna
8,9,30735.31,687.23,179.8,507.43,30227.88,60,0.0702,Toyota Sienna
9,10,30227.88,687.23,176.83,510.4,29717.48,60,0.0702,Toyota Sienna


In [20]:
# Count the number of missing values
#   - This works since Booleans are a subtype of integers
df['Interest Paid'].isna().sum()

np.int64(1)