In [45]:
import pandas as pd
import numpy as np


In [112]:
df = pd.DataFrame({
    'ID': range(1,31),
    'value':[n if n % 5 != 0 else np.nan for n in range(1,31)],
    'numbers': [n if n % 4 != 0 else '?' for n in range(1,31)],
    'text': ['txt' if n % 3 != 0 else 'No Value!' for n in range(1,31)],
})

df.head()

Unnamed: 0,ID,value,numbers,text
0,1,1.0,1,txt
1,2,2.0,2,txt
2,3,3.0,3,No Value!
3,4,4.0,?,txt
4,5,,5,txt


We know that we have null values in the *val_null* column, but it won't appear with the `df.isna()` method, nor `df.isnull()`.

In [113]:
# Finding the total of null values
df.isna().sum()

ID         0
value      6
numbers    0
text       0
dtype: int64

In [114]:
df.isnull().sum()

ID         0
value      6
numbers    0
text       0
dtype: int64

### Common Missing Data

For the known missing data, just find them slicing the dataset

In [115]:
# Pandas has good methods for that
df[ df.value.isna() ]

Unnamed: 0,ID,value,numbers,text
4,5,,5,txt
9,10,,10,txt
14,15,,15,No Value!
19,20,,?,txt
24,25,,25,txt
29,30,,30,No Value!


In [146]:
df.dtypes

ID           int64
value      float64
numbers     object
text        object
dtype: object

### Numbers with "weird" missing data

Now for the other columns, one of the best ideas at first is to check the data types. If we know which type we should have in a certain column, like we know that the column *numbers* should have integers, but we got object, then we know we probably have a NA or something we should check in that column. 

In [162]:
df.iloc[3,2] = ' 3'

In [163]:
# Check types
df.dtypes

ID           int64
value      float64
numbers     object
text        object
dtype: object

In [164]:
# Find the non-numeric data
df[ df.numbers.astype(str).str.isdigit() == False ]

Unnamed: 0,ID,value,numbers,text
3,4,4.0,3,txt
7,8,8.0,?,txt
11,12,12.0,?,No Value!
15,16,16.0,?,txt
19,20,,?,txt
23,24,24.0,?,No Value!
27,28,28.0,?,txt


### Text Data

Now, what to do with Textual data where I don't know where the NAs are?
Many times, the **NaN** values will come as * NA, ?, - * , but they can be many other things, so the idea will be to use Regular Expressions to try to find them. Once you find the pattern, it becomes easier to identify the others.

Let's imagine we don't know that our NA values for the column *text* is "No Value!". We will check for many patterns with RegEx.


In [118]:
import re

In [145]:
# Using regex to find the NA value
df[ df.text.apply(lambda x: len(re.findall('NA|[*|?|!|#|-]', x)) !=0 )]

Unnamed: 0,ID,value,numbers,text
2,3,3.0,3,No Value!
5,6,6.0,6,No Value!
8,9,9.0,9,No Value!
11,12,12.0,?,No Value!
14,15,,15,No Value!
17,18,18.0,18,No Value!
20,21,21.0,21,No Value!
23,24,24.0,?,No Value!
26,27,27.0,27,No Value!
29,30,,30,No Value!
