In this section, we learn how to handle missing data in Python. We will be using the 'Pima Indians Diabetes' dataset as an example to walk us through some of the important techniques involving treatment of missing data. 

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import Imputer

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77% so far. Specifically, there are missing observations for some columns that are marked as a zero value. We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid. We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

In [2]:
dataset = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv', header=None)
print(type(dataset))
dataset.head(10)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


Now let's add column names and look at the summary level information about the dataset:

In [3]:
dataset.columns = ["Pregnant Times", "Pgc", "Diastolic Bld Pressure", "Tst", "Serum Insulin", "BMI", "Diabetes Pedigree Function", "Age", "y"]
print(dataset.head(5), "\n")
print(dataset.shape, "\n")
print(dataset.info())
dataset.describe()

   Pregnant Times  Pgc  Diastolic Bld Pressure  Tst  Serum Insulin   BMI  \
0               6  148                      72   35              0  33.6   
1               1   85                      66   29              0  26.6   
2               8  183                      64    0              0  23.3   
3               1   89                      66   23             94  28.1   
4               0  137                      40   35            168  43.1   

   Diabetes Pedigree Function  Age  y  
0                       0.627   50  1  
1                       0.351   31  0  
2                       0.672   32  1  
3                       0.167   21  0  
4                       2.288   33  1   

(768, 9) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnant Times                768 non-null int64
Pgc                           768 non-null int64
Diastolic Bld Pressure        768 non-null int64
Tst                           768 non-n

Unnamed: 0,Pregnant Times,Pgc,Diastolic Bld Pressure,Tst,Serum Insulin,BMI,Diabetes Pedigree Function,Age,y
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Our first task is to identify those columns that have missing values. To achieve this, we can get a count of the number of missing values on each of these columns by marking all of the values in the subset of the 'DataFrame' we are interested in that have zero values using Booleans. We can then count the number of true values in each column. Running the code below, we can see that columns 'Diabetes Pedigree Function' and 'Age' have no zero values (missing values in this context), whereas the rest of them have, some of which have a lot actually. 

In [4]:
column_names=dataset.columns.to_list()
zero_table=(dataset[column_names] == 0).sum() # dataset[column_names]==0 is a table of True/False
print(type(zero_table))
zero_table

<class 'pandas.core.series.Series'>


Pregnant Times                111
Pgc                             5
Diastolic Bld Pressure         35
Tst                           227
Serum Insulin                 374
BMI                            11
Diabetes Pedigree Function      0
Age                             0
y                             500
dtype: int64

Let's replace all the 0 values with numpy's NaN, as a quality control check, let's count the number of NaN values in each column as well:

In [5]:
dataset[column_names] = dataset[column_names].replace(0, np.NaN)
print(dataset.isnull().sum())
dataset.head(5)

Pregnant Times                111
Pgc                             5
Diastolic Bld Pressure         35
Tst                           227
Serum Insulin                 374
BMI                            11
Diabetes Pedigree Function      0
Age                             0
y                             500
dtype: int64


Unnamed: 0,Pregnant Times,Pgc,Diastolic Bld Pressure,Tst,Serum Insulin,BMI,Diabetes Pedigree Function,Age,y
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1.0
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,
2,8.0,183.0,64.0,,,23.3,0.672,32,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1.0


There are different ways of treating missing values. One can either drop all rows that have missing values, or we can impute it with a given number based on some pre-conceived knowledge/criteria. Let's examine the impact of dropping all the null values first. As you see, once we drop the null values, the row counts drastically decreased:

In [6]:
df1=dataset.dropna()
print(df1.shape) # summarizing the number of rows and columns in the dataset
print(dataset.shape) # summarizing the number of rows and columns in the dataset

(111, 9)
(768, 9)


The other way is to impute missing values with something else, and the methods are very flexible. For example, the scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values. It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The 'Imputer' class operates directly on the numpy arrays instead of 'DataFrame' objects in pandas. The example below uses the 'Imputer' class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix:

In [7]:
df2=dataset.copy()
values = df2.values # filling missing values with mean column values 
imputer = Imputer(strategy='mean') # other options include median or most_frequent
transformed_values = imputer.fit_transform(values)
print(np.isnan(transformed_values).sum()) # count the number of NaN values in each column
dataset2=pd.DataFrame(transformed_values, columns=column_names)
dataset2.head(15)
# help(Imputer)

0




Unnamed: 0,Pregnant Times,Pgc,Diastolic Bld Pressure,Tst,Serum Insulin,BMI,Diabetes Pedigree Function,Age,y
0,6.0,148.0,72.0,35.0,155.548223,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,155.548223,26.6,0.351,31.0,1.0
2,8.0,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,1.0
4,4.494673,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
5,5.0,116.0,74.0,29.15342,155.548223,25.6,0.201,30.0,1.0
6,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
7,10.0,115.0,72.405184,29.15342,155.548223,35.3,0.134,29.0,1.0
8,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0
9,8.0,125.0,96.0,29.15342,155.548223,32.457464,0.232,54.0,1.0


We have scratched the surface of missing values, a deeper understanding about how missing values work is required for further appreciation of the data as data scientist. So now we go more in depth for this particular topic. In computer science, there are a number of schemes that have been developed to indicate missing values. The way in which 'pandas' handles missing values is constrained by its reliance on the 'numpy' package, which does not have a built-in notion of NA values for non-floating data types. There are two types of null values in Python: None and np.NaN. We will discuss them now in greater detail. 

Simply put, None is often used in more general setting. Because None itself is a Python object, it cannot be used in any arbitrary 'numpy' arrays but only in arrays with data type 'object'. In comparison, numpy's NaN value is a special floating0point value recognized by all systems that use the standard IEEE floating-point representation. Compare the results from below, you will see that the different ways Python casts/treats the array object:

In [8]:
a1=np.array([1,None,3])
a2=np.array([1,np.NaN,3])
print(a1.dtype)
print(a2.dtype)

object
float64


The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will get an error. This reflects the fact that addition between an integer and None is undefined. However, NaN is different: it is a bit like a data virus- it infects any other objects it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN.

In [9]:
try: 
    a1.sum()
except:
    print('Type Error')
print(a2.sum()) # the virus NaN

Type Error
nan


The 'numpy' package does provide some special aggregations that will ignore missing values. Keep in mind that NaN is only specifi to floating-point values. There is no equivalent NaN value for integers, strings, or other data types. 

In [10]:
np.nansum(a2), np.nanmin(a2)

(4.0, 1.0)

When it comes to the 'pandas' package, the None and NaN values both have their places and the package will convert them interchageably when appropriate. Consider the example below. Notice that if we set a value in an integer array to NaN, it will automatically be upcast to a floating-point type to accommodate the NA value:

In [11]:
x= pd.Series(range(3), dtype=int)
print(x)
x[0]=None
x

0    0
1    1
2    2
dtype: int32


0    NaN
1    1.0
2    2.0
dtype: float64

Now let's see more examples:

In [12]:
data = pd.Series([1,np.nan, 'string', None, 98.7])
print(data)
print(type(data[1]), type(data[3]))
data.isnull()

0         1
1       NaN
2    string
3      None
4      98.7
dtype: object
<class 'float'> <class 'NoneType'>


0    False
1     True
2    False
3     True
4    False
dtype: bool

In [13]:
data.dropna()

0         1
2    string
4      98.7
dtype: object

References:
   - https://machinelearningmastery.com/handle-missing-data-python/
   - https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
   - https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html