Missing data occurs when no data is stored for certain observations within a variable. In other words, missing data is the absence of values, and is a common occurrence in most data sets. In this recipe, we will quantify and visualize missing information in variables, utilizing the dataset from the [KDD-CUP-98](https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data) available in the UCI Machine Learning Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

====================================================================================================

To download the data, visit this [website](https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/)

Click the 'cup98lrn.zip' to begin the download. Unzip the file and save 'cup98LRN.txt' to the parent directory of this repo (../cup98LRN.txt).

====================================================================================================

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# to display the total number columns present in the dataset
pd.set_option('display.max_columns', None)

In [2]:
# we will use the selected variables for the recipe
cols = [
    'AGE',
    'NUMCHLD',
    'INCOME',
    'WEALTH1',
    'MBCRAFT',
    'MBGARDEN',
    'MBBOOKS',
    'MBCOLECT',
    'MAGFAML',
    'MAGFEM',
    'MAGMALE',
]

# load the dataset
data = pd.read_csv('cup98LRN.txt', usecols=cols)

# let's inspect the first 5 rows
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'cup98LRN.txt'

In [None]:
# we can quantify the total number of missing values using
# the isnull() method plus the sum() method on the dataframe

data.isnull().sum()

In [None]:
# alternatively, we can use the mean() method after isnull()
# to visualise the percentage of missing values for each variable

data.isnull().mean()

In [None]:
# we can also plot the percentages of missing data utilising
# pandas plot.bar(), and add labels with matplotlib methods 
# as shown below

data.isnull().mean().plot.bar(figsize=(12,6))
plt.ylabel('Percentage of missing values')
plt.xlabel('Variables')
plt.title('Quantifying missing data')