# Requirements

In [1]:
import numpy as np
import pandas as pd

Pandas 1.0 adds some new types to improve the representation of missing values in pandas `Series` and `DataFrame`s. The data file has four columns, one with integer, floating point, categorical and string data respectively. Each column has one missing value.

# Representing missing values

When a file is read without any type specifications for the columns, the default behaviour is to convert columns with integer values that have missing data to floating point, which is not really what we want.

In [2]:
data1 = pd.read_csv('data/missing_values.csv')

In [3]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   int_data       9 non-null      float64
 1   float_data     9 non-null      float64
 2   category_data  9 non-null      object 
 3   string_data    9 non-null      object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


The numerical columns are both `float64`. Both other columns are of type `object`.

When the types of the columns are specified explicitly, the integer data is not converted to floating point.

In [4]:
data2 = pd.read_csv('data/missing_values.csv',
                    dtype={'int_data': pd.Int32Dtype(),
                           'category_data': pd.CategoricalDtype(),
                           'string_data': pd.StringDtype(),
                          })

In [5]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   int_data       9 non-null      Int32   
 1   float_data     9 non-null      float64 
 2   category_data  9 non-null      category
 3   string_data    9 non-null      string  
dtypes: Int32(1), category(1), float64(1), string(1)
memory usage: 368.0 bytes


In [14]:
data1

Unnamed: 0,int_data,float_data,category_data,string_data
0,3.0,3.7,A,str1
1,5.0,5.3,A,str1_str1
2,7.0,7.5,B,str2_str1
3,17.0,3.5,A,str1_str2
4,13.0,5.7,A,str2
5,23.0,7.1,B,str2_str2
6,,5.5,A,str3
7,29.0,,B,str3_str1
8,31.0,3.3,,str2_str3
9,37.0,7.7,B,


In [6]:
data2

Unnamed: 0,int_data,float_data,category_data,string_data
0,3.0,3.7,A,str1
1,5.0,5.3,A,str1_str1
2,7.0,7.5,B,str2_str1
3,17.0,3.5,A,str1_str2
4,13.0,5.7,A,str2
5,23.0,7.1,B,str2_str2
6,,5.5,A,str3
7,29.0,,B,str3_str1
8,31.0,3.3,,str2_str3
9,37.0,7.7,B,


For each column, a missing value is displayed.

# Computing and missing values

For numerical operations, such as column sum, the missing values are simply ignored.

In [7]:
data2.int_data.sum()

165

In [8]:
data2.float_data.sum()

49.3

In [9]:
data2.describe()

Unnamed: 0,int_data,float_data
count,9.0,9.0
mean,18.333333,5.477778
std,12.328828,1.715938
min,3.0,3.3
25%,7.0,3.7
50%,17.0,5.5
75%,29.0,7.1
max,37.0,7.7


Also for categorical and string data, missing values are ignored.

In [10]:
data2[['category_data', 'string_data']].describe()

Unnamed: 0,category_data,string_data
count,9,9
unique,2,9
top,A,str1
freq,5,1


In [12]:
data2[['category_data', 'string_data']] \
    .groupby('category_data', observed=False) \
    .count()

Unnamed: 0_level_0,string_data
category_data,Unnamed: 1_level_1
A,5
B,3


# Selecting rows with missing data

In [15]:
data1[data2.isnull().any(axis=1)]

Unnamed: 0,int_data,float_data,category_data,string_data
6,,5.5,A,str3
7,29.0,,B,str3_str1
8,31.0,3.3,,str2_str3
9,37.0,7.7,B,


In [13]:
data2[data2.isnull().any(axis=1)]

Unnamed: 0,int_data,float_data,category_data,string_data
6,,5.5,A,str3
7,29.0,,B,str3_str1
8,31.0,3.3,,str2_str3
9,37.0,7.7,B,
