<center><h1>Chapter 7 Missing Data</h1></center>

In [1]:
import numpy as np
import pandas as pd

## 1. Statistics and deletion of missing values
### 1. Statistics of missing information

For missing data, you can use `isna` or `isnull` (the two functions are the same) to check whether each cell is missing. Combined with `mean`, the proportion of missing values ​​in each column can be calculated:

In [2]:
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])
df.isna().head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,True,False,False
4,False,False,False,False,False,False


In [3]:
df.isna().mean() # 查看缺失的比例

Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

If you want to see rows where a column is missing or not missing, you can use `isna` or `notna` on `Series` for Boolean indexing. For example, to see rows where height is missing:

In [4]:
df[df.Height.isna()].head()

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
12,Senior,Peng You,Female,,48.0,
26,Junior,Yanli You,Female,,48.0,N
36,Freshman,Xiaojuan Qin,Male,,79.0,Y
60,Freshman,Yanpeng Lv,Male,,65.0,N


If you want to retrieve rows with all missing columns, at least one missing column, or no missing columns at the same time, you can use a combination of `isna, notna` and `any, all`. For example, search for the three columns of height, weight, and transfer status in these three situations:

In [5]:
sub_set = df[['Height', 'Weight', 'Transfer']]
df[sub_set.isna().all(1)] # 全部缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
102,Junior,Chengli Zhao,Male,,,


In [6]:
df[sub_set.isna().any(1)].head() # 至少有一个缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
3,Sophomore,Xiaojuan Sun,Female,,41.0,N
9,Junior,Juan Xu,Female,164.8,,N
12,Senior,Peng You,Female,,48.0,
21,Senior,Xiaopeng Shen,Male,166.0,62.0,
26,Junior,Yanli You,Female,,48.0,N


In [7]:
df[sub_set.notna().all(1)].head() # 没有缺失

Unnamed: 0,Grade,Name,Gender,Height,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Freshman,Changqiang You,Male,166.5,70.0,N
2,Senior,Mei Sun,Male,188.9,89.0,N
4,Sophomore,Gaojuan You,Male,174.0,74.0,N
5,Freshman,Xiaoli Qian,Female,158.0,51.0,N


### 2. Deletion of missing information

In data processing, it is often necessary to delete row samples or column features based on the size, proportion or other characteristics of missing values. `pandas` provides the `dropna` function to perform this operation.

The main parameters of `dropna` are the axis direction `axis` (the default is 0, that is, delete rows), the deletion method `how`, the threshold of the number of non-missing values ​​to be deleted `thresh` (the corresponding dimensions that do not reach this number will be deleted), and the alternative deletion subset `subset`. Among them, `how` mainly has two parameters to choose from: `any` and `all`.

For example, delete rows with at least one missing height and weight:

In [8]:
res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
res.shape

(174, 6)

For example, to drop columns with more than 15 missing values:

In [9]:
res = df.dropna(1, thresh=df.shape[0]-15) # 身高被删除
res.head()

Unnamed: 0,Grade,Name,Gender,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,46.0,N
1,Freshman,Changqiang You,Male,70.0,N
2,Senior,Mei Sun,Male,89.0,N
3,Sophomore,Xiaojuan Sun,Female,41.0,N
4,Sophomore,Gaojuan You,Male,74.0,N


Of course, it is also possible to do without `dropna`. For example, the two operations mentioned above can also be done using Boolean indexing:

In [10]:
res = df.loc[df[['Height', 'Weight']].notna().all(1)]
res.shape

(174, 6)

In [11]:
res = df.loc[:, ~(df.isna().sum()>15)]
res.head()

Unnamed: 0,Grade,Name,Gender,Weight,Transfer
0,Freshman,Gaopeng Yang,Female,46.0,N
1,Freshman,Changqiang You,Male,70.0,N
2,Senior,Mei Sun,Male,89.0,N
3,Sophomore,Xiaojuan Sun,Female,41.0,N
4,Sophomore,Gaojuan You,Male,74.0,N


## 2. Filling and interpolation of missing values
### 1. Filling with fillna

There are three commonly used parameters in `fillna`: `value, method, limit`. Among them, `value` is the filling value, which can be a scalar or a dictionary mapping from index to element; `method` is the filling method, which has two types: `ffill` to fill with the previous elements and `bfill` to fill with the following elements. The `limit` parameter indicates the maximum number of fillings for consecutive missing values.

Let's construct a simple `Series` to illustrate the usage:

In [12]:
s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))
s

a    NaN
a    1.0
a    NaN
b    NaN
c    2.0
d    NaN
dtype: float64

In [13]:
s.fillna(method='ffill') # 用前面的值向后填充

a    NaN
a    1.0
a    1.0
b    1.0
c    2.0
d    2.0
dtype: float64

In [14]:
s.fillna(method='ffill', limit=1) # 连续出现的缺失，最多填充一次

a    NaN
a    1.0
a    1.0
b    NaN
c    2.0
d    2.0
dtype: float64

In [15]:
s.fillna(s.mean()) # value为标量

a    1.5
a    1.0
a    1.5
b    1.5
c    2.0
d    1.5
dtype: float64

In [16]:
s.fillna({'a': 100, 'd': 200}) # 通过索引映射填充的值

a    100.0
a      1.0
a    100.0
b      NaN
c      2.0
d    200.0
dtype: float64

Sometimes, in order to fill in more reasonably, you need to group first and then operate. For example, fill in the mean value of height according to grade:

In [17]:
df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).head()

0    158.900000
1    166.500000
2    188.900000
3    163.075862
4    174.000000
Name: Height, dtype: float64

#### 【Practice】
Fill missing values ​​in a sequence with the following rules: if a missing value appears alone, fill it with the mean before and after, if the missing values ​​appear consecutively, do not fill it, that is, the sequence `[1, NaN, 3, NaN, NaN]` is filled with `[1, 2, 3, NaN, NaN]`, please use the `fillna` function to implement it. (Hint: use the `limit` parameter)
#### 【END】
### 2. Interpolation function

In the description of the `interpolate` function [document](https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html#pandas.Series.interpolate), many interpolation methods are listed, including a large number of methods in `Scipy`. Since many interpolation methods involve relatively complex mathematical knowledge, only three common and simple cases are discussed here, namely linear interpolation, nearest neighbor interpolation and index interpolation.

For `interpolate`, in addition to the interpolation method (the default is `linear` linear interpolation), there are two common parameters similar to `fillna`, one is `limit_direction` to control the direction, and the other is `limi` to control the maximum number of consecutive missing values ​​to interpolate.t`. The default direction of the restricted interpolation is `forward`, which is similar to `ffill` in `method` of `fillna`. If you want backward restricted interpolation or bidirectional restricted interpolation, you can specify it as `backward` or `both`.

In [18]:
s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])
s.values

array([nan, nan,  1., nan, nan, nan,  2., nan, nan])

For example, under the default linear interpolation method, `backward` and bidirectional limited interpolation are performed respectively, and the maximum number of consecutive lines is limited to 1:

In [19]:
res = s.interpolate(limit_direction='backward', limit=1)
res.values

array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])

In [20]:
res = s.interpolate(limit_direction='both', limit=1)
res.values

array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])

The second common interpolation is nearest neighbor interpolation, where the missing value is the same as the nearest non-missing value:

In [21]:
s.interpolate('nearest').values

array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])

Finally, let's introduce index interpolation, which is linear interpolation based on the index size. For example, construct unequally spaced indexes for demonstration:

In [22]:
s = pd.Series([0,np.nan,10],index=[0,1,10])
s

0      0.0
1      NaN
10    10.0
dtype: float64

In [23]:
s.interpolate() # 默认的线性插值，等价于计算中点的值

0      0.0
1      5.0
10    10.0
dtype: float64

In [24]:
s.interpolate(method='index') # 和索引有关的线性插值，计算相应索引大小对应的值

0      0.0
1      1.0
10    10.0
dtype: float64

At the same time, this method can also be used for timestamp indexes. Other topics about time series will be discussed in Chapter 10. Here is a simple example:

In [25]:
s = pd.Series([0,np.nan,10], index=pd.to_datetime(['20200101', '20200102', '20200111']))
s

2020-01-01     0.0
2020-01-02     NaN
2020-01-11    10.0
dtype: float64

In [26]:
s.interpolate()

2020-01-01     0.0
2020-01-02     5.0
2020-01-11    10.0
dtype: float64

In [27]:
s.interpolate(method='index')

2020-01-01     0.0
2020-01-02     1.0
2020-01-11    10.0
dtype: float64

#### 【NOTE】Notes on polynomial and spline interpolation
If the interpolation method of `polynomial` is selected in `interpolate`, it internally calls `scipy.interpolate.interp1d(*,*,kind=order)`, which internally calls the `make_interp_spline` method, so it is actually spline interpolation instead of polynomial fitting interpolation similar to `polyfit` in `numpy`; and when the `spline` method is selected, `pandas` calls `scipy.interpolate.UnivariateSpline` instead of ordinary spline interpolation. The document description of this part is confusing, and the design of this parameter is also unreasonable. When using these two types of interpolation methods, users must be careful to choose the appropriate interpolation method according to their actual needs.
#### 【END】
## 3. Nullable type
### 1. Missing symbol and its defects

In `python`, missing values ​​are represented by `None`, which is not equal to any other element except itself:

In [28]:
None == None

True

In [29]:
None == False

False

In [30]:
None == []

False

In [31]:
None == ''

False

In numpy, np.nan is used to represent missing values. In addition to being unequal to any other element, the comparison result with itself also returns False:

In [32]:
np.nan == np.nan

False

In [33]:
np.nan == None

False

In [34]:
np.nan == False

False

It is worth noting that although the corresponding position of np.nan will return False when comparing elements of missing sequences or tables, when using the equals function to test the identity of two tables or two sequences, it will automatically skip the positions where both tables have missing values ​​and directly return True:

In [35]:
s1 = pd.Series([1, np.nan])
s2 = pd.Series([1, 2])
s3 = pd.Series([1, np.nan])
s1 == 1

0     True
1    False
dtype: bool

In [36]:
s1.equals(s2)

False

In [37]:
s1.equals(s3)

True

In time series objects, `pandas` uses `pd.NaT` to refer to missing values, which has the same effect as `np.nan` (time series objects and constructions will be discussed in Chapter 10):

In [38]:
pd.to_timedelta(['30s', np.nan]) # Timedelta中的NaT

TimedeltaIndex(['0 days 00:00:30', NaT], dtype='timedelta64[ns]', freq=None)

In [39]:
pd.to_datetime(['20200101', np.nan]) # Datetime中的NaT

DatetimeIndex(['2020-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)

So why introduce `pd.NaT` to represent missing values ​​in time objects? What are the problems if we still store them in the form of `np.nan`? In `pandas`, we can see objects of type `object`, which is a mixed object type. If multiple types of elements are stored in `Series` at the same time, its type will become `object`. For example, a list that stores both integers and strings:

In [40]:
pd.Series([1, 'two'])

0      1
1    two
dtype: object

The root of the `NaT` problem comes from the fact that `np.nan` itself is a floating-point type. If floating-point and time types are stored together, if a new built-in missing type is not designed to handle it, it will become an ambiguous `object` type, which is obviously not desirable.

In [41]:
type(np.nan)

float

At the same time, due to the floating-point nature of `np.nan`, if there is a missing value in a `Series` of integers, its type will be converted to `float64`; and if there is a missing value in a Boolean series, its type will be converted to `object` instead of `bool`:

In [42]:
pd.Series([1, np.nan]).dtype

dtype('float64')

In [43]:
pd.Series([True, False, np.nan]).dtype

dtype('O')

Therefore, after entering the `1.0.0` version, `pandas` tried to design a new missing type `pd.NA` and three `Nullable` sequence types to deal with these defects, which are `Int, boolean` and `string`.

### 2. Properties of Nullable Types

From the literal meaning, `Nullable` is empty, which means that the sequence type is not affected by missing values. For example, storing missing values ​​in the above three `Nullable` types will be converted to `pandas` built-in `pd.NA`:

In [44]:
pd.Series([np.nan, 1], dtype = 'Int64') # "i"是大写的

0    <NA>
1       1
dtype: Int64

In [45]:
pd.Series([np.nan, True], dtype = 'boolean')

0    <NA>
1    True
dtype: boolean

In [46]:
pd.Series([np.nan, 'my_str'], dtype = 'string')

0      <NA>
1    my_str
dtype: string

In sequences of `Int`, the returned results will be of type `Nullable` whenever possible:

In [47]:
pd.Series([np.nan, 0], dtype = 'Int64') + 1

0    <NA>
1       1
dtype: Int64

In [48]:
pd.Series([np.nan, 0], dtype = 'Int64') == 0

0    <NA>
1    True
dtype: boolean

In [49]:
pd.Series([np.nan, 0], dtype = 'Int64') * 0.5 # 只能是浮点

0    NaN
1    0.0
dtype: float64

For a sequence of type `boolean`, there are two main differences in behavior from a `bool` sequence:

The first is that a boolean list with missing values ​​cannot be selected in the indexer, while `boolean` treats missing values ​​as `False`:

In [50]:
s = pd.Series(['a', 'b'])
s_bool = pd.Series([True, np.nan])
s_boolean = pd.Series([True, np.nan]).astype('boolean')
# s[s_bool] # Error
s[s_boolean]

0    a
dtype: object

The second point is that when performing logical operations, the `bool` type always returns `False` at the missing value, while `boolean` will return the corresponding value based on whether the logical operation can determine a unique result. So what does it mean to determine a unique result? Let's take a simple example: `True | pd.NA` will always return `True` regardless of the missing value; the result of `False | pd.NA` will change depending on the value of the missing value, and `pd.NA` will be returned in this case; `False & pd.NA` will always return `False` regardless of the missing value.

In [51]:
s_boolean & True

0    True
1    <NA>
dtype: boolean

In [52]:
s_boolean | True

0    True
1    True
dtype: boolean

In [53]:
~s_boolean # 取反操作同样是无法唯一地判断缺失结果

0    False
1     <NA>
dtype: boolean

The specific properties of the `string` type will be discussed in the next chapter on text data.

Generally, when processing actual data, you can convert it to the `Nullable` type through `convert_dtypes` after reading the data set:

In [54]:
df = pd.read_csv('../data/learn_pandas.csv')
df = df.convert_dtypes()
df.dtypes

School          string
Grade           string
Name            string
Gender          string
Height         float64
Weight           Int64
Transfer        string
Test_Number      Int64
Test_Date       string
Time_Record     string
dtype: object

### 3. Calculation and grouping of missing data

When calling the functions `sum, prod` using addition and multiplication, the missing data is equivalent to being treated as 0 and 1 respectively, which does not change the original calculation results:

In [55]:
s = pd.Series([2,3,np.nan,4,5])
s.sum()

14.0

In [56]:
s.prod()

120.0

When using the cumulative function, missing values ​​are automatically skipped:

In [57]:
s.cumsum()

0     2.0
1     5.0
2     NaN
3     9.0
4    14.0
dtype: float64

When performing a single scalar operation, except for the two cases of `np.nan ** 0` and `1 ** np.nan`, which are definite values, all the results of the operation are missing (the behavior of `pd.NA` is consistent with this), and `np.nan` must return `False` in the comparison operation, while `pd.NA` returns `pd.NA`:

In [58]:
np.nan == 0

False

In [59]:
pd.NA == 0

<NA>

In [60]:
np.nan > 0

False

In [61]:
pd.NA > 0

<NA>

In [62]:
np.nan + 1

nan

In [63]:
np.log(np.nan)

nan

In [64]:
np.add(np.nan, 1)

nan

In [65]:
np.nan ** 0

1.0

In [66]:
pd.NA ** 0

1

In [67]:
1 ** np.nan

1.0

In [68]:
1 ** pd.NA

1

It should also be noted that although the two functions `diff, pct_change` have similar functions, they handle missing values ​​differently. In the former, all parts involved in missing calculations are set to missing values, while in the latter, the missing value positions are set to a change rate of 0%:

In [69]:
s.diff()

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
dtype: float64

In [70]:
s.pct_change()

0         NaN
1    0.500000
2    0.000000
3    0.333333
4    0.250000
dtype: float64

For some functions, missing can be treated as a category. For example, in `groupby, get_dummies`, you can set the corresponding parameters to add missing categories:

In [71]:
df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan], 'value':[1,3,5,7,9]})
df_nan

Unnamed: 0,category,value
0,a,1
1,a,3
2,b,5
3,,7
4,,9


In [72]:
df_nan.groupby('category', dropna=False)['value'].mean() # pandas版本大于1.1.0

category
a      2
b      5
NaN    8
Name: value, dtype: int64

In [73]:
pd.get_dummies(df_nan.category, dummy_na=True)

Unnamed: 0,a,b,NaN
0,1,0,0
1,1,0,0
2,0,1,0
3,0,0,1
4,0,0,1


## 4. Exercises
### Ex1: Correlation test between missing values ​​and categories
In data processing, columns with too many missing values ​​are often deleted unless the missing values ​​are strongly correlated with the labels. Below is a data set for a binary classification problem, where `X_1, X_2` are feature variables and `y` is a binary classification label.

In [74]:
df = pd.read_csv('../data/missing_chi.csv')
df.head()

Unnamed: 0,X_1,X_2,y
0,,,0
1,,,0
2,,,0
3,43.0,,0
4,,,0


In [75]:
df.isna().mean()

X_1    0.855
X_2    0.894
y      0.000
dtype: float64

In [76]:
df.y.value_counts(normalize=True)

0    0.918
1    0.082
Name: y, dtype: float64

In fact, sometimes the presence or absence of missing values ​​is itself a feature, and in some cases it may be related to the positive or negative of the label. Regarding the presence or absence of missing values ​​and the positive or negative nature of the label, the chi-square test can be used in statistics to assert whether there is a correlation between them. According to the positive examples with missing features, the negative examples with missing features, the positive examples with no missing features, and the negative examples with no missing features, it can be divided into four cases, and the corresponding sample numbers are $n_{11}, n_{10}, n_{01}, n_{00}$. If they are unrelated, then the theoretical value of positive examples in feature missing should be close to the ratio of total number of feature missing $\times$total positive examples, that is: 

$$E_{11} = n_{11} \approx (n_{11}+n_{10})\times\frac{n_{11}+n_{01}}{n_{11}+n_{10}+n_{01}+n_{00}} = F_{11}$$

The same is true for the other three cases. Now let the actual value and theoretical value be recorded as $E_{ij}, F_{ij}$ respectively. Then we hope that the following statistic is as small as possible, which means that the actual value is close to the theoretical value of the unrelated case:

$$S = \sum_{i\in \{0,1\}}\sum_{j\in \{0,1\}} \frac{(E_{ij}-F_{ij})^2}{F_{ij}}$$

CanTo prove that the above statistic approximately obeys the chi-square distribution with a degree of freedom of $1$, that is, $S\overset{\cdot}{\sim} \chi^2(1)$. Therefore, the correlation can be judged by calculating the probability of $P(\chi^2(1)>S)$. It is generally believed that when this probability is less than $0.05$, there is a correlation between the missing situation and the positive or negative label, that is, the theoretical value under irrelevant conditions is quite different from the actual value.

The probability mentioned above is the $p$ value of the $2\times2$ contingency table test problem in statistics, which can be obtained through `scipy.stats.chi2.sf(S, 1)`. Please test the `X_1, X_2` columns respectively according to the above materials.

### Ex2: Solve classification problems with regression models

`KNN` is a supervised learning model that can solve both regression and classification problems. For categorical variables, the `KNN` classification model can be used to interpolate missing values. The idea is to measure the distance between the features of the missing samples and all other sample features. When the model parameter `n_neighbors=n` is given, the category with the most samples among the $n$ closest to the sample is calculated, and this category is used as the missing prediction category of the sample. As shown in the figure below, the unknown category is predicted to be yellow:

<img src="../source/_static/ch7_ex.png" width="25%">

The characteristic data of the colored dots above are provided as follows:

In [77]:
df = pd.read_excel('../data/color.xlsx')
df.head(3)

Unnamed: 0,X1,X2,Color
0,-2.5,2.8,Blue
1,-1.5,1.8,Blue
2,-0.8,2.8,Blue


It is known that the sample points to be predicted are $X_1=0.8, X_2=-0.2$, then the predicted category can be written as follows:

In [78]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(df.iloc[:,:2], df.Color)
clf.predict([[0.8, -0.2]])

array(['Yellow'], dtype=object)

1. For regression problems, a specific value is needed, so the predicted value is obtained by the average value corresponding to the most recent $n$ samples. Please convert the above classification problem into a regression problem, and only use `KNeighborsRegressor` to complete the above `KNeighborsClassifier` function.
2. Please use the method in question 1 to interpolate missing values ​​for the `Employment` variable in the `audit` dataset.

In [79]:
df = pd.read_csv('../data/audit.csv')
df.head(3)

Unnamed: 0,ID,Age,Employment,Marital,Income,Gender,Hours
0,1004641,38,Private,Unmarried,81838.0,Female,72
1,1010229,35,Private,Absent,72099.0,Male,30
2,1024587,32,Private,Divorced,154676.74,Male,40
