# 处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

## 1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

object类型的运算要比int类型的运算慢得多  
计算不同数据类型求和时间  
%timeit np.arange(1e5,dtype=xxx).sum()

## 2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

但可以使用np.nan*()函数来计算nan，此时视nan为0。

## 3. pandas中的None与NaN

### 1) pandas中None与np.nan都视作np.nan

创建DataFrame

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

In [2]:
df1 = DataFrame(data=np.random.randint(0,100,size=(3,3)),columns=list('ABC'))
display(df1)

Unnamed: 0,A,B,C
0,31,12,81
1,6,75,33
2,38,97,94


In [3]:
df1.loc[0,'B'] = None
df1

Unnamed: 0,A,B,C
0,31,,81
1,6,75.0,33
2,38,97.0,94


In [4]:
df1['C'] = np.nan
df1

Unnamed: 0,A,B,C
0,31,,
1,6,75.0,
2,38,97.0,


In [5]:
df1.loc[0,'B']

nan

使用DataFrame行索引与列索引修改DataFrame数据

### 2) pandas中None与np.nan的操作

- ``isnull()``
- ``notnull()``
- ``dropna()``: 过滤丢失数据
- ``fillna()``: 填充丢失数据

(1)判断函数
- ``isnull()``
- ``notnull()``

In [6]:
df1

Unnamed: 0,A,B,C
0,31,,
1,6,75.0,
2,38,97.0,


In [7]:
df1.isnull()

Unnamed: 0,A,B,C
0,False,True,True
1,False,False,True
2,False,False,True


In [8]:
df1.notnull()

Unnamed: 0,A,B,C
0,True,False,False
1,True,True,False
2,True,True,False


(2) 过滤函数
- ``dropna()``

In [9]:
# 新增一行、一列
df1.loc[3] = [12,15,18]
df1

Unnamed: 0,A,B,C
0,31,,
1,6,75.0,
2,38,97.0,
3,12,15.0,18.0


In [10]:
# 默认是过滤所有有缺失值的行
df1.dropna()

Unnamed: 0,A,B,C
3,12,15.0,18.0


In [11]:
# 使用axis控制轴向变化
df1.dropna(axis=1)

Unnamed: 0,A
0,31
1,6
2,38
3,12


In [12]:
df1.loc[3,'C'] = np.nan

In [15]:
df1.dropna?

[1;31mSignature:[0m [0mdf1[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[0maxis[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m [0mhow[0m[1;33m=[0m[1;34m'any'[0m[1;33m,[0m [0mthresh[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0msubset[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0minplace[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Remove missing values.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    Determine if rows or columns which contain missing values are
    removed.

    * 0, or 'index' : Drop rows which contain missing values.
    * 1, or 'columns' : Drop columns which contain missing value.

    .. deprecated:: 0.23.0: Pass tuple or list to drop on multiple
    axes.
how : {'any', 'all'}, default 'any'
    Determine if row or column is removed from DataFrame, when we have
    at least 

In [16]:
# inplace默认为False,设置为True,表示修改原始数据结构
df1.dropna(axis=1,how='all',inplace=True)

In [17]:
df1

Unnamed: 0,A,B
0,31,
1,6,75.0
2,38,97.0
3,12,15.0


可以选择过滤的是行还是列（默认为行）

也可以选择过滤的方式 how = 'all'

(3) 填充函数 Series/DataFrame
- ``fillna()``

In [18]:
df1

Unnamed: 0,A,B
0,31,
1,6,75.0
2,38,97.0
3,12,15.0


In [19]:
# value参数默认是把所有空值用指定值填充
df1.fillna(value=100)

Unnamed: 0,A,B
0,31,100.0
1,6,75.0
2,38,97.0
3,12,15.0


In [20]:
df1

Unnamed: 0,A,B
0,31,
1,6,75.0
2,38,97.0
3,12,15.0


In [22]:
df1.fillna?

[1;31mSignature:[0m [0mdf1[0m[1;33m.[0m[0mfillna[0m[1;33m([0m[0mvalue[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmethod[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0minplace[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mlimit[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mdowncast[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Fill NA/NaN values using the specified method

Parameters
----------
value : scalar, dict, Series, or DataFrame
    Value to use to fill holes (e.g. 0), alternately a
    dict/Series/DataFrame of values specifying which value to use for
    each index (for a Series) or column (for a DataFrame). (values not
    in the dict/Series/DataFrame will not be filled). This value cannot
    be a list.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
    Method to use for filling holes in reindexed Series
    pad / 

In [24]:
df1.fillna(method='ffill',limit=1,axis=1)

Unnamed: 0,A,B
0,31.0,31.0
1,6.0,75.0
2,38.0,97.0
3,12.0,15.0


可以选择前向填充还是后向填充

对于DataFrame来说，还要选择填充的轴axis。记住，对于DataFrame来说：

- axis=0：index/行
- axis=1：columns/列

============================================

练习7：

1. 简述None与NaN的区别

2. 假设张三李四参加模拟考试，但张三因为突然想明白人生放弃了英语考试，因此记为None，请据此创建一个DataFrame,命名为ddd3

3. 老师决定根据用数学的分数填充张三的英语成绩，如何实现？
    用李四的英语成绩填充张三的英语成绩？

============================================

In [25]:
data = np.random.randint(0,100,size=(3,3))
columns = ['语文','数学','英语']
index = ['张三','李四','王二小']
score = DataFrame(data=data, index=index,columns=columns)
score

Unnamed: 0,语文,数学,英语
张三,68,97,89
李四,49,6,73
王二小,51,38,0


In [26]:
score.loc['张三','英语'] = np.nan

In [27]:
score

Unnamed: 0,语文,数学,英语
张三,68,97,
李四,49,6,73.0
王二小,51,38,0.0


In [28]:
score.fillna(method='ffill',axis=1)

Unnamed: 0,语文,数学,英语
张三,68.0,97.0,97.0
李四,49.0,6.0,73.0
王二小,51.0,38.0,0.0


In [29]:
score.fillna(method='bfill',axis=0)

Unnamed: 0,语文,数学,英语
张三,68,97,73.0
李四,49,6,73.0
王二小,51,38,0.0
