# 处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

In [2]:
df = DataFrame({'Python':np.random.randint(0,150,size = 6),'数学':np.random.randint(0,150,size = 6),'Java':np.random.randint(0,150,size = 6)},
               index=list('ABCDEF'),columns = ['Python','数学','Java','En'])
df

Unnamed: 0,Python,数学,Java,En
A,50,19,40,
B,98,106,82,
C,41,111,25,
D,129,9,100,
E,113,145,100,
F,27,109,40,


In [8]:
# [3,1]
df.values

array([[50, 19, 40, nan],
       [98, 106, 82, nan],
       [41, 111, 25, nan],
       [129, 9, 100, nan],
       [113, 145, 100, nan],
       [27, 109, 40, nan]], dtype=object)

In [9]:
df['数学']['D'] = None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [10]:
df

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,
C,41,111.0,25,
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [12]:
# Python 和Java这两列没有空数据
# 
df.isnull().any()

Python    False
数学         True
Java      False
En         True
dtype: bool

## 1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

object类型的运算要比int类型的运算慢得多  
计算不同数据类型求和时间  
%timeit np.arange(1e5,dtype=xxx).sum()

In [13]:
10 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

## 2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

但可以使用np.nan*()函数来计算nan，此时视nan为0。

In [14]:
10 + np.nan

nan

In [16]:
# DataFrame或者Series方法，可以进行运算
df.sum()

Python    458.0
数学        490.0
Java      387.0
En          0.0
dtype: float64

## 3. pandas中的None与NaN

### 1) pandas中None与np.nan都视作np.nan

创建DataFrame

使用DataFrame行索引与列索引修改DataFrame数据

### 2) pandas中None与np.nan的操作

- ``isnull()``
- ``notnull()``
- ``dropna()``: 过滤丢失数据
- ``fillna()``: 填充丢失数据

(1)判断函数
- ``isnull()``
- ``notnull()``

(2) 过滤函数
- ``dropna()``

In [20]:
df['En']['B'] = 100
df['En']['C'] = 78
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,100.0
C,41,111.0,25,78.0
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [21]:
df.dropna()

Unnamed: 0,Python,数学,Java,En
B,98,106.0,82,100
C,41,111.0,25,78


In [24]:
df.dropna(axis = 1,how = 'all')

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,100.0
C,41,111.0,25,78.0
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [25]:
df.loc['B','En'] = np.nan
df.loc['C','En'] = np.nan
df

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,
C,41,111.0,25,
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [26]:
df.dropna(axis = 1,how = 'all')

Unnamed: 0,Python,数学,Java
A,50,19.0,40
B,98,106.0,82
C,41,111.0,25
D,129,,100
E,113,145.0,100
F,27,109.0,40


可以选择过滤的是行还是列（默认为行）

也可以选择过滤的方式 how = 'all'

(3) 填充函数 Series/DataFrame
- ``fillna()``

In [28]:
df.fillna(value=20)

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,20
B,98,106.0,82,20
C,41,111.0,25,20
D,129,20.0,100,20
E,113,145.0,100,20
F,27,109.0,40,20


可以选择前向填充还是后向填充

In [29]:
df

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,
C,41,111.0,25,
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [30]:
'''method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
    Method to use for filling holes in reindexed Series
    pad / ffill: propagate last valid observation forward to next valid
    backfill / bfill: use NEXT valid observation to fill gap'''
df.fillna(method='pad',axis =1 )

Unnamed: 0,Python,数学,Java,En
A,50.0,19.0,40.0,40.0
B,98.0,106.0,82.0,82.0
C,41.0,111.0,25.0,25.0
D,129.0,129.0,100.0,100.0
E,113.0,145.0,100.0,100.0
F,27.0,109.0,40.0,40.0


In [32]:
df2 = df[['Python','En','数学','Java']]
df2

Unnamed: 0,Python,En,数学,Java
A,50,,19.0,40
B,98,,106.0,82
C,41,,111.0,25
D,129,,,100
E,113,,145.0,100
F,27,,109.0,40


In [34]:
df2.fillna(method='bfill',axis =1 )

Unnamed: 0,Python,En,数学,Java
A,50.0,19.0,19.0,40.0
B,98.0,106.0,106.0,82.0
C,41.0,111.0,111.0,25.0
D,129.0,100.0,100.0,100.0
E,113.0,145.0,145.0,100.0
F,27.0,109.0,109.0,40.0


In [35]:
df2.fillna(method='bfill',axis =0 )

Unnamed: 0,Python,En,数学,Java
A,50,,19.0,40
B,98,,106.0,82
C,41,,111.0,25
D,129,,145.0,100
E,113,,145.0,100
F,27,,109.0,40


In [36]:
df

Unnamed: 0,Python,数学,Java,En
A,50,19.0,40,
B,98,106.0,82,
C,41,111.0,25,
D,129,,100,
E,113,145.0,100,
F,27,109.0,40,


In [38]:
# axis = 1列
# 聚合操作 少
df.mean(axis = 1)

A     36.333333
B     95.333333
C     59.000000
D    114.500000
E    119.333333
F     58.666667
dtype: float64

In [39]:
for i in range(6):
    df['En'][i] = df.mean(axis = 1).iloc[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [45]:
df.fillna(0).astype(int)

Unnamed: 0,Python,数学,Java,En
A,50,19,40,36
B,98,106,82,95
C,41,111,25,59
D,129,0,100,114
E,113,145,100,119
F,27,109,40,58


In [None]:
# 数据填充，这块，不同数据，采用规则会不同
# 统计身高体重，体重丢失了

# 大概线性关系 f(x) = w*x + b 
x     y
身高 体重
180   80
170   70
160   65
180   78
170   70
160   60
160   65
170   70



In [None]:
最简单方法， 5万个数据，50条数据为空（删除dropna）
# fillna 思考合理

对于DataFrame来说，还要选择填充的轴axis。记住，对于DataFrame来说：

- axis=0：index/行
- axis=1：columns/列

============================================

练习7：

1. 简述None与NaN的区别

2. 假设张三李四参加模拟考试，但张三因为突然想明白人生放弃了英语考试，因此记为None，请据此创建一个DataFrame,命名为ddd3

3. 老师决定根据用数学的分数填充张三的英语成绩，如何实现？
    用李四的英语成绩填充张三的英语成绩？

============================================