## Handling Missing Data
## 處理缺失數據

> The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.
To make matters even more complicated, different data sources may indicate missing data in different ways. we'll refer to missing data in general as *null*, *NaN*, or *NA* values.

我們在許多教程裡面看到的數據和真實的數據的區別就是真實的數據很少是乾淨和同質的。更尋常的情況是，很多有意思的數據集都有很多的數據缺失。更複雜的是，不同的數據源可能有著不同指代缺失數據的方式，我們會將這些缺失數據標示為*null*、*NaN*或*NA*。

### ``None``: Pythonic missing data 缺失值
> The first sentinel value used by Pandas is ``None``, a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):

第一個種缺失值是 None，很多情況下它都作為 Python 代碼中缺失值的標誌。None 不能被任意的 NumPy 或Pandas 數組中使用，當使用聚合操作如sum()或min()的時候，如果碰到了None值，那就會產生錯誤：

In [1]:
import numpy as np
import pandas as pd

vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [3]:
#vals1.sum()

### ``NaN``: Missing numerical data 缺失值

> The other missing data representation, ``NaN`` (acronym for *Not a Number*), is different; Notice that NumPy chose a native floating-point type for this array: this array supports fast operations pushed into compiled code.You should be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches.Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``, NumPy does provide some special aggregations that will ignore these missing values: Keep in mind that ``NaN`` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

另外一個缺失的數據表現形式`NaN`（*非數字*的縮寫），NumPy使用原始的浮點類型來存儲這個數組：這個數組支持使用編譯代碼來進行快速運算。你應該了解到`NaN`就像一個數據的病毒，它會傳染到任何接觸到的數據。不論運算是哪種類型，`NaN`參與的算術運算的結果都會是另一個`NaN`，請記住`NaN`是一個特殊的浮點數值；對於整數、字符串或者其他類型來說都沒有對應的值。

In [4]:
1 + np.nan
0 * np.nan

nan

In [6]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2

array([ 1., nan,  3.,  4.])

In [7]:
vals2.sum(), vals2.min(), vals2.max()  #

(nan, nan, nan)

In [8]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)   #

(8.0, 1.0, 4.0)

## Operating on Null Values
## 操作空值

> Pandas treats ``None`` and ``NaN`` as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

Pandas將`None`和`NaN`看成是可以互相轉換的缺失值或空值。與此同時，Pandas還提供了一些很有用的方法用來在數據集中發現、移除和替換空值。這些方法包括：

- `isnull()`：生成一個布爾遮蓋數組指示缺失值的位置
- `notnull()`：`isnull()`相反方法
- `dropna()`：返回一個過濾掉缺失值、空值的數據集
- `fillna()`：返回一個數據集的副本，裡面的缺失值、空值使用另外的值來替代

In [2]:
import numpy as np
import pandas as pd

### isnull() - Detecting null values 檢測空值

In [3]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

### notnull() - Detecting null values 檢測空值

In [4]:
data = pd.Series([1, np.nan, 'hello', None])
data[data.notnull()]

0        1
2    hello
dtype: object

### dropna() - Dropping null values 去除空值

In [5]:
data = pd.Series([1, np.nan, 'hello', None])
data.dropna()

0        1
2    hello
dtype: object

### isnull() - Filling null values 填充空值

In [6]:
data = pd.Series([1, np.nan, 'hello', None])
data.fillna(0)  # fill NA entries with a single value
data.fillna(method='ffill') # 向前填充  forward-fill
data.fillna(method='bfill') # 向後填充 back-fill

0        1
1        0
2    hello
3        0
dtype: object

## DataFrame

> We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, so ``dropna()`` gives a number of options for a ``DataFrame``.By default, ``dropna()`` will drop all rows in which *any* null value is present: Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:

我們不能在`DataFrame`中移除單個空值；我們只能移除整行或者整列。`dropna()`為`DataFrame`對象提供了一些參數選擇。默認，`dropna()`會移除出現了空值的整行：你可以通過設置axis參數（如`axis=1`）來沿著不同的維度來移除空值，下面是移除含有空值的列的例子：

In [16]:
import pandas as pd
import numpy as np

# For a DataFrame
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]],columns=list('ABC'))
df

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [17]:
df.dropna()                #移除整行 default
df.dropna(axis='rows')     #移除整行
df.dropna(axis='columns')  #移除整列

Unnamed: 0,C
0,2
1,5
2,6


In [18]:
df.dropna(axis='rows', thresh=3)  #行中如果有3個或以上的非空值，將會被保留
df.dropna(axis='columns', how='all')  #行或列全部由空值構成的情況下才會被移除
df.dropna(axis='columns', how='any')  #行或列只要含有空值都會被移除

Unnamed: 0,C
0,2
1,5
2,6


In [19]:
df.fillna(50)

Unnamed: 0,A,B,C
0,1.0,50.0,2
1,2.0,3.0,5
2,50.0,4.0,6


In [20]:
df.fillna(method='ffill')
df.fillna(method='bfill')
df.fillna(method='ffill', axis=1)  # 按列進行向前填充
df.fillna(method='ffill', limit=2)

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,2.0,4.0,6


## Example

In [21]:
import pandas as pd
data=pd.read_csv("input/pd-fifa.csv", encoding = 'gb2312')
data.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,㈤226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,㈤127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,㈤228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,㈤138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,㈤196.4M


In [22]:
#sorting the missing values in rows in descending order
data.isnull().sum(axis=1).sort_values(ascending=False)

568     31
1120    31
1349    28
504     28
702     28
        ..
649      1
648      1
646      1
645      1
1821     1
Length: 1822, dtype: int64

In [23]:
#checking if there are any missing values in rows
data.isnull().any(axis=1)

0       True
1       True
2       True
3       True
4       True
        ... 
1817    True
1818    True
1819    True
1820    True
1821    True
Length: 1822, dtype: bool

In [24]:
print("Before deleting the rows ",data.shape[0])

Before deleting the rows  1822


In [25]:
data.shape

(1822, 89)

In [26]:
print("Before deleting the rows ",data.shape[0])
data=data[data.isnull().sum(axis=1)<=50]
print("After removing the rows having more than 50 missing values ",data.shape[0])

Before deleting the rows  1822
After removing the rows having more than 50 missing values  1822


In [27]:
#checking for the missing values in columns

data.isnull().sum()

Unnamed: 0          0
ID                  0
Name                0
Age                 0
Photo               0
                 ... 
GKHandling          0
GKKicking           0
GKPositioning       0
GKReflexes          0
Release Clause    122
Length: 89, dtype: int64

In [28]:
pd.set_option("max_rows",89)
data.isnull().sum()

Unnamed: 0                     0
ID                             0
Name                           0
Age                            0
Photo                          0
Nationality                    0
Flag                           0
Overall                        0
Potential                      0
Club                          14
Club Logo                      0
Value                          0
Wage                           0
Special                        0
Preferred Foot                 0
International Reputation       0
Weak Foot                      0
Skill Moves                    0
Work Rate                      0
Body Type                      0
Real Face                      0
Position                       0
Jersey Number                  0
Joined                       121
Loaned From                 1715
Contract Valid Until          14
Height                         0
Weight                         0
LS                           180
ST                           180
RS        

In [29]:
x=data.isnull().sum()
y=(data.isnull().sum()/data.shape[0])*100
z={'Number of missing values':x,'Percentage of missing values':y}
df=pd.DataFrame(z,columns=['Number of missing values','Percentage of missing values'])
df.sort_values(by='Percentage of missing values',ascending=False)

Unnamed: 0,Number of missing values,Percentage of missing values
Loaned From,1715,94.127333
LWB,180,9.879254
LCM,180,9.879254
RS,180,9.879254
LW,180,9.879254
LF,180,9.879254
CF,180,9.879254
RF,180,9.879254
RW,180,9.879254
LAM,180,9.879254


In [30]:
data=data.drop(['Loaned From'],axis=1)

In [None]:
print("Let's check the columns after removing Loaned From column",data.columns)

In [31]:
data.dtypes[data.isnull().any()]

Club                    object
Joined                  object
Contract Valid Until    object
LS                      object
ST                      object
RS                      object
LW                      object
LF                      object
CF                      object
RF                      object
RW                      object
LAM                     object
CAM                     object
RAM                     object
LM                      object
LCM                     object
CM                      object
RCM                     object
RM                      object
LWB                     object
LDM                     object
CDM                     object
RDM                     object
RWB                     object
LB                      object
LCB                     object
CB                      object
RCB                     object
RB                      object
Release Clause          object
dtype: object

In [32]:
#Player who have missing value in jersey number means that they donot have jersey number so it will be illogical to impute the 
#missing values using mean,median or mode. So let's impute the missing value as NA
data['Jersey Number'].fillna('NA',inplace=True)

In [33]:
data['Club']=data['Club'].fillna(data['Club'].mode()[0])
data['Position']=data['Position'].fillna(data['Position'].mode()[0])
data['Joined']=data['Joined'].fillna(data['Joined'].mode()[0])
data['Contract Valid Until']=data['Contract Valid Until'].fillna(data['Contract Valid Until'].mode()[0])
data['Release Clause']=data['Release Clause'].fillna(data['Release Clause'].mode()[0])


In [34]:
#business logic
data['RB'].fillna(0,inplace=True)
data['RCB'].fillna(0,inplace=True)
data['CB'].fillna(0,inplace=True)
data['LCB'].fillna(0,inplace=True)
data['LB'].fillna(0,inplace=True)
data['RWB'].fillna(0,inplace=True)
data['RDM'].fillna(0,inplace=True)
data['CDM'].fillna(0,inplace=True)
data['LDM'].fillna(0,inplace=True)
data['LWB'].fillna(0,inplace=True)
data['RM'].fillna(0,inplace=True)
data['RCM'].fillna(0,inplace=True)
data['CM'].fillna(0,inplace=True)
data['LCM'].fillna(0,inplace=True)
data['LM'].fillna(0,inplace=True)
data['RAM'].fillna(0,inplace=True)
data['CAM'].fillna(0,inplace=True)
data['LAM'].fillna(0,inplace=True)
data['RW'].fillna(0,inplace=True)
data['RF'].fillna(0,inplace=True)
data['CF'].fillna(0,inplace=True)
data['LF'].fillna(0,inplace=True)
data['LW'].fillna(0,inplace=True)
data['RS'].fillna(0,inplace=True)
data['ST'].fillna(0,inplace=True)
data['LS'].fillna(0,inplace=True)

If you have to impute all the missing values with 0 as in the above case you can directly write the command as
data.fillna(0,inplace=True)

In [35]:
data.isnull().sum().sum()

0

## Handling missing values in dataframes

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario.

Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed.

For Example, Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way many datasets went missing.

* None: None is a Python singleton object that is often used for missing data in Python code.
* NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :

* isnull()
* notnull()
* dropna()
* fillna()
* replace()
* interpolate()

In [36]:
# importing libraries
import pandas as pd
import numpy as np
# creating dataframe
d = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [37]:
df.isnull().sum()

First Score     1
Second Score    1
Third Score     1
dtype: int64

In [38]:
df.isnull().sum(axis = 1)

0    1
1    0
2    1
3    1
dtype: int64

In [39]:
df.fillna(df.mean())

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,72.666667
1,90.0,45.0,40.0
2,95.0,56.0,80.0
3,95.0,43.666667,98.0


In [40]:
# importing required libraries
import pandas as pd
import numpy as np
d = {"col1": [2019, 2019, 2020],
     "col2": [350, 365, 1],
     "col3": [np.nan, 365, None]}

df = pd.DataFrame(d)
df

Unnamed: 0,col1,col2,col3
0,2019,350,
1,2019,365,365.0
2,2020,1,


In [41]:
df.isnull().sum() # Solution 1
df.isna().sum()   # Solution 2
df.isna().any()  # Solution 3
df.isna().sum(axis = 1) # Solution 4

0    1
1    0
2    1
dtype: int64

In [42]:
# total number of missing values in the dataframe
df.isnull().sum().sum()

2

In [43]:
# rowwise missing values
df.isnull().sum(axis=1)

0    1
1    0
2    1
dtype: int64

In [44]:
# returns boolean object 
df.isnull().any()

col1    False
col2    False
col3     True
dtype: bool

<!--NAVIGATION-->
< [在Pandas中操作数据](03.03-Operations-in-Pandas.ipynb) | [目录](Index.ipynb) | [层次化的索引](03.05-Hierarchical-Indexing.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.04-Missing-Values.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
