#### <!--NAVIGATION-->
< [在Pandas中操作数据](03.03-Operations-in-Pandas.ipynb) | [目录](Index.ipynb) | [层次化的索引](03.05-Hierarchical-Indexing.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.04-Missing-Values.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Handling Missing Data

# 處理缺失數據

> The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.
To make matters even more complicated, different data sources may indicate missing data in different ways. we'll refer to missing data in general as *null*, *NaN*, or *NA* values.

我們在許多教程裡面看到的數據和真實的數據的區別就是真實的數據很少是乾淨和同質的。更尋常的情況是，很多有意思的數據集都有很多的數據缺失。更複雜的是，不同的數據源可能有著不同指代缺失數據的方式，我們會將這些缺失數據標示為*null*、*NaN*或*NA*。

## Trade-Offs in Missing Data Conventions

## 缺失數據約定的權衡

> There are a number of schemes that have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry.

用來在數據表或DataFrame中指定和標示缺失數據的方案有很多種。通常來說，會有兩種主要的策略：
- 使用一個全局的*遮蓋*來標示缺失數據：遮蓋層可以是一整個獨立的布林數組，又或者可以在數據中使用一個比特標示空值。
- 使用*哨兵值*來標示缺失的元素：哨兵值是某種數據特定的約定值，例如用-9999標示一個缺失的整數或者其他罕見的數值，又或者使用更加通用的方式，比方說 NaN（非數字），NaN是IEEE浮點數標準中的一部分。

> None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array, which adds overhead in both storage and computation. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types.

以上解決方案都是有所取捨的：獨立的遮蓋數組需要更多的內存空間用於存儲布爾數組；普通的哨兵值會縮小正確數據的取值範圍，而且需要額外的（通常是未優化的）CPU和GPU運算；通用的特殊值如NaN又無法應用於所有的數據類型上。

> With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point ``NaN`` value, and the Python ``None`` object.
This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.

Pandas選擇了通用哨兵值標示缺失值。更進一步說就是，使用兩個已經存在的Python空值：`NaN`代表特殊的浮點數值和Python的`None`對象。這種做法當然也有一些副作用，我們後面也會看到，但是在實踐中它被證明在大多數情況下都是一個較好的折中方案。

### ``None``: Pythonic missing data
### `None`：Python的缺失值

> The first sentinel value used by Pandas is ``None``, a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):

第一個被Pandas使用的缺失哨兵值是`None`，它是一個Python的單例對象，很多情況下它都作為Python代碼中缺失值的標誌。因為這是一個Python對象，`None`不能在任意的NumPy或Pandas數組中使用，它只能在數組的數據類型是`object`的情況下使用（例如，Python對象組成的數組）：

In [1]:
import numpy as np
import pandas as pd

vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

> This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

這裡的`dtype=object`表示這個NumPy數組的元素類型是Python的對象。雖然這種類型的對像數組在某些場景中很有用，任何數據的操作都會在Python層面進行，這會比NumPy其他基礎類型進行的快速操作消耗更多的執行時間：

In [2]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
42 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
586 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



> The use of Python objects in an array also means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error:

而且使用Python對像作為數組數據類型的話，當使用聚合操作如`sum()`或`min()`的時候，如果碰到了`None`值，那就會產生一個錯誤：

In [4]:
#vals1.sum()

### ``NaN``: Missing numerical data

### `NaN`：缺失的數值類型數據

> The other missing data representation, ``NaN`` (acronym for *Not a Number*), is different; Notice that NumPy chose a native floating-point type for this array: this array supports fast operations pushed into compiled code.You should be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches.Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:

另外一個缺失的數據表現形式`NaN`（*非數字*的縮寫），NumPy使用原始的浮點類型來存儲這個數組：這個數組支持使用編譯代碼來進行快速運算。你應該了解到`NaN`就像一個數據的病毒，它會傳染到任何接觸到的數據。不論運算是哪種類型，`NaN`參與的算術運算的結果都會是另一個`NaN`

In [5]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

In [6]:
1 + np.nan

nan

In [7]:
0 *  np.nan

nan

In [8]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

> NumPy does provide some special aggregations that will ignore these missing values: Keep in mind that ``NaN`` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

NumPy還提供了一些特殊的聚合函數可以用來忽略這些缺失值：請記住`NaN`是一個特殊的浮點數值；對於整數、字符串或者其他類型來說都沒有對應的值。

In [9]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) 

(8.0, 1.0, 4.0)

### NaN and None in Pandas

### Pandas中的NaN和None

> ``NaN`` and ``None`` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

`NaN`和`None`在Pandas都可以使用，而且Pandas基本上將兩者進行等同處理，可以在合適的情況下互相轉換：

In [10]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

> For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present.
For example, if we set a value in an integer array to ``np.nan``, it will automatically be upcast to a floating-point type to accommodate the NA:

對於哪些沒有通用哨兵值的類型，Pandas在發現出現了NA值的情況下會自動對它們進行類型轉換。例如，如果我們在一個整數數組中設置了一個`np.nan`值，整個數組會自動向上擴展為浮點類型：

In [11]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [12]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

> Notice that in addition to casting the integer array to floating point, Pandas automatically converts the ``None`` to a ``NaN`` value.

上述例子中除了將整數類型轉換為浮點數類型之外，Pandas還自動將`None`轉換成了`NaN`值。

> While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues.

雖然這種解決方案對比起類似R語言那樣使用統一的NA值來標示的方案來說，顯得有點像魔術。但是Pandas的這種哨兵+類型轉換的方式在實踐中運行良好，而且在作者的經驗中，很少導致問題。

> The following table lists the upcasting conventions in Pandas when NA values are introduced:Keep in mind that in Pandas, string data is always stored with an ``object`` dtype.

下表列出了Pandas在出現NA值的時候向上類型擴展的規則：在Pandas中，字符串數據總是使用`object`類型存儲的。

|大類型     | 當NA值存在時轉換規則 | NA哨兵值      |
|--------------|-----------------------------|------------------------|
| ``浮點數`` | 保持不變                   | ``np.nan``             |
| ``object``   | 保持不變                   | ``None`` 或 ``np.nan`` |
| ``整數``  | 轉換為``float64``         | ``np.nan``             |
| ``布爾``  | 轉換為``object``          | ``None`` 或 ``np.nan`` |

## Operating on Null Values

## 操作空值

> As we have seen, Pandas treats ``None`` and ``NaN`` as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:


我們已經看到，Pandas將`None`和`NaN`看成是可以互相轉換的缺失值或空值。與此同時，Pandas還提供了一些很有用的方法用來在數據集中發現、移除和替換空值。這些方法包括：

- `isnull()`：生成一個布爾遮蓋數組指示缺失值的位置
- `notnull()`：`isnull()`相反方法
- `dropna()`：返回一個過濾掉缺失值、空值的數據集
- `fillna()`：返回一個數據集的副本，裡面的缺失值、空值使用另外的值來替代

### Detecting null values

### 檢測空值

> Pandas data structures have two useful methods for detecting null data: ``isnull()`` and ``notnull()``.
Either one will return a Boolean mask over the data. For example:

Pandas數據集有兩個方法用來檢測空值：`isnull()`和`notnull()`。
它們都會返回一個布爾遮蓋數組。例如：

In [10]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [15]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [1]:
import pandas as pd
import numpy as np

data_df = pd.DataFrame([[np.nan, 2, np.nan, 0], 
                      [3, 4, 5, 1], 
                      [np.nan, np.nan, np.nan, 5], 
                      [np.nan, 3, np.nan, 4]], 
                      columns=list('ABCD'))
data_df

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [2]:
data_df.fillna(50)
# data_df.fillna({'A':20, 'B':50, 'C':40, 'D':60, 'E':70})
# data_df.fillna({'A':20, 'B':50, 'C':40, 'D':60, 'E':70}, inplace=True)

0    1.0
2    3.5
4    7.0
dtype: float64

In [3]:
data_rand = pd.DataFrame(np.random.randn(5, 4))
data_rand.iloc[2:, 2]=np.nan
data_rand.iloc[:2, 3]=np.nan
data_rand

0    1.0
2    3.5
4    7.0
dtype: float64

In [None]:
data_rand.fillna(method='ffill')
# data_rand.fillna(method='bfill')
# data_rand.fillna(method='ffill', limit=2)

### Dropping null values

### 去除空值

> In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). For a ``Series``,
the result is straightforward:

除了上面的遮蓋之外，還有兩個很方便的方法`dropna()`（移除NA值）和`fillna()`（填充NA值）。對於`Series`對象來說，結果顯而易見：

In [11]:
# For a Series
data.dropna()

0        1
2    hello
dtype: object

> We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, so ``dropna()`` gives a number of options for a ``DataFrame``.By default, ``dropna()`` will drop all rows in which *any* null value is present: Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:

我們不能在`DataFrame`中移除單個空值；我們只能移除整行或者整列。`dropna()`為`DataFrame`對象提供了一些參數選擇。默認，`dropna()`會移除出現了空值的整行：你可以通過設置axis參數（如`axis=1`）來沿著不同的維度來移除空值，下面是移除含有空值的列的例子：

In [16]:
# For a DataFrame
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [17]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


> But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.

但是這會移除一些良好的數據；你可能更希望移除那些*全部*是NA值或者大部分是NA值的行或列。這可以通過設置`how`或`thresh`參數來實現，它們可以更加精細地控制移除的行或列包含的空值個數。

> The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:

默認的情況是`how='any'`，因此只要含有空值都會被移除。你可以將它設置為`how=all`，這樣只有那些行或列*全部*由空值構成的情況下才會被移除：

In [18]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [19]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [21]:
#行中如果有3個或以上的非空值，將會被保留
df.dropna(axis='rows', thresh=3) 

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling null values

### 填充空值

> Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced.

有時我們想要的不是移除NA值，而是希望將它們替換為正確的值。替換後的值可能是一個標量如0，或者從其他正確數值歸併或插補的值。你當然可以使用`isnull()`然後賦值的方式來實現，但是因為這個需求是如此廣泛，Pandas提供了`fillna()`方法，用來返回一個替換空值後的數據集副本。

In [22]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [24]:
data.fillna(0)  # fill NA entries with a single value

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [23]:
# 向前填充  forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [26]:
# 向後填充 back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [25]:
# 按列進行向前填充
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


## Example

In [5]:
import pandas as pd
data=pd.read_csv("input/fifa.csv", encoding = 'gb2312')
data.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96,33,28,26,6,11,15,14,8,㈤226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95,28,31,23,7,11,15,14,11,㈤127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94,27,24,33,9,9,15,15,11,㈤228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68,15,21,13,90,85,87,88,94,㈤138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88,68,58,51,15,13,5,10,13,㈤196.4M


In [6]:
#sorting the missing values in rows in descending order
data.isnull().sum(axis=1).sort_values(ascending=False)

568     31
1120    31
1349    28
504     28
702     28
        ..
649      1
648      1
646      1
645      1
1821     1
Length: 1822, dtype: int64

In [7]:
#checking if there are any missing values in rows
data.isnull().any(axis=1)

0       True
1       True
2       True
3       True
4       True
        ... 
1817    True
1818    True
1819    True
1820    True
1821    True
Length: 1822, dtype: bool

In [10]:
print("Before deleting the rows ",data.shape[0])

Before deleting the rows  1822


In [11]:
data.shape

(1822, 89)

In [12]:
print("Before deleting the rows ",data.shape[0])
data=data[data.isnull().sum(axis=1)<=50]
print("After removing the rows having more than 50 missing values ",data.shape[0])

Before deleting the rows  1822
After removing the rows having more than 50 missing values  1822


In [13]:
#checking for the missing values in columns

data.isnull().sum()

Unnamed: 0          0
ID                  0
Name                0
Age                 0
Photo               0
                 ... 
GKHandling          0
GKKicking           0
GKPositioning       0
GKReflexes          0
Release Clause    122
Length: 89, dtype: int64

In [14]:
pd.set_option("max_rows",89)
data.isnull().sum()

Unnamed: 0                     0
ID                             0
Name                           0
Age                            0
Photo                          0
Nationality                    0
Flag                           0
Overall                        0
Potential                      0
Club                          14
Club Logo                      0
Value                          0
Wage                           0
Special                        0
Preferred Foot                 0
International Reputation       0
Weak Foot                      0
Skill Moves                    0
Work Rate                      0
Body Type                      0
Real Face                      0
Position                       0
Jersey Number                  0
Joined                       121
Loaned From                 1715
Contract Valid Until          14
Height                         0
Weight                         0
LS                           180
ST                           180
RS        

In [15]:
x=data.isnull().sum()
y=(data.isnull().sum()/data.shape[0])*100
z={'Number of missing values':x,'Percentage of missing values':y}
df=pd.DataFrame(z,columns=['Number of missing values','Percentage of missing values'])
df.sort_values(by='Percentage of missing values',ascending=False)

Unnamed: 0,Number of missing values,Percentage of missing values
Loaned From,1715,94.127333
LWB,180,9.879254
LCM,180,9.879254
RS,180,9.879254
LW,180,9.879254
LF,180,9.879254
CF,180,9.879254
RF,180,9.879254
RW,180,9.879254
LAM,180,9.879254


In [16]:
data=data.drop(['Loaned From'],axis=1)

In [17]:
print("Let's check the columns after removing Loaned From column",data.columns)

Let's check the columns after removing Loaned From column Index(['Unnamed: 0', 'ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag',
       'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', 'Contract Valid Until', 'Height', 'Weight',
       'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM',
       'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB',
       'LCB', 'CB', 'RCB', 'RB', 'Crossing', 'Finishing', 'HeadingAccuracy',
       'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy',
       'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility',
       'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength',
       'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
       'Penalties', 'Composure', 'M

In [18]:
data.dtypes[data.isnull().any()]

Club                    object
Joined                  object
Contract Valid Until    object
LS                      object
ST                      object
RS                      object
LW                      object
LF                      object
CF                      object
RF                      object
RW                      object
LAM                     object
CAM                     object
RAM                     object
LM                      object
LCM                     object
CM                      object
RCM                     object
RM                      object
LWB                     object
LDM                     object
CDM                     object
RDM                     object
RWB                     object
LB                      object
LCB                     object
CB                      object
RCB                     object
RB                      object
Release Clause          object
dtype: object

In [19]:
#Player who have missing value in jersey number means that they donot have jersey number so it will be illogical to impute the 
#missing values using mean,median or mode. So let's impute the missing value as NA
data['Jersey Number'].fillna('NA',inplace=True)

In [20]:
data['Club']=data['Club'].fillna(data['Club'].mode()[0])
data['Position']=data['Position'].fillna(data['Position'].mode()[0])
data['Joined']=data['Joined'].fillna(data['Joined'].mode()[0])
data['Contract Valid Until']=data['Contract Valid Until'].fillna(data['Contract Valid Until'].mode()[0])
data['Release Clause']=data['Release Clause'].fillna(data['Release Clause'].mode()[0])


In [21]:
#business logic
data['RB'].fillna(0,inplace=True)
data['RCB'].fillna(0,inplace=True)
data['CB'].fillna(0,inplace=True)
data['LCB'].fillna(0,inplace=True)
data['LB'].fillna(0,inplace=True)
data['RWB'].fillna(0,inplace=True)
data['RDM'].fillna(0,inplace=True)
data['CDM'].fillna(0,inplace=True)
data['LDM'].fillna(0,inplace=True)
data['LWB'].fillna(0,inplace=True)
data['RM'].fillna(0,inplace=True)
data['RCM'].fillna(0,inplace=True)
data['CM'].fillna(0,inplace=True)
data['LCM'].fillna(0,inplace=True)
data['LM'].fillna(0,inplace=True)
data['RAM'].fillna(0,inplace=True)
data['CAM'].fillna(0,inplace=True)
data['LAM'].fillna(0,inplace=True)
data['RW'].fillna(0,inplace=True)
data['RF'].fillna(0,inplace=True)
data['CF'].fillna(0,inplace=True)
data['LF'].fillna(0,inplace=True)
data['LW'].fillna(0,inplace=True)
data['RS'].fillna(0,inplace=True)
data['ST'].fillna(0,inplace=True)
data['LS'].fillna(0,inplace=True)

If you have to impute all the missing values with 0 as in the above case you can directly write the command as
data.fillna(0,inplace=True)

In [22]:
data.isnull().sum().sum()

0

## Handling missing values in dataframes

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario.

Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed.

For Example, Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way many datasets went missing.

* None: None is a Python singleton object that is often used for missing data in Python code.
* NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.

Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :

* isnull()
* notnull()
* dropna()
* fillna()
* replace()
* interpolate()

In [19]:
# importing libraries
import pandas as pd
import numpy as np
# creating dataframe
d = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,95.0,,98.0


In [20]:
df.isnull().sum()

First Score     1
Second Score    1
Third Score     1
dtype: int64

In [21]:
df.isnull().sum(axis = 1)

0    1
1    0
2    1
3    1
dtype: int64

In [22]:
df.fillna(df.mean())

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,72.666667
1,90.0,45.0,40.0
2,95.0,56.0,80.0
3,95.0,43.666667,98.0


In [23]:
# importing required libraries
import pandas as pd
import numpy as np
d = {"col1": [2019, 2019, 2020],
     "col2": [350, 365, 1],
     "col3": [np.nan, 365, None]}

df = pd.DataFrame(d)
df

Unnamed: 0,col1,col2,col3
0,2019,350,
1,2019,365,365.0
2,2020,1,


In [25]:
df.isnull().sum() # Solution 1
df.isna().sum()   # Solution 2
df.isna().any()  # Solution 3
df.isna().sum(axis = 1) # Solution 4

0    1
1    0
2    1
dtype: int64

In [26]:
# total number of missing values in the dataframe
df.isnull().sum().sum()

2

In [27]:
# rowwise missing values
df.isnull().sum(axis=1)

0    1
1    0
2    1
dtype: int64

In [28]:
# returns boolean object 
df.isnull().any()

col1    False
col2    False
col3     True
dtype: bool

<!--NAVIGATION-->
< [在Pandas中操作数据](03.03-Operations-in-Pandas.ipynb) | [目录](Index.ipynb) | [层次化的索引](03.05-Hierarchical-Indexing.ipynb) >

<a href="https://colab.research.google.com/github/wangyingsm/Python-Data-Science-Handbook/blob/master/notebooks/03.04-Missing-Values.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
