Note 1: All of the descriptive statistic on pandas objects exclude missing data by default. <br>
Note 2: Pandas use NaN to represent missing data for numeric. Python use None object to represent null. Use 'is' operator to check None object

### isnull()
1. isnull() is a record level method. It will return a boolean value for each value it checks. Not an aggregation funciton. <br>
2. You can assign isnull() result to a new variable, the new variable result is bool. 
#### Reason: Pandas Series.dtype attribute returns the data type of the underlying data for the given Series object.<br>
3. isnull() function treat NaN and None as missing value <br>

### notnull()
1. The negation of isnull(). Others are the same

In [24]:
import pandas as pd
import numpy as np
from numpy import nan as na

In [2]:
string_data = pd.Series(['aardvark','artichoke',np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [11]:
tst = string_data.isnull()
tst

0    False
1    False
2     True
3    False
dtype: bool

### dropna()

1. drop the missing value<br>
2. dropna won't change the original dataset<br>
3. If you want a clean dataset without any missing value. There are two options: <br>
   a: you can assign the output of dropna() to another variable<br>
   b: use [inplace = True] within dropna()<br>
4. dropna by default drops any row containing a missing value. If you only want to drop the row with all missing values. use [how='all'] within dropna()<br>
5. Use [axis = 0,1] to determine if you want to drop row or column<br>
6. Want to keep only rows with certion number of observations. Use [Thresh] argument.

In [25]:
data = pd.Series([1,na,3.5, na, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [26]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [23]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [16]:
tst = data.dropna()
tst

0    1.0
2    3.5
4    7.0
dtype: float64

In [38]:
data = pd.DataFrame([[1,6.5,3],[1,na,na],[na,na,na],[na,6.5,3]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [29]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [30]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [39]:
data[3]=na  # data[X], X is column name or column index
data

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [44]:
test = pd.Series([na,na,na,na])
data = data.append(test, ignore_index=True)  # How to add a new row to a DataFrame, it is different from adding new column
                                      # Append doesn't have [inplace] argument
data

Unnamed: 0,0,1,2,3
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,
4,,,,


In [45]:
data.dropna(how='all',axis=1)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0
4,,,


In [50]:
# Test thresh argument

df=pd.DataFrame(np.random.randn(7,3))  # 7 rows, 3 columns
df.iloc[:4,1]=na   # :4 直到index number 4， 但是不包括index number 4
df.iloc[:2,2]=na
df

Unnamed: 0,0,1,2
0,0.635769,,
1,-0.400061,,
2,0.007553,,0.660702
3,-1.612578,,-0.057297
4,0.218704,0.789599,1.705617
5,0.769265,-2.071546,-0.200002
6,-1.106361,0.707866,0.363928


In [52]:
df.dropna(thresh =2) # drop 的是含有两个missing value的row

Unnamed: 0,0,1,2
2,0.007553,,0.660702
3,-1.612578,,-0.057297
4,0.218704,0.789599,1.705617
5,0.769265,-2.071546,-0.200002
6,-1.106361,0.707866,0.363928


In [55]:
df.dropna(thresh=2,axis=1)  # thresh 不能使用于column

Unnamed: 0,0,1,2
0,0.635769,,
1,-0.400061,,
2,0.007553,,0.660702
3,-1.612578,,-0.057297
4,0.218704,0.789599,1.705617
5,0.769265,-2.071546,-0.200002
6,-1.106361,0.707866,0.363928


## Filling In Missing Data

### fillna( )

1. calling fillna with a dict, use a different fill value for each column <br>
2. fillna 并不能改变原有的dataframe,it will return a new object. use [inplace] argument to modify the existing object.<br>
3. interpoliation（插入）methods: <br>
      ffill -- 向前fill<br>
      bfill -- 向后fill<br>
      limit -- fill几个值<br>
4 Use [axis] to determine fill row or column


In [57]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.635769,0.0,0.0
1,-0.400061,0.0,0.0
2,0.007553,0.0,0.660702
3,-1.612578,0.0,-0.057297
4,0.218704,0.789599,1.705617
5,0.769265,-2.071546,-0.200002
6,-1.106361,0.707866,0.363928


In [59]:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,0.635769,0.5,0.0
1,-0.400061,0.5,0.0
2,0.007553,0.5,0.660702
3,-1.612578,0.5,-0.057297
4,0.218704,0.789599,1.705617
5,0.769265,-2.071546,-0.200002
6,-1.106361,0.707866,0.363928


In [61]:
df = pd.DataFrame(np.random.randn(6,3))
df.iloc[2:,1]=na
df.iloc[4:,2]=na
df

Unnamed: 0,0,1,2
0,-1.494141,-0.141704,-1.566211
1,-0.531914,-0.109354,-0.875158
2,1.679723,,-1.617093
3,0.317212,,0.412313
4,0.962284,,
5,1.222318,,


In [62]:
df.fillna(method='ffill',limit =1)

Unnamed: 0,0,1,2
0,-1.494141,-0.141704,-1.566211
1,-0.531914,-0.109354,-0.875158
2,1.679723,-0.109354,-1.617093
3,0.317212,,0.412313
4,0.962284,,0.412313
5,1.222318,,


In [63]:
df.fillna(method='bfill') # because the NaN is the last value, so bfill method doesn't work.

Unnamed: 0,0,1,2
0,-1.494141,-0.141704,-1.566211
1,-0.531914,-0.109354,-0.875158
2,1.679723,,-1.617093
3,0.317212,,0.412313
4,0.962284,,
5,1.222318,,
