In [2]:
import pandas as pd
import numpy as np

**Criando o Dataset**

In [18]:
# Alternatives to np.nan: isna, notna, notnull
df = pd.DataFrame(data={'Col01' : [0, np.nan, 2, 3, 4, np.nan, np.nan],
                        'Col02' : [100, 500, 200, 300, 400, 600, 800]})

In [19]:
df

Unnamed: 0,Col01,Col02
0,0.0,100
1,,500
2,2.0,200
3,3.0,300
4,4.0,400
5,,600
6,,800


In [20]:
df.isnull().sum()

Col01    3
Col02    0
dtype: int64

### In the `fillna` function, by specifying `pad` we can fill the `nan` value as follows

In [21]:
df.fillna(method='pad', limit=1)

Unnamed: 0,Col01,Col02
0,0.0,100
1,0.0,500
2,2.0,200
3,3.0,300
4,4.0,400
5,4.0,600
6,,800


* Here we can see that the `nan` value after `4.0` is filled with the value previous to it i.e `4.0`

### `limit` flag is for specifying how many `nan` value should be filled

In [22]:
df.fillna(method='pad', limit=2)

Unnamed: 0,Col01,Col02
0,0.0,100
1,0.0,500
2,2.0,200
3,3.0,300
4,4.0,400
5,4.0,600
6,4.0,800


* Here we can see that the `nan` value up to limit two has been filled with the value `4.0`

### `nan` backfilling with `bfill`.

In [23]:
df.fillna(method = 'bfill')

Unnamed: 0,Col01,Col02
0,0.0,100
1,2.0,500
2,2.0,200
3,3.0,300
4,4.0,400
5,,600
6,,800


In [24]:
df.fillna(method = 'ffill')

Unnamed: 0,Col01,Col02
0,0.0,100
1,0.0,500
2,2.0,200
3,3.0,300
4,4.0,400
5,4.0,600
6,4.0,800


* It fills the `NaN` value in backward direction with the value which is before the `NaN`, if we do not specify the limit, it'll fill all the values with `NaN`

In [25]:
df.dropna(axis=0)

Unnamed: 0,Col01,Col02
0,0.0,100
2,2.0,200
3,3.0,300
4,4.0,400


In [26]:
df.dropna(axis=1)

Unnamed: 0,Col02
0,100
1,500
2,200
3,300
4,400
5,600
6,800


### Only drop columns which have at least 90% non-NaNs

In [27]:
df.dropna(thresh=int(df.shape[0] * .9), axis=1)

Unnamed: 0,Col02
0,100
1,500
2,200
3,300
4,400
5,600
6,800


The parameter `thresh=N`requires that a column has at least `N` `non-NaNs` to survive. Think of this as the lower limit for missing data you will find acceptable in your columns. 

In [28]:
df.shape[0] * .9

6.3

* The col1 need atleast `6.3 Non NaN` value to survive

### Fill with the mean

In [29]:
df['Col01'].fillna(df['Col01'].mean())

0    0.00
1    2.25
2    2.00
3    3.00
4    4.00
5    2.25
6    2.25
Name: Col01, dtype: float64

### Interpolation

In [30]:
df['Col01'].interpolate()

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    4.0
6    4.0
Name: Col01, dtype: float64

### Replace

In [31]:
df.replace(np.nan, 0)

Unnamed: 0,Col01,Col02
0,0.0,100
1,0.0,500
2,2.0,200
3,3.0,300
4,4.0,400
5,0.0,600
6,0.0,800


### inf and -inf

In [32]:
df = pd.DataFrame(data = {'Col01': [np.nan, -np.inf, 2, 3, 4, np.inf, np.nan],
                          'Col02': [1, -np.inf, 2, 3, 4, 6, 8]})
df

Unnamed: 0,Col01,Col02
0,,1.0
1,-inf,-inf
2,2.0,2.0
3,3.0,3.0
4,4.0,4.0
5,inf,6.0
6,,8.0


In [33]:
df.isna().sum()

Col01    2
Col02    0
dtype: int64

In [34]:
df.sum()

Col01    NaN
Col02   -inf
dtype: float64

* If you want to consider inf and -inf to be “NA” in computations, you can set pandas.options.mode.use_inf_as_na = True.

In [35]:
pd.options.mode.use_inf_as_na = True

In [36]:
df.isna().sum()

Col01    4
Col02    1
dtype: int64

In [37]:
df.sum()

Col01     9.0
Col02    24.0
dtype: float64