In [24]:
import pandas as pd
import numpy as np
pd.options.display.width = 1000


Int won't accept NAN, so int will be convert to fload if the value list have Nan.So, whenever there’s a NaN value in an integer column, the data type of the whole column is upcasted to float.

In [25]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan],
    "B": [2.4, 6.2, 5.1, np.nan],
    "C": ["foo","zoo","bar", np.nan]
})

print(df)

     A    B    C
0  1.0  2.4  foo
1  2.0  6.2  zoo
2  3.0  5.1  bar
3  NaN  NaN  NaN


To overcome this issue, a new missing value representation for integers, <NA>, was introduced with Pandas 1.0. We need to explicitly declare the data type as pd.Int64Dtype()

In [26]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan],
    "B": [2.4, 6.2, 5.1, np.nan],
    "C": ["foo","zoo","bar", np.nan]
})

df["A"] = df["A"].astype(pd.Int64Dtype())

# look at A column as Int64
print(df)

     A    B    C
0    1  2.4  foo
1    2  6.2  zoo
2    3  5.1  bar
3  NaN  NaN  NaN


isna and notna function, return a new dataframe of True/False

In [27]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan, 7],
    "B": [2.4, np.nan, 5.1, np.nan, 2.6],
    "C": [np.nan, "foo","zoo","bar", np.nan],
    "D": [11.5, np.nan, 6.2, 21.1, 8.7]
})

print(df.isna())
print("Number of NA in each column")
print(df.isna().sum())  # same as print(df.isna().sum(axis=0))

print("Total number of NA cell")
print(df.isna().sum().sum())

       A      B      C      D
0  False  False   True  False
1  False   True  False   True
2  False  False  False  False
3   True   True  False  False
4  False  False   True  False
Number of NA in each column
A    1
B    2
C    2
D    1
dtype: int64
Total number of NA cell
6


Drona function

In [28]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan, 7],
    "B": [2.4, np.nan, 5.1, np.nan, 2.6],
    "C": [np.nan, "foo","zoo","bar", np.nan],
    "D": [11.5, np.nan, 6.2, 21.1, 8.7]
})
print(df)
print("\ndrop row with any NA")
print(df.dropna())  # drop rows with any NA

print("\ndrop column with any NA")
print(df.dropna(axis=1))  # drop columns with any NA

print("\ndrop row if all cell is NA")
print(df.dropna(how='all'))  # drop rows with any NA


     A    B    C     D
0  1.0  2.4  NaN  11.5
1  2.0  NaN  foo   NaN
2  3.0  5.1  zoo   6.2
3  NaN  NaN  bar  21.1
4  7.0  2.6  NaN   8.7

drop row with any NA
     A    B    C    D
2  3.0  5.1  zoo  6.2

drop column with any NA
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

drop row if all cell is NA
     A    B    C     D
0  1.0  2.4  NaN  11.5
1  2.0  NaN  foo   NaN
2  3.0  5.1  zoo   6.2
3  NaN  NaN  bar  21.1
4  7.0  2.6  NaN   8.7


**Drop thresh**

In [29]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan, 7],
    "B": [2.4, np.nan, 5.1, np.nan, 2.6],
    "C": [np.nan, "foo","zoo","bar", np.nan],
    "D": [11.5, np.nan, 6.2, 21.1, 8.7],
    "E": [1, 2, 3, 4, 5]
})

print(df)
print("\n\n")
# Drop rows that have less than 4 non-missing values
print(df.dropna(thresh=4))

     A    B    C     D  E
0  1.0  2.4  NaN  11.5  1
1  2.0  NaN  foo   NaN  2
2  3.0  5.1  zoo   6.2  3
3  NaN  NaN  bar  21.1  4
4  7.0  2.6  NaN   8.7  5



     A    B    C     D  E
0  1.0  2.4  NaN  11.5  1
2  3.0  5.1  zoo   6.2  3
4  7.0  2.6  NaN   8.7  5


Fillna, this is just technical function. To reiterate, determining the replacement value depends on the characteristics of the data and domain knowledge.
For instance, if our data consists of the daily stock prices, filling a missing value with the previous or next value is more optimal than using the average.
This is a commonly used approach in time-series data. The fillna function performs this operation easily with the method parameter **method="bfill" and "ffill"**

In [35]:
df = pd.DataFrame({
    "A": [1, 2, 3, np.nan, 7],
    "B": [2.4, np.nan, 5.1, np.nan, 2.6],
    "C": [np.nan, "foo","zoo","bar", np.nan],
    "D": [11.5, np.nan, 6.2, 21.1, 8.7],
    "E": [1, 2, 3, 4, 5]
})

print(df["A"])
print("\n\nColumn A after fill NA with mean")
print(df["A"].fillna(value = df["A"].mean()))


0    1.0
1    2.0
2    3.0
3    NaN
4    7.0
Name: A, dtype: float64


Column A after fill NA with mean
0    1.00
1    2.00
2    3.00
3    3.25
4    7.00
Name: A, dtype: float64


In [36]:
print("\nFilling backward")
print(df["A"].fillna(method="bfill"))

print("\nFilling forward")
print(df["A"].fillna(method="ffill"))


Filling backward
0    1.0
1    2.0
2    3.0
3    7.0
4    7.0
Name: A, dtype: float64

Filling forward
0    1.0
1    2.0
2    3.0
3    3.0
4    7.0
Name: A, dtype: float64


In [38]:
df = pd.DataFrame({
    "A": [1, 2, np.nan, np.nan, 8]
})

print("Without the limit parameter")
print(df.fillna(method="bfill"))

print("\nWith the limit parameter")
print(df.fillna(method="bfill", limit=1))

Without the limit parameter
     A
0  1.0
1  2.0
2  8.0
3  8.0
4  8.0

With the limit parameter
     A
0  1.0
1  2.0
2  NaN
3  8.0
4  8.0


Excersice: We have the following DataFrame with two columns and ten rows. We want to fill the missing values in the measurement column with their previous values, but we’re only allowed to fill two consecutive values with this method. The remaining ones need to be filled with the average value of the column.

In [43]:
df = pd.DataFrame({
    "Date": pd.date_range(start="2021-10-01", periods=10),
    "Measurement": [16, 13, 14, 12, np.nan, np.nan, np.nan, 8, 7, 5]
})

print(df)

df["Measurement"].fillna(method="ffill", inplace=True, limit=2)
df["Measurement"].fillna(inplace=True, value=df["Measurement"].mean())
print(df)


        Date  Measurement
0 2021-10-01         16.0
1 2021-10-02         13.0
2 2021-10-03         14.0
3 2021-10-04         12.0
4 2021-10-05          NaN
5 2021-10-06          NaN
6 2021-10-07          NaN
7 2021-10-08          8.0
8 2021-10-09          7.0
9 2021-10-10          5.0
        Date  Measurement
0 2021-10-01         16.0
1 2021-10-02         13.0
2 2021-10-03         14.0
3 2021-10-04         12.0
4 2021-10-05         12.0
5 2021-10-06         12.0
6 2021-10-07         11.0
7 2021-10-08          8.0
8 2021-10-09          7.0
9 2021-10-10          5.0
