<a href="https://colab.research.google.com/github/mishad01/Data-Science-Machine-Learning/blob/main/8_handling_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling Missing Data
Missing data is very common in many data analysis applications. pandas has a great ability to deal with the missing data. <br>Let's learn some convenient methods to deal with **missing data in pandas**:<br>

* isnull(), isna(), notnull(), dropna(), fillna(),

In [None]:
import numpy as np
import pandas as pd

In [None]:
data_dic = {'A':[1,2,None,4,np.nan],
            'B':[np.nan,np.nan,np.nan,np.nan,np.nan],
            'C':[11,12,13,14,15],
            'D':[16,np.nan,18,19,20]}

In [None]:
df = pd.DataFrame(data_dic)
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0



**isnull(), isna(), notnull() -- Check for missing data in the dataset!**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      float64
 1   B       0 non-null      float64
 2   C       5 non-null      int64  
 3   D       4 non-null      float64
dtypes: float64(3), int64(1)
memory usage: 288.0 bytes


In [None]:
df.isnull()

Unnamed: 0,A,B,C,D
0,False,True,False,False
1,False,True,False,True
2,True,True,False,False
3,False,True,False,False
4,True,True,False,False


In [None]:
df.isnull().sum() #return how many null presents in column

Unnamed: 0,0
A,2
B,5
C,0
D,1


In [None]:
df.isnull().sum().sum() #to see the total sum we use double sum

8

In [None]:
df['A'].isnull() #returns null in A series

Unnamed: 0,A
0,False
1,False
2,True
3,False
4,True


In [None]:
df['A'].isnull().sum()

2

In [None]:
df.isna() #isull() and isna() same stuff

Unnamed: 0,A,B,C,D
0,False,True,False,False
1,False,True,False,True
2,True,True,False,False
3,False,True,False,False
4,True,True,False,False


In [None]:
df.isna().sum().sum()

8

In [None]:
df.loc[1].isnull().sum() #1 means, how many null present in row 1

2

In [None]:
df.notnull()

Unnamed: 0,A,B,C,D
0,True,False,True,True
1,True,False,True,False
2,False,False,True,True
3,True,False,True,True
4,False,False,True,True


In [None]:
df.shape #it will return(row,col)

(5, 4)

In [None]:
df.notnull().sum()


Unnamed: 0,0
A,3
B,0
C,5
D,4


In [None]:
df.notna().sum()

Unnamed: 0,0
A,3
B,0
C,5
D,4


In [None]:
df.notnull().sum().sum()

12

In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
# Sum on Column "A", (NaN as 0)
df['A'].sum()

7.0

&#9758; NaN ignored for mean().

In [None]:
df['A'].mean() #  Average

2.3333333333333335

In [None]:
df.loc[3].mean() #Row wise Average

12.333333333333334

**dropna(), fillna() -- Cleaning / filling the missing data**<br>

*   When We want to drop row with null value we use dropna()
*   List item



In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df.dropna(axis=0) #When We want to drop row with null value we use dropna()

Unnamed: 0,A,B,C,D


In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df.dropna(axis=1) #When We want to drop row with null value we use dropna()

Unnamed: 0,C
0,11
1,12
2,13
3,14
4,15


In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


Not null value is thrrsh
thresh : int, default None<br>
thresh = 3 means, it will drop any column that have less than 3 non-NaN values.<br>




In [None]:
df.dropna(axis=1,thresh=5) #if there is not 5 value in column, it will drop

Unnamed: 0,C
0,11
1,12
2,13
3,14
4,15


In [None]:
df.dropna(axis=1,thresh=3)

Unnamed: 0,A,C,D
0,1.0,11,16.0
1,2.0,12,
2,,13,18.0
3,4.0,14,19.0
4,,15,20.0


We can use fillna() to fill in the values.<br>
inplaced = True for permanent change.

In [None]:
df.fillna(value=2) #Replace null value with 2

Unnamed: 0,A,B,C,D
0,1.0,2.0,11,16.0
1,2.0,2.0,12,2.0
2,2.0,2.0,13,18.0
3,4.0,2.0,14,19.0
4,2.0,2.0,15,20.0


In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df['A'].fillna(value=df['A'].mean()) #Replacing value with Fillna

Unnamed: 0,A
0,1.0
1,2.0
2,2.333333
3,4.0
4,2.333333


#method(ffill) -> Null value will replaced with its previous/upper value<br>
#df.fillna(method='bfill')-> Null value will replaced with its next/lower value



In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df.fillna(method='ffill') #method(ffill) -> Null value will replaced with its previous value

  df.fillna(method='ffill') #method(ffill) -> Null value will replaced with its previous value


Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,16.0
2,2.0,,13,18.0
3,4.0,,14,19.0
4,4.0,,15,20.0


In [None]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df.fillna(method='bfill') #df.fillna(method='bfill')-> Null value will replaced with its next/lower value

  df.fillna(method='bfill') #df.fillna(method='bfill')-> Null value will replaced with its next/lower value


Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,18.0
2,4.0,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


#The SimpleImputer class in scikit-learn is used to handle missing data by replacing missing values with a specified constant or a statistical measure (mean, median, or most frequent value) computed from the dataset. It's a simple and efficient way to impute missing values.

In [None]:
from sklearn.impute import SimpleImputer

In [41]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [42]:
imputer = SimpleImputer(strategy='constant',fill_value=-1)
df2['A'] = imputer.fit_transform(df2[['A']]) #Here all the null value of A column will be filled with -1
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,-1.0,,13,18.0
3,4.0,,14,19.0
4,-1.0,,15,20.0


In [45]:
df2 = df.copy()
imputer = SimpleImputer(strategy='constant',fill_value=-1)
df2['B'] = imputer.fit_transform(df2[['B']]) #Here all the null value of B column will be filled with -1
df2

Unnamed: 0,A,B,C,D
0,1.0,-1.0,11,16.0
1,2.0,-1.0,12,
2,,-1.0,13,18.0
3,4.0,-1.0,14,19.0
4,,-1.0,15,20.0


In [46]:
df2 = df.copy()
imputer = SimpleImputer(strategy='mean')
df2['A'] = imputer.fit_transform(df2[['A']]) #Here all the null value of B column will be filled with -1
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,2.333333,,13,18.0
3,4.0,,14,19.0
4,2.333333,,15,20.0


In [47]:
df2 = df.copy()
imputer = SimpleImputer(strategy='mean')
df2= imputer.fit_transform(df2) #Here all the null value will be filled with mean
df2



array([[ 1.        , 11.        , 16.        ],
       [ 2.        , 12.        , 18.25      ],
       [ 2.33333333, 13.        , 18.        ],
       [ 4.        , 14.        , 19.        ],
       [ 2.33333333, 15.        , 20.        ]])

In [49]:
df2 = df.copy()
imputer = SimpleImputer(strategy='median')
df2['A'] = imputer.fit_transform(df2[['A']]) #Here all the null value of B column will be filled with -1
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,2.0,,13,18.0
3,4.0,,14,19.0
4,2.0,,15,20.0


In [51]:
df2 = df.copy()
imputer = SimpleImputer(strategy='most_frequent')
df2['D'] = imputer.fit_transform(df2[['D']]) #Here all the null value of B column will be filled with -1
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,16.0
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


#The KNNImputer class in scikit-learn is used to handle missing values in a dataset by imputing them using the k-nearest neighbors algorithm. This imputation method fills in missing values based on the values of the nearest neighbors of the data point with missing values.

In [52]:
from sklearn.impute import KNNImputer

In [53]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [54]:
knn_imputer= KNNImputer(n_neighbors=2,weights='uniform') #gives average from neighbors
A1 = knn_imputer.fit_transform(df2)
A1

array([[ 1., 11., 16.],
       [ 2., 12., 17.],
       [ 3., 13., 18.],
       [ 4., 14., 19.],
       [ 3., 15., 20.]])

In [55]:
df2 = pd.DataFrame(A1,columns=['A','C','D'])
df2

Unnamed: 0,A,C,D
0,1.0,11.0,16.0
1,2.0,12.0,17.0
2,3.0,13.0,18.0
3,4.0,14.0,19.0
4,3.0,15.0,20.0


In [56]:
# fill with you own given value
df.fillna(0, inplace=True)
df

Unnamed: 0,A,B,C,D
0,1.0,0.0,11,16.0
1,2.0,0.0,12,0.0
2,0.0,0.0,13,18.0
3,4.0,0.0,14,19.0
4,0.0,0.0,15,20.0
