**HANDLING MISSING DATAS USING PANDAS**

Pandas and Data Cleaning” is a beginner-friendly course designed to teach you how to clean, organize, and prepare raw data for analysis using the powerful Python library, Pandas. You will learn essential techniques like handling missing values, removing duplicates, parsing dates, renaming columns, filtering data, and transforming messy datasets into structured, usable formats. By the end of this course, you’ll have the practical skills to clean real-world data and make it ready for analysis or machine learning.

The first step in data cleaning is to firstly find the missing data.

Effects of a missing values :
 
1. Biased Analysis
Missing values can distort statistics like mean, median or correaltion

2. Reduction in Model Accuracy
Many machine learning algorithms cannot handle missing values 

3. Loss of information 
Dropping rows with missing values can be a huge problem and can result data loss if missing values are frequent

4. Incorrect trends or results
Missing values may hide important patterns or casue misleading results

5. Impeded feature Engineering 
Missing values make it hard to create derived features or transformations

 Processes that Missing Values Hamper

1. Exploratory Data Analysis (EDA): Statistics like averages and correlations become unreliable.

2. Data Visualization: Plots may look misleading or incomplete.

3. Machine Learning Pipelines: Most models fail if NaN values exist in the dataset.

4. Normalization and Scaling: Operations like min-max scaling or standardization cannot be done on missing values.

5. Time Series Analysis: Missing timestamps or values disrupt trends, seasonality detection, and forecasting.



Note : Feature engineering means turning raw data into meaningful feature( input variables) which will further improve the machine learning models


In [2]:
import numpy as np 
import pandas as pd

**PANDAS UTILITY FUNCTIONS**

Just like numpy pandas has few functions to identify and detect null values

In [3]:
pd.isnull(np.nan)

#np.nan (Not a Number) is a special floating-point value in NumPy that represents a missing or undefined value.


True

In [4]:
pd.isnull(None)



True

In [5]:
pd.isna(np.nan)

True

In [6]:
pd.isna(None)

True

The opposite ones do exist for these as well

In [7]:
pd.notnull(None)

False

In [8]:
pd.notnull(np.nan)

False

In [9]:
pd.notnull(5)

True

These function also work with series and dataframe.

In [10]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [11]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [12]:
pd.isnull(pd.DataFrame({
    "Column A" : [1, np.nan, 7],
    "Column B" : [np.nan, 2, 3],
    "Column C" : [np.nan, 2, np.nan]
}))

#Here we have creating the data frame using the dictionary for the readability 
#We can also use lists of list as well but its less readable compared to a dictionary

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


In [13]:
data = [
    [1, np.nan, np.nan],  
    [np.nan, 2, 2],        
    [7, 3, np.nan]        
]

df = pd.DataFrame(data, columns=["Column A", "Column B", "Column C"])
#Passing the data and the column names as the arguments when creating a data frame
print(df)

   Column A  Column B  Column C
0       1.0       NaN       NaN
1       NaN       2.0       2.0
2       7.0       3.0       NaN


**Pandas Operation with missing values**

Pandas manages missing values more gracefully than numpy.  nan will no longer behave as viruses  and operation will just ignore them completely.
The term i am talking about the viruses is that numpy treats np.nan as a number and that will spread acroos the datas.

In [14]:
pd.Series([1, np.nan ,7]).count()

np.int64(2)

In [15]:
pd.Series([1, np.nan, 3]).sum()

np.float64(4.0)

In [16]:
pd.Series([2, 4, np.nan]).mean()

np.float64(3.0)

In all of the above usecases pandas has ignored the np.nan values completely.

**Filtering missing data**

As we did in the nump, we could combine boolean selection + pd.isnull to filter those nan s and null values.

•pd.isnull() returns a mask showing missing values.

•Using that mask inside df[...] filters the rows accordingly.

•This combination is super helpful for data cleaning, like:

•Finding rows with missing data.

•Removing or filling missing data.


What is a mask in pandas (or in general)?

• A mask is basically a boolean array or Series of the same shape as your data that contains only True or False values.

• Each True or False corresponds to whether a condition is met for that particular element or row.

• It acts like a filter or “mask” that you can use to select or hide certain data.

In [17]:
s = pd.Series([1,2,3, np.nan, np.nan, 4])

In [18]:
pd.notnull(s)
# This is the boolean mask creation step


0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [19]:
pd.notnull(s).count()

np.int64(6)

In [20]:
#Boolean masking or boolean indexing 
s[pd.notnull(s)]

#This firstly shows the list in Boolean form 
# And after that only the rows are kept where the mask is True

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

Why do we call this boolean masking?

• You are using a boolean list (mask) as an index to “mask out” or filter elements of the Series.

• Only elements that satisfy the condition (notnull) remain.

In [21]:
#Since both the notnull and isnull are the methods of series and dataframe as well, we can use it this way as wel.
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [22]:
s.sum()

np.float64(10.0)

In [23]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [24]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

**Dropping null values**

Boolean Selection + notnull() seems a little but verbose or repittive. Since every repititice task has a more easy and dry way, we can use the dropna method.

In [25]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

In [26]:
s.head()

# The dataframe is immutable so no changes were being made even after dropping the na.

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
dtype: float64

Dropping null values on a Dataframe

In [27]:
df = pd.DataFrame({
    "Column A" : [1, np.nan, 7, 8],
    "Column B" : [np.nan, 2, 3, 9],
    "Column C" : [np.nan, 2, np.nan, 6],
    "Column D" : [1 , 2, 3 , np.nan]
    
})

In [28]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,,2.0,2.0,2.0
2,7.0,3.0,,3.0
3,8.0,9.0,6.0,


In [29]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,True,True,False
1,True,False,False,False
2,False,False,True,False
3,False,False,False,True


In [30]:
df.shape

(4, 4)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  3 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  2 non-null      float64
 3   Column D  3 non-null      float64
dtypes: float64(4)
memory usage: 256.0 bytes


In [32]:
df.sum()

Column A    16.0
Column B    14.0
Column C     8.0
Column D     6.0
dtype: float64

In [33]:
df.isnull().sum()

Column A    1
Column B    1
Column C    2
Column D    1
dtype: int64

In [34]:
df.dropna()
#This will drop all the rows in which null value is present

Unnamed: 0,Column A,Column B,Column C,Column D


In [35]:
df.dropna(axis=1) 
#can be read as axis = "columns" and this too works
#We do this wehen we want to drop the specific axis based on the null values present

0
1
2
3


"dropna" on default removes any rows or columns if there is atleast one null value. So to control this we pass any or all as arguments. "any" drops the row or column even if only one of the value is null whereas "all" only drops if all the values are NaN.

In this case, any row or column that contains at least one null value will be dropped. Which can be too extremen depending on the cases. You can control this behavior with the use of the how parameter which can be either "any" or "all".

In [36]:
df2 = pd.DataFrame({
    "Column A" : [1 , np.nan, 5],
    "Column B" : [2, 3, np.nan],
    "Column C" : [np.nan, 6, 7]
 })

In [37]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,3.0,6.0
2,5.0,,7.0


By default the argument for the dropna is axis = 0 that means the row.
If we specify columns that axis = 1 it will look after the columns

In [38]:
df.dropna(how='any')
#Deleted all rows 


Unnamed: 0,Column A,Column B,Column C,Column D


In [39]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,,2.0,2.0,2.0
2,7.0,3.0,,3.0
3,8.0,9.0,6.0,


In [40]:
df.dropna(how='all')

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,,2.0,2.0,2.0
2,7.0,3.0,,3.0
3,8.0,9.0,6.0,


We can also use the thresh parameter to indicate a threshold of null-values for the row/column to be kept

In [41]:
df.dropna(thresh=3)
#Drop the row if there are given thresh number of values and for now its 3

Unnamed: 0,Column A,Column B,Column C,Column D
1,,2.0,2.0,2.0
2,7.0,3.0,,3.0
3,8.0,9.0,6.0,


In [42]:
df.dropna(thresh = 3, axis = 1)
#axis = 1 is same as axis = "columns"

Unnamed: 0,Column A,Column B,Column D
0,1.0,,1.0
1,,2.0,2.0
2,7.0,3.0,3.0
3,8.0,9.0,


**FILLING THE NULL VALUES**

Sometimes, instead of dropping null values, we use filling techniques (e.g., fillna() in pandas) to replace missing values with meaningful substitutes such as a constant, mean, median, mode, or forward/backward fill. This approach is useful because dropping rows or columns with null values can lead to loss of important data and reduce the dataset size, which might affect model performance. Filling nulls ensures that the data remains consistent and usable for further steps in the data science pipeline, such as feature engineering, model training, and visualization. It facilitates smooth processing by preventing errors that arise from missing values and allows algorithms to work effectively without interruptions.



**Common ways to fill the missing values**

Here are the common ways to fill missing values in pandas:

• Fill with a constant value

df.fillna(0)

• Fill with the column mean

df.fillna(df.mean(numeric_only=True))

• Fill with the column median

df.fillna(df.median(numeric_only=True))

• Fill with the column mode (most frequent value)

df.fillna(df.mode().iloc[0])

• Forward fill (propagate previous value)

df.fillna(method='ffill')

• Backward fill (propagate next value)

df.fillna(method='bfill')

• Fill with interpolation (estimate values)

df.interpolate()

1. Filling constant values

In [43]:
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

2. Filling the mean values


In [44]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

In [45]:
df.fillna({"Column A": 0, "Column B" : 2, "Column C" : df["Column C"].mean()})
#We can also fill the respective values we need per respective columns

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,4.0,1.0
1,0.0,2.0,2.0,2.0
2,7.0,3.0,4.0,3.0
3,8.0,9.0,6.0,


3. Filling nulls with contiguos(close) values

Filling nulls with contiguous (close) values means replacing missing values using nearby values from the dataset, instead of using a fixed value (like 0) or statistics (like mean or median).

Why Use This?

• In time-series data, the previous or next value often provides a logical guess for the missing value.

• Interpolation is useful when you assume the data changes gradually and smoothly over time.

a. Forward Fill (ffil)

In [46]:
df.fillna(method='ffill')

#Creates a problem when first value is the missing value

  df.fillna(method='ffill')


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,1.0,2.0,2.0,2.0
2,7.0,3.0,2.0,3.0
3,8.0,9.0,6.0,3.0


In [47]:
df.fillna(method="bfill")
#Creates a problem when last value is the null value

  df.fillna(method="bfill")


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,1.0
1,7.0,2.0,2.0,2.0
2,7.0,3.0,6.0,3.0
3,8.0,9.0,6.0,


Therefore filling the null with contiguos values often leave null values at the extreme.

In [48]:
pd.Series([np.nan, 2, 5]).fillna(method ="ffill")

  pd.Series([np.nan, 2, 5]).fillna(method ="ffill")


0    NaN
1    2.0
2    5.0
dtype: float64

In [49]:
pd.Series([np.nan, 2, np.nan]).fillna(method="bfill")

  pd.Series([np.nan, 2, np.nan]).fillna(method="bfill")


0    2.0
1    2.0
2    NaN
dtype: float64

This works for dataframe as well

In [50]:
df.fillna(method="ffill",axis = 0)

  df.fillna(method="ffill",axis = 0)


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,1.0,2.0,2.0,2.0
2,7.0,3.0,2.0,3.0
3,8.0,9.0,6.0,3.0


In [51]:
df.fillna(method="ffill",axis = 1)
#For columns the data will be filled from left to right 

  df.fillna(method="ffill",axis = 1)


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,1.0,1.0,1.0
1,,2.0,2.0,2.0
2,7.0,3.0,3.0,3.0
3,8.0,9.0,6.0,6.0


In [52]:
df.fillna(method="bfill",axis=0)

  df.fillna(method="bfill",axis=0)


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,1.0
1,7.0,2.0,2.0,2.0
2,7.0,3.0,6.0,3.0
3,8.0,9.0,6.0,


In [53]:
df.fillna(method="bfill",axis=1)

#For the columns the data will be filled from right to left

  df.fillna(method="bfill",axis=1)


Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,7.0,3.0,3.0,3.0
3,8.0,9.0,6.0,


**IMPORTANT**

The passing of the methods as ffill and bfill are deprecated and will be removed in further version of pandas so instead we use it in a differnt way that is given below.

In [54]:
pd.Series([1,np.nan,3]).ffill()



0    1.0
1    1.0
2    3.0
dtype: float64

In [55]:
pd.Series([2,3,np.nan]).bfill()



0    2.0
1    3.0
2    NaN
dtype: float64

In [56]:
df.ffill()



Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,,,1.0
1,1.0,2.0,2.0,2.0
2,7.0,3.0,2.0,3.0
3,8.0,9.0,6.0,3.0


In [57]:
df.bfill()



Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,1.0
1,7.0,2.0,2.0,2.0
2,7.0,3.0,6.0,3.0
3,8.0,9.0,6.0,


In [58]:
df.ffill(axis=1)



Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,1.0,1.0,1.0
1,,2.0,2.0,2.0
2,7.0,3.0,3.0,3.0
3,8.0,9.0,6.0,6.0


In [59]:
df.bfill(axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,1.0,1.0,1.0
1,2.0,2.0,2.0,2.0
2,7.0,3.0,3.0,3.0
3,8.0,9.0,6.0,


**Checking if there are NA's.** 


1. Checking the length

In [60]:
#If there are missing values s.dropna() will have less values than s
#So when comparing the two valueswe will not be using the count but we will be using the len function instead because count excludes the nan

s.dropna().count()

np.int64(4)

In [61]:
len(s)

6

In [62]:
s.count()

np.int64(4)

In [63]:
print(s)

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64


In [64]:
missing_values = len(s.dropna()) != len(s)
missing_values

True

More pythonic solution **any**

The methods any and all checks whether there are any values True in the Series or all values True. They work in the same way in Python.

With the help of these we can identify whether every values in a column are null or not. 

In [65]:
pd.Series([True, False, True]).any()
#Its true since any one of the value being true return True

np.True_

In [66]:
pd.Series([True, False, True]).all()
#This is returning false because all of the value is not True

np.False_

In [67]:
pd.Series([True, True, True]).all()
#This is true because all of the values here are TRue

np.True_

In [68]:
pd.Series([True,np.nan, True]).isnull()

# The isnull() returned True whereever there was null value 

0    False
1     True
2    False
dtype: bool

In [69]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [70]:
pd.Series([1, np.nan]).isnull().any()

#When using the paranthesis, we must be careful otherwise it can result in different errors

np.True_

In [71]:
pd.Series([1,2]).isnull().any()

np.False_

In [72]:
#We can use it similarly on a dataframe as well
s.isnull().any()

np.True_

**CLEANING NULL VALUES**

Handling null values is a crucial step in data cleaning because missing data can affect analysis, visualizations, and model performance. Nulls are not always represented as np.nan, None, or NaN; they can also appear as placeholders like empty strings (""), special characters ("-" or "?"), or text like "N/A", "null", or "missing". These inconsistencies make detecting nulls more challenging.

To properly clean null values, we first standardize all placeholders by converting them into recognized null values (np.nan). This can be done using pandas’ na_values parameter when reading files or by replacing values manually. Once identified, we can decide whether to drop these rows/columns (if they are sparse and unimportant) or fill them with meaningful values (such as mean, median, or forward/backward fill).

A robust null handling strategy ensures data integrity, prevents errors in computations, and keeps downstream tasks (like feature engineering and machine learning) accurate and reliable.

In [73]:
df = pd.DataFrame({
    "Sex" : ["M" , "F", "F", "D", "?"],
    "Age" : [29, 30, 24, 290, 25],
})

df

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,290
4,?,25


The dataframe above doesnot have any missing values but the values that are being used are not a valid value as age is mentioned 290 and for the age there is "D" and "?" being mentioned which doesnot make sense

**FINDING UNIQUE VALUES**


The first step to clean invalid values is to notice them., then identify them and finally handle them appropriately( remove them, replace them). Usually for a categorical type of data like sex which only take values like "M" or "F", we start by analyzing the variety of values present.For that we use unique() function to find the unique values.

In [74]:
df["Sex"].unique()

array(['M', 'F', 'D', '?'], dtype=object)

Here, the problem is that we all know there are only two genders and "D" and "?" are due to the typo mistakes. Suppose we call the place where this data was taken and later found out that it was actually a typo mistake and it meant Female indicating F. Now, since we knew that it was a mistake. We will have to replace that data.

In [78]:
df["Sex"].value_counts

<bound method IndexOpsMixin.value_counts of 0    M
1    F
2    F
3    D
4    ?
Name: Sex, dtype: object>

In [77]:
df["Sex"].replace("D","F")

# The replace function takes two arguments. The first one is the value that needs to be replaced and the second one is the value that replaces the value to be replaced.

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

# You can provide multiple replacements using the replace() function in Pandas, and there is no fixed limit on how many replacements you can pass.

In [79]:
df["Sex"].replace({"D","F","N","M"})

# We can replace multiple values like this but it must be enclosed with the curly braces like a dictionary

  df["Sex"].replace({"D","F","N","M"})


0    M
1    F
2    F
3    D
4    ?
Name: Sex, dtype: object

If we have many columns to replace then we can do that at a DataFrame level.

In [82]:
df.replace(
    {
        "Sex" : {
            "D" : "F",
            "N" : "M"
                },
        "Age" : {
            290 : 29
        }
    })

#This is a way of replacing the values in the dataframe in many columns as we need

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,F,29
4,?,25


In the previous example code, we assumed that the 290 had an extra 0 in it and with that we replaced it by removing the zero. But what if we want to remove zeros in the age column if any of the value exceeds certain number barrier.

In [83]:
df[df["Age"] > 100]

Unnamed: 0,Sex,Age
3,D,290


In [84]:
df.loc[df["Age"] > 100, "Age"] = df.loc[df["Age"] > 100, "Age"] / 10
#This is simply first accessing the series of Age that is higher than 100 and reassigning it with the one divided by 10

In [85]:
df 


Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,29
4,?,25
