## How to Handle Missing Data in Pandas DataFrame

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.githubusercontent.com/nyangweso-rodgers/Data_Analytics/main/Datasets/Online_Retail.csv'
online_retial_data = pd.read_csv(url, encoding= 'unicode_escape')

### Shape of DataFrame

In [3]:
online_retial_data.shape

(541909, 8)

### Preview the DataFrame

In [4]:
online_retial_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 8:26,3.39,17850.0,United Kingdom


### How to Find Out Missing Values
* We check for missing values by utilizing the __isna()__ method, which returns a dataframe of boolean values indicating whether or not a field is null. We can use the __sum()__ method to group all missing values by column.

In [5]:
# check for missing values
online_retial_data.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [6]:
# check for Proportion of Missing Values
online_retial_data.isnull().sum() * 100/online_retial_data.shape[0]

InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
dtype: float64

### Extract Rows with Null values

In [7]:
data_records_with_null_values = online_retial_data[online_retial_data.isna().any(axis=1)]
data_records_with_null_values.to_csv("online_retail_data_with_null_values.csv", index=False)
# shape of the null values
data_records_with_null_values.shape

(135080, 8)

### Dropping Rows with Null Values

In [8]:
# drop rows with NaN values
# axis=0 is used to drop the row with NaN values.
updated_online_retail_data = online_retial_data.dropna()
# check the new shape
updated_online_retail_data.shape

(406829, 8)

### Save the New DataFrame, without the Missing Values

In [9]:
# save the new dataframe to a csv file
# index=False removes the index column which is populated by default from the output
updated_online_retail_data.to_csv("online_retail_data_without_null_values.csv", index=False)