# Missing Data

## Import data and libraries

In [1]:
import os
import pandas as pd

In [3]:
# Get the absolute path to the current notebook
os_path = os.getcwd()
# Add the extra path to the dataset file
dataset_path = os_path+'\datasets\Online_Retail.csv'
retail_data = pd.read_csv(dataset_path, encoding='ISO-8859-1')
retail_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/11 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/11 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/11 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/11 12:50,4.15,12680.0,France


## Methods to find missing values

### `isna()`

The `isna()` method in Pandas is used to detect missing values in a DataFrame or Series. It returns a boolean DataFrame or Series of the same shape, where each element is `True` if the corresponding value is `NaN` (Not a Number) and `False` otherwise.

#### Sintaxis of `isna()`

```python
df_data.isna()
se_data.isna()
```

#### Get missing data

In [6]:
missing_data = retail_data.isna()
missing_data

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
541904,False,False,False,False,False,False,False,False
541905,False,False,False,False,False,False,False,False
541906,False,False,False,False,False,False,False,False
541907,False,False,False,False,False,False,False,False


#### Count the missing data

In [7]:
missing_data_count = missing_data.sum()
missing_data_count

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

## Removing missing data

If you decide to remove the missing data rows, you can remove that data with `dropna()`.

### `dropna()`

This method deletes the rows and/or columns containing `null` or `NaN` values.

- `axis=0`: Delete the rows (default mode).

- `axis=1`: Delete the columns.

- `how='any`': Delete any row or column with `null` values (by default).

- `how='all'`: Delete the row or column where all the values are null.

- `thresh=N`: Delete the row or column if contain more than `N` `null` values.

#### Remove all the rows with misssing values

If you compare the total rows in this table with the original table you can notice the difference.

In [8]:
removed_missing_rows = retail_data.dropna()
removed_missing_rows

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/11 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/11 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/11 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/11 12:50,4.15,12680.0,France


#### Remove all the columns containing missing values

In this table the columns containing missing values are **Description** and **CustomerID**, in consequence tohse 2 columns will be deleted.

In [9]:
removed_missing_cols = retail_data.dropna(axis=1)
removed_missing_cols

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice,Country
0,536365,85123A,6,12/1/10 8:26,2.55,United Kingdom
1,536365,71053,6,12/1/10 8:26,3.39,United Kingdom
2,536365,84406B,8,12/1/10 8:26,2.75,United Kingdom
3,536365,84029G,6,12/1/10 8:26,3.39,United Kingdom
4,536365,84029E,6,12/1/10 8:26,3.39,United Kingdom
...,...,...,...,...,...,...
541904,581587,22613,12,12/9/11 12:50,0.85,France
541905,581587,22899,6,12/9/11 12:50,2.10,France
541906,581587,23254,4,12/9/11 12:50,4.15,France
541907,581587,23255,4,12/9/11 12:50,4.15,France


## Fill the lacking data

Instead of deleting the rows or columns with missing data, sometimes its better to fill those cellswith a dummy value that can be used in our computations.

### `fillna()`

You can fill those `null` values with a specific value using the method `fillna()`.

- `method='ffill'`: Spread the last known value towards.

- `method='bfill'`: Spread the last known value backwards.

- `value`: Assign any constant value like a number (like `0`) or a string (like `'NA'`).

#### Fill the `null`s with a constant value

Fill the `null` values with `0`.

In [10]:
fill_missing_const = retail_data.fillna(0)
fill_missing_const

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/10 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/11 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/11 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/11 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/11 12:50,4.15,12680.0,France


Yuo can check there are no more null values:

In [11]:
fill_missing_const.isna().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

#### Fill the `null`s with a **statistic value**

Sometimes is more convenient to fill the missing values with the `mean`, `median` or the `moda`.

- **Mean**: Useful when the data is distributed normally.

- **Median**: Better for asymetric distributions or many outliers.

- **Moda**: In categorical type data.

In [17]:
mean_unitPrice = retail_data['UnitPrice'].mean()
fill_missing_mean = retail_data['UnitPrice'].fillna(mean_unitPrice)
print('Filling missing values with the mean:\n', fill_missing_mean)
print(30*'-')
print('Null values:', fill_missing_mean.isna().sum())

Filling missing values with the mean:
 0         2.55
1         3.39
2         2.75
3         3.39
4         3.39
          ... 
541904    0.85
541905    2.10
541906    4.15
541907    4.15
541908    4.95
Name: UnitPrice, Length: 541909, dtype: float64
------------------------------
Null values: 0
