# ArbitraryImputer
This notebook shows the functionality in the ArbitraryImputer class. This transformer fills null values with a value set by the user. <br>

In this notebook two example datasets (public datasets) are used to demonstrate using Arbitrary Imputer and verify is the data types are preserved in the Arbitrary Imputer.

The downcast dtypes function (logic) can we viewed in the `data_type_casting.py` file.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

import sys

sys.path.append("../..")

In [2]:
# import tubular
from tubular.imputers import ArbitraryImputer

In [3]:
def downcast_dtypes(df):
    # Checking if min and max values of each column fit into the smallest possible datatype for int and float
    # If yes, downcasting it to smallest possible datatype
    # Else, leave as is
    for col in df.columns:
        if df[col].dtype == "int64":
            c_min = df[col].min()
            c_max = df[col].max()
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        elif df[col].dtype == "float64":
            c_min = df[col].min()
            c_max = df[col].max()
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)
    return df

## Example 1: SKlearn California Dataset

### Load California housing dataset from sklearn

In [4]:
cali = fetch_california_housing()
cali_df = pd.DataFrame(cali["data"], columns=cali["feature_names"])
cali_df["AveOccup"] = cali_df["AveOccup"].sample(frac=0.99, random_state=1)
cali_df["HouseAge"] = cali_df["HouseAge"].sample(frac=0.95, random_state=2)
cali_df["Population"] = cali_df["Population"].sample(frac=0.995, random_state=3)

In [5]:
cali_df.shape

(20640, 8)

In [6]:
cali_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    19608 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20537 non-null  float64
 5   AveOccup    20434 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


In [7]:
cali_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [8]:
cali_df.isnull().sum()

MedInc           0
HouseAge      1032
AveRooms         0
AveBedrms        0
Population     103
AveOccup       206
Latitude         0
Longitude        0
dtype: int64

In [9]:
# Pass the dataframe to downcast_dtypes function
cali_df = downcast_dtypes(cali_df)

cali_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float16
 1   HouseAge    19608 non-null  float16
 2   AveRooms    20640 non-null  float16
 3   AveBedrms   20640 non-null  float16
 4   Population  20537 non-null  float16
 5   AveOccup    20434 non-null  float16
 6   Latitude    20640 non-null  float16
 7   Longitude   20640 non-null  float16
dtypes: float16(8)
memory usage: 322.6 KB


## Simple usage

### Initialising ArbitraryImputer

The user must specify the value to impute with. This will be used to fill nulls in all columns specified in the transformer so the user must take care to not mix columns of different dtypes.

In [10]:
imp_1 = ArbitraryImputer(
    columns=["HouseAge", "AveOccup", "Population"],
    impute_value=-1,
    copy=True,
    verbose=True,
)

BaseTransformer.__init__() called


### ArbitraryImputer fit
There is no fit method for the ArbitraryImputer as the user sets the impute value when initialising the object.

### ArbitraryImputer transform
Multiple column mappings were specified when creating imp_1 so these columns will be imputed when the transform method is run.

In [11]:
cali_df_2 = imp_1.transform(cali_df)

BaseTransformer.transform() called


In [12]:
cali_df_2[["HouseAge", "AveOccup", "Population"]].isnull().sum()

HouseAge      0
AveOccup      0
Population    0
dtype: int64

In [13]:
(cali_df_2[["HouseAge", "AveOccup", "Population"]] == -1).sum()

HouseAge      1032
AveOccup       206
Population     103
dtype: int64

In [14]:
cali_df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float16
 1   HouseAge    20640 non-null  float16
 2   AveRooms    20640 non-null  float16
 3   AveBedrms   20640 non-null  float16
 4   Population  20640 non-null  float16
 5   AveOccup    20640 non-null  float16
 6   Latitude    20640 non-null  float16
 7   Longitude   20640 non-null  float16
dtypes: float16(8)
memory usage: 322.6 KB


## Example 2: SKlearn Breast Cancer dataset

In [15]:
# load sklearn breast cancer dataset
from sklearn.datasets import load_breast_cancer

# loading dataset
data = load_breast_cancer()

# creating pandas dataframe
breast_cancer_df = pd.DataFrame(data.data, columns=data.feature_names)
# Taking only first 10 columns
breast_cancer_df = breast_cancer_df.iloc[:, :10]

# add target variable
breast_cancer_df["target"] = data.target

# Adding missing values
breast_cancer_df["mean radius"] = breast_cancer_df["mean radius"].sample(
    frac=0.99, random_state=1
)
breast_cancer_df["mean texture"] = breast_cancer_df["mean texture"].sample(
    frac=0.95, random_state=2
)
breast_cancer_df["mean perimeter"] = breast_cancer_df["mean perimeter"].sample(
    frac=0.995, random_state=3
)

In [16]:
breast_cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0


In [17]:
breast_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   mean radius             563 non-null    float64
 1   mean texture            541 non-null    float64
 2   mean perimeter          566 non-null    float64
 3   mean area               569 non-null    float64
 4   mean smoothness         569 non-null    float64
 5   mean compactness        569 non-null    float64
 6   mean concavity          569 non-null    float64
 7   mean concave points     569 non-null    float64
 8   mean symmetry           569 non-null    float64
 9   mean fractal dimension  569 non-null    float64
 10  target                  569 non-null    int64  
dtypes: float64(10), int64(1)
memory usage: 49.0 KB


In [18]:
breast_cancer_df.isnull().sum()

mean radius                6
mean texture              28
mean perimeter             3
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
target                     0
dtype: int64

In [19]:
# Downcast the dataframe
breast_cancer_df = downcast_dtypes(breast_cancer_df)

breast_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   mean radius             563 non-null    float16
 1   mean texture            541 non-null    float16
 2   mean perimeter          566 non-null    float16
 3   mean area               569 non-null    float16
 4   mean smoothness         569 non-null    float16
 5   mean compactness        569 non-null    float16
 6   mean concavity          569 non-null    float16
 7   mean concave points     569 non-null    float16
 8   mean symmetry           569 non-null    float16
 9   mean fractal dimension  569 non-null    float16
 10  target                  569 non-null    int8   
dtypes: float16(10), int8(1)
memory usage: 11.8 KB


In [20]:
# Initialize ArbitraryImputer
imp_2 = ArbitraryImputer(
    columns=["mean radius", "mean texture", "mean perimeter"],
    impute_value=-1,
    copy=True,
    verbose=True,
)

BaseTransformer.__init__() called


In [21]:
# ArbitraryImputer transform
breast_cancer_df_2 = imp_2.transform(breast_cancer_df)

BaseTransformer.transform() called


In [22]:
breast_cancer_df_2[["mean radius", "mean texture", "mean perimeter"]].isnull().sum()

mean radius       0
mean texture      0
mean perimeter    0
dtype: int64

In [23]:
(breast_cancer_df_2[["mean radius", "mean texture", "mean perimeter"]] == -1).sum()

mean radius        6
mean texture      28
mean perimeter     3
dtype: int64

In [24]:
breast_cancer_df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   mean radius             569 non-null    float16
 1   mean texture            569 non-null    float16
 2   mean perimeter          569 non-null    float16
 3   mean area               569 non-null    float16
 4   mean smoothness         569 non-null    float16
 5   mean compactness        569 non-null    float16
 6   mean concavity          569 non-null    float16
 7   mean concave points     569 non-null    float16
 8   mean symmetry           569 non-null    float16
 9   mean fractal dimension  569 non-null    float16
 10  target                  569 non-null    int8   
dtypes: float16(10), int8(1)
memory usage: 11.8 KB
