<a href="https://colab.research.google.com/github/ichhitsapkota143/Machine-Learning/blob/main/Day37.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Missing Indicator**
In machine learning, a missing indicator is a binary variable (0 or 1) that indicates whether a value in a dataset was missing before imputation. It's a feature engineering technique used when handling missing data.

🔍 Why Use a Missing Indicator?
Sometimes, the fact that a value was missing carries predictive information. For example:

  1. If a blood test value is missing, it might be because the patient wasn't considered high-risk.

  2. If a student's attendance is missing, it might indicate absence.

Adding a missing indicator helps models learn from the pattern of missingness.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator, SimpleImputer

In [6]:
df=pd.read_csv('training.csv',usecols=['Age','Fare','Survived'])

In [7]:
df.sample(5)

Unnamed: 0,Survived,Age,Fare
555,0,62.0,26.55
375,1,,82.1708
562,0,28.0,13.5
351,0,,35.0
581,1,39.0,110.8833


In [8]:
X=df.drop(columns=['Survived'])
y=df['Survived']

In [9]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

In [10]:
X_train.sample(5)

Unnamed: 0,Age,Fare
40,40.0,9.475
559,36.0,17.4
569,32.0,7.8542
554,22.0,7.775
464,,8.05


In [12]:
si=SimpleImputer()
X_train_trf=si.fit_transform(X_train)
X_test_trf=si.transform(X_test)

In [14]:
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train_trf,y_train)
y_pred=clf.predict(X_test_trf)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6145251396648045

In [15]:
#now using Missing Indicator
mi=MissingIndicator()
mi.fit(X_train)


In [16]:
mi.features_

array([0])

In [17]:
X_train_missing=mi.transform(X_train)
X_test_missing=mi.transform(X_test)

In [18]:
X_train_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [19]:
X_test_missing

array([[False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [

In [20]:
X_train

Unnamed: 0,Age,Fare
30,40.0,27.7208
10,4.0,16.7000
873,47.0,9.0000
182,9.0,31.3875
876,20.0,9.8458
...,...,...
534,30.0,8.6625
584,,8.7125
493,71.0,49.5042
527,,221.7792


In [21]:
X_train['Age_NA']=X_train_missing

In [22]:
X_train

Unnamed: 0,Age,Fare,Age_NA
30,40.0,27.7208,False
10,4.0,16.7000,False
873,47.0,9.0000,False
182,9.0,31.3875,False
876,20.0,9.8458,False
...,...,...,...
534,30.0,8.6625,False
584,,8.7125,True
493,71.0,49.5042,False
527,,221.7792,True


In [23]:
X_test['Age_NA']=X_test_missing

In [24]:
X_train_trf2=si.fit_transform(X_train)
X_test_trf2=si.fit_transform(X_test)

clf.fit(X_train_trf2,y_train)
y_pred=clf.predict(X_test_trf2)

accuracy_score(y_test,y_pred)

0.6312849162011173

In [25]:
#without using missing indicator class
df3=pd.read_csv('training.csv',usecols=['Age','Fare','Survived'])

In [26]:
X=df3.drop(columns=['Survived'])
y=df3['Survived']

In [27]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=2)

In [28]:
si=SimpleImputer(add_indicator=True)
X_train=si.fit_transform(X_train)
X_test=si.transform(X_test)

In [29]:
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

accuracy_score(y_test,y_pred)

0.6312849162011173