In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Imputation**

Datasets may have missing values, and this can cause problems for many machine learning algorithms.

As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

**Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.**

![imp1](https://miro.medium.com/max/1400/1*PovaJ2Ka7PdlJinqjHAwDQ.png)

#### **Note**

* *Missing values must be marked with NaN values and can be replaced with statistical measures to calculate the column of values.*
* *How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.*
* *How to impute missing values with statistics as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.*

## **Why Its Needed?**

We use imputation because Missing data can cause the below issues: –

1. **Incompatible with most of the Python libraries used in Machine Learning:-** Yes, you read it right. While using the libraries for ML(the most common is skLearn), they don’t have a provision to automatically handle these missing data and can lead to errors.
2. **Distortion in Dataset:-** A huge amount of missing data can cause distortions in the variable distribution i.e it can increase or decrease the value of a particular category in the dataset.
3. **Affects the Final Model:-** the missing data can cause a bias in the dataset and can lead to a faulty analysis by the model.

Importantly “We want to restore the complete dataset”. This is mostly in the case when we do not want to lose any data from our dataset as all of it is important, If dataset size is not very big, and removing some part of it can have a significant impact on the final model.

## **Types**

![imp2](https://1.bp.blogspot.com/-gF6uDeQi8ck/X0KeUfSFrgI/AAAAAAAAKUQ/PwsXhRPAjDsiWQlwnQn7085QAkU6IJJ_wCLcBGAsYHQ/s1357/data%2Bvariables.PNG)

# **Top 6 Techniques with Code**

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data=pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/data.csv',index_col='row_id')
data

# **Null Analysis**

In [None]:
data.info()

## **Number of Missing Values in Each of the Columns**

In [None]:
n_null=data.isnull().sum()
n_null

## **Visualize**

In [None]:
import missingno as msno
msno.matrix(data.sample(500),color=(1, 0.38, 0.27))

**The white bars represent Null Values**

## **Columns with NULL Values**

In [None]:
null=data.columns[data.isnull().any()]
null

## **Distribution of Features**

In [None]:
plt.rcParams["figure.figsize"] = (25,20)

fig, ax = plt.subplots(9,9)

# adds title to figure            
fig.text(0.35,1,'Distribution of Features',{'size': 24})

i = 0
j = 0
for col in data.columns: #iterate thru all dataset columns
    if col not in ['row_id']: 
        ax[j, i].hist(data[col], bins=100) #plots histogram on subplot [j, i]
        ax[j, i].set_title(col, #adds a title to the subplot
                           {'size': '14', 'weight': 'bold'}) 
        if i == 8: #if we reach the last column of the row, drop down a row and reset
            i = 0
            j += 1
        else: #if not at the end of the row, move over a column
            i += 1

plt.rcParams.update({'axes.facecolor':'lightgreen'})
plt.figure(facecolor='red') 
plt.show() 

## **Correlation-Heatmap**

In [None]:
import missingno as msno
msno.heatmap(data,cmap="RdYlGn")

## **Feature-Wise NAN Value Distribution**

In [None]:
import seaborn as sns
plt.rcParams.update({'axes.facecolor':'black'})
plot = sns.histplot(data=n_null, bins=10, stat="percent")
plot.set_xlabel('The number of NaN values')

## **Dendogram**

In [None]:
import missingno as msno
msno.dendrogram(data, figsize=(20,15), fontsize=12);

# **Top 6 Popular Imputers**

***These are some popular techniques. There are many other third-party imputers (and many are comming up)***

## **1. Constant Imputer:**

You may replace the missing value with a contant. The constant can be a numeric constant or even a string constant.

![fillna](https://www.w3resource.com/w3r_images/pandas-series-fillna-image-2.svg)

That’s an easy one. You just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost). Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

In [None]:
# Fill with Numeric
data_num=data.fillna(0)
data_num.isnull().sum().sum()

In [None]:
# Fill with String
data_str=data.fillna('kaggle')
data_str.isnull().sum().sum()

## **2. Numerical : Mean/Median Imputer:**

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within each column separately and independently from the others. It can only be used with numeric data.

![mimp](https://miro.medium.com/max/1400/1*MiJ_HpTbZECYjjF1qepNNQ.png)

#### **Pros:**
* Easy and fast.
* Works well with small numerical datasets.

#### **Cons:**
* Doesn’t factor the correlations between features. It only works on the column level.
* Will give poor results on encoded categorical features (do NOT use it on categorical features).
* Not very accurate.
* Doesn’t account for the uncertainty in the imputations.

In [None]:
# Impute with Mean
from sklearn.impute import SimpleImputer

imp=SimpleImputer(strategy='mean')

data_mean=imp.fit_transform(data)
data_mean

## **3. Numerical & Categorical : Most Frequent Imputer:**

Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.

![mimp](https://miro.medium.com/max/1208/1*bgzrL1JLxno2igi4M20tgA.png)

#### **Pros:**
* Works well with categorical features.

#### **Cons:**
* It also doesn’t factor the correlations between features.
* It can introduce bias in the data.


In [None]:
# Impute with Most Frequent (also for Categorical Features)
from sklearn.impute import SimpleImputer

imp=SimpleImputer(strategy='most_frequent')

data_mode=imp.fit_transform(data)
data_mode

## **4-k-NN Imputer:**

The k nearest neighbours is an algorithm that is used for simple classification. The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood

![kimp](https://miro.medium.com/max/1280/1*b9BXv0uAkbSAn8MJIa4-_Q.gif)

#### **Pros:**
* Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).

#### **Cons:**
* Computationally expensive. KNN works by storing the whole training dataset in memory.
* K-NN is quite sensitive to outliers in the data (unlike SVM)

In [None]:
# from sklearn.impute import KNNImputer

# knn = KNNImputer(n_neighbors=5)

# data_knn=knn.fit_transform(data)

# data_knn    

**kNN Imputers are conputationally very  Resource-Consuming. Here it was causing this notebook to crash & restart. So commented.Howeveeer you can reference the code**

## **5-Indicator Imputer**

Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.

![indi](https://www.researchgate.net/profile/Antonio-Pereira-Barata-2/publication/338597970/figure/fig1/AS:849726220029952@1579601926055/Imputation-through-missing-indicator-The-table-to-the-left-represents-a-4-rows-slice-of.png)

In [None]:
from sklearn.impute import SimpleImputer

imp=SimpleImputer(strategy='median',add_indicator=True)

data_ind=imp.fit_transform(data)
data_ind=pd.DataFrame(data_ind)
data_ind

## **6-Iterative Imputer**

Useful only when working with multivariate data, the IterativeImputer in scikit-learn utilizes the data available in other features in order to estimate the missing values being imputed. 

***It does so through an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned***

![ii](https://amueller.github.io/COMS4995-s18/slides/aml-08-021218-imputation-feature-selection/images/img_4.png)

In [None]:
# # This estimator is still experimental for now: the predictions and the API might change without any deprecation cycle

# from sklearn.experimental import enable_iterative_imputer
# from sklearn.impute import IterativeImputer
# from sklearn.linear_model import LinearRegression

# imp=IterativeImputer(estimator=LinearRegression(),missing_values=np.nan)

# data_ii=imp.fit_transform(data)
# data_ii

Taking too long, Commented!

# Suggestions:-
* Kaggle - https://www.kaggle.com/pythonkumar
* GitHub - https://github.com/KumarPython​
* Twitter - https://twitter.com/KumarPython
* LinkedIn - https://www.linkedin.com/in/kumarpython/

#  **Submission**

In [None]:
sub=pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv')

split1=sub['row-col'].str.split(pat="-",expand=True)
split2=split1.iloc[:,1].str.split(pat="_",expand=True)

row=split1.iloc[:,0].astype('int64')
col=split2.iloc[:,1].astype('int64')

val=[data_ind.iloc[row[i],col[i]] for i in range(len(row))]
sub['value']=val   

sub.to_csv('submission.csv',index=False)