<a href="https://colab.research.google.com/github/mioyn/AdvDataProg/blob/main/theory/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imputation

Many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values.

Set the missing values to some value(Zero, the mean, the median, etc.)

```
# Fill missing values using forward fill then backward fill
df[required_cols] = df[required_cols].fillna(method='ffill').fillna(method='bfill')
```
Another way is to use Scikit-learn class:SimpleImputer


In [5]:
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
SimpleImputer()
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]



# Outlier Detection & Removal
### Interquartile Range (IQR) Method
```
# Function to remove outliers using IQR
def remove_outliers_iqr(df, column):
  #Compute Q1 (25th percentile) and Q3 (75th percentile).
  Q1 = df[column].quantile(0.25)
  Q3 = df[column].quantile(0.75)

  # calculate IQR
  IQR = Q3 - Q1

  # define lower and upper bounds - values outside is considerd as outliers
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  
  # remove outliers
  df_cleaned = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
  return df_cleaned

# Remove outliers for column salary
df = remove_outliers_iqr(df.copy(), 'salary')
```
# Feature Scaling
# Encoding
# Feature Engineering