# 5 Essential Machine Learning Techniques to Master Your Data Preprocessing

https://medium.com/p/e888f6d220e1

This blog explores five crucial preprocessing techniques that every data scientist must master: handling missing data, scaling and normalization, encoding categorical data, feature engineering, and dealing with imbalanced data. 

## 1. Handling Missing Data

If improperly handled, missing data can lead to biased model predictions, misleading insights, or even training failures. 

Types of Missing Data

    1. Missing Completely at Random (MCAR):
        The probability of a missing data point is unrelated to any other observed or unobserved sample. In this case, removing the data may not introduce bias, as it’s random.
    2. Missing at Random (MAR):
        The missingness of a data point depends on other observed variables but not on the missing value itself. This is common in surveys or demographic datasets, where missing income data might be related to education level.
    3. Missing Not at Random (MNAR):
        The missingness is related to the unobserved data itself. For example, high-income people might be less likely to disclose their earnings, which can bias the dataset if not handled carefully.

**Strategy 1:** Listwise Deletion (Removing Missing Data)

The simplest way to handle missing data is to remove rows containing missing values. While this works for small datasets with few missing entries, it is far less practical for large datasets where missing data is frequent, as it will yield loss of valuable information. (Básicamente usar el .dropna()

Best Practice. Use listwise deletion cautiously. It’s only suitable when missing data is MCAR or removing rows won’t significantly impact the dataset’s integrity.

**Strategy 2:** Imputation Methods (Fill In Missing Data)

If removing data is not an option, impute (i.e., fill in) missing values using statistical measures (e.g., the mean, median, or mode). Imputation allows the model to use all available information, ensuring no data is discarded.
This method works well when the data is symmetrically distributed but can introduce bias in skewed distributions.

In [7]:
import pandas as pd
from sklearn.impute import SimpleImputer

pd.set_option("display.precision", 2)
# Simulated dataset with missing values
data = {'Price': [200000, 150000, None, 130000, 250000],
        'Bedrooms': [3, 2, 4, None, 3],
        'SquareFeet': [2000, 1600, 2400, 1800, None]}
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)
# Impute missing values using the mean for numerical data
imputer_mean = SimpleImputer(strategy='mean')
df_imputed_mean = pd.DataFrame(imputer_mean.fit_transform(df), columns=df.columns)
print("\nDataFrame after Mean Imputation:")
print(df_imputed_mean)

Original DataFrame with Missing Values:
      Price  Bedrooms  SquareFeet
0  200000.0       3.0      2000.0
1  150000.0       2.0      1600.0
2       NaN       4.0      2400.0
3  130000.0       NaN      1800.0
4  250000.0       3.0         NaN

DataFrame after Mean Imputation:
      Price  Bedrooms  SquareFeet
0  200000.0       3.0      2000.0
1  150000.0       2.0      1600.0
2  182500.0       4.0      2400.0
3  130000.0       3.0      1800.0
4  250000.0       3.0      1950.0


Best Practice. Mean imputation is effective for numerical data with minimal skew. For skewed data, consider using the median instead.

**En este caso antes de hacer eso vale la pena ver la distribución de cada variable, con el histograma por ejemplo.**

**Strategy 3:** Mode Imputation for Categorical Data

We cannot impute categorical data (e.g., gender or country) using the mean. Instead, we use the mode. Hence, we fill in missing values with the most frequently occurring category.

Best Practice. Mode imputation works well for categorical data, especially for features like gender or country, where most frequent values are meaningful.

In [8]:
import numpy as np

# Simulated dataset with categorical missing values
data_cat = {'Name': ['John', 'Emily', 'Michael', None, 'Jessica'],
            'Country': ['USA', 'UK', None, 'USA', 'Canada']}
df_cat = pd.DataFrame(data_cat)
print("\nOriginal Categorical DataFrame with Missing Values:")
print(df_cat)

# Replace None with np.nan for proper handling of missing values
df_cat.replace({None: np.nan}, inplace=True)

# Impute missing values using the mode (most frequent value) for categorical data
imputer_mode = SimpleImputer(strategy='most_frequent')
df_cat_imputed = pd.DataFrame(imputer_mode.fit_transform(df_cat), columns=df_cat.columns)

print("\nDataFrame after Mode Imputation:")
print(df_cat_imputed)


Original Categorical DataFrame with Missing Values:
      Name Country
0     John     USA
1    Emily      UK
2  Michael    None
3     None     USA
4  Jessica  Canada

DataFrame after Mode Imputation:
      Name Country
0     John     USA
1    Emily      UK
2  Michael     USA
3    Emily     USA
4  Jessica  Canada


**Strategy 4:** Advanced Techniques — Multivariate Imputation by Chained Equations (MICE)

Simple imputation methods like mean or mode can introduce bias, particularly in complex datasets. We can use Multivariate Imputation by Chained Equations (MICE) for such cases. This technique predicts missing values based on the relationships between multiple features.

Best Practice: Use MICE for datasets with complex interdependencies between features, mainly when simple imputations might introduce bias.

*Nota: no se puede usar con variables categóricas.

In [11]:

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

# Use Iterative Imputer (MICE) for advanced imputation
mice_imputer = IterativeImputer(max_iter=10, random_state=0)
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)
print("\nDataFrame after MICE Imputation:")
print(f"{df_mice_imputed}")



DataFrame after MICE Imputation:
       Price  Bedrooms  SquareFeet
0  200000.00      3.00     2000.00
1  150000.00      2.00     1600.00
2  254096.39      4.00     2400.00
3  130000.00      5.97     1800.00
4  250000.00      3.00     2305.46


## 2. Scaling and Normalization

Why Scaling and Normalization Are Essential

Many machine learning algorithms, especially those involving distance-based metrics (e.g., k-nearest neighbors or support vector machines) or gradient-based optimizers (e.g., logistic regression and neural networks), assume that features are on a similar scale.

Theory: What’s the Difference Between Scaling and Normalization?

Normalization generally refers to rescaling the data to fall within a specific range, typically [0, 1]. It is often used when the data doesn’t follow a Gaussian distribution.

Scaling (aka standardization) refers to adjusting the distribution of values to have a mean of 0 and a standard deviation of 1. It is usually applied to data that follows a Gaussian (i.e., normal) distribution; it is commonly used in algorithms that rely on gradient descent (e.g., logistic regression or neural networks).

### Min-Max Normalization

In [14]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Simulated dataset
data = {'Price': [200000, 150000, 180000, 130000, 250000],
        'Bedrooms': [3, 2, 4, 3, 3],
        'SquareFeet': [2000, 1600, 2400, 1800, 2200]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply Min-Max normalization to the dataset
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("\nDataFrame after Min-Max Normalization:")
print(df_minmax)

Original DataFrame:
    Price  Bedrooms  SquareFeet
0  200000         3        2000
1  150000         2        1600
2  180000         4        2400
3  130000         3        1800
4  250000         3        2200

DataFrame after Min-Max Normalization:
   Price  Bedrooms  SquareFeet
0   0.58       0.5        0.50
1   0.17       0.0        0.00
2   0.42       1.0        1.00
3   0.00       0.5        0.25
4   1.00       0.5        0.75


Este método es útil cuando los valores están bien definidos, por ejemplo porcentajes que siempre se muevan entre 0 y 100. Por defecto, el minmax scaler usa los límites según los datos disponibles pero se pueden definir:

In [None]:
# Define the dataset (training data)
df_train = pd.DataFrame({
    'percentage_score': [55, 70, 85, 90]  # Min value is 55, max value is 90 in training data
})
limits = [[0], [100]]
# Initialize the MinMaxScaler with a fixed range (0, 100)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(limits)  # Fit the scaler using the known min and max (0 and 100)

# Normalize the training data
df_train['percentage_normalized'] = scaler.transform(df_train[['percentage_score']].values)

# Example test set that includes a value outside the training set's range
df_test = pd.DataFrame({
    'percentage_score': [50, 100, 110]  # Includes a value (50) lower than seen in training and (110) above 100
})

# Normalize the test data using the fixed scaler
df_test['percentage_normalized'] = scaler.transform(df_test[['percentage_score']].values)

print("Training Data Normalized:")
print(df_train)
print("\nTest Data Normalized:")
print(df_test)

### Z-Score Standardization (Standard Scaling)

In [15]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
scaler_std = StandardScaler()
# Apply Z-Score standardization
df_std = pd.DataFrame(scaler_std.fit_transform(df), columns=df.columns)
print("\nDataFrame after Z-Score Standardization:")
print(df_std)


DataFrame after Z-Score Standardization:
   Price  Bedrooms  SquareFeet
0   0.43      0.00        0.00
1  -0.77     -1.58       -1.41
2  -0.05      1.58        1.41
3  -1.25      0.00       -0.71
4   1.63      0.00        0.71


Min-Max Normalization vs. Z-Score Standardization

Min-Max Normalization is best when:

    Your data does not follow a normal distribution.
    Your model makes assumptions about the range of the data (e.g., neural networks with activation functions like sigmoid or tanh, which expect inputs in a specific range).
    You want to preserve the relationships between the minimum and maximum values. Furthermore, there are clear upper and lower limits.

On the other hand, Z-Score Standardization is best to use when:

    Your data follows a Gaussian (i.e., normal) distribution.
    You use models like logistic regression, support vector machines, or neural networks that assume standardized inputs for optimal performance.
    You need features to be centered around zero, which can prevent issues like slow convergence in gradient descent.

Potential Pitfalls and Best Practices

Outliers.

If your data contains significant outliers, Z-score standardization can overinflate their effect because it relies on the mean and standard deviation. Consider removing outliers before standardizing or applying robust scaling techniques.

Data Leakage.

Always fit your scaler on the training data before applying it to the test set. This prevents data leakage, where information from the test set influences the training process.

# 3. Encoding Categorical Data

### Label Encoding

Label encoding assigns a unique integer to each category. While this method is simple, it’s mostly suited for ordinal variables (where the categories have an inherent order). It can introduce unintended ordinal relationships for nominal variables (where the categories are unordered).

In [17]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Define the dataset
df = pd.DataFrame({
    'education': ['High School', 'Bachelor\'s', 'Master\'s', 'PhD', 'Bachelor\'s', 'Master\'s']
})

# Ordinal mapping: explicit order
education_order = ['High School', 'Bachelor\'s', 'Master\'s', 'PhD']

# Apply the LabelEncoder correctly based on the ordinal relationship
df['education'] = pd.Categorical(df['education'], categories=education_order, ordered=True)
df['education_encoded'] = df['education'].cat.codes

print(df)

     education  education_encoded
0  High School                  0
1   Bachelor's                  1
2     Master's                  2
3          PhD                  3
4   Bachelor's                  1
5     Master's                  2


### One-Hot Encoding

In [18]:
# Simulated dataset with a nominal categorical variable
data = {'Animal': ['Dog', 'Cat', 'Rabbit', 'Dog', 'Rabbit']}
df_cat = pd.DataFrame(data)
print("\nOriginal Categorical DataFrame:")
print(df_cat)

# Perform one-hot encoding
df_one_hot = pd.get_dummies(df_cat, columns=['Animal'], prefix='Animal')
# change type, though boolean works well for typical code flags and 
# indexing
df_one_hot = df_one_hot.astype(np.uint)
print("\nDataFrame after One-Hot Encoding:")
print(df_one_hot)


Original Categorical DataFrame:
   Animal
0     Dog
1     Cat
2  Rabbit
3     Dog
4  Rabbit

DataFrame after One-Hot Encoding:
   Animal_Cat  Animal_Dog  Animal_Rabbit
0           0           1              0
1           1           0              0
2           0           0              1
3           0           1              0
4           0           0              1


### Advanced Encoding Technique: Target Encoding

This method encodes categories based on the mean of the target variable for each category. This is useful in situations where the categorical feature has many levels, but it can introduce overfitting if not done carefully.

When to Use:

    Target Encoding can be used when a categorical variable has many unique categories (e.g., zip codes, product IDs, or usernames).
    This method replaces the categorical values with the mean of the target variable for each category, allowing the model to capture patterns without dramatically increasing the number of features.

Caution: Target Encoding can lead to overfitting if not done carefully, especially if the model can memorize the relationship between the category and the target. To mitigate this, techniques like cross-validation or regularization should be applied.

# Calculate the mean price per neighborhood
mean_target = df_te.groupby('Neighborhood')['Price'].mean()

Best Practice. Always apply cross-validation when using Target Encoding to avoid overfitting. A common strategy is to calculate the target encoding on the training data and apply it to the validation/test set to ensure there is no data leakage.

In [5]:
import pandas as pd

# Simulated dataset with neighborhood and house prices
data = {'Neighborhood': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B'],
        'Price': [200000, 150000, 250000, 300000, 160000, 310000, 220000, 170000]}
df_te = pd.DataFrame(data)
print("\nOriginal DataFrame:")
print(df_te)

# Calculate the mean price per neighborhood
mean_target = df_te.groupby('Neighborhood')['Price'].mean()

# Map the mean target encoding back to the original DataFrame
df_te['Neighborhood_encoded'] = df_te['Neighborhood'].map(mean_target)
print("\nDataFrame after Target Encoding:")
print(df_te)


Original DataFrame:
  Neighborhood   Price
0            A  200000
1            B  150000
2            A  250000
3            C  300000
4            B  160000
5            C  310000
6            A  220000
7            B  170000

DataFrame after Target Encoding:
  Neighborhood   Price  Neighborhood_encoded
0            A  200000         223333.333333
1            B  150000         160000.000000
2            A  250000         223333.333333
3            C  300000         305000.000000
4            B  160000         160000.000000
5            C  310000         305000.000000
6            A  220000         223333.333333
7            B  170000         160000.000000


# 4. Feature Engineering

The Importance of Feature Engineering

Feature engineering is often considered the heart of machine learning, where domain knowledge and creativity intersect to transform raw data into meaningful inputs that improve model performance.

## Polynomial features

In [2]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# Simulated dataset
data = {'Price': [200000, 150000, 180000, 130000, 250000],
        'Bedrooms': [3, 2, 4, 3, 3],
        'SquareFeet': [2000, 1600, 2400, 1800, 2200]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize PolynomialFeatures for degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)

# Apply polynomial transformation
df_poly = pd.DataFrame(poly.fit_transform(df[['Bedrooms', 'SquareFeet']]),
                       columns=poly.get_feature_names_out(['Bedrooms', 'SquareFeet']))
print("\nDataFrame after Polynomial Feature Generation:")
print(df_poly)

Original DataFrame:
    Price  Bedrooms  SquareFeet
0  200000         3        2000
1  150000         2        1600
2  180000         4        2400
3  130000         3        1800
4  250000         3        2200

DataFrame after Polynomial Feature Generation:
   Bedrooms  SquareFeet  Bedrooms^2  Bedrooms SquareFeet  SquareFeet^2
0       3.0      2000.0         9.0               6000.0     4000000.0
1       2.0      1600.0         4.0               3200.0     2560000.0
2       4.0      2400.0        16.0               9600.0     5760000.0
3       3.0      1800.0         9.0               5400.0     3240000.0
4       3.0      2200.0         9.0               6600.0     4840000.0


## Log Transformations

In [3]:
import numpy as np

# Apply log transformation to 'Price'
df['Log_Price'] = np.log(df['Price'] + 1)  # Adding 1 to avoid log(0)
print("\nDataFrame after Log Transformation of 'Price':")
print(df[['Price', 'Log_Price']])


DataFrame after Log Transformation of 'Price':
    Price  Log_Price
0  200000  12.206078
1  150000  11.918397
2  180000  12.100718
3  130000  11.775297
4  250000  12.429220


## Binning

Binning is dividing continuous variables into intervals (i.e., bins). It is useful when you want to simplify the data or create meaningful groups. This technique constrains a feature’s range or makes the data more interpretable by converting it into categories (e.g., low, medium, and high).

In [4]:
# Define bin edges for price categories
bins = [0, 150000, 200000, np.inf]
labels = ['Low', 'Medium', 'High']

# Apply binning
df['Price_Binned'] = pd.cut(df['Price'], bins=bins, labels=labels)
print("\nDataFrame after Binning 'Price':")
print(df[['Price', 'Price_Binned']])


DataFrame after Binning 'Price':
    Price Price_Binned
0  200000       Medium
1  150000          Low
2  180000       Medium
3  130000          Low
4  250000         High


## Handling High Cardinality with Feature Hashing

Another common problem is when a feature has many unique categories (i.e., high cardinality), such as zip codes, product IDs, or user IDs. Using traditional one-hot encoding in such scenarios can drastically increase the dimensionality of the dataset, leading to memory inefficiencies and longer computation times. We can use feature hashing (i.e., the hashing trick) to reduce dimensionality and preserve essential data patterns.

Feature hashing transforms categories into integers using a hash function and assigns them to a fixed number of “buckets” (i.e., columns). This method avoids creating thousands or even millions of one-hot encoded columns.

In [6]:
from sklearn.feature_extraction import FeatureHasher

# Simulated high-cardinality data
data = {'ProductID': [['P001'], ['P002'], ['P003'], ['P004'], ['P005']]}
df_hash = pd.DataFrame(data)
# Feature hashing with 4 output features (buckets)
hasher = FeatureHasher(n_features=4, input_type='string')
hashed_features = hasher.fit_transform(df_hash['ProductID'])
# Convert the hashed result back to a DataFrame
df_hashed = pd.DataFrame(hashed_features.toarray(), columns=[f'Bucket_{i}' for i in range(1, 5)])
print("\nHashed Features (Feature Hashing for High Cardinality):")
print(df_hashed)


Hashed Features (Feature Hashing for High Cardinality):
   Bucket_1  Bucket_2  Bucket_3  Bucket_4
0      -1.0       0.0       0.0       0.0
1       0.0       0.0      -1.0       0.0
2       0.0       0.0       0.0       1.0
3      -1.0       0.0       0.0       0.0
4       0.0       1.0       0.0       0.0


# 5. Dealing with Imbalanced Data

Imbalanced data is when one class or label significantly outnumbers the other(s) in a dataset. For example, in fraud detection, the number of fraudulent transactions is usually much smaller than that of non-fraudulent ones. Left untreated, this imbalance can lead to a model that performs well on the majority class but poorly on the minority class.

Mathematics and Theory Behind Imbalanced Data

When dealing with an imbalanced dataset, there are a few standard metrics and definitions to keep in mind:

Imbalance Ratio:
The imbalance ratio quantifies the degree of imbalance between the majority and minority classes. For a binary classification problem:

IR = # de clase mayoritaria/# de clase minoritaria

Ver accuracy, precision, Recall y F1-Score

ROC Curve and AUC:
The ROC curve plots the true positive rate (recall) against the false positive rate. The AUC (Area Under the Curve) is a standard metric used for imbalanced datasets, evaluating the model’s ability to distinguish between classes irrespective of their distribution.

## Technique 1: Class Weighting

Class weighting is a standard method used with models that allow you to assign a higher weight to the minority class. By doing so, the model treats errors in the minority class as more costly, encouraging the model to learn from minority examples.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

# Simulate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2,
                           n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Logistic Regression with class weighting
clf = LogisticRegression(class_weight='balanced', random_state=42)

# Fit the model
clf.fit(X_train, y_train)

# Predict on the test data
y_pred = clf.predict(X_test)

# Evaluate the model
print("Classification Report with Class Weighting:")
print(classification_report(y_test, y_pred))

Classification Report with Class Weighting:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       178
           1       0.92      1.00      0.96        22

    accuracy                           0.99       200
   macro avg       0.96      0.99      0.98       200
weighted avg       0.99      0.99      0.99       200



## Technique 2: Random Oversampling

Random oversampling involves duplicating instances of the minority class to balance the dataset. It is a simple and effective method, but it can lead to overfitting if the model starts to memorize repeated instances.

In [10]:
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# Initialize RandomOverSampler
ros = RandomOverSampler(random_state=42)
# Apply oversampling to the training data
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
# Check the distribution after oversampling
print("Class distribution after Random Oversampling:", np.bincount(y_resampled))

Class distribution after Random Oversampling: [723 723]


## Technique 3: Random Undersampling

Random undersampling involves removing instances of the majority class to balance the dataset. This method can lead to a loss of valuable data from the majority class, but it helps reduce the training time and memory consumption, especially for large datasets.

This method is proper when the majority class significantly outnumbers (supera) the minority class.

In [15]:
from imblearn.under_sampling import RandomUnderSampler

# Initialize RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
# Apply undersampling to the training data
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
# Check the distribution after undersampling
print("Class distribution after Random Undersampling:", np.bincount(y_resampled))

Class distribution after Random Undersampling: [77 77]


## Technique 4: Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE is an advanced oversampling technique that creates synthetic instances of the minority class by interpolating between existing examples. This method generates more varied minority class examples, reducing the risk of overfitting compared to random oversampling.

SMOTE creates new, synthetic samples of the minority class, balancing the dataset while avoiding the overfitting risk of random oversampling.

In [16]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)
# Apply SMOTE to the training data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Check the distribution after SMOTE
print("Class distribution after SMOTE:", np.bincount(y_resampled))

Class distribution after SMOTE: [723 723]
