## **Data Preprocessing for Machine Language**

**Data Cleaning is Not Data Processing!**

Its worthwhile to know that data cleaning is different from data preprocessing. In the machine learning implementation pipeline, data cleaning comes first. As matter of emphasis, through more light on data cleaning.

The process of identifying and correcting errors, inconsistencies and inaccuracies in raw data is data cleaning. The tasks you would most likely encounter in data cleaning may include;

-   Handling missing values.
-   Removing duplicate rows or columns.
-   Correcting inconsistencies (like typos and inconsistent cases)
-   Handling outliers.
-   Handling corrupt strings (unicode problems).
-   Handling irrelevant data and more that you might have encountered or may likely encounter in your journey with working with data.


<hr>

**Other Conflicting Terms with "Data Preprocessing"**

It is quite important to note that as you progress in your learning, practice and research, you would come across some big data lingo that might throw you in confusion, kind of wandereing what they mean... some of these terms include **data mining, data wrangling, data preparation, data transformation, data harmonization, data refinement, data shaping, data manipulation, data manicuring, data validation** etc.

In your spare time, do well to take your time to learning their meaning, their similarities and difference with other related terms.

But two of those terms will be differentiated from data preprocessing in a simple table to get us going before moving into data preprocessing proper.


| **Feature** | **Data Wrangling** | **Data Preprocessing** | **Data Mining** |
|-------------|--------------------|------------------------|-----------------|
| **Purpose** | Preparing and structuring raw data | Transforming data into a format suitable for ML | Extracing insights and patterns |
| **Focus** | Cleaning, transforming, and integrating data | Encoding, scaling, and feature selection | Finding hidden relationships in data (the complete ML pipeline) |
| **Techniques** | Data cleaning, merging, reshaping | Normalization, encoding, feature extraction | Clustering, classification, regression, anomaly detection |
| **End Goal** | Organized, structured data | Model ready dataset | Actionable insights for decision making |


<hr>

**Data Preprocessing**

In the Nigerian setting, when you see people cooking many varieties of dishes with big sets of pots. The next thing that comes to mind is "What is the celebration?", that is, they are most likely cooking for a party or an event. It is the same thing when it comes to data preprocessing. So in a lay mans' definition, we can say that data preprocessing is the preparing and cooking of your dataset for ML model building. I think this should be clear enought.

If simplicity feels like a scam to you, lets subscribe to a little bit of complexity in our definition.

Data preprocessing is the step in the machine learning pipeline where raw data is transformed into a structured, clean, and suitable format for model training. It ensures that data is compatible with machine learning algorithms by handling incosistencies, scaling, encoding, and feature selection.


<hr>

**Why Should We Preprocess Data Before Building The ML Model?**

**NOTE:** _I will introduce some new terms here, don't get worried if you don't understand them yet, but is good you are aware of their existence. I will bolden them for emphasis_

Lets break it donw, the effectiveness of our model is at the mercy of how well our features are engineered, and **feature engineering** is at the mercy of data preprocessing. Data preprocessing is crucial in machine learning because raw data is often incomplete, inconsistent, and noisy. Without proper preprocessing, **ML Algorithms** may learn incorrect patterns, leading to poor performance.


So lets answer the question now.
1.  To improve data quality: After cleaning data **features** might still contain errors, inconsistencies and missing features that can affect the performance of the **model**. So preprocessing prepares the dataset for **feature selection** helping.
2.  Enhance Model Performance: Most ML algorithms learn better when the features are **scaled, normalized** and structured properly. So proper preprocessing ensures that features contribute meaningfully to **predictions.**
3.  Prevent Models fro **Overfitting, Underfitting, High Variance** and **Bias**: Handling **imbalance datasets** prevents the model from being biased towards the dominant **class**. So proper preprocessing ensures that the features contributes meaningfully to predictions of the model.
4.  Ensures Algorighm Compatibility: Many ML algorithms require numeric inputs (e.g., **encoding** categorical variables). Some algorithms, like KNN and neural networks, are sensitive to scale, requiring normaliztion.
5.  Improves Model Accuracy and Efficiency: Well preprocessed data helps models learn better patterns, leading to improved **accuracy** and **generalization**. Reduces **computational complexity** by eliminating **irrelevant features**.

<hr>

**Technical Terms Used in Data Preprocessing**

Who doesn't admire professional who sounds technical and at the same time reasonable when delivering talks that have to do with their fields? I do admire them, don't know about you... So don't be overwhelmed seeing the next table, I just made life easire for you. When you konw and understand this stuffs, you are like 70% away from being a professional ML hypeman, trust me!

You really don't have to know all of them, we will obviously not talk about most of them, and for some you may not use it all in your ML practice. But keep it handy as a cheatsheet.

| Term | Meaning | Implementation (Python) | Use Case |
|------|---------|-------------------------|----------|
| **Missing Value Imputation** | Strategies for filling in missing data | `df.fillna()` or `KNNImputer()` (from `sklearn`) | Ensures complete datasets for analysis or model training |
| **Mean/Median/Mode Imputation** | Replacing missing values with the average, median, or mode | `df.fillna(df.mean())`, `df.fillna(df.median())` | Quick and simple imputation for numeric columns |
| **KNN Imputation** | Using similar data points to fill in missing values | `KNNImputer()` (from `sklearn`) | Used when data is missing at random and can be inferred |
| **Regression Imputation**    | Predicting missing values using a regression model            | Custom implementation using `LinearRegression()`             | Suitable for datasets with relationships between features     |
| **Multiple Imputation**      | Creating multiple plausible versions of data and combining them | `IterativeImputer()` (from `sklearn`)                        | Better for uncertain data with missing values                 |
| **Outlier Detection**        | Identifying and handling extreme values                        | `zscore()`, `IQR method`                                     | Prevents extreme values from distorting the model             |
| **Z-score Method**           | Identifying outliers based on standard deviation               | `zscore(df['col'])`                                          | Used for identifying outliers in normally distributed data    |
| **IQR Method**               | Identifying outliers based on interquartile range              | `IQR = df['col'].quantile(0.75) - df['col'].quantile(0.25)`  | Detects outliers in skewed or non-normal distributions        |
| **Box Plots**                | Visual representation of data distribution with outliers      | `sns.boxplot(x='col', data=df)`                              | Helps visually identify outliers                             |
| **Winsorizing/Clipping**     | Limiting extreme values to a specified threshold              | `winsorize()` from `scipy.stats`                             | Reduces the impact of outliers on model performance           |
| **Data Deduplication**       | Removing duplicate records                                    | `df.drop_duplicates()`                                       | Prevents duplicate data from affecting analysis or models     |
| **Noise Removal**            | Reducing random errors or noise in data                       | `smooth()` or `scipy.signal.savgol_filter()`                 | Improves signal-to-noise ratio in datasets                   |
| **Smoothing Techniques**     | Applying filters to reduce random noise in time-series data   | `moving_average()`, `SavitzkyGolay()` (from `scipy`)         | Helps in smoothing time-series data or signals                |
| **Data Type Conversion**     | Changing data types for consistency                           | `df['col'] = df['col'].astype(int)`                          | Ensures consistency in feature formats (e.g., string to int) |
| **Data Transformation**      | Changing the format or structure of the data                  | `np.log()`, `BoxCox()` (from `scipy`)                        | Used for normalizing, stabilizing variance, or skewed data   |
| **Normalization**            | Scaling data to a specific range (usually 0 to 1)             | `MinMaxScaler()` (from `sklearn`)                            | Common for algorithms requiring bounded features (e.g., KNN) |
| **Min-Max Scaling**          | Scaling data between 0 and 1                                  | `MinMaxScaler().fit_transform(df)`                           | Ensures all features are in the same range                    |
| **Standardization**          | Scaling data to have a mean of 0 and standard deviation of 1 | `StandardScaler()` (from `sklearn`)                          | Often required for algorithms like linear regression, SVM    |
| **Robust Scaling**           | Scaling using median and IQR for robustness against outliers | `RobustScaler()` (from `sklearn`)                            | Handles datasets with many outliers                          |
| **Feature Scaling**          | A broader term covering normalization and standardization    | Combination of `MinMaxScaler()`, `StandardScaler()`          | Ensures uniform scaling of features                           |
| **Encoding Categorical Variables** | Converting categorical data into numeric format           | `pd.get_dummies()`, `LabelEncoder()`                         | Required for machine learning models to process categorical data |
| **One-Hot Encoding**         | Creating binary columns for each category                     | `pd.get_dummies()`                                           | Used when categorical variables have no ordinal relationship  |
| **Label Encoding**           | Assigning unique integers to categories                       | `LabelEncoder().fit_transform(df['col'])`                    | Used when categories have a natural order                    |
| **Ordinal Encoding**         | Assigning integers to categories based on order               | Custom mapping: `{category: index}`                           | Suitable for ordinal data where order matters                |
| **Target Encoding**          | Replacing categories with mean of the target variable         | `CategoryEncoders.TargetEncoder()`                           | Used in modeling when categories correlate with target variable |
| **Discretization/Binning**   | Converting continuous variables into discrete bins            | `pd.cut()` or `KBinsDiscretizer()`                           | Used for reducing model complexity or creating categorical features |
| **Log Transformation**       | Applying a log function to reduce skewness                    | `np.log(df['col'])`                                          | Used for positively skewed data or stabilizing variance       |
| **Power Transformation**     | Box-Cox or Yeo-Johnson transformations for normalizing data  | `PowerTransformer()` (from `sklearn`)                        | Used for stabilizing variance and making data more Gaussian  |
| **Polynomial Features**      | Creating new features by raising existing features to higher powers | `PolynomialFeatures()` (from `sklearn`)                      | Used for capturing non-linear relationships                   |
| **Interaction Features**     | Creating new features by combining existing features         | `df['new_feature'] = df['feature1'] * df['feature2']`        | Used to capture interactions between features                 |
| **Data Augmentation**        | Generating new data points based on existing data            | `ImageDataGenerator()` (from `Keras`), `augmented_text()`    | Common in image and text processing tasks                    |
| **Data Integration**         | Merging data from multiple sources                           | `pd.merge()` or `pd.concat()`                               | Combines multiple datasets into a single source of truth     |
| **Feature Selection**        | Choosing relevant features for model training                | `SelectKBest()`, `RFE()` (from `sklearn`)                   | Reduces dimensionality and avoids overfitting                |
| **Filter Methods**           | Selecting features based on statistical measures             | `SelectKBest()` (from `sklearn`)                            | Selects features with the highest statistical relevance       |
| **Wrapper Methods**          | Using a model to evaluate feature subsets                     | `RFE()` (Recursive Feature Elimination)                      | Evaluates feature sets by recursively fitting a model        |
| **Embedded Methods**         | Built-in feature selection during model training            | `Lasso`, `DecisionTreeClassifier()`                          | Selects features while training models (e.g., L1 regularization) |
| **Dimensionality Reduction** | Reducing the number of features while retaining key information | `PCA()`, `LDA()`, `t-SNE()`                                 | Reduces data complexity and overfitting risks                 |
| **PCA (Principal Component Analysis)** | Finding principal components that capture data variance | `PCA()` (from `sklearn`)                                     | Used for feature extraction and noise reduction              |
| **LDA (Linear Discriminant Analysis)** | Maximizing separability between classes                   | `LDA()` (from `sklearn`)                                     | Used for classification problems, especially in imbalanced data |
| **t-SNE (t-Distributed Stochastic Neighbor Embedding)** | Reducing dimensionality while preserving structure | `TSNE()` (from `sklearn`)                                   | Used for visualizing high-dimensional data                   |
| **Sampling**                 | Selecting a subset of data for processing                    | `train_test_split()`, `resample()`                          | Used when working with large datasets or for cross-validation |
| **Data Serialization**       | Converting data to a suitable format for storage or transmission | `pickle`, `json`, `parquet`                                 | Ensures data can be stored and loaded efficiently             |
| **JSON**                     | A lightweight data-interchange format                        | `json.load()` / `json.dump()`                               | Common format for storing and transmitting data              |
| **XML**                      | A markup language for representing structured data           | `xml.etree.ElementTree`                                     | Used for structured data storage and exchange                |
| **Pickle**                   | Python-specific format for serializing objects               | `pickle.dump()` / `pickle.load()`                           | Saves Python objects for future use or distributed computing  |
| **Parquet**                  | A columnar storage format                                    | `pyarrow.parquet`                                           | Efficient storage format, commonly used in big data          |
