## **Data Preprocessing for Machine Language**

**Data Cleaning is Not Data Processing!**

Its worthwhile to know that data cleaning is different from data preprocessing. In the machine learning implementation pipeline, data cleaning comes first. As matter of emphasis, through more light on data cleaning.

The process of identifying and correcting errors, inconsistencies and inaccuracies in raw data is data cleaning. The tasks you would most likely encounter in data cleaning may include;

-   Handling missing values.
-   Removing duplicate rows or columns.
-   Correcting inconsistencies (like typos and inconsistent cases)
-   Handling outliers.
-   Handling corrupt strings (unicode problems).
-   Handling irrelevant data and more that you might have encountered or may likely encounter in your journey with working with data.


<hr>

**Other Conflicting Terms with "Data Preprocessing"**

It is quite important to note that as you progress in your learning, practice and research, you would come across some big data lingo that might throw you in confusion, kind of wandereing what they mean... some of these terms include **data mining, data wrangling, data preparation, data transformation, data harmonization, data refinement, data shaping, data manipulation, data manicuring, data validation** etc.

In your spare time, do well to take your time to learning their meaning, their similarities and difference with other related terms.

But two of those terms will be differentiated from data preprocessing in a simple table to get us going before moving into data preprocessing proper.


| **Feature** | **Data Wrangling** | **Data Preprocessing** | **Data Mining** |
|-------------|--------------------|------------------------|-----------------|
| **Purpose** | Preparing and structuring raw data | Transforming data into a format suitable for ML | Extracing insights and patterns |
| **Focus** | Cleaning, transforming, and integrating data | Encoding, scaling, and feature selection | Finding hidden relationships in data (the complete ML pipeline) |
| **Techniques** | Data cleaning, merging, reshaping | Normalization, encoding, feature extraction | Clustering, classification, regression, anomaly detection |
| **End Goal** | Organized, structured data | Model ready dataset | Actionable insights for decision making |


<hr>

**Data Preprocessing**

In the Nigerian setting, when you see people cooking many varieties of dishes with big sets of pots. The next thing that comes to mind is "What is the celebration?", that is, they are most likely cooking for a party or an event. It is the same thing when it comes to data preprocessing. So in a lay mans' definition, we can say that data preprocessing is the preparing and cooking of your dataset for ML model building. I think this should be clear enought.

If simplicity feels like a scam to you, lets subscribe to a little bit of complexity in our definition.

Data preprocessing is the step in the machine learning pipeline where raw data is transformed into a structured, clean, and suitable format for model training. It ensures that data is compatible with machine learning algorithms by handling incosistencies, scaling, encoding, and feature selection.


<hr>

**Why Should We Preprocess Data Before Building The ML Model?**

**NOTE:** _I will introduce some new terms here, don't get worried if you don't understand them yet, but is good you are aware of their existence. I will bolden them for emphasis_

Lets break it donw, the effectiveness of our model is at the mercy of how well our features are engineered, and **feature engineering** is at the mercy of data preprocessing. Data preprocessing is crucial in machine learning because raw data is often incomplete, inconsistent, and noisy. Without proper preprocessing, **ML Algorithms** may learn incorrect patterns, leading to poor performance.


So lets answer the question now.
1.  To improve data quality: After cleaning data **features** might still contain errors, inconsistencies and missing features that can affect the performance of the **model**. So preprocessing prepares the dataset for **feature selection** helping.
2.  Enhance Model Performance: Most ML algorithms learn better when the features are **scaled, normalized** and structured properly. So proper preprocessing ensures that features contribute meaningfully to **predictions.**
3.  Prevent Models fro **Overfitting, Underfitting, High Variance** and **Bias**: Handling **imbalance datasets** prevents the model from being biased towards the dominant **class**. So proper preprocessing ensures that the features contributes meaningfully to predictions of the model.
4.  Ensures Algorighm Compatibility: Many ML algorithms require numeric inputs (e.g., **encoding** categorical variables). Some algorithms, like KNN and neural networks, are sensitive to scale, requiring normaliztion.
5.  Improves Model Accuracy and Efficiency: Well preprocessed data helps models learn better patterns, leading to improved **accuracy** and **generalization**. Reduces **computational complexity** by eliminating **irrelevant features**.

<hr>

**Technical Terms Used in Data Preprocessing**

Who doesn't admire professional who sounds technical and at the same time reasonable when delivering talks that have to do with their fields? I do admire them, don't know about you... So don't be overwhelmed seeing the next table, I just made life easire for you. When you konw and understand this stuffs, you are like 70% away from being a professional ML hypeman, trust me!

You really don't have to know all of them, we will obviously not talk about most of them, and for some you may not use it all in your ML practice. But keep it handy as a cheatsheet.

| **Term** | **Meaning** | **Implementation (Python)** |
|----------|-------------|-----------------------------|
| **Missing Value Imputation** | Strategies for filing in missing data | `df.fillna()` or `KNNImputer()` (from `sklearn`) |
| **Mean/Median/Mode Imputation** | Replacing missing values with the average, median or mode | `df.fillna(df.mean())`, `df.fillna(df.median())` |
| **KNN Imputation** | Using similar data points to fill in missing values | `KNNImputer()` (from `sklearn`) |
| **Regression Imputation** | Predicting missing values using a regression model | Custom implementation using `LinearRegression()` |
| **Multiple Imputation** | Creating multiple plausible versions of data and combining them | `IterativeImputer()` (from `sklearn`) |
| **Outlier Detection** | Identifying and handling extreme values | `zscore()`, `IQR method` |
| **Z-score Method** | Identifying outliers based on standard deviation | `zscore(df['col'])` |
| **IQR Method** | Identifying outliers based on interquartile range | `IQR = df['col'].quantile(0.75) - df['col'].quantile(0.25)` |
| **Box Plots** | Visual representation of data distribution with outliers | `sns.boxplot(x='col', data=df)` |
| **Winsorizing/Clipping** | Limiting extreme values to a specified threshold | `winsorize()` from `scipy.stats` |
| **Data Deduplication** | Removing duplicate records | `df.drop_duplicates()` |
| **Noise Removal** | Reducing random errors or noise in data | `smooth()` or `scipy.signal.savgol_filter()` |
| **Smoothing techniques** | Applying filters to reduce random noise in time-series data | `moving_average()`, `SavitzkyGolay()` (from `scipy`) |
| **Data Type Conversion** | Changing data types for consistency | `df['col'] = df['col'].astype(int)` |
| **Data Transformation** | Changing the format or structure of the data | `np.log()`, `BoxCox()` (from `scipy`) |
| **Normalization** | Scaling data to a specific range (usually 0 to 1) | `MinMaxScaler() (from `sklearn`)
| **Min-Max Scaling** | Scaling data between 0 and 1 | `MinMaxScaler().fit_transform(df)` |
|