## **Data Processing for a Machine Learning Model**

Data processing is process of cleaning, transforming, and preparing raw data for a ML model. 
Proper data processing improves model accuracy, efficiency, and generalization. This includes various steps
- Handling missing values
- Remove Duplicates
- Inconsistent formatting(Date formats or capitalization or number formatting)
- Feature encoding(ML models only work on numerical data, we need to convert categories values into numerical values, for ex: column of female and male in to 0 and 1)
- Feature scaling
- Dimensionality reduction
- Handling imbalanced data


In [1]:
## Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
## Creating a data frame. this is a in memory table
dataset = pd.read_csv('./Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values



numpy.ndarray

# Feature Scaling

#### Example: Why Feature Scaling is Needed
#### Scenario: Predicting House Prices

Suppose you have a dataset with two features:

House Size (in square feet) → Range: 500 to 5000
Number of Bedrooms → Range: 1 to 5
If you use a machine learning model like Linear Regression or k-NN, the large numerical range of "House Size" (500–5000) will dominate the smaller range of "Number of Bedrooms" (1–5). The model will assign more importance to "House Size," even if "Number of Bedrooms" is equally important.

Effect Without Feature Scaling
The model might assume "House Size" is much more important just because it has larger values, leading to biased predictions.

Applying Feature Scaling
If we apply Min-Max Scaling:

House Size (500 to 5000) → Scaled to 0 to 1
Number of Bedrooms (1 to 5) → Scaled to 0 to 1
Now, both features have equal weight, improving the model’s learning ability and prediction accuracy.

## Common techniques
#### 1. Normalization[Min-max scaling]
      New_x = (X - X_min)/(X_max-X_min)
#### 2. Standardization[z-score normalization]
      New_x = (X - avg)/standard deviation
