## **Data Processing for a Machine Learning Model**

Data processing is process of cleaning, transforming, and preparing raw data for a ML model. 
Proper data processing improves model accuracy, efficiency, and generalization. This includes various steps
- Handling missing values
- Remove Duplicates
- Inconsistent formatting(Date formats or capitalization or number formatting)
- Feature encoding(ML models only work on numerical data, we need to convert categories values into numerical values, for ex: column of female and male in to 0 and 1)
- Feature scaling
- Dimensionality reduction
- Handling imbalanced data


In [2]:
## Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
## Creating a data frame. this is a in memory table
dataset = pd.read_csv('./Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values



In [4]:
### Handling missing values
## Identifying how many null values are there?
print(dataset.isnull().sum())

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding Categorical data

Machine learning model training doesn't understand string values. In above Country names is a string but we can't also convert them in 1,2,3 etc why because 3 gains more precedence and model becomes biased. That's why we use one hot encoding. This allows creating binary vector

France Spain Germany
1       0      0
0       1      0
0       1      1


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[0])],remainder="passthrough")
X = np.array(ct.fit_transform(X))

## Future machine learning model training .fit function expects a numpy array. that's why we are casting

print(X)

## Convert dependency variable y into 1 and 0 using label encoder

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y = le.fit_transform(y)

print(y)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
[0 1 0 0 1 1 0 1 0 1]


In [7]:
## Splitting the data set for training and testing..

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)

print(X_train)
print(y_train)
print(X_test) 
print(y_test)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]
[0 1 0 0 1 1 0 1]
[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
[0 1]


# Feature Scaling

#### Example: Why Feature Scaling is Needed
#### Scenario: Predicting House Prices

Suppose you have a dataset with two features:

House Size (in square feet) → Range: 500 to 5000
Number of Bedrooms → Range: 1 to 5
If you use a machine learning model like Linear Regression or k-NN, the large numerical range of "House Size" (500–5000) will dominate the smaller range of "Number of Bedrooms" (1–5). The model will assign more importance to "House Size," even if "Number of Bedrooms" is equally important.

Effect Without Feature Scaling
The model might assume "House Size" is much more important just because it has larger values, leading to biased predictions.

Applying Feature Scaling
If we apply Min-Max Scaling:

House Size (500 to 5000) → Scaled to 0 to 1
Number of Bedrooms (1 to 5) → Scaled to 0 to 1
Now, both features have equal weight, improving the model’s learning ability and prediction accuracy.

## Common techniques
#### 1. Normalization[Min-max scaling]
      New_x = (X - X_min)/(X_max-X_min)
#### 2. Standardization[z-score normalization]
      New_x = (X - avg)/standard deviation

Feature scaling has to be done after splitting the data because your model is supposed to be consider standard values of the train data set but not test dataset

In [8]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [9]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
X_test[:,3:] = sc.transform(X_test[:,3:])