# Data Preprocessing Template

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [8]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive',force_remount=True)

file_path = '/content/drive/My Drive/AI-projects-lab/ml-datasets/preprocessing_sample.csv'
dataset = pd.read_csv(file_path)



Mounted at /content/drive
<class 'pandas.core.frame.DataFrame'>


In [57]:
print(type(dataset))
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

print('\nX value')
print(X)
print('\nY value:')
print(y)

print('\nNull records count:\n{0}'.format(dataset.isna().sum()))

<class 'pandas.core.frame.DataFrame'>

X value
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

Y value:
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Null records count:
Country      0
Age          1
Salary       1
Purchased    0
dtype: int64


## Taking care of missing data

**SimpleImputer**:

SimpleImputer is a class in scikit-learn that provides basic strategies for imputing (i.e., filling in) missing values in a dataset.

1) missing_values=np.nan:

This parameter specifies what value the imputer should consider as "missing."
np.nan (from the numpy library) represents missing or undefined values in a dataset.
By setting missing_values=np.nan, you are telling the imputer to look for NaN values in your dataset to impute.

2) strategy='mean':

This parameter specifies the imputation strategy, i.e., how the missing values should be filled.
strategy='mean' means that the imputer will calculate the mean of the non-missing values in each column and replace any missing values (NaN) with this mean.
Other strategies you could use include:
'median': Replace missing values with the median of the column.
'most_frequent': Replace missing values with the most frequent value in the column.
'constant': Replace missing values with a constant value (specified by the fill_value parameter).

In [58]:
# Error code Cannot use mean strategy with non-numeric data: could not convert string to float: 'France'

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)

# Only apply transformations to these columns
X = imputer.transform(X)

# Apply transformations
print(X)


ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'France'

In [59]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])

# Only apply transformations to these columns
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Apply transformations
print(X)



[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

**ColumnTransformer:**
The ColumnTransformer is a class in scikit-learn that allows you to apply different transformations to different columns of your dataset.
It takes a list of transformers and applies them to specified columns.

1) transformers=[...]:

This parameter specifies a list of transformations to be applied. Each item in the list is a tuple containing three elements:
A name for the transformer: 'encoder' in this case.
The transformer object: OneHotEncoder() here.
The column(s) to apply the transformer to: [0], which refers to the first column of your dataset.

2) OneHotEncoder():

This is a transformer that converts categorical values into a format that can be provided to ML algorithms to do a better job in prediction.
One-hot encoding turns a categorical column with n unique values into n binary columns, where each column corresponds to one of the unique values in the original column.
For example, if the first column contains categories like ['red', 'blue', 'green'], OneHotEncoder will create three new columns, one for each category.

3) remainder='passthrough':

This parameter tells ColumnTransformer what to do with the columns that are not explicitly transformed by the transformers list.
'passthrough' means that all other columns should be left unchanged and passed through in the final output.
Other options include 'drop' to drop the other columns or applying another transformer to them.

4) X = np.array(ct.fit_transform(X)):

fit_transform(X) applies the transformation(s) defined in the ColumnTransformer to the data X.
This line transforms the specified columns in X according to the provided transformers (in this case, one-hot encoding the first column) and returns the transformed array.

The result is then converted to a NumPy array using np.array().

In [60]:
# Encoding dependent variables

# In this case Country column is the independent categorical value
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), [0])
        ],
     remainder='passthrough'
     )
transformed_X = ct.fit_transform(X)
print('Transformed X:\n{0}'.format(transformed_X))
X = np.array(transformed_X)
print('\nEncoded X:\n{0}'.format(X))


# Encoding independent variables

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Transformed X:
[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

Encoded X:
[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Splitting the dataset into the Training set and Test set

In [61]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print('Record count for each split :\nX_train:{x_train}, X_test:{x_test}, Y_train:{y_train}, Y_test:{y_test}'.format(x_train= len(X_train),x_test = len(X_test),y_train=len(y_train),y_test = len(y_test)))

Record count for each split :
X_train:8, X_test:2, Y_train:8, Y_test:2


## Feature Scaling

from sklearn.preprocessing import StandardScaler:

This imports the StandardScaler class from the sklearn.preprocessing module. StandardScaler is a tool used to standardize the features in your dataset.
sc = StandardScaler():

This line creates an instance of the StandardScaler class, named sc.
StandardScaler standardizes features by removing the mean and scaling to unit variance.
X_train = sc.fit_transform(X_train):

**fit_transform(X_train):**
The fit_transform method does two things:

**fit(X_train):** It calculates the mean and standard deviation of each feature in the X_train dataset. This is the "fitting" process.

**transform(X_train):** It applies the transformation to the X_train dataset, using the mean and standard deviation calculated during the fit step.
After this, all features in X_train will have a mean of 0 and a standard deviation of 1.
This standardization process ensures that each feature contributes equally to the model, which is particularly important for algorithms that rely on distance calculations (e.g., k-nearest neighbors, SVM) or gradient descent optimization (e.g., linear regression, logistic regression, neural networks).
X_test = sc.transform(X_test):

transform(X_test):
This method applies the same transformation to the X_test dataset that was applied to X_train, using the mean and standard deviation calculated from X_train.
Important: fit_transform is only applied to the training data (X_train) to avoid data leakage. We do not "fit" the scaler on X_test because the test data should be completely unseen by the model during training. Instead, we simply use the parameters learned from the training data to transform the test data.



**Impact of Data Leakage:**

**Overfitting:**

The test data has influenced the mean and standard deviation calculations. The model might now perform exceptionally well on X_test during validation, giving you an overly optimistic view of its performance.


**Poor Generalization:**

 Since the test data "leaked" into the training process, the model might not generalize well to truly unseen data (like a new test set or data in production).


**Biased Model Evaluation:**

The evaluation metrics (e.g., accuracy, RMSE) calculated on the test set might be misleading because the model had indirect access to test data characteristics during training.

In [72]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

print(X_train)
print(X_test)

[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
[[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]


### Example:

X_train = [
    
    [3, 2000, 400000],  # 3 bedrooms, 2000 sqft, $400,000
    [4, 2500, 500000],  # 4 bedrooms, 2500 sqft, $500,000
    [2, 1500, 300000]   # 2 bedrooms, 1500 sqft, $300,000
]


Mean: [3, 2000, 400000]


Standard Deviation: [1, 500, 100000]



X_train_scaled = [

    [0, 0, 0],   # (3-3)/1, (2000-2000)/500, (400000-400000)/100000
    [1, 1, 1],   # (4-3)/1, (2500-2000)/500, (500000-400000)/100000
    [-1, -1, -1] # (2-3)/1, (1500-2000)/500, (300000-400000)/100000
]