<a href="https://colab.research.google.com/github/mishrark0145/datamining/blob/main/Data_Mining_Lab04_DataTransformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explore and implement diverse data transformation techniques
(Z-score, Min-Max, Mean normalization, Max Absolute, Robust scaling) in Python,
understanding their impact on data distribution for effective preprocessing.**

---



**Data Normalization:** Data Normalization could also be a typical practice in machine learning which consists of transforming numeric columns to a standard scale. In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the learning process.

In [None]:
### Titanic Dataset

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style

In [None]:
df = pd.read_csv("/content/train..csv")

Fill NaN values with 0

In [None]:
df.fillna(0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare zscore
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,0,S,-0.502445
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.786845
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S,-0.488854
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0.420730
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S,-0.486337
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,0,S,-0.386671
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,-0.044381
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.0,1,2,W./C. 6607,23.4500,0,S,-0.176263
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,-0.044381


In [None]:
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare zscore
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,-0.502445
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.786845


**Z-Score**

z = (X – μ) / σ

where,

z = Z-Score,

X = The value of the element,

μ = The population mean, and

σ = The population standard deviation


*   Makes it easier to compare values from different datasets because they take away the original units of measurement.
*   Indicate how far a data point is from the mean in terms of standard deviations, providing a measure of the data point’s relative position within the distribution



In [None]:
import scipy.stats as stats
df['Fare zscore'] = stats.zscore(df['Fare'])

In [None]:
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare zscore
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,-0.502445
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.786845


**Min-Max**

The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the .min() and .max() methods.

In [None]:
df_minmax= df.copy()

for column in df_minmax.select_dtypes(include='float').columns:
	df_minmax[column] = (df_minmax[column] - df_minmax[column].min()) / (df_minmax[column].max() - df_minmax[column].min())

print(df_minmax.select_dtypes(include='float').head(2))


        Age      Fare  Fare zscore
0  0.271174  0.014151     0.014151
1  0.472229  0.139136     0.139136


In [None]:
#import matplotlib.pyplot as plt
#df_minmax.select_dtypes(include='float64').plot(kind = 'bar')


**Maximum absolute scaling**

The maximum absolute scaling rescales each feature between -1 and 1 by dividing every observation by its maximum absolute value. We can apply the maximum absolute scaling in Pandas using the .max() and .abs() methods, as shown below.

In [None]:
df_max_scaled = df.copy()

for column in df_max_scaled.select_dtypes(include='float').columns:
	df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs().max()

display(df_max_scaled.select_dtypes(include='float').head())


Unnamed: 0,Age,Fare,Fare zscore
0,0.275,0.014151,-0.051974
1,0.475,0.139136,0.081394
2,0.325,0.015469,-0.050569
3,0.4375,0.103644,0.043522
4,0.4375,0.015713,-0.050308


**Mean Normalisation**

Mean Normalization is a way to implement Feature Scaling. What Mean normalization does is that it calculates and subtracts the mean for every feature. A common practice is also to divide this value by the range or the standard deviation.

In [None]:
df_mn = df.copy()

for column in df_mn.select_dtypes(include='float').columns:
  df_mn[column] = (df_mn[column]-df_mn[column].mean())/(df_mn[column].max()-df_mn[column].min())


display(df_mn.select_dtypes(include='float').head())

Unnamed: 0,Age,Fare,Fare zscore
0,-0.096747,-0.048707,-0.048707
1,0.104309,0.076277,0.076277
2,-0.046483,-0.04739,-0.04739
3,0.066611,0.040786,0.040786
4,0.066611,-0.047146,-0.047146


**Robust scaling**

we scale each feature of the data set by subtracting the median and then dividing by the interquartile range. The interquartile range (IQR) is defined as the difference between the third and the first quartile and represents the central 50% of the data.

In [None]:
df_robust = df.copy()

for column in df_robust.select_dtypes(include='float').columns:
  df_robust[column] = (df_robust[column] - df_robust[column].median())  / (df_robust[column].quantile(0.75) - df_robust[column].quantile(0.25))

display(df_robust.select_dtypes(include='float').head())

Unnamed: 0,Age,Fare,Fare zscore
0,-0.335664,-0.312011,-0.312011
1,0.559441,2.461242,2.461242
2,-0.111888,-0.282777,-0.282777
3,0.391608,1.673732,1.673732
4,0.391608,-0.277363,-0.277363
