In [None]:
# what is standardization and normalization in eda


# Show drafts

# Standardization and Normalization in EDA: A Brief Overview
# Standardization and normalization are two common preprocessing techniques used in Exploratory Data Analysis (EDA) to transform data into a standard format, making it easier to compare and analyze.

# Standardization
# Goal: Scales data to have a mean of 0 and a standard deviation of 1.

# Formula:

# z = (x - mean) / standard_deviation
# where:

# z is the standardized value
# x is the original value
# mean is the mean of the dataset
# standard_deviation is the standard deviation of the dataset
# Use Cases:

# When data is on different scales (e.g., age and income).
# For algorithms that assume a standard normal distribution (e.g., linear regression).
# Normalization
# Goal: Scales data to a specific range, typically between 0 and 1.

# Formula:

# z = (x - min) / (max - min)
# where:

# z is the normalized value
# x is the original value
# min is the minimum value in the dataset
# max is the maximum value in the dataset
# Use Cases:

# When data is on a bounded scale (e.g., percentages).
# For algorithms that require data in a specific range (e.g., neural networks).
# Choosing Between Standardization and Normalization:

# Standardization is often preferred when data is normally distributed and you want to preserve the relative distances between values.
# Normalization is useful when you want to scale data to a specific range, regardless of the distribution.
# Example:

# Python
# import pandas as pd
# from sklearn.preprocessing import StandardScaler, MinMaxScaler

# # Sample data
# data = {'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40]}
# df = pd.DataFrame(data)

# # Standardization
# scaler = StandardScaler()
# df_standardized = scaler.fit_transform(df)

# # Normalization
# scaler = MinMaxScaler()
# df_normalized = scaler.fit_transform(df)
# Use code with caution.

# By understanding the differences between standardization and normalization, you can choose the appropriate technique to preprocess your data effectively in EDA.







# both performs same operation?


# Show drafts

# No, standardization and normalization do not perform the same operation.

# While both techniques are used to scale data, they achieve different results:

# Standardization: Scales data to have a mean of 0 and a standard deviation of 1. This preserves the relative distances between values.
# Normalization: Scales data to a specific range (typically 0 to 1). This ensures all values are within a certain range, but may distort relative distances.
# Which technique to use depends on the specific requirements of your analysis. For example, if you're working with algorithms that assume a normal distribution (like linear regression), standardization is often preferred. If you need to ensure all values are within a specific range (like for neural networks), normalization might be more suitable.







# elaborate This ensures all values are within a certain range, but may distort relative distances.


# Show drafts

# When you normalize data, you're essentially rescaling it to fit within a specific range, typically 0 to 1. This can be useful in situations where you need all values to be on a comparable scale, such as when using algorithms that require input values to be within a certain range.

# However, normalization can distort the relative distances between values. For example, if you have two values that are relatively close together in the original data, but they are at the extremes of the range, normalization might make them appear much further apart. This is because the normalization process stretches the data to fit the specified range, potentially exaggerating the differences between values.

# In contrast, standardization preserves the relative distances between values. It scales data to have a mean of 0 and a standard deviation of 1, which means that values that were close together in the original data will still be close together after standardization.

# Therefore, the choice between standardization and normalization depends on your specific use case and the properties of your data. If you need to ensure all values are within a specific range, normalization might be appropriate. However, if preserving the relative distances between values is important, standardization is often a better choice.







# standarization does not scale it to 0 to 1 
# it just standardize the data to have mean =0 and standard deviation =1?


# Show drafts

# You're absolutely right. Standardization does not scale data to a specific range like 0 to 1. Instead, it transforms the data to have a mean of 0 and a standard deviation of 1.

# This transformation is useful when:

# Data is on different scales: Standardization ensures that all features are on a comparable scale, making it easier to compare and analyze them.
# Algorithms require a standard normal distribution: Some machine learning algorithms, like linear regression and logistic regression, assume that the data is normally distributed. Standardization helps to ensure that the data meets this assumption.
# Here's a brief recap of the key differences between standardization and normalization:

# Standardization:

# Scales data to have a mean of 0 and a standard deviation of 1.
# Preserves relative distances between values.
# Useful when data is on different scales or algorithms require a normal distribution.
# Normalization:

# Scales data to a specific range (typically 0 to 1).
# May distort relative distances between values.
# Useful when data needs to be on a specific scale or for algorithms that require input values within a certain range.
# By understanding these differences, you can choose the appropriate technique to preprocess your data based on your specific needs.

In [1]:
import seaborn as sns

In [5]:
df = sns.load_dataset('taxis')

In [6]:
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [7]:
from sklearn.preprocessing import MinMaxScaler

In [8]:
min_max =MinMaxScaler()

In [9]:

min_max.fit(df[['distance','fare','tip']])

In [10]:
min_max.transform(df[['distance','fare','tip']])

array([[0.04359673, 0.04026846, 0.06475904],
       [0.02152589, 0.02684564, 0.        ],
       [0.0373297 , 0.04362416, 0.07108434],
       ...,
       [0.11280654, 0.10067114, 0.        ],
       [0.03051771, 0.03355705, 0.        ],
       [0.10490463, 0.09395973, 0.10120482]])

In [11]:
min_max.fit_transform(df[['distance','fare','tip']])

array([[0.04359673, 0.04026846, 0.06475904],
       [0.02152589, 0.02684564, 0.        ],
       [0.0373297 , 0.04362416, 0.07108434],
       ...,
       [0.11280654, 0.10067114, 0.        ],
       [0.03051771, 0.03355705, 0.        ],
       [0.10490463, 0.09395973, 0.10120482]])