#  Name -> Deven Chhajed
# Roll No-> 32
# Batch -> B1 (CSE)
# Prn -> 1032210789
# Data Transformation and Smoothening

## Data Transformation and Smoothening

# Data Transformation:
Data transformation is the process of converting data from one format or structure to another to prepare it for analysis or modeling. This involves cleaning, scaling, encoding, and other operations to make the data suitable for specific tasks. It's a crucial step in data analysis and can significantly impact results.

# Importance of the Data Transformation:
The importance of data transformation lies in its ability to enhance the quality, suitability, and usability of data for various analytical and modeling tasks.
Here are some key reasons why data transformation is crucial:

# **Enhancing the Quality of Data:**

**Identifying and Managing Outliers:** Data transformation plays a crucial role in recognizing and addressing outliers that, if unaddressed, can negatively impact results and the effectiveness of models.
# Improved Model Effectiveness:
**Normalization/Standardization:** By scaling numerical features, we ensure that all variables carry equal importance in machine learning algorithms, ultimately resulting in improved model performance.

**Encoding Categorical Data:** Transforming categorical variables into numerical representations enhances their compatibility with a broad spectrum of machine learning algorithms, thereby increasing model accuracy.

# Enhanced Understandability:
**Streamlining:** Data transformation has the ability to streamline intricate datasets, rendering them more comprehensible when utilized in visualizations and reports.



# Importing the Libraries

**from sklearn.preprocessing import MinMaxScaler:** In Python, the code "from sklearn.preprocessing import MinMaxScaler" is used to import the MinMaxScaler class from the scikit-learn (sklearn) library. The MinMaxScaler is a data preprocessing technique used to transform numerical data into a specific range, typically between 0 and 1, by linearly scaling the data. This scaling is commonly used in machine learning to ensure that features with different scales do not dominate the learning process and to make the data more suitable for various algorithms.

**import numpy as np:** By including 'import numpy as np' in Python, you gain access to NumPy, a vital numerical computing library. NumPy provides robust support for arrays, matrices, and mathematical functions, making it a cornerstone for scientific and mathematical computations.

**import pandas as pd**: Pandas, a Python library for data manipulation and analysis, offers essential data structures such as DataFrames and Series to facilitate efficient data handling and analysis.

**from scipy.stats import zscore:**  This function is used for calculating z-scores, which measure how many standard deviations a data point is away from the mean in a dataset. Z-scores are helpful for standardizing and identifying outliers in numerical data, making it easier to compare and analyze data with different scales and distributions.

**import matplotlib.pyplot as plt:** matplotlib.pyplot simplifies the creation of diverse plots like lines, bars, scatter plots, and histograms, supporting static, animated, or interactive data visualizations.

**import math:** By importing the 'math' module in Python, you can tap into a comprehensive collection of mathematical functions and constants. These include trigonometric functions, logarithms, and fundamental mathematical constants like pi and e, all of which are essential for a wide array of mathematical calculations and operations.

In [5]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt
import math

# Data Transformation Techniques:
# **1. Normalization / Standardization:**
**Z_Score Normalization:**
Z-score normalization, also known as standard score normalization, is a specific form of standardization used to transform a dataset such that it has a mean (average) of 0 and a standard deviation of 1. This technique is particularly useful when you want to compare and analyze data points in terms of their deviation from the mean, regardless of the original units or scales of the data.

Here's the formula for z-score normalization:

* Z-Score (x_z) = (x - μ) / σ

Where:

* x is an individual data point in the dataset.

* μ (mu) is the mean (average) of the dataset.

* σ (sigma) is the standard deviation of the dataset.

**Min Max Scaling:**
Min-Max scaling, also known as feature scaling or min-max normalization, is a data preprocessing technique used to scale the values of a feature to a specific range, typically between 0 and 1. This ensures that all features have the same scale, making them comparable and preventing features with larger values from dominating the learning process in machine learning algorithms.

Here's the formula for Min-Max scaling:

* Min-Max Scaled Value (x_scaled) = (x - min(x)) / (max(x) - min(x))

Where:

* x is an individual data point in the feature.

* min(x) is the minimum value in the feature.

* max(x) is the maximum value in the feature.

**Decimal Scaling:**

Decimal scaling, also known as decimal normalization, is a data preprocessing technique used to scale the values of a feature to a specific range between -1 and 1, or any other desired range that is a power of 10. Unlike Min-Max scaling, which scales values between 0 and 1, decimal scaling allows you to choose the range by specifying the number of decimal places.

Here's the formula for decimal scaling:

* Decimal Scaled Value (x_scaled) = x / 10^n

Where:

* x is an individual data point in the feature.
* n is the number of decimal places you want in the scaling.







# Normalization

In [6]:
l=[37, 82, 15, 48, 71, 29, 92, 40, 63, 27, 88, 11, 76, 51, 44, 68, 25, 59]
a=np.array(l)
b=a.reshape(-1,1)

# Min Max Scaling

In [7]:
min_max_scaled_data=[]
for i in a:
	min_max_scaled_data.append((i-a.min())/(a.max()-a.min()))
min_max_scaled_data

[0.32098765432098764,
 0.8765432098765432,
 0.04938271604938271,
 0.4567901234567901,
 0.7407407407407407,
 0.2222222222222222,
 1.0,
 0.35802469135802467,
 0.6419753086419753,
 0.19753086419753085,
 0.9506172839506173,
 0.0,
 0.8024691358024691,
 0.49382716049382713,
 0.4074074074074074,
 0.7037037037037037,
 0.1728395061728395,
 0.5925925925925926]

# Min Max Scaling using Sklean Library

# It is used to import the MinMaxScaler class from scikit-learn (sklearn), which is a tool for scaling numerical features to a specific range, typically between 0 and 1, making them suitable for machine learning algorithms.

In [8]:
scaler=MinMaxScaler()

In [9]:
scaled_data=scaler.fit_transform(b)
print(scaled_data)

[[0.32098765]
 [0.87654321]
 [0.04938272]
 [0.45679012]
 [0.74074074]
 [0.22222222]
 [1.        ]
 [0.35802469]
 [0.64197531]
 [0.19753086]
 [0.95061728]
 [0.        ]
 [0.80246914]
 [0.49382716]
 [0.40740741]
 [0.7037037 ]
 [0.17283951]
 [0.59259259]]


# Z_Score Calculation

In [10]:
z_score=[]
for i in a:
	z_score.append((i-a.mean())/a.std())
z_score

[-0.5970216145873387,
 1.2629303385501398,
 -1.5063314583434393,
 -0.14236669270928842,
 0.8082754166720895,
 -0.9276797395895571,
 1.6762529948029128,
 -0.47302481771150684,
 0.4776172916698711,
 -1.0103442708401118,
 1.5109239323018036,
 -1.6716605208445485,
 1.014936744798476,
 -0.018369895833456513,
 -0.3076957552103976,
 0.6842786197962576,
 -1.0930088020906663,
 0.31228822916876187]

# Calculation of Z_Score Using Scipy Library

In [11]:
z_score_1=zscore(l)
z_score_1

array([-0.59702161,  1.26293034, -1.50633146, -0.14236669,  0.80827542,
       -0.92767974,  1.67625299, -0.47302482,  0.47761729, -1.01034427,
        1.51092393, -1.67166052,  1.01493674, -0.0183699 , -0.30769576,
        0.68427862, -1.0930088 ,  0.31228823])

## Data Smoothening:
Data smoothing is a technique used to reduce noise in a dataset by applying mathematical methods. It involves creating a smoother version of the data to reveal underlying patterns or trends while removing random fluctuations. Common methods include moving averages, exponential smoothing, and filters, each with its own way of reducing data noise. Data smoothing is useful in various fields, such as signal processing and time series analysis, to improve data interpretation and analysis.

## Data Smoothening Importance:
* Noise Reduction: Reduces random fluctuations in data.

* Visualization: Enhances data visualization and interpretation.

* Analysis: Stabilizes statistical analyses and machine learning.

* Forecasting: Improves prediction accuracy in time series data.

* Control Systems: Essential for stabilizing control systems.

* Signal Processing: Filters out unwanted noise in signals.

* Market Analysis: Aids trend identification and decision-making.

* Data Quality: Cleans data by removing outliers.

* Sensor Data: Ensures accurate readings in IoT and sensors.

## Data Smoothening Techniques:

**1. Equal Width Binning:** Equal width binning is a data preprocessing technique that involves dividing continuous data into a specified number of equal-width intervals or bins.
Binning by mean is a data preprocessing technique that involves grouping data points into bins or intervals based on their proximity to the mean (average) value of the data. This technique is used to create bins where each bin's center corresponds to the mean value of the data within that bin.


**2. Binning by Mean:** Binning by mean is a data preprocessing technique that involves grouping data points into bins or intervals based on their proximity to the mean (average) value of the data. This technique is used to create bins where each bin's center corresponds to the mean value of the data within that bin.

**3. Equal Frequency Binning:** Equal frequency binning, also known as equi-depth or quantile binning, is a data preprocessing technique used to discretize continuous numerical data into a set of bins or intervals such that each bin contains approximately the same number of data points. This technique is particularly useful when you want to ensure that each bin represents an equal portion of the dataset, making it suitable for handling skewed data distributions.


**4. Custom Binning:** Custom binning, also known as manual or user-defined binning, is a data preprocessing technique where you define bins or intervals based on your domain knowledge, specific requirements, or insights about the data. Unlike other binning techniques that use automated rules, custom binning allows you to group data points into bins according to your expertise and understanding of the data.

**5. Binning by Boundary:** Binning by boundary is a data preprocessing technique that involves dividing a dataset into bins or intervals based on predefined boundaries or thresholds. Instead of using statistical measures or automated rules, binning by boundary relies on specific values you choose to separate the data into meaningful categories or ranges.

**5. Binning by Median:** Binning by median is a data preprocessing technique that involves dividing a dataset into bins or intervals based on the median value of the data. This approach creates bins that balance the data distribution around the median, which is the middle value of a sorted dataset.




In [12]:
d = [0, 4, 12, 16, 16, 18, 24, 26, 28]
d

[0, 4, 12, 16, 16, 18, 24, 26, 28]

In [13]:
d.sort()
d

[0, 4, 12, 16, 16, 18, 24, 26, 28]

# Binning


# Binning by Equal Frequency:

In [14]:
r=max(d)-min(d)
print('Range of list is: ',r)

Range of list is:  28


In [15]:
b=r/len(d)
bins=math.floor(b)
bins

3

In [16]:
bin1=np.zeros((bins,bins))
bin1

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [17]:
a=0
while(a<len(d)):
	for i in range(0,bins):
		for j in range(0, bins):
			bin1[i, j] = d[a]
			a += 1

In [18]:
bin1

array([[ 0.,  4., 12.],
       [16., 16., 18.],
       [24., 26., 28.]])

# Binning by Mean:

In [None]:
bin2=np.zeros((bins,bins))
bin2

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [None]:
for i in range(0,len(d),bins):
	mean=(d[i]+d[i+1]+d[i+2])/bins
	k=int(i/bins)
	for j in range(0,bins):
		bin2[k,j]=mean
bin2

array([[ 5.33333333,  5.33333333,  5.33333333],
       [16.66666667, 16.66666667, 16.66666667],
       [26.        , 26.        , 26.        ]])

# Binning by Bin Boundary

In [20]:
bin3=np.zeros((bins,bins))
bin3

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [21]:
for i in range (0,len(d),bins):
    k=int(i/bins)
    for j in range (bins):
        if (d[i+j]-d[i]) < (d[i+2]-d[i+j]):
            bin3[k,j]=d[i]
        else:
            bin3[k,j]=d[i+2]
bin3

array([[ 0.,  0., 12.],
       [16., 16., 18.],
       [24., 28., 28.]])