# Feature Engineering-3

#### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min-Max Scaling is a data preprocessing technique used to scale numeric features in a dataset to a specific range, usually [0, 1]. It works by subtracting the minimum value from each data point and then dividing it by the range (difference between the maximum and minimum values). This technique ensures that all features are on the same scale, which can improve the performance of certain machine learning algorithms.

In [1]:
# Example
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = pd.DataFrame({'price': [200, 300, 150, 250, 180],'rating': [4.5, 3.8, 4.2, 3.9, 4.7]})
s = MinMaxScaler()
scaled = s.fit_transform(data)
print(scaled)

[[0.33333333 0.77777778]
 [1.         0.        ]
 [0.         0.44444444]
 [0.66666667 0.11111111]
 [0.2        1.        ]]


#### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Unit Vector technique scales each feature to have a unit norm (length). Unlike Min-Max Scaling, which brings the values within a specific range, Unit Vector technique scales each data point so that its Euclidean norm becomes 1. It's particularly useful when the magnitude of each feature is not important, and the direction matters more.

In [2]:
# Example
import pandas as pd
from sklearn.preprocessing import Normalizer
data = pd.DataFrame({'price': [200, 300, 150, 250, 180],'rating': [4.5, 3.8, 4.2, 3.9, 4.7]})
n = Normalizer()
scaled = n.fit_transform(data)
print(scaled)

[[0.99974697 0.02249431]
 [0.99991979 0.01266565]
 [0.99960823 0.02798903]
 [0.99987834 0.0155981 ]
 [0.99965928 0.02610221]]


#### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

PCA (Principal Component Analysis) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. It achieves this by finding the principal components (linear combinations of original features) that explain the most variability in the data.

In [3]:
## Example
import pandas as pd
from sklearn.decomposition import PCA
data = pd.DataFrame({'price': [200, 300, 150, 250, 180],'rating': [4.5, 3.8, 4.2, 3.9, 4.7]})
p = PCA(n_components=1)
c = PCA()
scaled = p.fit_transform(data)
s = c.fit_transform(data)
print("When Dimensionalities are combined:\n",scaled)
print("When Dimensionalities are NOT combined:\n",s)

When Dimensionalities are combined:
 [[-16.00114271]
 [ 84.00104663]
 [-65.99917148]
 [ 34.00113115]
 [-36.00186358]]
When Dimensionalities are NOT combined:
 [[-1.60011427e+01 -2.04528839e-01]
 [ 8.40010466e+01  2.37880848e-02]
 [-6.59991715e+01  3.31305469e-01]
 [ 3.40011311e+01  1.59626842e-01]
 [-3.60018636e+01 -3.10191556e-01]]


#### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

PCA can be used for Feature Extraction by identifying and selecting the most important principal components as new features. These components capture the most significant variability in the data. By selecting a smaller number of principal components, we can effectively reduce the dimensionality of the dataset while retaining important information.

In [4]:
## Example
import pandas as pd
from sklearn.decomposition import PCA
data = pd.DataFrame({'price': [200, 300, 150, 250, 180],'rating': [4.5, 3.8, 4.2, 3.9, 4.7]})
p = PCA(n_components=1)
c = PCA()
scaled = p.fit_transform(data)
s = c.fit_transform(data)
print("Principal components (directions)",p.components_)
print("Variance explained by each component",p.explained_variance_ratio_)
print("Principal components (directions)",c.components_)
print("Variance explained by each component",c.explained_variance_ratio_)

Principal components (directions) [[ 0.99998888 -0.00471675]]
Variance explained by each component [0.99998061]
Principal components (directions) [[ 0.99998888 -0.00471675]
 [-0.00471675 -0.99998888]]
Variance explained by each component [9.99980606e-01 1.93944305e-05]


#### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

For the recommendation system project, Min-Max scaling can be used to preprocess features like price, rating, and delivery time. This would ensure that each feature is on the same scale, allowing the recommendation algorithm to give appropriate weight to each feature without one dominating over the others due to their different magnitudes.

#### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

For the stock price prediction project, PCA can be used to reduce the dimensionality of the dataset by identifying the principal components that capture the most variability in the stock-related features. By retaining a smaller number of principal components, we can simplify the model and reduce the risk of overfitting while preserving the essence of the data.

#### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

For the given dataset [1, 5, 10, 15, 20], to transform the values to a range of -1 to 1 using Min-Max scaling, we can follow these steps:

In [5]:
import numpy as ny
from sklearn.preprocessing import MinMaxScaler
data = ny.array([1,5,10,15,20])
mn = ny.min(data)
mx = ny.max(data)
mms = -1 + 2*(data-mn) / (mx-mn)
scaled = MinMaxScaler(feature_range=(-1,1)).fit_transform(data.reshape(-1,1))
print("Manual Calculation:",mms)
print("Auto Calculation:",scaled)

Manual Calculation: [-1.         -0.57894737 -0.05263158  0.47368421  1.        ]
Auto Calculation: [[-1.        ]
 [-0.57894737]
 [-0.05263158]
 [ 0.47368421]
 [ 1.        ]]


#### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

The number of principal components to retain in PCA depends on the desired level of variance preservation. We can use the cumulative explained variance ratio to determine the number of components that collectively explain a sufficient portion of the total variance. A common approach is to choose the number of components that explain a high percentage of variance, like 95% or 99%.

In [15]:
# Example
import numpy as ny
from sklearn.decomposition import PCA
data = ny.array([[170, 65, 30, 0, 120],[160, 55, 25, 1, 130],[155, 55, 22, 0, 110]])
p = PCA()
p.fit_transform(data)
cv = ny.cumsum(p.explained_variance_ratio_)
print("Variance explained by each component",p.explained_variance_ratio_)
print("Cumulative Variance",cv)
dv = 0.95
num_comp = ny.argmax(cv>= dv) +1
print("Number of principal components to retain:", num_comp)

Variance explained by each component [6.15052684e-01 3.84947316e-01 4.53459435e-31]
Cumulative Variance [0.61505268 1.         1.        ]
Number of principal components to retain: 2
