# Week 13 Feature Engineering Assignment 2

### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min-Max scaling is a type of data normalization technique used in data preprocessing. It transforms the data in a way that scales all the features to a specific range, usually between 0 and 1.
The formula for Min-Max scaling is:

`scaled_value = (value - min) / (max - min)`

Where 'value' is the original value of the data point, 'min' and 'max' are the minimum and maximum values of the feature respectively.

This transformation ensures that all the features are in the same scale and have a similar influence on the model. It is particularly useful when working with algorithms that use distance measures, such as K-Nearest Neighbors (KNN), where the distance between data points plays a crucial role in determining similarity.
Here is an example to illustrate the application of Min-Max scaling:

Suppose we have a dataset containing the heights of a group of individuals in centimeters. The dataset has the following values:

160, 168, 155, 175, 180, 158

To apply Min-Max scaling to this dataset, we first calculate the minimum and maximum values:

min = 155 <br>
max = 180

Next, we apply the formula to each value in the dataset:

scaled_value = (value - min) / (max - min)

For example, the first value of 160 would be transformed as:

scaled_value = (160 - 155) / (180 - 155) = 0.357

After applying the formula to all the values in the dataset, we obtain the following scaled dataset:

0.357, 0.571, 0.000, 0.857, 1.000, 0.214

Now all the values in the dataset are scaled between 0 and 1, making it easier to compare and analyze the data.


### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

The Unit Vector technique is a type of feature scaling used in data preprocessing that scales the data to a unit length vector. Unlike Min-Max scaling, it preserves the direction of the data while scaling the magnitude.

In the Unit Vector technique, each data point is divided by its Euclidean length, which is calculated as the square root of the sum of squares of all the feature values of that data point. This results in a vector with a length of 1.

The formula for Unit Vector scaling is:

**unit_vector = value / sqrt(sum(value^2))**


Here is an example to illustrate the application of Unit Vector scaling:

Suppose we have a dataset containing the heights and weights of a group of individuals. The dataset has the following values:

Height (cm): 160, 168, 155, 175, 180, 158
Weight (kg): 55, 60, 50, 70, 75, 48

To apply Unit Vector scaling to this dataset, we first calculate the Euclidean length of each data point using the formula:

Euclidean length = sqrt(height^2 + weight^2)

For example, the Euclidean length of the first data point (160 cm, 55 kg) would be:

sqrt(160^2 + 55^2) = 169.45

Next, we divide each feature value by the Euclidean length of its corresponding data point to obtain the scaled value:

Height (cm) Scaled: [0.944, 0.993, 0.917, 0.980, 0.984, 0.925] <br>
Weight (kg) Scaled: [0.328, 0.355, 0.292, 0.425, 0.441, 0.282]

As we can see from the scaled values, the direction of the data is preserved while scaling the magnitude. Each data point is now a unit length vector, making it easier to compare and analyze the data.
In comparison to Min-Max scaling, Unit Vector scaling is more useful in cases where we want to preserve the direction of the data, such as in text classification or natural language processing.

### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Principal Component Analysis (PCA) is a technique used for dimensionality reduction in data analysis. It is a statistical method that aims to transform the original dataset into a lower-dimensional space while retaining as much of the original information as possible.

The main idea behind PCA is to identify a set of new variables, called principal components, that explain the maximum amount of variance in the original data. These principal components are linear combinations of the original variables and are orthogonal to each other. The first principal component captures the direction of the maximum variance in the data, while the subsequent principal components capture the remaining variance in decreasing order.

PCA can be used for various applications such as exploratory data analysis, data visualization, and feature extraction for machine learning algorithms. It is particularly useful in reducing the dimensionality of high-dimensional datasets, where the number of features exceeds the number of observations, and in cases where there are strong correlations among the features.

![image.png](attachment:138e59d5-d909-4174-b91e-6184a99b008f.png)

In [2]:

import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# load the iris dataset
iris = sns.load_dataset('iris')

# Print orignal dataset 
print('Orignal Dataset :\n',iris.head())
print('\n=====================================================================\n')

# extract the feature data
X = iris.drop(labels=['species'],axis=1)
Y = iris[['species']]

# Scaling the X data using StandardScaler
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X),columns=X.columns)


# apply PCA with two principal components
pca = PCA(n_components=2)
X_pca = pd.DataFrame(pca.fit_transform(X_scaled),columns=['PC1','PC2'])

# print the variance ratio explained by each principal component
print("Variance Ratio:", pca.explained_variance_ratio_)

# print the transformed data
print("\nTransformed X Data:\n", X_pca.head())

Orignal Dataset :
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


Variance Ratio: [0.72962445 0.22850762]

Transformed X Data:
         PC1       PC2
0 -2.264703  0.480027
1 -2.080961 -0.674134
2 -2.364229 -0.341908
3 -2.299384 -0.597395
4 -2.389842  0.646835


### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

PCA is a technique used for dimensionality reduction, which can be applied for feature extraction in machine learning. Feature extraction is a process of transforming high-dimensional data into a lower-dimensional space while retaining the essential information. The main goal of feature extraction is to reduce the computational complexity of the data, improve the accuracy of machine learning models, and enhance the interpretability of the results.

PCA can be used for feature extraction by identifying the principal components that capture the maximum amount of variance in the data. These principal components can be considered as the most important features of the original data, and they can be used as input for machine learning models. This approach not only reduces the dimensionality of the data but also enhances the interpretability of the features.

In [4]:
import pandas as pd
df = pd.read_csv('credit_card.csv')


In [5]:
X = df.drop(labels=['ID','default.payment.next.month'],axis=1)
Y = df[['default.payment.next.month']]

# Apply StandardScaler to X data
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X),columns=X.columns)

In [6]:
from sklearn.decomposition import PCA
pca = PCA(0.90)
X_pca = pca.fit_transform(X_scaled)

print("Variance Ratio:", pca.explained_variance_ratio_)
print('\nTop 5 rows of transformed PCA data :\n',X_pca[0:5])

Variance Ratio: [0.2726381  0.17083159 0.06470214 0.0614085  0.04365895 0.04106621
 0.0395654  0.03780221 0.03696921 0.03629678 0.03249152 0.03053234
 0.02844441 0.02378875]

Top 5 rows of transformed PCA data :
 [[-1.89909907e+00 -9.44774158e-01 -3.78827719e-01 -6.42702921e-01
  -1.88104778e-01 -1.78313758e+00  9.26399579e-01 -4.94659330e-01
   2.18778010e-01 -2.53505194e-01  1.57143211e-01  2.60297894e-01
   3.22504400e+00 -1.05813095e+00]
 [-7.76113222e-01 -2.14828323e+00  1.19900181e+00 -5.46322707e-01
  -1.96155844e-01 -1.63797540e+00  5.26471317e-01  1.67534850e-01
   1.88464873e-01 -2.07972882e-01  3.17870073e-01  1.52841847e-01
  -9.19548388e-01  6.73918945e-02]
 [-8.58630440e-01 -1.11138452e+00  5.17438774e-01 -5.15649758e-01
  -3.17224467e-01 -1.58338123e+00  8.06793278e-01 -3.11173015e-02
   1.04432648e-02 -2.12800466e-01  3.02196607e-01  5.86651682e-02
  -3.13827898e-01  5.07557776e-01]
 [-2.07704551e-01 -8.48011952e-01 -7.24511917e-01  1.79784940e-01
  -3.19777296e-01 -1.8

### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

In [8]:
import pandas as pd
dct = {
    'food_item':['pizza','burger','pasta','noodles'],
    'price':[500,100,150,120],
    'delivery_time':[30,15,10,8]
}

df = pd.DataFrame(dct)

X_num = df[['price','delivery_time']]

# Apply Min-Max scaling to the numerical features
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_num),
                        columns=X_num.columns)
X_scaled

Unnamed: 0,price,delivery_time
0,1.0,1.0
1,0.0,0.318182
2,0.125,0.090909
3,0.05,0.0


# Question 6 : You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

PCA (Principal Component Analysis) is a dimensionality reduction technique that is commonly used to reduce the number of features in a dataset while retaining the most important information. In the context of building a model to predict stock prices, we could use PCA to reduce the dimensionality of the dataset by identifying the most significant features that are driving the stock price movement.

Here is a step-by-step approach to using PCA for this purpose:
1. Standardize the data: The first step is to standardize the data by subtracting the mean and dividing by the standard deviation. This ensures that all features have the same scale and helps to improve the performance of PCA.

2. Compute the covariance matrix: Next, we compute the covariance matrix of the standardized data. The covariance matrix represents the relationships between the different features in the dataset.

3. Compute the eigenvectors and eigenvalues: We then calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions in which the data varies the most, while the eigenvalues represent the magnitude of the variation.

4. Select the principal components: We then select the top k eigenvectors with the highest eigenvalues. These eigenvectors are known as the principal components and represent the most important features in the dataset.

5. Project the data onto the principal components: Finally, we project the original data onto the selected principal components to obtain a new, reduced-dimensional dataset. This dataset can then be used as input to a machine learning algorithm to predict stock prices.

In [10]:
from sklearn.decomposition import PCA

pca = PCA(0.95)
X_pca = pca.fit_transform(X)

print("Variance Ratio:", pca.explained_variance_ratio_)
print('\nTop 5 rows of transformed PCA data :\n',X_pca[0:5])

Variance Ratio: [0.60943217 0.2948671  0.03047399 0.01690108]

Top 5 rows of transformed PCA data :
 [[-166511.13375728  -75548.89621184    -502.02823199    -213.24550923]
 [-114249.88815065    9769.79163234    2705.06781695   -1642.8356634 ]
 [ -98455.33902342  -33482.29158299   -2707.88885993   -5177.93052329]
 [ -71253.75133668  -95235.22193433  -10077.1060786     4343.90131599]
 [-114857.59963712  -68740.03110412   24575.9756348    34180.30740549]]


### Question 7 : For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [13]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame([1,5,10,15,20],columns=['A'])

scaler = MinMaxScaler(feature_range=(-1,1))
scaled = scaler.fit_transform(df)
print(scaled)


[[-1.        ]
 [-0.57894737]
 [-0.05263158]
 [ 0.47368421]
 [ 1.        ]]


### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why? 

The number of principal components to retain in PCA depends on the level of variance we want to preserve in the dataset. In general, we want to retain enough principal components to explain a significant portion of the total variance in the data, while also keeping the number of features as small as possible.

To determine how many principal components to retain for the given dataset containing the features height, weight, age, gender, and blood pressure, we would perform the following steps:

1. Standardize the data: We would first standardize the data by subtracting the mean and dividing by the standard deviation. This ensures that all features have the same scale and helps to improve the performance of PCA.

2. Compute the covariance matrix: Next, we would compute the covariance matrix of the standardized data. The covariance matrix represents the relationships between the different features in the dataset.

3. Compute the eigenvectors and eigenvalues: We would then calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the directions in which the data varies the most, while the eigenvalues represent the magnitude of the variation.

4. Select the principal components: We would then select the top k eigenvectors with the highest eigenvalues. These eigenvectors are known as the principal components and represent the most important features in the dataset.

5. Evaluate the explained variance: Finally, we would evaluate the amount of variance explained by each principal component and choose the number of principal components that preserve a significant portion of the total variance in the data.


Typically, we would select the number of principal components that can explain at least 80% of the total variance in the data. However, the exact number of principal components to retain may depend on the specific dataset and the problem we are trying to solve.

In summary, we would need to perform PCA on the given dataset to determine the optimal number of principal components to retain based on the amount of variance we want to preserve.

In [14]:
import numpy as np
import pandas as pd

# Set the seed for reproducibility
np.random.seed(678)

# Generate random data for each feature
height = np.random.normal(loc=170, scale=10, size=10000)
weight = np.random.normal(loc=70, scale=10, size=10000)
age = np.random.randint(18, 65, size=10000)
gender = np.random.choice(['Male', 'Female'], size=10000)
blood_pressure = np.random.normal(loc=120, scale=10, size=10000)

# Combine the data into a Pandas DataFrame
data = pd.DataFrame({'Height': height, 
                     'Weight': weight, 
                     'Age': age, 
                     'Gender': gender, 
                     'Blood Pressure': blood_pressure})


In [15]:
cat_cols = list(data.columns[data.dtypes == 'object'])
num_cols = list(data.columns[data.dtypes != 'object'])
# Print Categorical and Numeric Variables
print('Categorical Variables : ',cat_cols)
print('Numerical Variables   : ',num_cols)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
le = LabelEncoder()
data[cat_cols[0]]=le.fit_transform(data[cat_cols].values.flatten())
data.head()


scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data),columns=data.columns)
data_scaled.head()

from sklearn.decomposition import PCA
pca = PCA(n_components=3)

X_pca = pd.DataFrame(pca.fit_transform(data_scaled),columns=['PC1','PC2','PC3'])
# print the variance ratio explained by each principal component
print("Variance Ratio:", pca.explained_variance_ratio_)
print('\nTop 5 rows of transformed PCA data :\n',X_pca.head())

Categorical Variables :  ['Gender']
Numerical Variables   :  ['Height', 'Weight', 'Age', 'Blood Pressure']
Variance Ratio: [0.20491989 0.20113817 0.20024048]

Top 5 rows of transformed PCA data :
         PC1       PC2       PC3
0  0.179922 -1.464986  0.664593
1 -0.536445 -1.423220  1.533594
2  2.149224  0.738165 -0.709998
3  1.316837 -1.776677  1.784522
4  0.597166 -0.539852  0.356913
