In [None]:
#PCA - Principle Component Analysis, unsupervised ML algo

In [1]:
#Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

'''Min-Max scaling, also known as normalization, is a data preprocessing technique used to rescale numeric features to a specific range. It transforms the values of a 
feature into a new range, typically between 0 and 1. Min-Max scaling is especially useful when the input features have different scales, and you want to bring them to a 
common scale.
X_scaled = (X - X_min) / (X_max - X_min)
Original ages: [20, 30, 40, 50, 60]
X_min = 20
X_max = 60
Scaled ages: [(20-20)/(60-20), (30-20)/(60-20), (40-20)/(60-20), (50-20)/(60-20), (60-20)/(60-20)]
Scaled ages: [0, 0.1667, 0.3333, 0.5, 1]

'''
#example:
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import pandas as pd
min_max_scaler = MinMaxScaler()
df = sns.load_dataset('iris')
normalized_data = pd.DataFrame(min_max_scaler.fit_transform(df[['sepal_length','sepal_width','petal_length', 'petal_width']]),
                               columns = ['sepal_length','sepal_width','petal_length',' petal_width'])
print(df.head())
print('______________________________________________________________')
print(normalized_data.head())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
______________________________________________________________
   sepal_length  sepal_width  petal_length   petal_width
0      0.222222     0.625000      0.067797      0.041667
1      0.166667     0.416667      0.067797      0.041667
2      0.111111     0.500000      0.050847      0.041667
3      0.083333     0.458333      0.084746      0.041667
4      0.194444     0.666667      0.067797      0.041667


In [2]:
#Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.
'''The Unit Vector technique, also known as vector normalization or feature scaling by unit length, is a data preprocessing technique used to rescale numeric features to
have a unit length or magnitude of 1. Unlike Min-Max scaling, which brings the values within a specific range, Unit Vector scaling focuses on the direction or orientation of
the data points rather than their absolute values.
X_scaled = X / ||X||
Height: [170, 180, 160]
Weight: [70, 80, 90]

||[170, 70]|| = sqrt(170^2 + 70^2) = 181.0193
||[180, 80]|| = sqrt(180^2 + 80^2) = 193.6492
||[160, 90]|| = sqrt(160^2 + 90^2) = 181.8653

Now, we can divide each data point by its respective magnitude:

Height_scaled: [170 / 181.0193, 180 / 193.6492, 160 / 181.8653]
Weight_scaled: [70 / 181.0193, 80 / 193.6492, 90 / 181.8653]
'''
import pandas as pd
from sklearn.preprocessing import normalize
norm = pd.DataFrame(normalize(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]),columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
print(norm)

     sepal_length  sepal_width  petal_length  petal_width
0        0.803773     0.551609      0.220644     0.031521
1        0.828133     0.507020      0.236609     0.033801
2        0.805333     0.548312      0.222752     0.034269
3        0.800030     0.539151      0.260879     0.034784
4        0.790965     0.569495      0.221470     0.031639
..            ...          ...           ...          ...
145      0.721557     0.323085      0.560015     0.247699
146      0.729654     0.289545      0.579090     0.220054
147      0.716539     0.330710      0.573231     0.220474
148      0.674671     0.369981      0.587616     0.250281
149      0.690259     0.350979      0.596665     0.210588

[150 rows x 4 columns]


In [None]:
#Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.
'''PCA (Principal Component Analysis) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining the most
important information. It achieves this by identifying the principal components, which are new uncorrelated variables that capture the maximum variance in the data.

In this method a orthogonal line is drawn between the axis means that in the middle of the  data points a line is drawn and all the data points on both axis projected on 
that single line. Like this the dimensionality is reduced.

Original dataset:
Height: [170, 180, 160]
Weight: [70, 80, 90]
Age: [25, 30, 35]

Standardize the data:
Standardized height: [0, 1, -1]
Standardized weight: [-1, 0, 1]
Standardized age: [-1, 0, 1]

Compute the covariance matrix:
Covariance matrix:
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]

Compute the eigenvectors and eigenvalues:
Eigenvectors:
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]

Eigenvalues: [1, 1, 1]

Select the principal components:
Since all the eigenvalues are equal, any combination of the three eigenvectors can be selected. Let's choose the first two eigenvectors.
Selected eigenvectors:
[[1, 0, 0],
[0, 1, 0]]

Project the data:
Projected data:
[[0, 1],
[-1, 0],
[-1, 0]]'''

In [None]:
#Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.
'''PCA (Principal Component Analysis) can be used for feature extraction, which is a process of creating new features from the existing ones. Feature extraction aims to 
transform the original features into a smaller set of features that still capture the essential information.

In the context of PCA, feature extraction involves selecting the top-k principal components and using them as the new features. These principal components are the linear 
combinations of the original features that explain the most variance in the data.


Original dataset:
Age: [25, 30, 35, 40]
Height: [170, 180, 160, 175]
Weight: [70, 80, 90, 75]
Income: [$50,000, $60,000, $70,000, $80,000]

Standardize the data:
Standardized age: [-1.3416, -0.4472, 0.4472, 1.3416]
Standardized height: [-0.4472, 1.3416, -1.3416, 0.4472]
Standardized weight: [-1.3416, -0.4472, 0.4472, 1.3416]
Standardized income: [-1.3416, -0.4472, 0.4472, 1.3416]

Compute the covariance matrix:
Covariance matrix:
[[1.3333, 1.3333, 1.3333, 1.3333],
[1.3333, 1.3333, 1.3333, 1.3333],
[1.3333, 1.3333, 1.3333, 1.3333],
[1.3333, 1.3333, 1.3333, 1.3333]]

Compute the eigenvectors and eigenvalues:
Eigenvectors:
[[ 0.5, 0.5, -0.5, -0.5],
[ 0.5, -0.5, 0.5, -0.5]]

Eigenvalues: [5.3333, 0]

Select the principal components:
Since one eigenvalue is 0, we only have one non-zero principal component. Let's select the first eigenvector.
Selected eigenvector:
[ 0.5, 0.5, -0.5, -0.5]

Project the data:
Projected data:
[0.4472, 0.4472, -0.4472, -0.4472]
After applying PCA, we have extracted the most important feature, which is a linear combination of the original features. This new feature represents the direction of 
maximum variance in the data and can be used for further analysis or modeling. By selecting fewer principal components, we effectively reduced the dimensionality of the
dataset while preserving the most significant information.
'''


In [None]:
#Q5.You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time.
#Explain how you would use Min-Max scaling to preprocess the data.

'''I will try to understand the dataset and the specific features it contains, such as price, rating, and delivery time. Determine the range and distribution of values for 
each feature.For each feature, identify the minimum and maximum values present in the dataset. This step allows to define the range to which the values will be scaled.
Then the formula will be applied:X_scaled = (X - X_min) / (X_max - X_min)
For each feature in the dataset, apply the Min-Max scaling formula to obtain the scaled values. This process will be for all the relevant features, such as price, rating, and
delivery time.
The scaled dataset can now be used in recommendation system. The scaled values ensure that all features are on a comparable scale, making it easier to analyze and model the
data effectively.

By applying Min-Max scaling, transform the feature values to a common range (0 to 1), preserving the relative relationships between the data points while removing the 
influence of the original scale. This normalization facilitates accurate comparisons and analysis of the features, enabling the recommendation system to consider
all relevant factors equally.
'''


In [None]:
#Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how
#you would use PCA to reduce the dimensionality of the dataset.
'''
Gain a thorough understanding of the dataset, including the various features it contains, such as company financial data and market trends.Assess the size of the dataset and 
the number of features available.
Before applying PCA, it's essential to standardize the dataset. This step involves transforming the features to have zero mean and unit variance. Standardization ensures 
that features with larger scales do not dominate the PCA analysis.
Calculate the covariance matrix of the standardized dataset. The covariance matrix measures the relationships between different features. It provides insights into the
variance and covariance structure of the data.
Perform eigendecomposition on the covariance matrix to obtain the eigenvectors and eigenvalues. The eigenvectors represent the principal components, and the eigenvalues
indicate the amount of variance explained by each principal component. Sort the eigenvectors in descending order based on their corresponding eigenvalues.
Decide on the number of principal components. This selection depends on the desired level of dimensionality reduction and the amount of variance  to retain in the dataset. 
Typically, he top-k eigenvectors that explain the most variance will choose. These eigenvectors become the principal components that will form the reduced-dimensional space.
Multiply the standardized dataset by the selected eigenvectors to obtain the transformed data in the reduced-dimensional space. This projection maps the original dataset 
onto the new set of uncorrelated features (principal components).
The reduced-dimensional dataset, consisting of the transformed features, can now be used in stock price prediction model. The dimensionality reduction achieved through
PCA helps to eliminate redundant or less informative features, simplifies the dataset, and focuses on the most significant information.
'''

In [3]:
#Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler(feature_range=(-1, 1))
data = [1, 5, 10, 15, 20]
df = pd.DataFrame(data)
scaled_data = min_max.fit_transform(df)
print(scaled_data)

[[-1.        ]
 [-0.57894737]
 [-0.05263158]
 [ 0.47368421]
 [ 1.        ]]


In [4]:
#Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components 
#would you choose to retain, and why?

'''
When performing feature extraction using PCA on a dataset with features like height, weight, age, gender, and blood pressure, the number of principal components to retain
depends on the desired level of dimensionality reduction and the amount of variance we want to preserve in the dataset.
To determine the number of principal components to retain, we typically consider the cumulative explained variance ratio. This metric tells us the proportion of the total 
variance in the dataset that is explained by each principal component.Atfirst Calculate the covariance matrix of the dataset. Then find the eigenvalues and eigenvectors of
the covariance matrix.Then Sort the eigenvalues in descending order. Compute the explained variance ratio for each principal component by dividing its eigenvalue by the 
sum of all eigenvalues. Calculate the cumulative sum of the explained variance ratios.
Choose the number of principal components that explain a significant portion of the total variance in the dataset. A common rule of thumb is to retain the principal
components that explain a cumulative variance of around 80% to 95%. However, the specific threshold may vary depending on the application and specific requirements.'''

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'height': [170, 165, 180, 155, 175],
    'weight': [65, 60, 75, 50, 70],
    'age': [30, 25, 35, 28, 32],
    'gender': ['M', 'F', 'M', 'F', 'M'],
    'blood_pressure': [120, 110, 130, 115, 125]
})
features = ['height', 'weight', 'age', 'blood_pressure']
X = data[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

for i, explained_var in enumerate(explained_variance_ratio):
    print(f"Explained Variance of PC{i+1}: {explained_var}")
print(f"Cumulative Explained Variance: {cumulative_explained_variance[-1]}")

num_components = np.sum(cumulative_explained_variance < 0.95) + 1
print(f"Number of Components to Retain: {num_components}")

X_reduced = X_pca[:, :num_components]

print("Reduced Feature Matrix:")
print(X_reduced)


Explained Variance of PC1: 0.901370694173991
Explained Variance of PC2: 0.09825111829674554
Explained Variance of PC3: 0.00037818752926356786
Explained Variance of PC4: 2.5918199527373645e-32
Cumulative Explained Variance: 1.0000000000000002
Number of Components to Retain: 2
Reduced Feature Matrix:
[[ 0.11631388 -0.11586161]
 [-1.90521341 -0.97856135]
 [ 2.71941058  0.16753009]
 [-2.2757703   0.9813193 ]
 [ 1.34525925 -0.05442644]]
