In [None]:
#Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min-Max scaling is a data preprocessing technique used to rescale numerical features in a dataset to a specific range, typically between 0 and 1. It's done to ensure that all features have the same scale, which can be important for machine learning algorithms that are sensitive to the magnitude of input data. Here's a simple explanation of how it works:

Find the minimum (min) and maximum (max) values of the feature you want to scale.

For each data point in the feature, subtract the minimum value and then divide by the range (max - min).

Mathematically, the formula for Min-Max scaling is:

Scaled Value=Original Value - Min Value /Max Value-Min Value

​
 

Here's an example to illustrate Min-Max scaling:

Suppose you have a dataset of ages ranging from 20 to 60 years, and you want to scale this feature to a range between 0 and 1.

Minimum age (min): 20 years
Maximum age (max): 60 years
Now, let's say you have an age value of 40 years that you want to scale:
Scaled Age=40-20/60-20=20/40= 0.5


So, the scaled value for an age of 40 years would be 0.5 after Min-Max scaling. 
This ensures that all the age values in your dataset are transformed to a common scale between 0 and 1, 
making it easier for machine learning algorithms to work with this feature, especially when you have multiple features with different scales.


In [1]:
#Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?Provide an example to illustrate its application.
"""Unit Vector scaling, also known as Normalization, is another data preprocessing technique used to scale numerical features in a dataset. 
   It differs from Min-Max scaling in that it scales the features so that their magnitudes (norms) are equal to 1.
   This technique is useful when you want to retain the direction or angle information between data points but standardize their magnitudes.
   Scaled Value= Original Value/Norm of the Vector
"""
import numpy as np

# Sample vector
vector = np.array([3, 4])

# Calculate the norm
norm = np.linalg.norm(vector)

# Scale the vector
unit_vector = vector / norm

print(unit_vector)


[0.6 0.8]


In [2]:
#Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.
"""
Principal Component Analysis (PCA) is a widely used technique in statistics and data analysis for dimensionality reduction and feature extraction.
It's used to simplify complex datasets by transforming them into a lower-dimensional representation while preserving as much of the original data's variance as possible.
PCA accomplishes this by finding new orthogonal axes called principal components, which are linear combinations of the original features.

Data Centering: PCA begins by centering the data, which means subtracting the mean from each feature. This ensures that the data is centered around the origin.

Covariance Matrix Calculation: PCA calculates the covariance matrix of the centered data. The covariance matrix describes how features in the dataset vary together.

Eigenvalue and Eigenvector Calculation: PCA then calculates the eigenvalues and eigenvectors of the covariance matrix. These eigenvectors represent the principal components, and the eigenvalues represent the variance explained by each principal component.

Sorting Principal Components: The eigenvectors are sorted in descending order of their corresponding eigenvalues. The principal components are ranked by the amount of variance they explain, with the first principal component explaining the most variance, the second explaining the second most, and so on.

Dimensionality Reduction: To reduce dimensionality, you can select a subset of the top-ranked principal components that collectively explain most of the variance in the data. These selected principal components form a new basis for the data.
"""

from sklearn.decomposition import PCA
import numpy as np

# Sample data
data = np.array([[160, 60], [165, 65], [155, 55], [175, 70]])

# Create a PCA object
pca = PCA(n_components=1)

# Fit and transform the data to reduce it to 1 dimension
reduced_data = pca.fit_transform(data)

print(reduced_data)


[[ -4.49962554]
 [  2.50184501]
 [-11.50109608]
 [ 13.49887661]]


In [3]:
#Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

"""Principal Component Analysis (PCA) can be thought of as a feature extraction technique, and its primary goal is to find new, uncorrelated features called principal components that are linear combinations of the original features. The relationship between PCA and feature extraction is that PCA extracts these new features while reducing the dimensionality of the data. Here's how PCA can be used for feature extraction:

Dimensionality Reduction: PCA is often used to reduce the dimensionality of a dataset while retaining as much information as possible. It does this by finding the principal components, which are combinations of the original features that capture the most variance in the data. These principal components are used as the new, extracted features.

Variance Retention: When you apply PCA for feature extraction, you typically select a subset of the top-ranked principal components that collectively explain a high percentage of the variance in the data. By retaining these components, you effectively reduce the number of features in your dataset while minimizing information loss.

New Feature Space: The selected principal components form a new feature space. This space can be used for various purposes such as visualization, data analysis, or feeding into machine learning algorithms."""
from sklearn.decomposition import PCA
import numpy as np

# Sample health data with 4 features (blood pressure, cholesterol, BMI, heart rate)
data = np.array([[120, 200, 25, 70],
                 [130, 220, 27, 75],
                 [115, 190, 24, 68],
                 [140, 240, 29, 80]])

# Create a PCA object, retaining 2 principal components
pca = PCA(n_components=2)

# Fit and transform the data to extract 2 features
extracted_features = pca.fit_transform(data)

print(extracted_features)



[[-14.40108261   0.21522015]
 [  8.59843568   0.06636213]
 [-25.79530703  -0.19908638]
 [ 31.59795396  -0.0824959 ]]


In [5]:
#Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
#contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data. 
"""To preprocess the data for building a recommendation system for a food delivery service using Min-Max scaling, follow these steps:

Understand Your Data: Start by understanding the dataset and the specific features you have. In your case, you mentioned that the dataset contains features like price, rating, and delivery time. Ensure you know the range and distribution of values for each of these features.

Import Libraries: Import the necessary Python libraries, including NumPy and scikit-learn, for data preprocessing."""
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data with three features: price, rating, delivery time
data = np.array([[10, 4.5, 30],
                 [30, 3.8, 45],
                 [20, 4.2, 20],
                 [15, 4.0, 60]])

# Create a Min-Max scaler
scaler = MinMaxScaler()

# Fit the scaler and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)


[[0.         1.         0.25      ]
 [1.         0.         0.625     ]
 [0.5        0.57142857 0.        ]
 [0.25       0.28571429 1.        ]]


In [None]:
#Q6. You are working on a project to build a model to predict stock prices. The dataset contains many
#features, such as company financial data and market trends. Explain how you would use PCA to reduce the
# dimensionality of the dataset.

"""Step 1: Data Preparation
Before applying PCA, it's essential to prepare your data:

Data Cleaning: Ensure that your dataset is clean, free of missing values, and outliers are handled appropriately.

Feature Scaling: Normalize or standardize your data if necessary, so that all features have a similar scale. PCA is sensitive to the scale of the features.

Step 2: Applying PCA

Once your data is prepared, follow these steps to apply PCA:

Center the Data: Subtract the mean from each feature. This centers the data around the origin.

Calculate the Covariance Matrix: Compute the covariance matrix of the centered data. The covariance matrix provides information about how different features are related.

Compute Eigenvectors and Eigenvalues: Calculate the eigenvectors and eigenvalues of the covariance matrix. These eigenvectors represent the principal components, and the eigenvalues represent the variance explained by each principal component.

Select Principal Components: Sort the eigenvectors in descending order of their corresponding eigenvalues. The principal components with the highest eigenvalues explain the most variance in the data. You can choose how many principal components to retain based on how much variance you want to preserve. For instance, if you retain 90% of the variance, you can sum the eigenvalues and keep the corresponding eigenvectors until their cumulative sum reaches 90%.

Transform the Data: Project your original data onto the selected principal components. This forms a new dataset with reduced dimensionality. The transformed data can be used for modeling.

Step 3: Modeling and Evaluation

After reducing the dimensionality of your dataset using PCA, you can proceed with building your stock price prediction model using the transformed data. It's essential to:

Split your data into training and testing sets.

Apply a regression or time-series forecasting model to the transformed data. Common models for stock price prediction include linear regression, ARIMA, or machine learning algorithms like random forests or neural networks.

Evaluate your model's performance using appropriate metrics, such as mean squared error (MSE) or root mean squared error (RMSE), on the testing dataset.

Step 4: Interpretation

PCA can make your model more interpretable by highlighting the most important dimensions (principal components) of your data. You can analyze the loadings of the original features on these principal components to gain insights into which aspects of the financial and market data are most influential in predicting stock prices."""


In [6]:
#Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
# values to a range of -1 to 1.
import numpy as np

# Define the dataset
data = np.array([1, 5, 10, 15, 20])

# Define the new range
new_min = -1
new_max = 1

# Calculate the min and max of the original data
min_value = np.min(data)
max_value = np.max(data)

# Apply Min-Max scaling
scaled_data = ((data - min_value) / (max_value - min_value)) * (new_max - new_min) + new_min

print(scaled_data)



[-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


In [7]:
#Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
# Feature Extraction using PCA. How many principal components would you choose to retain, and why?

"""Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?"""

from sklearn.decomposition import PCA
import numpy as np

# Sample dataset with features: height, weight, age, gender, blood pressure
data = np.array([[170, 70, 30, 1, 120],
                 [160, 65, 25, 0, 130],
                 [180, 80, 35, 1, 110],
                 [155, 52, 28, 0, 125]])

# Create a PCA object
pca = PCA()

# Fit the PCA model to your data
pca.fit(data)

# Calculate the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(explained_variance_ratio)

# Set a threshold, e.g., 95% variance
threshold = 0.95

# Determine the number of principal components to retain
num_components_to_retain = np.argmax(cumulative_variance >= threshold) + 1

print("Explained Variance Ratios:", explained_variance_ratio)
print("Cumulative Explained Variance:", cumulative_variance)
print("Number of Principal Components to Retain:", num_components_to_retain)


Explained Variance Ratios: [9.19360488e-01 7.94897720e-02 1.14973970e-03 4.52647078e-35]
Cumulative Explained Variance: [0.91936049 0.99885026 1.         1.        ]
Number of Principal Components to Retain: 2
