## Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

**Min-Max scaling** is a data preprocessing technique used to transform numeric features in a dataset to a specific range, typically between 0 and 1. It linearly scales the values of each feature, preserving the relative relationships between data points while ensuring that all values fall within the specified range.

The formula for Min-Max scaling is:



**Xsc = X - Xmin / Xmax - Xmin**

In [1]:
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
MinMaxScaler()


MinMaxScaler()


In [2]:
 print(scaler.data_max_)

[ 1. 18.]


In [3]:
print(scaler.transform(data))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]


In [4]:
print(scaler.transform([[2, 2]]))

[[1.5 0. ]]


In [5]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

size = np.array([800, 1200, 1500, 2000, 2500]).reshape(-1, 1)
bedrooms = np.array([2, 3, 4, 5]).reshape(-1, 1)

scaler = MinMaxScaler()

size_scaled = scaler.fit_transform(size)
bedrooms_scaled = scaler.fit_transform(bedrooms)

print("Size (Scaled):", size_scaled)
print("Bedrooms (Scaled):", bedrooms_scaled)


Size (Scaled): [[0.        ]
 [0.23529412]
 [0.41176471]
 [0.70588235]
 [1.        ]]
Bedrooms (Scaled): [[0.        ]
 [0.33333333]
 [0.66666667]
 [1.        ]]


## Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

 **Unit Vector scaling**, also known as **Normalization**, is a feature scaling technique used to transform numeric features in a dataset so that they have a unit magnitude or length. In other words, it scales the features to have a Euclidean norm (magnitude) of 1. This technique is often used when the direction or angle between data points is more important than their absolute values.

The formula for Unit Vector scaling of a feature vector **X** is as follows:
- **X(normalized)=X/|X|**
##### Here's how Unit Vector scaling differs from Min-Max scaling:

1. **Scale Range:**
   - Min-Max scaling scales the features to a specific range (e.g., between 0 and 1), ensuring that all values fall within this range.
   - Unit Vector scaling scales the features to have a magnitude of 1, which means they lie on the unit circle in multi-dimensional space. It doesn't constrain them to a specific numerical range.

2. **Magnitude vs. Direction:**
   - Min-Max scaling preserves the magnitude and direction of the original features but scales them proportionally to fit within the specified range.
   - Unit Vector scaling preserves the direction of the features but normalizes their magnitude to 1. It's particularly useful when the direction of the data vectors matters more than their absolute values.

**Example: Unit Vector Scaling**

In [6]:
from sklearn.preprocessing import normalize

# Original dataset (features as rows and columns as features)
data = np.array([[3, 0, -4, 6],
                 [0, 2, -3, 4]])

# Normalize the dataset to unit vectors using scikit-learn's normalize function
normalized_data = normalize(data, norm='l2', axis=1)

print("Original Data:")
print(data)
print("\nNormalized Data (Unit Vectors):")
print(normalized_data)


Original Data:
[[ 3  0 -4  6]
 [ 0  2 -3  4]]

Normalized Data (Unit Vectors):
[[ 0.38411064  0.         -0.51214752  0.76822128]
 [ 0.          0.37139068 -0.55708601  0.74278135]]


## Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an  example to illustrate its application. 

**Principal Component Analysis (PCA)** is a dimensionality reduction technique widely used in data analysis and machine learning. It is used to transform high-dimensional data into a lower-dimensional form while preserving as much of the variance or information in the data as possible. PCA accomplishes this by identifying and capturing the underlying structure or patterns in the data.

Here's a step-by-step overview of how PCA works:

1. **Data Centering:** PCA begins by centering the data, which means subtracting the mean (average) value of each feature from the data points. Centering helps remove any bias in the data.

2. **Covariance Matrix:** PCA computes the covariance matrix of the centered data. The covariance matrix describes how different features in the data vary together and provides insights into their relationships and dependencies.

3. **Eigenvalue Decomposition:** PCA performs an eigenvalue decomposition (or singular value decomposition) of the covariance matrix. This decomposition yields eigenvalues and eigenvectors.

4. **Principal Components:** The eigenvectors represent the directions (axes) in the original feature space along which the data varies the most. These eigenvectors are called principal components. The corresponding eigenvalues indicate the variance of the data along those directions.

5. **Dimensionality Reduction:** PCA sorts the eigenvalues in descending order and selects the top k eigenvectors (principal components) that capture the most variance in the data. By choosing a smaller number of principal components (k), PCA effectively reduces the dimensionality of the dataset from the original number of features to k features.

6. **Projection:** The original data is projected onto the new lower-dimensional space defined by the selected principal components. This projection results in a new dataset with reduced dimensions.

PCA is commonly used for various purposes, including:

- **Dimensionality Reduction:** Reducing the dimensionality of high-dimensional datasets, which can lead to more efficient computation and visualization.

- **Noise Reduction:** Eliminating noise or irrelevant features, which can improve the performance of machine learning algorithms.

- **Data Compression:** Storing or transmitting data more efficiently by representing it in a lower-dimensional form.

- **Visualization:** Reducing the dimensionality of data for visualization purposes, allowing for easier exploration and understanding.


In [7]:
import numpy as np
from sklearn.decomposition import PCA

# Sample data (3 features)
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9],
                 [10, 11, 12]])

# Create a PCA instance with 2 components
pca = PCA(n_components=2)

# Fit and transform the data to the first 2 principal components
transformed_data = pca.fit_transform(data)

# Print the transformed data
print("Original Data:")
print(data)
print("\nTransformed Data (2 Principal Components):")
print(transformed_data)


Original Data:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Transformed Data (2 Principal Components):
[[ 7.79422863e+00  4.41704682e-16]
 [ 2.59807621e+00 -1.20464913e-16]
 [-2.59807621e+00  1.20464913e-16]
 [-7.79422863e+00  3.61394740e-16]]


## Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature  Extraction? Provide an example to illustrate this concept. 

**Principal Component Analysis (PCA)** and **Feature Extraction** are closely related concepts in the context of dimensionality reduction and data analysis. PCA can be used as a feature extraction technique, allowing you to transform a dataset into a set of new features (principal components) that capture the most important information while reducing dimensionality.

Here's the relationship between PCA and Feature Extraction and an example to illustrate it:

**Relationship between PCA and Feature Extraction:**

- **PCA as Feature Extraction:** PCA is a dimensionality reduction technique, but it can also be used as a feature extraction method. Instead of using all original features, you can use PCA to extract a smaller set of new features (principal components) that are linear combinations of the original features.

- **Preservation of Information:** PCA selects the principal components in such a way that they capture the maximum variance in the data. These principal components retain the most important information while discarding less important information. In this sense, PCA performs feature extraction by creating a compact representation of the data.

- **Dimensionality Reduction:** The extracted principal components are typically fewer in number than the original features, which results in dimensionality reduction. This can lead to more efficient computation and potentially improve the performance of machine learning algorithms.

**Example: Using PCA for Feature Extraction**

In [8]:
from sklearn.decomposition import PCA

# Sample data with three features (3D data)
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9],
                 [10, 11, 12]])

# Create a PCA instance with 2 components
pca = PCA(n_components=2)

# Fit and transform the data to extract two principal components (features)
extracted_features = pca.fit_transform(data)

# Print the original data and the extracted features
print("Original Data (3D):")
print(data)
print("\nExtracted Features (2D):")
print(extracted_features)

Original Data (3D):
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

Extracted Features (2D):
[[ 7.79422863e+00  4.41704682e-16]
 [ 2.59807621e+00 -1.20464913e-16]
 [-2.59807621e+00  1.20464913e-16]
 [-7.79422863e+00  3.61394740e-16]]


## Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset  contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to  preprocess the data.


- Understand the Data
- Apply Min-Max Scaling
- Normalization Range [0, 1]
- Repeat for Each Feature
- Use the Preprocessed Data

In [9]:
from sklearn.preprocessing import MinMaxScaler

# Sample dataset with features: price, rating, and delivery time
data = np.array([[10.0, 4.5, 30],
                 [20.0, 3.8, 45],
                 [15.0, 4.0, 40],
                 [25.0, 4.9, 35]])

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data using Min-Max scaling
scaled_data = scaler.fit_transform(data)

# Convert the original data to a DataFrame
original_df = pd.DataFrame(data, columns=['Price', 'Rating', 'Delivery Time'])

# Convert the scaled data to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=['Scaled Price', 'Scaled Rating', 'Scaled Delivery Time'])

concatenated_df = pd.concat([original_df, scaled_df], axis=1)

# Print the concatenated DataFrame
print("Concatenated DataFrames:")
print(concatenated_df)

Concatenated DataFrames:
   Price  Rating  Delivery Time  Scaled Price  Scaled Rating  \
0   10.0     4.5           30.0      0.000000       0.636364   
1   20.0     3.8           45.0      0.666667       0.000000   
2   15.0     4.0           40.0      0.333333       0.181818   
3   25.0     4.9           35.0      1.000000       1.000000   

   Scaled Delivery Time  
0              0.000000  
1              1.000000  
2              0.666667  
3              0.333333  


## Q6. You are working on a project to build a model to predict stock prices. The dataset contains many  features, such as company financial data and market trends. Explain how you would use PCA to reduce the  dimensionality of the dataset. 


- Data Preprocessing
- Feature Selection and Engineering
- Standardization or Normalization
- Applying PCA
- Selecting Principal Components
- Reduced-Dimension Dataset
- Model Building
- Model Evaluation and Fine-Tuning

In [10]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

df = pd.read_csv('ADANIPORTS.csv')
df.head()

Unnamed: 0,Date,Symbol,Series,Prev Close,Open,High,Low,Last,Close,VWAP,Volume,Turnover,Trades,Deliverable Volume,%Deliverble
0,2007-11-27,MUNDRAPORT,EQ,440.0,770.0,1050.0,770.0,959.0,962.9,984.72,27294366,2687719000000000.0,,9859619,0.3612
1,2007-11-28,MUNDRAPORT,EQ,962.9,984.0,990.0,874.0,885.0,893.9,941.38,4581338,431276500000000.0,,1453278,0.3172
2,2007-11-29,MUNDRAPORT,EQ,893.9,909.0,914.75,841.0,887.0,884.2,888.09,5124121,455065800000000.0,,1069678,0.2088
3,2007-11-30,MUNDRAPORT,EQ,884.2,890.0,958.0,890.0,929.0,921.55,929.17,4609762,428325700000000.0,,1260913,0.2735
4,2007-12-03,MUNDRAPORT,EQ,921.55,939.75,995.0,922.0,980.0,969.3,965.65,2977470,287520000000000.0,,816123,0.2741


In [11]:
# Separate the target variable from the features
X = df.drop(columns=['Turnover','Trades','Symbol','Date','Series'])
y = df['Turnover']


In [12]:
# Standardize the features (mean=0, variance=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=10)  # Choose the number of components you want to retain
X_pca = pca.fit_transform(X_scaled)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train a regression model (e.g., Linear Regression) on the reduced-dimension data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 7.348374519676787e+27


## Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the  values to a range of 0 to 1. 

In [13]:
data= np.array([1, 5, 10, 15, 20]).reshape(-1, 1)
scaler = MinMaxScaler()

# Fit and transform the data using Min-Max scaling
scaled_data = scaler.fit_transform(data)

print(scaled_data)

[[0.        ]
 [0.21052632]
 [0.47368421]
 [0.73684211]
 [1.        ]]


## Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform  Feature Extraction using PCA. How many principal components would you choose to retain, and why? 

The number of principal components (PCs) to retain in a PCA-based feature extraction process depends on several factors, including the specific goals of your analysis and the amount of variance you want to preserve. To determine how many principal components to retain, you can consider the cumulative explained variance and the trade-off between dimensionality reduction and information loss.

Here's a general process to decide how many principal components to retain for feature extraction:

1. **Compute Explained Variance:** After applying PCA to your dataset, you'll have access to the explained variance for each principal component. Explained variance indicates how much of the original variance in the data is captured by each PC.

2. **Plot Explained Variance:** Plot the cumulative explained variance as a function of the number of principal components. This plot is often referred to as a "scree plot" or "explained variance plot."

3. **Select a Threshold:** Decide on a threshold for the amount of variance you want to preserve. This threshold could be a specific percentage of the total variance (e.g., 95% or 99%) or a specific number of components (e.g., retain the top k components).

4. **Determine the Number of Components:** Choose the number of principal components that allows you to meet or exceed your chosen threshold. This number is the one you'll retain for feature extraction.

5. **Consider Practicality:** Consider practical considerations, such as the computational complexity of your analysis and the specific requirements of your downstream tasks. Sometimes, a balance between dimensionality reduction and information loss is necessary.

6. **Validate Your Choice:** You can also perform cross-validation or other model evaluation techniques to assess the impact of dimensionality reduction on the performance of your machine learning models. This can help you confirm that the chosen number of components is appropriate for your specific task.

The choice of how many principal components to retain would depend on your objectives. Here are a few considerations:

- If you want to reduce dimensionality significantly and are primarily interested in capturing the most prominent patterns in the data, you might choose to retain a relatively small number of components (e.g., 2 or 3).
  
- If you want to balance dimensionality reduction with preserving a substantial amount of information, you might aim for a higher percentage of explained variance (e.g., 95% or more).

- The inclusion of the "gender" feature, which is categorical, may impact your choice. You might consider encoding it as binary values (e.g., 0 for male and 1 for female) before applying PCA.

Ultimately, the number of principal components to retain should align with your specific goals and the characteristics of your dataset. It may require experimentation and iterative analysis to find the most suitable balance between dimensionality reduction and information retention for your particular use case.

In [14]:
data = {
    'height': [160, 170, 155, 175, 180],
    'weight': [60, 70, 65, 80, 85],
    'age': [30, 40, 35, 45, 50],
    'gender': [0, 1, 0, 1, 1],  # Example encoding: 0 for male, 1 for female
    'blood_pressure': [120, 130, 110, 140, 150]
}

# Convert the dataset to a DataFrame
df = pd.DataFrame(data)

# Separate the features (X) and target (if applicable)
X = df.drop(columns=['gender'])  # Exclude the 'gender' column for PCA

# Standardize the features (mean=0, variance=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate the cumulative explained variance
explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Determine the number of components to retain (e.g., retain 95% of variance)
n_components_to_retain = np.argmax(explained_variance_ratio >= 0.95) + 1

# Print the explained variance and the chosen number of components
print("Explained Variance Ratio:")
print(explained_variance_ratio)
print("\nNumber of Components to Retain (95% of Variance):", n_components_to_retain)


Explained Variance Ratio:
[0.95388929 0.99576736 0.99923813 1.        ]

Number of Components to Retain (95% of Variance): 1
