FEATURE ENGINEERING

# 1. What is a parameter?
A parameter is a value that is used to control or influence the behavior of a system, function, or process. The meaning of "parameter" depends on the context:
# In Mathmatics
# In Programming
# In Science and Engineering

# 2. What is correlation?What does negative correlation mean?
Correlation is a statistical measure that describes the relationship between two variables—how they move together. It ranges from -1 to +1:

Positive Correlation (+1 to 0): When one variable increases, the other tends to increase too (e.g., height and weight).
Negative Correlation (0 to -1): When one variable increases, the other tends to decrease (e.g., the more time spent exercising, the lower the body fat percentage).
Zero Correlation (0): No relationship between the variables.

# A negative correlation means that as one variable goes up, the other goes down. For example, if the price of a product increases and its sales decrease, they have a negative correlation.

# 3. Define Machine Learning. What are the main components in Machine Learning?
Machine Learning is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed. It involves training models using data and improving their performance over time.

# 4. How does loss value help in determining whether the model is good or not?
The loss value helps determine how well a machine learning model is performing by measuring the difference between the model’s predictions and the actual target values. It serves as a key indicator of model accuracy and effectiveness.

How Loss Value Helps in Model Evaluation

Lower Loss = Better Model
A low loss value means the model’s predictions are close to the actual values, indicating good performance.
A high loss value means the model is making large errors, suggesting poor performance.

Loss Function Choice Matters
Different ML problems use different loss functions (e.g., Mean Squared Error for regression, Cross-Entropy for classification).
The right choice ensures the model learns effectively.

Training vs. Validation Loss
If both training and validation loss are low, the model is performing well.
If training loss is low but validation loss is high, the model is overfitting (memorizing training data instead of generalizing).
If both training and validation loss are high, the model is underfitting (not learning enough patterns)

Tracking Loss for Model Improvement
Loss is minimized using optimization techniques like gradient descent.
Loss trends over epochs (iterations) show if the model is improving or if adjustments are needed.

# 5. What are continuous and categorical variables?
Continuous Variables
A continuous variable is a numeric variable that can take an infinite number of values within a given range. These values can be measured and often include decimals or fractions.

Examples:
Height (e.g., 5.8 ft, 6.2 ft)

Categorical Variables
A categorical variable represents distinct groups or categories. These values do not have a numerical meaning and cannot be measured on a continuous scale.

Types of Categorical Variables:
Nominal: No inherent order (e.g., colors: Red, Blue, Green; Gender: Male, Female)

# 6. How do we handle categorical variables in Machine Learning? What are the common techniques?
Since most machine learning models work with numerical data, categorical variables need to be converted into numerical format before training. Below are common techniques used:
 
 One-Hot Encoding (OHE)
Converts each category into a binary (0/1) vector.
Works well for nominal variables (unordered categories).
Can lead to high-dimensionality if there are too many unique categories.

Label Encoding
Assigns a unique integer to each category.
Works well for ordinal categorical variables (where order matters).
Can introduce misleading numerical relationships for nominal categories.

Ordinal Encoding
Similar to label encoding but used specifically for ordinal data where category order matters.

Frequency Encoding
Replaces each category with the frequency of its occurrence in the dataset.
Helps capture categorical importance but can still introduce some bias.

# 7. What do you mean by training and testing a dataset?
In Machine Learning, a dataset is typically split into two parts: training data and testing data to evaluate how well a model performs on unseen data.

1. Training Dataset
The training dataset is the portion of the data used to train the model.
The model learns patterns, relationships, and features from this data.
The process involves adjusting the model’s parameters using techniques like gradient descent to minimize errors.
🔹 Example:If you’re building a spam email classifier, the training dataset will contain labeled emails (Spam or Not Spam) that the model uses to learn.

2. Testing Dataset
The testing dataset is the portion of the data used to evaluate the model's performance after training.
This helps check if the model can generalize to new, unseen data.
The model does not learn from the test data—it is only used for validation.
🔹 Example:After training the spam classifier, we test it on new emails (not seen during training) to check how well it predicts spam.

# 8. What is sklearn.preprocessing?
sklearn.preprocessing is a module in Scikit-Learn that provides tools for transforming raw data into a suitable format for machine learning models. It helps improve model performance by scaling, normalizing, encoding, and transforming data.

# 9. What is a Test set?
A test set is a portion of a dataset that is used to evaluate the performance of a trained machine learning model. It helps determine how well the model can generalize to new, unseen data.

# 10.  How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

1. Splitting Data for Model Fitting (Training & Testing) in Python
Step 1: Define the Problem
Step 2: Collect & Explore Data
Step 3: Preprocess the Data
Step 4: Split Data into Training & Testing Sets
Step 5: Choose a Model
Step 6: Train the Model
Step 7: Evaluate the Model
Step 8: Tune Hyperparameters
Step 9: Deploy the Model


In [2]:
# 10
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Load and Explore Data
data = {'Study_Hours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Exam_Score': [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]}
df = pd.DataFrame(data)
print(df.describe())  # Check basic statistics

# Step 2: Split Data into Training and Testing Sets
X = df[['Study_Hours']]  # Features
y = df['Exam_Score']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Data Preprocessing (Scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train the Model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Step 5: Make Predictions
y_pred = model.predict(X_test_scaled)

# Step 6: Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Step 7: Predict for a New Value
new_data = np.array([[7]])  # Example: 7 hours of study
new_data_scaled = scaler.transform(new_data)
predicted_score = model.predict(new_data_scaled)
print(f"Predicted Exam Score for 7 hours of study: {predicted_score[0]}")

       Study_Hours  Exam_Score
count     10.00000   10.000000
mean       5.50000   72.500000
std        3.02765   15.138252
min        1.00000   50.000000
25%        3.25000   61.250000
50%        5.50000   72.500000
75%        7.75000   83.750000
max       10.00000   95.000000
Mean Squared Error: 0.0
Predicted Exam Score for 7 hours of study: 80.0




# 11. Why do we have to perform EDA before fitting a model to the data?
Exploratory Data Analysis (EDA) is a crucial step in machine learning that helps understand the dataset before applying any model. Skipping EDA can lead to poor model performance, biases, and misleading results.

# 12. What is correlation?
Correlation measures the relationship between two variables and indicates how one variable changes in relation to another. It helps identify patterns in data and is crucial in feature selection for machine learning.

# 13. What does negative correlation mean?
Negative correlation means that when one variable increases, the other variable decreases (and vice versa). It indicates an inverse relationship between two variables.

# 14. How can you find correlation between variables in Python?
In Python, we can compute the correlation between variables using Pandas and Seaborn. The most common method is Pearson’s correlation coefficient, but there are also other methods like Spearman and Kendall.

In [1]:
#14 
import pandas as pd

# Sample dataset
data = {
    'Study_Hours': [1, 2, 3, 4, 5, 6, 7, 8],
    'Exam_Score': [50, 55, 60, 70, 75, 80, 85, 90],
    'TV_Hours': [8, 7, 6, 5, 4, 3, 2, 1]
}

df = pd.DataFrame(data)

# Compute correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

             Study_Hours  Exam_Score  TV_Hours
Study_Hours      1.00000     0.99544  -1.00000
Exam_Score       0.99544     1.00000  -0.99544
TV_Hours        -1.00000    -0.99544   1.00000


# 15. What is causation? Explain difference between correlation and causation with an example.
What is Causation?
Causation means that one event directly causes another event to happen. If variable A changes and this directly leads to a change in variable B, we say there is a cause-and-effect relationship between them.
For example, if you increase the number of hours you study, your exam score improves. Here, studying more directly causes better performance.

What is Correlation?
Correlation means that two variables are related, but one does not necessarily cause the other to change. They may move together due to coincidence or because of another hidden factor (confounding variable).
For example, ice cream sales and drowning incidents both increase in the summer. This does not mean that eating ice cream causes drowning. Instead, hot weather is the real reason why both happen at the same time.

Key Differences Between Correlation and Causation
Correlation shows that two variables change together but does not prove one causes the other.
Causation means that a change in one variable directly leads to a change in another.
Correlation can be due to a third hidden factor (confounder), while causation requires solid proof.

# 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
Types of Optimizers in Machine Learning
1. Gradient Descent (GD)
Gradient Descent is the most basic optimization algorithm. It updates model parameters by moving in the direction of the negative gradient of the loss function.

Types of Gradient Descent:
Batch Gradient Descent (BGD) – Updates weights after computing the gradient on the entire dataset.
Stochastic Gradient Descent (SGD) – Updates weights after each training example, making it faster but noisier.
Mini-Batch Gradient Descent – Updates weights after a small batch of training samples, balancing speed and stability.

Example of SGD in Python:
from tensorflow.keras.optimizers import SGD
optimizer = SGD(learning_rate=0.01)

2. Momentum Optimizer
Momentum adds a velocity term to the gradient, helping the model accelerate in the right direction and avoid getting stuck in local minima.

Example:
from tensorflow.keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01, momentum=0.9)
✔ Helps with faster convergence and reduces oscillations.

3. Adagrad (Adaptive Gradient Algorithm)
Adagrad adapts the learning rate for each parameter, making frequent updates smaller and rare updates larger. It is useful for sparse data.

Example:
from tensorflow.keras.optimizers import Adagrad

optimizer = Adagrad(learning_rate=0.01)
✔ Works well in text processing (NLP) but can slow down over time.

4. RMSprop (Root Mean Square Propagation)
RMSprop adjusts the learning rate using a moving average of past squared gradients, preventing it from becoming too small.

Example:
from tensorflow.keras.optimizers import RMSprop

optimizer = RMSprop(learning_rate=0.01)
✔ Works well for recurrent neural networks (RNNs) and prevents the learning rate from decaying too much.

5. Adam (Adaptive Moment Estimation)
Adam combines Momentum and RMSprop, adjusting the learning rate for each parameter based on past gradients.

Example:
from tensorflow.keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001)
✔ Most popular optimizer for deep learning due to its fast convergence and stability.

6. AdamW (Adam with Weight Decay)
AdamW is a modified version of Adam that includes weight decay, preventing overfitting.

Example:
from tensorflow.keras.optimizers import AdamW
optimizer = AdamW(learning_rate=0.001, weight_decay=0.01)

# 17. What is sklearn.linear_model ?
sklearn.linear_model is a module in Scikit-Learn that provides various linear models for regression and classification tasks. These models assume a linear relationship between input features and the target variable.

# 18. What does model.fit() do? What arguments must be given?
What does model.fit() do?
The .fit() method trains a machine learning model by adjusting its internal parameters (weights & biases) using the provided training data. It learns patterns from input features (X) and their corresponding target values (y) to make predictions on new data.

Arguments for model.fit()
The required arguments depend on the type of model used. In general:

1. For Supervised Learning (Regression & Classification)
model.fit(X, y)
X → Feature matrix (independent variables)

y → Target variable (dependent variable)

✅ Used in models like Linear Regression, Logistic Regression, Decision Trees, etc.

Example (Linear Regression):
from sklearn.linear_model import LinearRegression

X = [[1], [2], [3], [4], [5]]  # Features
y = [10, 20, 30, 40, 50]       # Target

model = LinearRegression()
model.fit(X, y)  # Model learns the relationship between X and y

2. For Unsupervised Learning (Clustering, Dimensionality Reduction)
model.fit(X)
X → Only features are needed (no y) since there are no predefined labels.
✅ Used in models like K-Means, PCA, DBSCAN, etc.

Example (K-Means Clustering):
from sklearn.cluster import KMeans

X = [[1, 2], [3, 4], [5, 6], [7, 8]]  # Only features

model = KMeans(n_clusters=2)
model.fit(X)  # Finds cluster centers

# 19 What does model.predict() do? What arguments must be given?
What does model.predict() do?
The .predict() method makes predictions using a trained machine learning model. After the model has learned patterns from training data using .fit(), .predict() is used to estimate outputs for new, unseen data.

Arguments for model.predict()
The required argument is:
X_new → New
Must have the same number of features as the training data.
Shape: (num_samples, num_features).
predictions = model.predict(X_new)
Example 1: Predicting with Linear Regression
from sklearn.linear_model import LinearRegression


In [4]:
# 19
#Training data
X_train = [[1], [2], [3], [4], [5]]
y_train = [10, 20, 30, 40, 50]

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict for new values
X_new = [[6], [7], [8]]  # New feature values
predictions = model.predict(X_new)

print(predictions)  # Output: [60, 70, 80] (based on learned relationship)

[60. 70. 80.]


# 20.  What are continuous and categorical variables?
Continuous Variables
🔹 Definition: Continuous variables are numerical values that can take any value within a range. They can be measured and have decimal points.
🔹 Examples:
Height (e.g., 170.5 cm)

 Categorical Variables
🔹 Definition: Categorical variables represent groups or categories. They do not have a numerical meaning but rather label different groups.
🔹 Types of Categorical Variables:





# 21. What is feature scaling? How does it help in Machine Learning?
What is Feature Scaling?
Feature scaling is the process of normalizing or standardizing numerical features so that they have the same scale. This prevents certain features from dominating others due to their larger range of values.
Why is Feature Scaling Important?
✔ Prevents Bias in Models → Large values can dominate smaller ones.
✔ Speeds Up Training → Helps gradient descent converge faster.
✔ Improves Model Accuracy → Works better with distance-based algorithms.
✔ Required for Some ML Algorithms → Especially for models that depend on magnitude differences.

# 22. How do we perform scaling in Python?
Feature scaling can be done using scikit-learn’s preprocessing module. Below are the three main techniques:


In [None]:
# 22
#  Min-Max Scaling (Normalization)
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Sample data
data = np.array([[100], [200], [300], [400], [500]])

# Apply Min-Max Scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)  # Values scaled between 0 and 1

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


In [6]:
#22
# Standardization (Z-Score Scaling)
from sklearn.preprocessing import StandardScaler

# Apply Standard Scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)  # Mean = 0, Standard Deviation = 1



[[-1.41421356]
 [-0.70710678]
 [ 0.        ]
 [ 0.70710678]
 [ 1.41421356]]


In [7]:
#22
#Robust Scaling (Resistant to Outliers)
from sklearn.preprocessing import RobustScaler

# Apply Robust Scaling
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)

print(scaled_data)  # Less affected by outliers


[[-1. ]
 [-0.5]
 [ 0. ]
 [ 0.5]
 [ 1. ]]


# 23. What is sklearn.preprocessing?
sklearn.preprocessing in Machine Learning
The sklearn.preprocessing module in scikit-learn provides various functions to transform raw data into a suitable format for machine learning models.

# 24. How do we split data for model fitting (training and testing) in Python?
In Machine Learning, we split data into training and testing sets to evaluate model performance.
Using train_test_split() in Scikit-Learn
The train_test_split() function from sklearn.model_selection is commonly used to randomly split data into training and testing sets.
train_test_split(X, y, test_size=0.2, random_state=42)



In [8]:
# 24
from sklearn.model_selection import train_test_split
import numpy as np

# Sample Data (X: Features, y: Target)
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])  # Features
y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])  # Target labels

# Split data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display Results
print("X_train:", X_train.ravel())  
print("X_test:", X_test.ravel())    
print("y_train:", y_train)  
print("y_test:", y_test)  

X_train: [ 6  1  8  3 10  5  4  7]
X_test: [9 2]
y_train: [1 0 1 0 1 0 1 0]
y_test: [0 1]


# 25. Explain data encoding
Data encoding is the process of converting categorical variables into numerical values so that machine learning models can process them.