**Feature Engineering**

Question 1: What is a parameter?

Answer: A parameter typically refers to a configurable value or setting used in the process of transforming raw data into features suitable for machine learning models. These parameters influence how features are created or processed, but they are not learned from the data like model parameters

Question 2: What is correlation? What does negative correlation mean?

Answer: Correlation refers to the statistical relationship between two features. It is usually to understand how one feature varies with another, particularly in relation to the target variable.

A negative correlation means that as one variable increases, the other decreases — they move in opposite directions.

Question 3: Define Machine Learning. What are the main components in Machine Learning?

Answer: Machine Learning is a branch of artificial intelligence (AI) that enables computers to learn from data and make decisions or predictions without being explicitly programmed for every scenario.

> Main Components of Machine Learning: To build and use machine learning systems effectively, several core components are involved:

1. Data
- The foundation of machine learning.
- Comes in the form of:
    - Features: Input variables (e.g., age, temperature, text).
    - Labels/Targets: Desired output (used in supervised learning).
- Types: Structured (tables), unstructured (images, audio, text).

2. Model
- A mathematical representation that maps inputs (features) to outputs (predictions).
- Examples:
    - Linear regression, decision trees, neural networks, etc.

3. Algorithm
- The procedure used to train the model using data.
- It adjusts model parameters to minimize error.
- Examples:
    - Gradient descent (for optimization),
    - k-Nearest Neighbors (for classification),
    - ID3 (for decision trees).

4. Training
- The process of feeding data into a model so it can learn patterns.
- Involves:
    - Optimizing weights/parameters.
    - Minimizing a loss function.

5. Evaluation
- Measuring the model’s performance using metrics on unseen (test or validation) data.
- Common metrics:
    - Accuracy, Precision, Recall, F1-score (classification)
    - MSE, RMSE, MAE (regression)

6. Prediction/Inference
- Once trained, the model is used to make predictions or decisions on new, unseen data.

7. Feedback Loop (Optional)
- In some systems (like online recommendation engines), model predictions are updated continuously using new data (a concept in online learning).

Question 4: How does loss value help in determining whether the model is good or not?

Answer: The loss value is a quantitative measure of how far off a machine learning model’s predictions are from the actual values (ground truth). It plays a central role in training and evaluating a model. The loss value helps determine how well a model is performing by measuring how far its predictions are from the actual values. A lower loss typically means a better model, but it must be interpreted in context with validation data and other metrics like accuracy or F1-score.

Question 5: What are continuous and categorical variables?

Answer:
1. Continuous Variables: A continuous variable is a numerical variable that can take any value within a range — including decimals.

    ✅ Characteristics:
    - Infinite possible values within a range.
    - Values are measurable (not counted).
    - You can perform arithmetic operations (add, subtract, average, etc.).

2. Categorical Variables: A categorical variable (also called a qualitative variable) represents categories or groups. These values are labels and not numerical (even if they use numbers).

    ✅ Characteristics:
    - Values fall into a fixed number of groups.
    - Usually not suitable for arithmetic operations.
    - Can be ordinal (ordered) or nominal (unordered)

Question 6: How do we handle categorical variables in Machine Learning? What are the common techniques?

Answer: Machine learning models require numerical input, but categorical variables are non-numeric labels. So, before feeding them into a model, you need to convert them into numeric form using encoding techniques.

Common Techniques to Handle Categorical Variables:
1. Label Encoding: Assigns a unique integer to each category.
2. One-Hot Encoding: Creates binary columns for each category.
3. Ordinal Encoding: Similar to label encoding, but used when category order matters.
4. Binary Encoding: Combines label encoding and binary representation.
5. Target Encoding (Mean Encoding): Replaces categories with the mean of the target variable for that category.
6. Frequency / Count Encoding: Replaces each category with its frequency or count in the dataset.

Question 7: What do you mean by training and testing a dataset?

Answer: In machine learning, the dataset is typically split into separate parts to build and evaluate a model:
1. Training Dataset: The training dataset is the portion of data used to train the machine learning model — that is, to teach the model the patterns in the data.

2. Testing Dataset: The testing dataset is a separate portion of data used after training to evaluate how well the model performs on unseen data.

Question 8: What is sklearn.preprocessing?

Answer: sklearn.preprocessing is a module in Scikit-learn (sklearn) that provides tools for data preprocessing and feature transformation. These tools help prepare your data before feeding it into a machine learning model.

Question 9: What is a Test set?

Answer: A test set is a subset of data used to evaluate the performance of a machine learning model after it has been trained.

Question 10: How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

Answer: The most common way is to use train_test_split from scikit-learn.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (120, 4)
Test set size: (30, 4)


✅ General Approach to a Machine Learning Problem: Here's a high-level workflow I usually follow:

Step 1: Understand the problem
- What is the task? Classification, regression, clustering?
- What are the evaluation metrics? Accuracy, F1-score, RMSE, etc.?

Step 2: Collect and explore the data
- Load the dataset.
- Analyze features, check for missing values, data types.
- Visualize distributions and relationships.

Step 3: Preprocess the data
- Handle missing values.
- Encode categorical variables.
- Normalize or scale features if needed.

Step 4: Split the data
- Divide into training and testing sets.
- Sometimes also create a validation set or use cross-validation.

Step 5: Select and train models
- Choose algorithms appropriate for the problem.
- Train models on the training set.

Step 6: Evaluate the models
- Use the test set (or validation set) to check model performance.
- Tune hyperparameters if needed.

Step 7: Interpret and validate
- Understand the model’s decisions.
- Check for biases or errors.

Step 8: Deploy or present results
- Depending on your goal, deploy the model or communicate findings.

Question 11: Why do we have to perform EDA before fitting a model to the data?

Answer: Exploratory Data Analysis (EDA) is a crucial step before fitting a machine learning model because it helps you understand your data deeply. Here’s why it’s important:

1. Understand the Data Distribution
    - EDA helps you see how your features and target variables are distributed.
    - You can spot skewness, outliers, or unusual patterns that may affect model performance.

2. Identify Missing or Erroneous Data
    - Missing values or incorrect entries can cause errors or bias in your model.
    - EDA lets you detect these issues early so you can decide how to handle them (impute, drop, etc.).

3. Detect Relationships and Correlations
    - Understanding how features relate to each other and the target can guide feature selection.
    - Strongly correlated or redundant features might be dropped or combined.

4. Select Appropriate Features
    - EDA helps to spot which variables might be more important.
    - You can engineer new features or transform existing ones based on insights.

5. Choose the Right Model and Preprocessing
    - Some algorithms work better with certain data distributions or types.
    - For example, if data is not normally distributed, you might apply transformations before training.

6. Prevent Garbage In, Garbage Out
    - If your data has problems that you don’t address, the model’s predictions won’t be reliable.
    - EDA ensures you feed clean, meaningful data to your model.

Question 12: What is correlation?

Answer: Answer already given in question 2

Question 13: What does negative correlation mean?

Answer: Answer already given in question 2

Question 14: How can you find correlation between variables in Python?

Answer: Finding correlations between variables in Python is pretty straightforward, especially with libraries like Pandas and Seaborn. Here are some common ways:
1. Using Pandas .corr(): If you have a DataFrame, you can calculate the correlation matrix easily
2. Correlation with a target variable: If you want the correlation between each feature and the target
3. Visualizing correlation with a heatmap: To better visualize the correlation matrix

Question 15: What is causation? Explain difference between correlation and causation with an example.

Answer: Causation means that one event directly causes another. In other words, a change in variable A produces a change in variable B.

Difference between Correlation and Causation

- Correlation: It is a statistical relationship or association between two variables. For example:

    - Ice cream sales and drowning incidents both increase during the summer. They’re correlated (both go up), but ice cream sales don’t cause drownings. Instead, a lurking variable (hot weather) causes both.


- Causation: One variable directly affects or causes changes in another. For Example:

    - Smoking causes an increase in the risk of lung cancer. This is supported by experiments and studies showing a direct causal link.

Question 16: What is an Optimizer? What are different types of optimizers? Explain each with an example.

Answer: An optimizer is an algorithm or method used to adjust the parameters (weights) of a model to minimize the loss function during training.

- The loss function measures how well the model is doing.
- The optimizer updates the model’s parameters step-by-step to reduce this loss.
- It helps the model “learn” from data by tweaking weights to improve predictions.

Common Types of Optimizers
1. Gradient Descent (GD)
- The simplest optimizer.
- Calculates the gradient of the loss function with respect to each parameter using the entire training dataset.
- Updates weights in the opposite direction of the gradient to minimize loss.
- Example:
        
        weights = weights - learning_rate * gradient
    - Pros: Simple, effective.
    - Cons: Can be slow for large datasets since it uses all data every step.
2. Stochastic Gradient Descent (SGD)
- Instead of using the whole dataset, updates weights using a single training example at a time.
- Makes updates more frequent but noisier.
- Example:
      
      for each training example:    
          weights = weights - learning_rate * gradient(example)

    - Pros: Faster updates, can escape shallow local minima.
    - Cons: Noisier updates, less stable convergence.
3. Mini-batch Gradient Descent
- A middle ground: updates weights using small batches (subsets) of data.
- Balances noise and speed.
4. Momentum
- Enhances SGD by adding a “momentum” term that accumulates past gradients to smooth updates.
- Helps accelerate learning and avoid oscillations.
- Update rule:

      velocity = momentum * velocity - learning_rate * gradient
      weights = weights + velocity
5. Adam (Adaptive Moment Estimation)
- One of the most popular optimizers today.
- Combines ideas from Momentum and RMSProp.
- Keeps track of an exponentially decaying average of past gradients and squared gradients.
- Adapts learning rates for each parameter individually.
- Example usage in Keras:

      from tensorflow.keras.optimizers import Adam

      optimizer = Adam(learning_rate=0.001)
      model.compile(optimizer=optimizer, loss='categorical_crossentropy')

Question 17: What is sklearn.linear_model?

Answer: sklearn.linear_model is a module in the Scikit-learn library that provides various linear models for regression and classification tasks in machine learning.

✅ What is Scikit-learn?
- Scikit-learn (sklearn) is one of the most widely used machine learning libraries in Python.
- It provides simple and efficient tools for data analysis and modeling.

Question 18: What does model.fit() do? What arguments must be given?

Answer: The .fit() method in scikit-learn is used to train (fit) a machine learning model on your data. Basic scikit-learn models, only X and y are typically required.

Required Arguments

      model.fit(X, y)

X: Your features/input variables
  - Typically a 2D array or DataFrame (n_samples × n_features)
  - Example: size, age, color, etc.

Y: Your target/output variable
  - A 1D array or Series of labels or values.
  - Can be:
    - Continuous values (for regression)
    - Class labels (for classification)

Question 19: What does model.predict() do? What arguments must be given?

Answer: Once your model has been trained using .fit(), the .predict() method is used to: Make predictions on new, unseen data (features only, not labels).

Required Arguments

      model.predict(X)


X: A 2D array or DataFrame of input features (same shape and format as the training data).

Shape: (n_samples, n_features)

You must not include the target (y) here — just the features.

Question 20: What are continuous and categorical variables?

Answer: Answer already given in question 5.

Question 21: What is feature scaling? How does it help in Machine Learning?

Answer: Feature scaling is the process of transforming the features (input variables) in your dataset so that they are on a similar scale (range of values).

This is important because many machine learning algorithms perform better or converge faster when features are scaled properly.

1. Improves Model Performance
    - Algorithms like SVM, KNN, Logistic Regression, Neural Networks, and Gradient Descent-based models are sensitive to feature scales.
2. Faster Convergence
    - Feature scaling helps gradient descent converge faster during training.
3. Equal Feature Importance
    - Prevents features with larger ranges from dominating learning.
4. Required for Distance-Based Models
    - Models like KNN, K-Means, and PCA rely on calculating distances, which get distorted if scales differ.

Question 22: How do we perform scaling in Python?

Answer:

🔹 Step 1: Import the Scaler
- Scikit-learn provides several scaling classes.

🔹 Step 2: Choose a Scaling Technique
    
  1. StandardScaler (Z-score normalization)
      - Transforms data so that it has mean = 0 and standard deviation = 1.
      - Best for: Most ML models, especially if the data is normally distributed.
  2. MinMaxScaler (Normalization)
      - Transforms features to a fixed range [0, 1].
      - Best for: Algorithms that need bounded input, like neural networks.
  3. RobustScaler (for outliers)
      - Uses median and interquartile range (IQR).
      - Best for: Datasets with outliers.

🔹 Step 3: Fit and Transform the Data
- fit() computes the scaling parameters (mean, std, etc.).
- transform() applies the scaling to the data.
- fit_transform() does both in one step.

🔹 Step 4: Apply Scaling Correctly (Train/Test Split)
- You should fit on the training set and transform both train and test:

Question 23: What is sklearn.preprocessing?

Answer: Answer already given in question 8

Question 24: How do we split data for model fitting (training and testing) in Python?

Answer: Answer already given in question 10

Question 25: Explain data encoding?

Answer: Data encoding is the process of converting categorical (non-numeric) data into a numerical format so that it can be used in machine learning models. Most ML models can't handle text or categories directly — they need numbers to perform calculations.