                                           Feature Engineering

` Q1 What is a parameter? `

In programming and mathematics, a parameter is a named variable passed into a function, method, or subroutine. It acts as a placeholder for the actual values (called arguments) that will be supplied when the function is called. Parameters define the type of input a function expects and allow functions to be reusable with different data.

For example, in the Python function def calculate_sum(a, b):, a and b are parameters. When you call calculate_sum(5, 10), 5 and 10 are the arguments passed to those parameters.





---



`Q2. What is correlation? `

In statistics, correlation refers to the statistical relationship between two or more variables. It describes the extent to which two variables tend to move together. When two variables are correlated, it means that as one variable changes, the other variable tends to change in a predictable way.

There are different types of correlation:

1. Positive Correlation: As one variable increases, the other variable also tends to increase (e.g., height and weight).

2. Negative Correlation: As one variable increases, the other variable tends to decrease (e.g., hours spent studying and number of errors on a test if studying helps reduce errors).

3. No Correlation: There is no consistent relationship between the variables.


Correlation is often quantified by a correlation coefficient, such as Pearson's r, which ranges from -1 to +1. A value of +1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation.





---



## Q3. Define Machine Learning. What are the main components in Machine Learning?

**Machine Learning (ML)** is a subfield of Artificial Intelligence (AI) that enables a computer system to **learn patterns from data** and make predictions/decisions **without being explicitly programmed** for every rule.

### Main components of ML
1. **Data**: input examples (features) and sometimes labels (targets).
2. **Features (X)**: variables used for prediction.
3. **Labels/Target (y)**: output variable (only in supervised learning).
4. **Model / Algorithm**: mathematical function that maps input → output (e.g., Linear Regression, Decision Tree).
5. **Loss/Objective function**: measures model error (e.g., MSE, Log loss).
6. **Optimizer / Learning algorithm**: updates model parameters to minimize loss (e.g., Gradient Descent).
7. **Training**: process of learning parameters from training data.
8. **Evaluation/Validation**: checking performance using test/validation set.
9. **Hyperparameters**: external settings (e.g., learning rate, depth).



---



## Q4. How does loss value help in determining whether the model is good or not?

A **loss value** tells how far a model's prediction is from the actual output.

- **Lower loss** ⇒ predictions are closer to actual values ⇒ model is better.
- **Higher loss** ⇒ model is making larger errors.

### Important notes
- We compare loss on **training set** and **validation/test set**
- If **training loss low but validation loss high** ⇒ **overfitting**
- If **both losses high** ⇒ **underfitting** or poor features/model.

Example:
- Regression uses **MSE / MAE**
- Classification uses **Log Loss / Cross Entropy**



---



## Q5. What are continuous and categorical variables?

### Continuous variables
Numeric values with infinite possibilities in a range.
Examples:
- height, weight, temperature, income

### Categorical variables
Represent categories/labels.
Examples:
- gender (Male/Female)
- color (Red/Blue/Green)
- city (Kolkata/Delhi/Mumbai)

Categorical types:
- **Nominal**: no order (e.g., city)
- **Ordinal**: ordered (e.g., low < medium < high)



---



## Q6. How do we handle categorical variables in ML? What are the common techniques?

ML models require numeric input, so categorical variables must be encoded.

### Common techniques
1. **Label Encoding**
   - Convert categories into numbers.
   - Good for **ordinal categories**.
2. **One-Hot Encoding**
   - Creates binary columns for each category.
   - Best for **nominal categories**.
3. **Ordinal Encoding**
   - Assign ordered ranks to ordered categories.
4. **Target Encoding**
   - Replace category with mean of target for that category.
   - Useful when many categories (high cardinality).
5. **Frequency/Count Encoding**
   - Replace category with its frequency.



---



## Q7. What do you mean by training and testing a dataset?

- **Training dataset**: used to **fit/learn** the model parameters.
- **Testing dataset**: used to **evaluate** the trained model performance on unseen data.

Goal:
- Ensure the model **generalizes** (works well on new/unseen data).



---



## Q8. What is sklearn.preprocessing?

`sklearn.preprocessing` is a module in **Scikit-learn** that provides tools for:

- **Scaling**: StandardScaler, MinMaxScaler, RobustScaler
- **Normalization**: Normalizer
- **Encoding**: OneHotEncoder, LabelEncoder, OrdinalEncoder
- **Binarization** and other transformations

It helps in preparing data before training ML models.



---



## Q9. What is a Test set?

A **test set** is a portion of the dataset kept separate from training and used **only for final evaluation**.

Why?
- To estimate real-world performance
- To detect overfitting

Common split ratios:
- 70/30, 80/20, 75/25



---



## Q10. How do we split data for model fitting (training and testing) in Python?

We use `train_test_split()` from `sklearn.model_selection`.

Example:
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```



---



## Q11. How do you approach a Machine Learning problem?

Typical ML workflow:
1. **Problem understanding** (classification/regression/clustering)
2. **Collect data**
3. **Data cleaning** (missing values, outliers)
4. **EDA (Exploratory Data Analysis)**
5. **Feature engineering**
6. **Split data** into train/test (and validation)
7. **Model selection**
8. **Train model**
9. **Evaluate** using metrics
10. **Hyperparameter tuning**
11. **Deploy / monitoring**



---



## Q12. Why do we have to perform EDA before fitting a model?

EDA helps:
- Understand data distribution
- Detect missing values/outliers
- Identify relationships between variables
- Choose correct preprocessing methods
- Improve feature selection/engineering

Without EDA:
- Model accuracy may suffer due to poor quality data.



---



## Q13. What is correlation?

**Correlation** measures the strength and direction of relationship between two variables.

- Range: **-1 to +1**
- +1: perfect positive correlation
- 0: no correlation
- -1: perfect negative correlation



---



## Q14. What does negative correlation mean?

**Negative correlation** means:
- when one variable increases, the other decreases.

Example:
- as price increases, demand decreases (often).



---



## Q15. How can you find correlation between variables in Python?

Using Pandas:
```python
import pandas as pd

df.corr()
```

To visualize:
- `seaborn.heatmap(df.corr())` (if seaborn available)
or matplotlib-based heatmap.



---



## Q16. What is causation? Explain difference between correlation and causation with an example.

### Causation
One variable **directly affects** the other.

### Correlation
Variables change together but may not have a cause-effect relation.

Example:
- Ice cream sales and drowning cases may be correlated (both increase in summer).
- But ice cream does **not** cause drowning → **correlation ≠ causation**.



---



## Q17. What is an optimizer? What are different types of optimizers? Explain each with example.

An **optimizer** updates model parameters to minimize the loss function.

### Common optimizers
1. **Gradient Descent (Batch GD)**
   - uses all data at once
2. **Stochastic Gradient Descent (SGD)**
   - updates using one sample
3. **Mini-batch Gradient Descent**
   - updates using small batches
4. **Adam**
   - adaptive learning rate + momentum
5. **RMSprop**
   - adaptive learning rate
6. **Adagrad**
   - adapts learning rate for each parameter

In sklearn models like LinearRegression, the optimizer is internal (closed-form or iterative depending on estimator).



---



## Q18. What is sklearn.linear_model?

`sklearn.linear_model` is a Scikit-learn module that provides **linear models** such as:

- LinearRegression
- Ridge, Lasso, ElasticNet
- LogisticRegression
- SGDRegressor / SGDClassifier
- Perceptron

Used for regression and classification.



---



## Q19. What does model.fit() do? What arguments must be given?

`model.fit()` trains the model on data.

Basic syntax:
```python
model.fit(X_train, y_train)
```

### Arguments
- `X_train`: feature matrix (2D)
- `y_train`: target vector (1D)

Some models accept extra params, but these two are mandatory in supervised learning.



---



## Q20. What does model.predict() do? What arguments must be given?

`model.predict()` generates predictions using a trained model.

Syntax:
```python
y_pred = model.predict(X_test)
```

### Argument
- `X_test`: feature matrix for which predictions are needed.



---



## Q21. What are continuous and categorical variables?


Continuous = numeric measurable values (e.g., salary, height).  
Categorical = label-based values (e.g., city, gender).



---



## Q22. What is feature scaling? How does it help in ML?

**Feature scaling** converts features to similar ranges.

Why needed?
- Many models depend on distance or gradient:
  - KNN, KMeans, SVM, Logistic Regression, Neural Networks
- Improves convergence speed and stability

Common scaling:
- Standardization (mean=0, std=1)
- Min-Max scaling (0 to 1)



---



## Q23. How do we perform scaling in Python?

Using Scikit-learn scalers:
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```



---



## Q24. What is sklearn.preprocessing?


It contains preprocessing tools like:
- StandardScaler, MinMaxScaler
- OneHotEncoder, LabelEncoder



---



## Q25. Explain data encoding.

**Data encoding** converts categorical text features into numeric form.

Examples:
- Label Encoding: A,B,C → 0,1,2
- OneHot Encoding: Color=Red/Blue → [1,0] / [0,1]

Encoding is required because most ML algorithms cannot directly work with strings.



---



In [None]:

# ✅ Practical Examples in Python

import numpy as np
import pandas as pd

# Sample dataset
df = pd.DataFrame({
    "Age": [22, 25, 47, 52, 46, 56],
    "Salary": [25000, 32000, 58000, 60000, 52000, 62000],
    "City": ["Kolkata", "Delhi", "Kolkata", "Mumbai", "Delhi", "Mumbai"],
    "Purchased": [0, 1, 0, 1, 1, 1]
})

df


In [None]:

# Train-Test Split example

from sklearn.model_selection import train_test_split

X = df[["Age", "Salary"]]
y = df["Purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

X_train, X_test, y_train, y_test


In [None]:

# Feature scaling example

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled[:3], X_test_scaled[:3]


In [None]:

# Encoding categorical variables example (OneHotEncoding)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
city_encoded = encoder.fit_transform(df[["City"]])

encoded_df = pd.DataFrame(city_encoded, columns=encoder.get_feature_names_out(["City"]))
encoded_df


In [None]:

# model.fit() and model.predict() example (Logistic Regression)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
accuracy, y_pred
