<a href="https://colab.research.google.com/github/kanchandhole/Data-Scientist/blob/main/18th_march_feature_eng_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1.** What is the Filter method in feature selection, and how does it work?

**Ans:**

---

## **Filter Method in Feature Selection**

### **Definition**

The **Filter method** is a **feature selection technique** that evaluates the importance of features **independently of any machine learning model**.
It selects or removes features based on **statistical measures** or intrinsic properties of the data.

---

### **How It Works**

1. Calculate a **relevance score** for each feature using statistical tests or metrics.
2. Rank features based on the score.
3. Select the top features or remove irrelevant ones before model training.

---

### **Common Metrics Used**

* **Correlation coefficient** (Pearson, Spearman) â€“ for numerical features
* **Chi-square test** â€“ for categorical features
* **ANOVA F-test** â€“ for comparing groups
* **Mutual information** â€“ measures dependency between feature and target

---

### **Example**

Suppose you have a dataset to predict customer churn:

* Feature: `Age`, `Salary`, `Account Type`, `Region`
* Target: `Churn`

**Step 1:** Calculate correlation with the target
**Step 2:** Select features with high correlation (`Age`, `Salary`) and remove low-correlation features (`Region`)

---

### **Advantages**

* Fast and simple
* Works well with **high-dimensional datasets**
* Reduces overfitting

### **Disadvantages**

* Ignores **feature interactions**
* May select redundant features

---

### **Conclusion**

The **Filter method** is a **pre-processing step** that quickly identifies relevant features using statistical measures, independent of any machine learning algorithm.


**Q2.** How does the Wrapper method differ from the Filter method in feature selection?

**Ans:**

---

## **Wrapper Method vs Filter Method in Feature Selection**

Feature selection techniques can be broadly categorized into **Filter**, **Wrapper**, and **Embedded** methods. Here we compare **Wrapper** and **Filter** methods.

---

### **1. Filter Method**

* **Model-agnostic:** Selects features **independently of any machine learning model**.
* **Based on statistical measures**: correlation, chi-square, ANOVA, mutual information, etc.
* **Advantages:** Fast, simple, works well for high-dimensional data.
* **Disadvantages:** Ignores interactions between features; may include redundant features.

---

### **2. Wrapper Method**

* **Model-based:** Uses a **machine learning model** to evaluate subsets of features.
* **How it works:**

  1. Select a subset of features
  2. Train the model on this subset
  3. Evaluate model performance (e.g., accuracy, F1-score)
  4. Repeat for different subsets to find the best-performing set
* **Search strategies:** Forward selection, backward elimination, recursive feature elimination (RFE)
* **Advantages:** Accounts for **feature interactions** and model performance
* **Disadvantages:** Computationally expensive, especially for large datasets

---

### **Comparison Table**

| Aspect               | Filter Method           | Wrapper Method                      |
| -------------------- | ----------------------- | ----------------------------------- |
| Model Dependency     | Independent             | Model-based                         |
| Computation          | Fast                    | Slow (computationally expensive)    |
| Feature Interactions | Ignored                 | Considered                          |
| Accuracy             | May be lower            | Usually higher (better performance) |
| Example              | Correlation, Chi-square | Recursive Feature Elimination (RFE) |

---

### **Example**

* **Filter:** Select features with high correlation with target.
* **Wrapper:** Use RFE with a decision tree to iteratively select features that maximize model accuracy.

---

### **Conclusion**

* **Filter methods** are faster and simpler, suitable for preprocessing.
* **Wrapper methods** are more accurate and account for interactions but are computationally expensive.


**Q3.** What are some common techniques used in Embedded feature selection methods?

**Ans:**

---

## **Embedded Feature Selection Methods**

### **Definition**

Embedded methods perform **feature selection during the model training process**.
Unlike **Filter methods** (independent of model) or **Wrapper methods** (iterative search), embedded methods **integrate feature selection into the learning algorithm itself**.

---

### **Common Techniques**

1. **Lasso Regression (L1 Regularization)**

   * Adds a penalty proportional to the **absolute value of coefficients**.
   * Encourages **sparse coefficients**, automatically shrinking some feature weights to zero.
   * Features with zero coefficients are **effectively removed**.

2. **Ridge Regression (L2 Regularization)**

   * Adds a penalty proportional to the **square of coefficients**.
   * Doesnâ€™t shrink features to zero, but reduces the effect of less important features.

3. **Elastic Net**

   * Combines **L1 and L2 penalties**.
   * Helps in selecting features when features are correlated.

4. **Tree-based Models**

   * Algorithms like **Decision Trees, Random Forests, Gradient Boosting** naturally perform feature selection.
   * Use **feature importance scores** to identify key features.
   * Example: `model.feature_importances_` in scikit-learn.

5. **Regularized Logistic Regression**

   * Logistic regression with **L1 or L2 regularization** selects important features for classification tasks.

---

### **Advantages**

* Performs feature selection **during model training**
* Accounts for **feature interactions**
* Less computationally expensive than wrapper methods

### **Disadvantages**

* Model-dependent
* Selected features may vary with **different algorithms or hyperparameters**

---

### **Example Workflow**



---

### **Conclusion**

Embedded methods combine **feature selection and model training**, making them **efficient and accurate**, especially for high-dimensional datasets.


In [3]:
from sklearn.linear_model import Lasso
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load California housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target

# Lasso regression for embedded feature selection
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Select features with non-zero coefficients
selected_features = X.columns[lasso.coef_ != 0]
print("Selected features:", selected_features)


Selected features: Index(['MedInc', 'HouseAge', 'Population', 'AveOccup', 'Latitude',
       'Longitude'],
      dtype='object')


**Q5.** In which situations would you prefer using the Filter method over the Wrapper method for feature
selection?

**Ans:**

## **When to Prefer Filter Method Over Wrapper Method**

The **Filter method** and **Wrapper method** are both used for feature selection, but they have different strengths. You would prefer **Filter methods** in the following situations:

---

### **1. High-Dimensional Datasets**

* When the dataset has **thousands or millions of features** (e.g., text data, gene expression data)
* Filter methods are **computationally efficient** and fast
* Wrapper methods would be **too slow**, as they evaluate subsets of features with a model

---

### **2. Need for Model-Agnostic Selection**

* Filter methods are **independent of the machine learning algorithm**
* Useful if you want to **preprocess data** before trying multiple models
* Wrapper methods are **model-dependent** (feature selection changes with different algorithms)

---

### **3. Quick Feature Screening**

* When you need a **preliminary selection** to remove irrelevant features quickly
* Helps reduce noise and improve model training speed
* Example: Selecting features based on correlation with the target

---

### **4. Avoiding Overfitting in Small Datasets**

* Filter methods do not use the model, so they **do not overfit** to the training set
* Wrapper methods can overfit, especially when the dataset is **small**

---

### **5. Simple and Interpretable Criteria**

* Filter methods use **statistical measures** (correlation, chi-square, ANOVA, mutual information)
* Easy to **understand and justify** feature selection

---

### **Example**

* **Filter Method:** Removing features with low correlation with the target before model training
* **Wrapper Method:** Using Recursive Feature Elimination (RFE) with a classifier to find the best subset

**Scenario:**

* Text classification with 10,000 words â†’ Filter method to reduce features based on chi-square scores
* Gene expression dataset with 20,000 genes â†’ Filter method for quick screening

---

### **Conclusion**

Use the **Filter method** when you need **speed, scalability, model-agnostic selection, or preliminary feature screening**, especially for **high-dimensional datasets**.

**Q6.** In a telecom company, you are working on a project to develop a predictive model for customer churn.
You are unsure of which features to include in the model because the dataset contains several different
ones. Describe how you would choose the most pertinent attributes for the model using the Filter Method.

**Ans:**

Hereâ€™s a **clear, step-by-step answer** for **Q6** ðŸ‘‡

---

## **Using the Filter Method for Feature Selection in a Telecom Churn Project**

When predicting **customer churn**, you often have a dataset with many features such as:

* Customer demographics (age, gender, location)
* Usage metrics (call duration, data usage, number of complaints)
* Billing info (monthly charges, payment method)

To select the **most relevant features**, the **Filter method** can be applied as follows:

---

### **Step 1: Understand the Data**

* Identify feature types:

  * **Numerical:** Age, monthly charges, call duration
  * **Categorical:** Gender, contract type, payment method
* Identify the **target variable**: `Churn` (Yes/No)

---

### **Step 2: Choose Statistical Measures**

* **Numerical Features vs Target:**

  * Use **Correlation Coefficient** (Pearson or Spearman) to measure linear relationship with churn
  * Features highly correlated with churn are likely relevant
* **Categorical Features vs Target:**

  * Use **Chi-square test** or **ANOVA F-test**
  * Example: Test if contract type or payment method is associated with churn

---

### **Step 3: Calculate Scores**

* Compute correlation scores for numeric features
* Compute chi-square or ANOVA scores for categorical features

```python

```

---

### **Step 4: Select Top Features**

* Rank features by score
* Keep features with **high scores** and remove low-scoring features
* Example:

  * Keep: `MonthlyCharges`, `Contract_TwoYear`, `Tenure`
  * Remove: `Gender`, `City` (if low relevance)

---

### **Step 5: Validate Selection**

* Check if selected features improve model performance
* Filter methods are **fast**, but you can later combine with **Wrapper or Embedded methods** for fine-tuning

---

### **Advantages of Filter Method in This Case**

1. Works **independently of the model**, fast for many features
2. Reduces noise and irrelevant data
3. Prevents overfitting in small datasets

---

### **Conclusion**

Using the **Filter method**, you can efficiently identify **pertinent attributes** for the churn model by applying **correlation, chi-square, or ANOVA** tests, ranking features, and selecting the most relevant ones for training.


In [5]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2, f_classif

# --------------------------
# Step 1: Create sample dataset
# --------------------------
data = pd.DataFrame({
    'Contract': ['Month-to-month', 'Two year', 'One year', 'Month-to-month', 'Two year', 'One year'],
    'PaymentMethod': ['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card', 'Electronic check', 'Bank transfer'],
    'MonthlyCharges': [70, 80, 60, 90, 85, 75],
    'Tenure': [1, 24, 12, 3, 36, 18],
    'Churn': ['Yes', 'No', 'No', 'Yes', 'No', 'No']
})

# --------------------------
# Step 2: Prepare features and target
# --------------------------
# Encode categorical variables
X_cat = pd.get_dummies(data[['Contract', 'PaymentMethod']], drop_first=True)
y = data['Churn'].map({'No': 0, 'Yes': 1})  # Encode target

# --------------------------
# Step 3: Apply Chi-square test
# --------------------------
chi_selector = SelectKBest(chi2, k='all')
chi_selector.fit(X_cat, y)

# --------------------------
# Step 4: View scores
# --------------------------
scores = pd.DataFrame({'Feature': X_cat.columns, 'Score': chi_selector.scores_})
scores = scores.sort_values(by='Score', ascending=False)
print(scores)


                          Feature  Score
2       PaymentMethod_Credit card   2.00
0               Contract_One year   1.00
1               Contract_Two year   1.00
4      PaymentMethod_Mailed check   0.50
3  PaymentMethod_Electronic check   0.25


**Q7.** You are working on a project to predict the outcome of a soccer match. You have a large dataset with
many features, including player statistics and team rankings. Explain how you would use the Embedded
method to select the most relevant features for the model.

**Ans:**

---

## **Using Embedded Feature Selection for Soccer Match Outcome Prediction**

Suppose you are predicting the outcome of a soccer match (Win/Loss/Draw) using a dataset that contains:

* Player statistics (goals scored, assists, passes completed)
* Team rankings (league position, recent form)
* Other match-related features (home/away, weather, referee)

To select the **most relevant features**, you can use **Embedded methods**, which perform feature selection **during model training**.

---

### **Step 1: Choose a Model with Built-in Feature Selection**

Embedded methods rely on models that **penalize or rank features** automatically:

* **Tree-based models:** Decision Trees, Random Forests, Gradient Boosting
* **Regularized models:** Logistic Regression with L1 (Lasso) or Elastic Net

---

### **Step 2: Train the Model**

* Fit the chosen model to your data:

```python
from sklearn.ensemble import RandomForestClassifier

# X = feature matrix, y = target (Win/Loss/Draw)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)
```

---

### **Step 3: Evaluate Feature Importance**

* Most embedded methods provide **feature importance scores**:

  * **Random Forest / Gradient Boosting:** `model.feature_importances_`
  * **Lasso / Elastic Net:** Coefficients close to zero â†’ unimportant
* Rank features based on importance.

```python
import pandas as pd

feature_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(feature_importances)
```

---

### **Step 4: Select Top Features**

* Keep the most important features for model training (e.g., top 10â€“20)
* Remove low-importance features that contribute little or add noise

---

### **Advantages of Embedded Method**

1. **Accounts for feature interactions** during model training
2. **Efficient** because feature selection happens alongside model learning
3. Reduces overfitting by removing irrelevant or redundant features
4. Often provides **better predictive performance** than simple filter methods

---

### **Step 5: Optional â€“ Combine with Hyperparameter Tuning**

* You can **tune the model** (number of trees, L1 penalty, etc.) and **reevaluate feature importance**
* Ensures that selected features are robust and contribute meaningfully to predictions

---

### **Conclusion**

Using **Embedded feature selection**, you can efficiently identify **the most relevant player stats, team metrics, and match features** while training a predictive model. This improves **accuracy, reduces complexity**, and makes the model more interpretable.



**Q8.** You are working on a project to predict the price of a house based on its features, such as size, location,
and age. You have a limited number of features, and you want to ensure that you select the most important
ones for the model. Explain how you would use the Wrapper method to select the best set of features for the
predictor.

**Ans:**



---

## **Using the Wrapper Method for House Price Prediction**

Suppose you are predicting house prices using features such as:

* **Size** (square feet, number of bedrooms)
* **Location** (city, neighborhood)
* **Age of the house**
* **Other amenities** (garage, garden, pool)

You want to **select the best combination of features** to maximize predictive performance. The **Wrapper method** is ideal because it **evaluates feature subsets using the actual model**.

---

### **Step 1: Choose a Predictive Model**

* Select a model appropriate for regression, e.g.:

  * Linear Regression
  * Random Forest Regressor
  * Gradient Boosting Regressor

---

### **Step 2: Define a Search Strategy**

Wrapper methods explore **different subsets of features** and evaluate the model performance for each subset. Common strategies:

1. **Forward Selection**

   * Start with no features
   * Add one feature at a time that improves model performance
   * Repeat until no further improvement
2. **Backward Elimination**

   * Start with all features
   * Remove the least important feature iteratively
   * Stop when performance decreases
3. **Recursive Feature Elimination (RFE)**

   * Fit the model, rank features by importance
   * Remove the least important feature(s)
   * Repeat until desired number of features is left

---

### **Step 3: Evaluate Feature Subsets**

* For each subset, train the model and evaluate performance metrics, e.g.:

  * **RÂ² score**
  * **Mean Squared Error (MSE)**

* Select the subset that **maximizes accuracy** and reduces complexity.

---



### **Advantages of Wrapper Method**

1. **Considers feature interactions**
2. **Optimizes for model performance**
3. **More accurate feature selection** than simple filter methods

---

### **Disadvantages**

* Computationally expensive if **many features** are present
* May overfit on small datasets

---

### **Conclusion**

The **Wrapper method** is suitable when the number of features is limited and the goal is to find the **best-performing combination of features** for predicting house prices.



In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
import pandas as pd

# Sample dataset
X = pd.DataFrame({
    'Size': [1200, 1500, 800, 900, 1300],
    'Bedrooms': [3, 4, 2, 2, 3],
    'Age': [10, 5, 20, 15, 8],
    'Garage': [1, 1, 0, 0, 1]
})
y = [300000, 400000, 200000, 220000, 320000]

# Initialize model
model = LinearRegression()

# Recursive Feature Elimination
rfe = RFE(model, n_features_to_select=2)
rfe.fit(X, y)

# Selected features
selected_features = X.columns[rfe.support_]
print("Selected features:", selected_features)

Selected features: Index(['Bedrooms', 'Garage'], dtype='object')
