# Feature Engineering-2

#### Q1. What is the Filter method in feature selection, and how does it work?

The Filter method in feature selection involves evaluating the intrinsic properties of individual features without involving a specific machine learning model. It works by ranking or scoring features based on certain statistical measures, such as correlation, mutual information, or variance. Features are selected or retained based on their scores, without considering their interactions with the target variable or other features.

Example:
Suppose we have a dataset of students with features 'age', 'study_hours', and 'grade'. We can use the correlation coefficient between each feature and the 'grade' as a filter. Features with higher correlation coefficients are deemed more relevant.

In [1]:
import pandas as pd
data = pd.DataFrame({'age': [18, 20, 22, 19, 21], 'study_hours': [5, 6, 4, 7, 5], 'grade': [85, 90, 75, 95, 80]})
cor = data.corr()['grade']
selected = cor[cor>0.7].index.tolist()
print('Selected Features are', selected)

Selected Features are ['study_hours', 'grade']


#### Q2. How does the Wrapper method differ from the Filter method in feature selection?

The Wrapper method involves using a machine learning model to evaluate the performance of different subsets of features. It works by iteratively training and evaluating the model on different feature combinations, which allows it to capture feature interactions and their impact on model performance. Unlike the Filter method, the Wrapper method considers how features affect the model's predictive performance.

Example:
Consider the same student dataset. We can use a wrapper method like Recursive Feature Elimination (RFE) with a linear regression model to iteratively select the best subset of features.

In [2]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression as lr
data = pd.DataFrame({'age': [18, 20, 22, 19, 21], 'study_hours': [5, 6, 4, 7, 5], 'grade': [85, 90, 75, 95, 80]})
x= data.drop('grade', axis=1)
y= data['grade']
ref = RFE(lr(),n_features_to_select = 1)
ref.fit(x,y)
selected = x.columns[ref.support_]
print('Selected Featues are:',selected)

Selected Featues are: Index(['study_hours'], dtype='object')


#### Q3. What are some common techniques used in Embedded feature selection methods?

Embedded feature selection methods involve integrating feature selection into the model training process itself. Common techniques include:
* L1 Regularization (Lasso): Penalizes certain coefficients to encourage sparsity in feature selection.
* Decision Tree-based methods: Decision trees can assess feature importance during training.
* Recursive Feature Elimination (RFE): Iteratively removes the least important features during model training.

Example:
Suppose we have a dataset of houses with features 'size', 'location', and 'age'. We can use LASSO regression to automatically select relevant features.

In [3]:
import pandas as pd
from sklearn.linear_model import Lasso
data = pd.DataFrame({
    'size': [1500, 1800, 1200, 2000, 1600],
    'location': ['urban', 'suburban', 'urban', 'rural', 'suburban'],
    'age': [10, 5, 20, 2, 15],
    'price': [200000, 220000, 180000, 250000, 210000]
})
data = pd.get_dummies(data, columns=['location'], drop_first=True)
x = data.drop('price', axis=1)
y = data['price']
lasso =Lasso(alpha = 0.01)
lasso.fit(x,y)
selected = lasso.coef_
print('Selected Features are:')
for feature, coef in zip(data.columns, selected):
    if coef != 0:
        print(f"{feature}")

Selected Features are:
size
age
price
location_suburban


#### Q4. What are some drawbacks of using the Filter method for feature selection?

* The Filter method doesn't consider feature interactions, which can limit its effectiveness.
* It may remove redundant features but not necessarily the most relevant ones.
* It assumes that all features are equally important for all learning algorithms, which might not be the case.

#### Q5. In which situations would you prefer using the Filter method over the Wrapper method for feature selection?

The Filter method is preferred when computational efficiency is important and when we want a quick initial assessment of feature importance without involving complex model training. It can also be useful for data exploration and preliminary insights.

#### Q6. In a telecom company, you are working on a project to develop a predictive model for customer churn. You are unsure of which features to include in the model because the dataset contains several different ones. Describe how you would choose the most pertinent attributes for the model using the Filter Method.

In this case, we could use the Filter method by calculating correlations or mutual information between each feature and the target variable (churn). Features with higher correlation or mutual information scores are more likely to be relevant. We might also consider calculating variance to identify features with low variability. Based on these scores, we can rank the features and select the top ones as potential candidates for the model.

In [4]:
import pandas as pd
data = pd.DataFrame({
    'age': [25, 30, 40, 22, 28],
    'contract_length': [12, 24, 6, 18, 36],
    'total_usage': [100, 150, 80, 200, 120],
    'churn': [0, 1, 0, 1, 0]
})
cor = data.corr()['churn'].abs()
selected = cor[cor>0.7].index.tolist()
print('Selected Features are', selected)

Selected Features are ['total_usage', 'churn']


#### Q7. You are working on a project to predict the outcome of a soccer match. You have a large dataset with many features, including player statistics and team rankings. Explain how you would use the Embedded method to select the most relevant features for the model.

In the Embedded method, we can incorporate feature selection during the model training process. For soccer match prediction, we could use a machine learning algorithm like Random Forest or Gradient Boosting, which inherently provide feature importance scores. By training these models on our dataset, we can extract the feature importance scores and identify which player statistics, team rankings, or other features contribute the most to predicting match outcomes. Features with higher importance scores are more relevant for the model.

In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
data = pd.DataFrame({
    'player_stats': [10, 8, 5, 7, 9],
    'team_ranking': [3, 1, 5, 2, 4],
    'outcome': ['win', 'win', 'loss', 'win', 'loss']
})
x = data.drop('outcome', axis=1)
y = data['outcome']
rf = RandomForestClassifier()
rf.fit(x,y)
selected = rf.feature_importances_
print('Selected Features are:')
for feature, coef in zip(x.columns, selected):
    if coef != 0:
        print(f"{feature}")

Selected Features are:
player_stats
team_ranking


#### Q8. You are working on a project to predict the price of a house based on its features, such as size, location, and age. You have a limited number of features, and you want to ensure that you select the most important ones for the model. Explain how you would use the Wrapper method to select the best set of features for the predictor.

In the Wrapper method, we can employ techniques like Recursive Feature Elimination (RFE) or forward/backward selection. We would start with a subset of features and iteratively train the predictive model. After each iteration, we assess the model's performance using cross-validation or a separate validation set. We remove the least important feature and continue iterating until a stopping criterion is met. This way, the Wrapper method helps we find the best subset of features that maximizes the model's predictive performance.

In [6]:
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression as lr
data = pd.DataFrame({
    'size': [1500, 1800, 1200, 2000, 1600],
    'location': ['urban', 'suburban', 'urban', 'rural', 'suburban'],
    'age': [10, 5, 20, 2, 15],
    'price': [200000, 220000, 180000, 250000, 210000]
})
data = pd.get_dummies(data, columns=['location'], drop_first=True)
x = data.drop('price', axis=1)
y = data['price']
rfe = RFE(lr(),n_features_to_select = 2)
selected = rfe.fit_transform(x,y)
print('Selected Featues are:',selected)

Selected Featues are: [[0 1]
 [1 0]
 [0 1]
 [0 0]
 [1 0]]
