# Comprehensive Feature Engineering Tutorial

This notebook demonstrates various feature engineering techniques using a sample dataset. We'll go through each step of the process, explaining the concepts and showing their effects on the data.

In [None]:
# First, let's import the necessary libraries. 
# pandas is used for data manipulation and analysis.
# numpy is used for numerical operations.
# sklearn provides machine learning tools.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA

## Creating the Sample Dataset

Let's create a sample dataset to work with. This dataset represents information about individuals, including their age, income, education level, car ownership, and credit score.

In [None]:
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 50, 55, 60],
    'income': [30000, 45000, 50000, 60000, 70000, 80000, 85000, 90000],
    'education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'High School', 'PhD'],
    'has_car': [True, False, True, True, False, True, False, True],
    'credit_score': [650, 700, np.nan, 800, 750, np.nan, 600, 850]
})

print("Original Data:")
print(data)

## 1. Feature Creation

Feature creation involves generating new features from existing ones. Here, we're creating a new feature 'income_per_age' by dividing income by age. This could potentially capture how income changes with age.

In [None]:
data['income_per_age'] = data['income'] / data['age']
print("\n1. Feature Creation")
print(data['income_per_age'])

## 2. Feature Transformation

Feature transformation involves changing the scale or distribution of a feature. Here, we're applying a logarithmic transformation to 'income'. This can be useful for handling skewed data or when we expect the impact of income to be multiplicative rather than additive.

In [None]:
data['log_income'] = np.log(data['income'])
print("\n2. Feature Transformation")
print(data['log_income'])

## 3. Handling Categorical Variables

Machine learning models typically work with numerical data. One-hot encoding is a method to convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction.

In [None]:
encoder = OneHotEncoder(sparse=False)
education_encoded = encoder.fit_transform(data[['education']])
education_df = pd.DataFrame(education_encoded, columns=encoder.get_feature_names_out(['education']))
data = pd.concat([data, education_df], axis=1)
print("\n3. Handling Categorical Variables")
print(data[encoder.get_feature_names_out(['education'])])

## 4. Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. StandardScaler standardizes features by removing the mean and scaling to unit variance.

In [None]:
scaler = StandardScaler()
data['scaled_age'] = scaler.fit_transform(data[['age']])
print("\n4. Feature Scaling")
print(data['scaled_age'])

## 5. Handling Missing Values

Missing values can be problematic for many machine learning algorithms. Imputation is the process of replacing missing data with substituted values. Here, we're using mean imputation, which replaces missing values with the mean of the column.

In [None]:
imputer = SimpleImputer(strategy='mean')
data['credit_score_imputed'] = imputer.fit_transform(data[['credit_score']])
print("\n5. Handling Missing Values")
print(data['credit_score_imputed'])

## 6. Feature Selection

Feature selection is the process of selecting a subset of relevant features for use in model construction. SelectKBest selects features according to the k highest scores. Here, we're using f_regression which computes the F-value between label/feature for regression tasks.

In [None]:
X = data[['age', 'income', 'credit_score_imputed']]
y = data['income_per_age']
selector = SelectKBest(score_func=f_regression, k=2)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
print("\n6. Feature Selection")
print("Selected features:", selected_features)

## 7. Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of random variables under consideration. Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset.

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print("\n7. Dimensionality Reduction")
print("PCA components shape:", X_pca.shape)

## Final Processed Data

Let's take a look at our final processed dataset.

In [None]:
print("\nFinal Processed Data:")
print(data)

## Conclusion

This final dataset includes all our original features, plus:
- The created feature (income_per_age)
- The transformed feature (log_income)
- One-hot encoded education levels
- Scaled age
- Imputed credit score

We've also performed feature selection and dimensionality reduction, which can be used to choose which features to include in our final model.

Remember, the choice of which feature engineering techniques to use depends on your specific dataset and the requirements of your machine learning model.