# Feature Selection & Feature Engineering - Day 5

This notebook will cover fundamental methods used to develop ML algorithms, such as preparing data for a ML model (feature engineering).

**Topics:**


1.   Feature Selection
2.   Feature Engineering
3.   Preparing Data



**Goals:**


1.   Understand how to select important features for a ML algorithm
2.   Understand how to clean data for a ML algorithm
3.   Become familiar with common methods for preparing data to train a ML model



## Import Packages

The first thing we do at the beginning of any script.



1.   **Pandas:** Working with datasets. Arguably the most widely-used data-science Python package.
2.   **NumPy:** Scientific computing package for working with vectors & matrices. 
3. **MatplotLib:** Tool for dataset vizualizations.
4. **Sci-Kit Learn:** Open-source ML algorithms.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

## Feature Selection & Feature Engineering

In this section we will cover how to prepare a dataset for a ML model.

Note - it is called data "science" for a reason. Each use-case can benefit from different methods and implementations, it is the job of the data scientist to experiment with these options and choose the best one. 

Here we will be going over common methods utilized for feature selection & engineering.

#### **Feature Selection**

Feature selection is the process of selecting and removing features from the dataset, so that the ML model can learn in an optimal manner.

**Read-in Dataset**

In [1]:
# Read-in heart csv, show head
## TODO: YOUR CODE HERE ##

**Split data into X & y. (features & target)**

In [2]:
# Split into features/target
## TODO: YOUR CODE HERE ##

**Examine null values in each feature**

In [3]:
# Check for null values
## TODO: YOUR CODE HERE ##

**Examine Cardinality.** 

This is an important concept - how many unique values in a particular feature?

This is called *Cardinality*

In [4]:
# Check cardinality
def show_cardinality(frame):
    """
    This function calculates the cardinality
    of each feature and prints it
    
    Args:
        - 'frame': pd.DataFrame
    """
    ## TODO: YOUR CODE HERE ##

show_cardinality(X)

**Perform Feature Selection**

Now we will define a function to remove features that have any one of the following attributes to be true:


*   Cardinality above 85% (extremely high amount of unique values)
*   Cardinality below 0.1% (basically every value is the same)
*   Null value count above 60% (most values are null)



In [5]:
def feature_selection(data, features):
    """
    This function drops features with high cardinality

    Params:
      -- 'data': pd.DataFrame
      -- 'features': list[str]
      
    Returns:
      -- list of selected features 
    """

    ## TODO: YOUR CODE HERE ##

    
# Calling functions - do not edit!
good_features = feature_selection(X, features)
print('Removed Features: ', [i for i in features if i not in good_features])
print(good_features)

#### **Feature Engineering**

Feature engineering is the process of cleaning selected features for the model.

We want to scale & normalize numerical data, as well as numerically encode categorical features so they can be normalized & scaled.

**Common steps (in order):**


1.   Impute null values
2.   Numerically encode categorical features
3.   Scale & normalize numeric features (they are all numbers at this point)





**1.   Impute Null Values**





*   For numerical data we will replace null values with the mean of the feature
*   For categorical data we will replace the null values with the most freqeuent value

These are very basic ways to impute a feature. Much more complex methods exist.



In [6]:
# Impute a series
## TODO: YOUR CODE HERE ##



**2.   Encode Categorical Data**





*   Each unique value is assigned a number 1 through the length of unique values



In [7]:
# Step 1: Impute categorical feature
## TODO: YOUR CODE HERE ##

# Step 2: Encode the imputed feature
## TODO: YOUR CODE HERE ##

# Step 3: Print
## TODO: YOUR CODE HERE ##




**3.   Scale & Normalize Data**





*   Will normalize a feature so the standard deviation is 1 and the mean is 0. This is also known as a 'Z-Score'
*   Each new value tells us how many standard deviations the original value was from the mean





In [8]:
# Scale a numerical feature
## TODO: YOUR CODE HERE ##

**Perform Feature Engineering**

Now we will define a function that will:


1.   Impute numerical values with the mean of that feature
2.   Impute categorical features with the most frequent value
3.   Encode categorical features to be numeric
4.   Scale & normalize the remaining data



In [9]:
def clean_dataset(frame, features):
    """
    This function performs feature engineering on a dataframe

    1. Imputing
    2. Encoding
    3. Scaling

    Params:
      -- 'data': pd.DataFrame
      -- 'features': list[str]
      
    Returns:
      -- 'data': pd.DataFrame
    """

    ## TODO: YOUR CODE HERE ##

X_clean = clean_dataset(frame=X, features=good_features)

In [10]:
print('ORIGINAL')
print()
X.head()

In [11]:
print('CLEANED')
print()
X_clean.head()

## Prepare Dataset For Training

This section will walk-through setting up a pipeline for your dataset. Ensuring the code is reproducible across any tabular dataset.

**Steps:**


1.   Create function to read-in data
2.   Create function to split data into X/y
3.   Create function for feature selection (one from above)
4.   Use Sci-Kit Learn 'train_test_split' to split data for train/test and ensure no "data-leakage"
5.   Create function(s) for feature engineering

**Important** - We need to perform feature engineering on the train & test sets separately, but use the same Imputer, Encoder, & Scaler. 

To do this we will create a dictionary with the feature and the object that was used.

**Steps 1-4:**

In [14]:
## read in dataset
def read_dataframe(path):
    ## TODO: YOUR CODE HERE ##
    pass
    
## split into X/y
def split_X_y(dataframe, target_col):
    ## TODO: YOUR CODE HERE ##
    pass



# Step 1: Read-in heart dataset
df = read_dataframe('https://raw.githubusercontent.com/j0sephsasson/Pepsi-Training-Course/main/datasets/heart.csv')

# Step 2: call split_X_y function
X, y = split_X_y(df, 'output')

# Step 3: Feature Selection on X
good_features = feature_selection(X, list(X.columns))
X = X[good_features].copy()

## Step 4: train_test_split (sklearn)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Step 5:**

In [28]:
# Feature Engineering - Part 1

def create_and_fit_objects(data, features):
    """
    This function fits the feature engineering objects to the training set
    
    Args:
        -- 'data': pd.DataFrame
        -- 'features': list[str]
        
    Returns:
        -- 'd': dictionary
    """
    
    d = {}
    
    for col in features:
            
        # if we have category use most frequent
        if data[col].dtypes == 'O':
            imputer = SimpleImputer(strategy='most_frequent')
            imputer.fit(data[col].values.reshape(-1, 1))
            d[col] = [imputer]
            
        # if we have number use mean
        elif data[col].dtypes == 'int64' or data[col].dtypes == 'float64':
            imputer = SimpleImputer(strategy='mean')
            imputer.fit(data[col].values.reshape(-1, 1))
            d[col] = [imputer]
            
    ## encode categorical features ##
    for col in features:
        if data[col].dtypes == 'O':
            enc = LabelEncoder()
            enc.fit(data[col])
            d[col].append(enc)
        
    ## scale numercial features ##
    for col in features:
        if data[col].dtypes == 'int64' or data[col].dtypes == 'float64':
            scaler = StandardScaler()
            scaler.fit(data[col].values.reshape(-1, 1))
            d[col].append(scaler)
            
    return d

In [29]:
# Feature Engineering - Part 2

def clean_dataset(frame, features, object_dict):
    """
    This function performs feature engineering on a dataframe

    1. Imputing
    2. Encoding
    3. Scaling

    Params:
      -- 'data': pd.DataFrame
      -- 'features': list[str]
      -- 'object_dict': dictionary
      
    Returns:
      -- 'data': pd.DataFrame
    """

    data = frame.copy()

    # Perform feature engineering
    for col in features:
        for idx in range(0, len(object_dict[col])):
            data[col] = object_dict[col][idx].transform(data[col].values.reshape(-1, 1))

    return data

In [30]:
d = create_and_fit_objects(X_train, good_features)

X_train_clean = clean_dataset(X_train, good_features, d)
X_test_clean = clean_dataset(X_test, good_features, d)

In [15]:
X_train_clean.head()

In [16]:
X_test_clean.head()

In [17]:
y_train.head()

In [18]:
y_test.head()

## Conclusion

**Question 1**

What is cardinality?



1.   The number of null values in a feature
2.   The number of unique values in a feature
3.   The length of a feature



**Question 2**

Is high cardinality good or bad?



1.   Good thing
2.   Bad thing



**Question 3**

Which of the following is a common method for dealing with null values?



1.   Imputing
2.   Scaling
3.   Encoding



**Question 4**

Which of the following is a common method for normalizing numerical data?



1.   Using a Z-Score formula
2.   Removing numbers over a certain limit
3.   Adding 3 to each number



**Question 5 - Bonus!**

What is the output shape of multiplying two matrices (A * B), where:

Shape A = (2,2)

Shape B = (2,6)



1.   (2,4)
2.   (4,2)
3.   (2,6)

