# Feature Selection & Feature Engineering

This notebook will cover fundamental methods used to develop ML algorithms, such as preparing data for a ML model (feature engineering).

**Topics:**


1.   Feature Selection
2.   Feature Engineering
3.   Preparing Data



**Goals:**


1.   Understand how to select important features for a ML algorithm
2.   Understand how to clean data for a ML algorithm
3.   Become familiar with common methods for preparing data to train a ML model



## Import Packages

The first thing we do at the beginning of any script.



1.   **Pandas:** Working with datasets. Arguably the most widely-used data-science Python package.
2.   **NumPy:** Scientific computing package for working with vectors & matrices. 
3. **MatplotLib:** Tool for dataset vizualizations.
4. **Sci-Kit Learn:** Open-source ML algorithms.

In [41]:
from sklearn.model_selection import train_test_split
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

## Feature Selection & Feature Engineering

In this section we will cover how to prepare a dataset for a ML model.

Note - it is called data "science" for a reason. Each use-case can benefit from different methods and implementations, it is the job of the data scientist to experiment with these options and choose the best one. 

Here we will be going over common methods utilized for feature selection & engineering.

#### **Feature Selection**

Feature selection is the process of selecting and removing features from the dataset, so that the ML model can learn in an optimal manner.

**Read-in Dataset**

In [44]:
data = pd.read_csv('https://raw.githubusercontent.com/j0sephsasson/Pepsi-Training-Course/main/datasets/heart.csv')
data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


**Split data into X & y. (features & target)**

In [45]:
# Split into features/target
all_columns = list(data.columns)
target = all_columns.pop()
features = all_columns 

print(target)
print(features)

X = data[features]
y = data[target]

output
['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall']


**Examine null values in each feature**

In [10]:
# Check for null values
X.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
dtype: int64

**Examine Cardinality.** 

This is an important concept - how many unique values in a particular feature?

This is called *Cardinality*

In [11]:
# Check cardinality
def show_cardinality(frame):
  for col in frame.columns:
     cardinality = len(frame[col].unique())
     print(col)
     print(cardinality)
     print()

show_cardinality(X)

age
41

sex
2

cp
4

trtbps
49

chol
152

fbs
2

restecg
3

thalachh
91

exng
2

oldpeak
40

slp
3

caa
5

thall
4



**Perform Feature Selection**

Now we will define a function to remove features that have any one of the following attributes to be true:


*   Cardinality above 85% (extremely high amount of unique values)
*   Cardinality below 0.1% (basically every value is the same)
*   Null value count above 60% (most values are null)



In [46]:
def feature_selection(data, features):
    """
    This function drops features with high cardinality

    Params:
      -- 'data': pd.DataFrame
      -- 'features': list[str]
      
    Returns:
      -- list of selected features 
    """

    good_features = []

    for f in features:
      cardinality = len(data[f].unique())
      null_count = data[f].isnull().sum()

      if cardinality > np.percentile(range(0,len(data)),85):
        continue
      elif cardinality < np.percentile(range(0,len(data)),0.1):
        continue
      elif null_count > np.percentile(range(0,len(data)),60):
        continue
      else:
        good_features.append(f)

    return good_features

good_features = feature_selection(X, features)
print('Removed Features: ', [i for i in features if i not in good_features])
print(good_features)

Removed Features:  []
['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall']


#### **Feature Engineering**

Feature engineering is the process of cleaning selected features for the model.

We want to scale & normalize numerical data, as well as numerically encode categorical features so they can be normalized & scaled.

**Common steps (in order):**


1.   Impute null values
2.   Numerically encode categorical features
3.   Scale & normalize numeric features (they are all numbers at this point)





**1.   Impute Null Values**





*   For numerical data we will replace null values with the mean of the feature
*   For categorical data we will replace the null values with the most freqeuent value

These are very basic ways to impute a feature. Much more complex methods exist.



In [25]:
# initialize a SimpleImputer with strategy 'mean'
imputer = SimpleImputer(strategy='mean') # used with numerical features

test = pd.Series([1,2,2,4,5,6, np.nan, np.nan, np.nan, 7,8,9])
test_imputed = imputer.fit_transform(test.values.reshape(-1, 1))

print('BEFORE:')
print(test.values)

print()

print('AFTER:')
print(test_imputed)

BEFORE:
[ 1.  2.  2.  4.  5.  6. nan nan nan  7.  8.  9.]

AFTER:
[[1.        ]
 [2.        ]
 [2.        ]
 [4.        ]
 [5.        ]
 [6.        ]
 [4.88888889]
 [4.88888889]
 [4.88888889]
 [7.        ]
 [8.        ]
 [9.        ]]




**2.   Encode Categorical Data**





*   Each unique value is assigned a number 1 through the length of unique values



In [29]:
# initialize a LabelEncoder
encoder = LabelEncoder()

# initialize imputer with strategy 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent') # for categorical features

test = pd.Series(['yessir','hello', 'goodbye', 'hello', 'pancakes',
                  np.nan, np.nan, np.nan])

# impute null vals
test_imputed = imputer.fit_transform(test.values.reshape(-1, 1))

# encode the imputed, categorical data
test_encoded = encoder.fit_transform(test_imputed.reshape(test_imputed.shape[0],))

print('BEFORE:')
print(test.values)

print()

print('AFTER:')
print(test_encoded)

BEFORE:
['yessir' 'hello' 'goodbye' 'hello' 'pancakes' nan nan nan]

AFTER:
[3 1 0 1 2 1 1 1]





**3.   Scale & Normalize Data**





*   Will normalize a feature so the standard deviation is 1 and the mean is 0. This is also known as a 'Z-Score'
*   Each new value tells us how many standard deviations the original value was from the mean





In [31]:
# initialize scaler
scaler = StandardScaler()

# turn encoded data back into pandas format
series = pd.Series(test_encoded)

# scale data
series = scaler.fit_transform(series.values.reshape(-1, 1))

print('ORIGINAL:')
print(test_encoded)

print()

print('SCALED:')
print(series)

ORIGINAL:
[3 1 0 1 2 1 1 1]

SCALED:
[[ 2.11057941]
 [-0.30151134]
 [-1.50755672]
 [-0.30151134]
 [ 0.90453403]
 [-0.30151134]
 [-0.30151134]
 [-0.30151134]]


**Perform Feature Engineering**

Now we will define a function that will:


1.   Impute numerical values with the mean of that feature
2.   Impute categorical features with the most frequent value
3.   Encode categorical features to be numeric
4.   Scale & normalize the remaining data



In [47]:
def clean_dataset(frame, features):
  """
    This function performs feature engineering on a dataframe

    1. Imputing
    2. Encoding
    3. Scaling

    Params:
      -- 'data': pd.DataFrame
      -- 'features': list[str]
      
    Returns:
      -- 'data': pd.DataFrame
    """

  data = frame.copy()

  # Perform imputing
  for col in features:

    if col not in list(data.columns):
      data.drop(columns=col, inplace=True)

    # if we have category use most frequent
    if data[col].dtypes == 'O':
        imputer = SimpleImputer(strategy='most_frequent')
        data[col] = imputer.fit_transform(data[col].values.reshape(-1, 1))

    # if we have number use mean
    elif data[col].dtypes == 'int64' or data[col].dtypes == 'float64':
        imputer = SimpleImputer(strategy='mean')
        data[col] = imputer.fit_transform(data[col].values.reshape(-1, 1))

  ## encode categorical features ##
  for col in features:
    if data[col].dtypes == 'O':
      data[col] = LabelEncoder().fit_transform(data[col])

  ## scale numercial features ##
  for col in features:
    if data[col].dtypes == 'int64' or data[col].dtypes == 'float64':
      data[col] = StandardScaler().fit_transform(data[col].values.reshape(-1, 1))

  return data

X_clean = clean_dataset(frame=X, features=good_features)

In [48]:
print('ORIGINAL')
print()
X.head()

ORIGINAL



Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [49]:
print('CLEANED')
print()
X_clean.head()

CLEANED



Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,0.290464,-1.468418,-0.938515,-0.663867,2.08205,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922


## Prepare Dataset For Training

This section will walk-through setting up a pipeline for your dataset. Ensuring the code is reproducible across any tabular dataset.

**Steps:**


1.   Create function to read-in data
2.   Create function to split data into X/y
3.   Use Sci-Kit Learn 'train_test_split' to split data for train/test and ensure no "data-leakage"
3.   Create function for feature selection
4.   Create function for feature engineering



**Steps 1-3:**

In [50]:
## read in dataset
def read_dataframe(path):
  return pd.read_csv(path)

## split into X/y
def split_X_y(dataframe, target_col):
  columns = list(dataframe.columns)
  columns.remove(target_col)

  X = dataframe[columns]
  y = dataframe[target_col]

  return X, y

df = read_dataframe('https://raw.githubusercontent.com/j0sephsasson/Pepsi-Training-Course/main/datasets/heart.csv')

X, y = split_X_y(df, 'output')

## split into X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Steps 4-5:**

In [51]:
good_features = feature_selection(X, list(X.columns))

X_train_clean = clean_dataset(X_train, good_features)
X_test_clean = clean_dataset(X_test, good_features)

In [52]:
X_train_clean.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
239,-2.171717,0.696177,-0.905478,-0.321525,0.649847,-0.424839,-1.026137,0.301775,1.371478,-0.91021,0.991814,-0.739105,1.127469
264,-0.060606,0.696177,-0.905478,-1.257572,-0.787381,-0.424839,-1.026137,-1.817584,1.371478,-0.91021,-0.619052,0.232979,-0.526152
146,-1.171717,-1.436416,1.008283,-0.789549,-0.106589,-0.424839,0.913903,-0.007298,-0.72914,-0.653385,-0.619052,0.232979,-0.526152
193,0.606061,0.696177,-0.905478,0.790032,0.649847,-0.424839,-1.026137,-0.316371,1.371478,1.48683,-0.619052,1.205062,1.127469
22,-1.393939,0.696177,-0.905478,0.497517,-0.409163,-0.424839,0.913903,1.273148,-0.72914,-0.91021,0.991814,-0.739105,-0.526152


In [53]:
X_test_clean.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
144,2.404003,-1.608799,0.984892,0.412803,-0.940763,-0.388514,2.628928,-1.505418,-0.57104,0.137828,-0.779584,-0.619079,-0.463831
122,-1.361622,-1.608799,0.984892,-1.056088,0.583534,-0.388514,-0.93473,0.865494,1.75119,-0.846032,0.918796,-0.619079,-0.463831
9,0.359807,0.621582,0.984892,0.937407,-1.563363,-0.388514,0.847099,0.950169,-0.57104,0.585038,0.918796,-0.619079,-0.463831
179,0.359807,0.621582,-1.086777,0.937407,0.755285,-0.388514,-0.93473,-1.674768,1.75119,-0.309381,-0.779584,0.401565,-2.035704
267,-0.500908,0.621582,0.984892,-0.741326,-1.971273,-0.388514,-0.93473,-1.082041,-0.57104,-0.130497,0.918796,2.442852,-0.463831


In [55]:
y_train.head()

239    0
264    0
146    1
193    0
22     1
Name: output, dtype: int64

In [56]:
y_test.head()

144    1
122    1
9      1
179    0
267    0
Name: output, dtype: int64

## Conclusion

**Question 1**

What is cardinality?



1.   The number of null values in a feature
2.   The number of unique values in a feature
3.   The length of a feature



**Question 2**

Is high cardinality good or bad?



1.   Good thing
2.   Bad thing



**Question 3**

Which of the following is a common method for dealing with null values?



1.   Imputing
2.   Scaling
3.   Encoding



**Question 4**

Which of the following is a common method for normalizing numerical data?



1.   Using a Z-Score formula
2.   Removing numbers over a certain limit
3.   Adding 3 to each number



**Question 5 - Bonus!**

What is the output shape of multiplying two matrices (A * B), where:

Shape A = (2,2)

Shape B = (2,6)



1.   (2,4)
2.   (4,2)
3.   (2,6)

