[Home](../../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes you can apply to your data before model training to maximise the performance of your machine learning model. For this demonstration we will engineer new or improved features for the diabetes data you previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [5]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [4]:
data_frame = pd.read_csv("2.2.1.wrangled_data_NEW.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

To encode the 'SEX' column you will assigning a number value to the gender. Because the data set only provides 2 values we will use -1 and 1.

In [6]:
data_frame['SEX'] = data_frame['SEX'].apply(lambda gender: -1 if gender.lower() == 'male' else 1 if gender.lower() == 'female' else None)
print(data_frame['SEX'].head())

0    1
1    1
2   -1
3   -1
4   -1
Name: SEX, dtype: int64


##### Calculating Age

In the context of medical diagnosis of a lifestyle disease a persons date of birth has limited influence on the target. However, their age is highly relevant. So we will convert two dates into a age. You could consider further encoding this into age brackets.

In [11]:
# Convert the 'DoB' and 'DoT' columns to datetime
data_frame['DoB'] = pd.to_datetime(data_frame['DoB'], format='%d/%m/%Y')
data_frame['DoT'] = pd.to_datetime(data_frame['DoT'], format='%d/%m/%Y')

# Convert datetime to float (timestamp)
data_frame['DoB_float'] = data_frame['DoB'].astype(int) / 10**9
data_frame['DoT_float'] = data_frame['DoT'].astype(int) / 10**9

# Calculate the year difference
data_frame['Age'] = ((data_frame['DoT'] - data_frame['DoB']).dt.days / 365.25).round()

# Print the result
print(data_frame[['DoB', 'DoT', 'DoB_float', 'DoT_float', 'Age']])

           DoB        DoT     DoB_float     DoT_float   Age
2   1981-03-11 2024-03-08  3.531168e+08  1.709856e+09  43.0
3   2002-05-15 2024-05-06  1.021421e+09  1.714954e+09  22.0
4   2001-06-21 2024-07-20  9.930816e+08  1.721434e+09  23.0
5   1998-06-10 2024-07-20  8.974368e+08  1.721434e+09  26.0
6   1991-01-22 2024-01-18  6.645024e+08  1.705536e+09  33.0
..         ...        ...           ...           ...   ...
428 1967-02-27 2024-02-17 -8.976960e+07  1.708128e+09  57.0
429 1969-01-21 2024-01-13 -2.980800e+07  1.705104e+09  55.0
431 1955-04-06 2024-04-12 -4.651776e+08  1.712880e+09  69.0
435 1987-06-13 2024-06-09  5.505408e+08  1.717891e+09  37.0
437 1989-08-19 2024-08-04  6.194880e+08  1.722730e+09  35.0

[234 rows x 5 columns]


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this, case some domain knowledge and data analysis have informed you that the BMI and AGE are risk multipliers (the greater the age and the greater the BMI the greater the feature). From this we can  risk value based on the feature interactions.

In [12]:
# Calculate the year difference and round to an integer
data_frame['Age'] = ((data_frame['DoT'] - data_frame['DoB']).dt.days / 365.25).round().astype(int)

# Create the 'Risk' column
data_frame['Risk'] = data_frame['BMI'] * data_frame['Age']

# Calculate the percentage of the maximum risk
data_frame['Risk%'] = (data_frame['Risk'] / data_frame['Risk'].max()).round(2)

# Print the result
print(data_frame[['Age', 'BMI', 'Risk%']])

     Age       BMI  Risk%
2     43  0.189655   0.14
3     22  0.193103   0.07
4     23  0.200000   0.08
5     26  0.200000   0.09
6     33  0.203448   0.12
..   ...       ...    ...
428   57  0.796552   0.80
429   55  0.813793   0.78
431   69  0.827586   1.00
435   37  0.872414   0.57
437   35  0.975862   0.60

[234 rows x 3 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

In [13]:
# Filter the data to -1 only
data_frame = data_frame[data_frame['SEX'] == -1]

# Print the result
print(data_frame[['Age', 'SEX', 'Target']])

     Age  SEX  Target
2     43   -1    90.0
3     22   -1   101.0
4     23   -1    85.0
5     26   -1    51.0
6     33   -1    72.0
..   ...  ...     ...
428   57   -1   270.0
429   55   -1   258.0
431   69   -1   237.0
435   37   -1   259.0
437   35   -1   346.0

[234 rows x 3 columns]


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In This case the domain is 'health' and more specifically   'Epidemiology' which is the study of how often diseases occur in different groups of people and why.

The column called '1st Degree Relatives' is a domain specific feature as is records the number of family members in the individuals direct bloodline who have developed type 2 adult onset diabetes. Domain specific knowledge, is that Family history of disease in first degree relatives is a major risk factor, especially for premature events.

First we will convert we will convert the FDR value to a risk percentage, because the risk can never be 0 (will never happen) or 100% (will definitely happen) we will scale the result between 0.15 and 0.95.

In [14]:
# Calculate the family history risk
data_frame['FHRisk'] = (data_frame['FDR'] / data_frame['FDR'].max())

# Scale the result between 0.15 and 0.85
min_val = 0.15
max_val = 0.85
data_frame['FHRisk'] = (((data_frame['FHRisk'] - data_frame['FHRisk'].min()) / (data_frame['FHRisk'].max() - data_frame['FHRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'FDR', 'FHRisk']])

     Age  FDR  FHRisk
2     43    2    0.62
3     22    2    0.62
4     23    2    0.62
5     26    2    0.62
6     33    0    0.15
..   ...  ...     ...
428   57    1    0.38
429   55    2    0.62
431   69    2    0.62
435   37    0    0.15
437   35    3    0.85

[234 rows x 3 columns]


Then to make it even more meaningful, we will combine it with the `Risk` feature we engineered using the `AGE` and `BMI` features to create a combined risk 'interaction feature' that captures real-world relationships between the features.

Again we will scale the result between 0.15 and 0.95.

- 2 Risk factors combined to show the overall risk factor of Diabetes

In [15]:
data_frame['CombRisk'] = (data_frame['FHRisk'] * data_frame['Risk%']).round(2)

min_val = 0.15
max_val = 0.85
data_frame['CombRisk'] = (((data_frame['CombRisk'] - data_frame['CombRisk'].min()) / (data_frame['CombRisk'].max() - data_frame['CombRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'Risk%', 'FHRisk', 'CombRisk']])

     Age  Risk%  FHRisk  CombRisk
2     43   0.14    0.62      0.24
3     22   0.07    0.62      0.18
4     23   0.08    0.62      0.20
5     26   0.09    0.62      0.21
6     33   0.12    0.15      0.16
..   ...    ...     ...       ...
428   57   0.80    0.38      0.48
429   55   0.78    0.62      0.69
431   69   1.00    0.62      0.85
435   37   0.57    0.15      0.24
437   35   0.60    0.85      0.72

[234 rows x 4 columns]


#### Save the wrangled and engineered data to CSV

In [16]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data_new.csv', index=False)