# Feature Engineering

Feature engineering is the most step in any machine learning project. It involves selecting the relevant columns for modeling as well as creating new features from the existing data which will provide more information to the models.

In [1]:
# Import necessary packages
import pandas as pd
import datetime
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

For feature engineering, it is necessary that data be cleaned and data types of the columns should have been corrected. In other words, the data should be preprocessed.

In [2]:
# Load the data
df = pd.read_csv('./../../data/preprocessed.csv', parse_dates=True)

## Feature Creation

To improve the performance of the model, it is expected that model be provided with an appropriate amount of information. It is difficult for models to understand indirect relationships. Hence, new features should be created in such a way that the indirect information gets converted to direct information. For example, date columns do not provide any direct information, but gaps between two dates can be a piece of important information for the model.

For the data set under consideration, the following new features can be created to provide more direct and relevant information to the model:
|New feature column name|Description|
|:--|:--|
|created_to_launch|Gap in days between the creation to the launch of project|
|launch_to_deadline|Gap in days between the launch to the deadline of project|
|launch_to_state_change_|Gap in days between the launch to the change in the state of project|
|is_mon_tue|Whether the project is launched on Monday or Tuesday. It is observed that projects released on these weekdays have more chance to be successful|

In [3]:
# Correct the data type of date variables

# Select the date variables
date_variables = ['created_at', 'launched_at', 'state_changed_at', 'deadline']

# Change the data type
df[date_variables] = df[date_variables].apply(pd.to_datetime)

In [4]:
# Creation of new columns from the date columns available in the data
df['created_to_launch'] = df['launched_at'] - df['created_at']
df['launch_to_deadline'] = df['deadline'] - df['launched_at']
df['launch_to_state_change'] = df['state_changed_at'] - df['launched_at']

# Changing the gap data to days
df['created_to_launch'] = df['created_to_launch'] / datetime.timedelta(days=1)
df['launch_to_deadline'] = df['launch_to_deadline'] / datetime.timedelta(days=1)
df['launch_to_state_change'] = df['launch_to_state_change'] / datetime.timedelta(days=1)

In [5]:
# Check whether the project is launched on Monday or Tuesday
df['is_mon_tue'] = df['launched_at'].apply(lambda x: 1 if x.weekday() in [0, 1] else 0)

In [6]:
# Get the shape of data
print(f"The shape of engineered data is: {df.shape[0]} rows and {df.shape[1]}")

The shape of engineered data is: 14862 rows and 75


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14862 entries, 0 to 14861
Data columns (total 75 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   index                   14862 non-null  int64         
 1   id                      14862 non-null  int64         
 2   name                    14862 non-null  object        
 3   blurb                   14862 non-null  object        
 4   goal                    14862 non-null  float64       
 5   pledged                 14862 non-null  float64       
 6   state                   14862 non-null  int64         
 7   slug                    14862 non-null  object        
 8   disable_communication   14862 non-null  int64         
 9   currency_trailing_code  14862 non-null  int64         
 10  deadline                14862 non-null  datetime64[ns]
 11  state_changed_at        14862 non-null  datetime64[ns]
 12  created_at              14862 non-null  dateti

## Feature selection

Since all the required columns are created or processed, total of 75 columns are created. This may increase the time of processing or the model might experience the curse of dimensionality. Hence it is necessary to only select the relevant columns. Thus, it may be better to drop off unnecessary columns. 
Also, the convergence of algorithms becomes difficult if the variance of a feature is less than 1%. Hence, features with less than 1% variance can also be removed. Features with less than 1% variance signify that not enough data is available for the model to fit the data.

Columns that will be dropped are:
1. *index*
2. *id*
3. *name*
4. *blurb*
5. *pledged*
6. *slug*
7. *deadline*
8. *state_changed_at*
9. *created_at*
10. *launched_at*
11. *usd_pledged*
12. *spotlight*
13. *source_url*
14. *name_len*
15. *blurb_len*

In [8]:
# Keep only relevant columns
drop_columns = ['index', 'id', 'name', 'blurb', 'pledged', 'slug', 'deadline', 'state_changed_at', 'created_at', 'launched_at',
                'usd_pledged', 'spotlight', 'source_url', 'name_len', 'blurb_len']
df.drop(drop_columns, axis=1, inplace=True)

In [9]:
# Remove all low-variance features

# Declare the selector
selector = VarianceThreshold(threshold=0.01) # Keep the threshold to 1%.

# Use selector to subset the data
data = selector.fit_transform(df)

Following features are selected after dropping the columns with low-variance: 

In [10]:
# Get the list of features that are selected
selector.get_feature_names_out().tolist()

['goal',
 'state',
 'currency_trailing_code',
 'staff_pick',
 'backers_count',
 'static_usd_rate',
 'name_len_clean',
 'blurb_len_clean',
 'country_AU',
 'country_CA',
 'country_DE',
 'country_FR',
 'country_GB',
 'country_IT',
 'country_NL',
 'country_US',
 'currency_CAD',
 'currency_EUR',
 'currency_GBP',
 'currency_USD',
 'category_Experimental',
 'category_Festivals',
 'category_Flight',
 'category_Gadgets',
 'category_Hardware',
 'category_Immersive',
 'category_Makerspaces',
 'category_Musical',
 'category_Plays',
 'category_Robots',
 'category_Software',
 'category_Sound',
 'category_Wearables',
 'category_Web',
 'created_to_launch',
 'launch_to_deadline',
 'launch_to_state_change',
 'is_mon_tue']

In [11]:
# Store the result
result = pd.DataFrame(data, columns=selector.get_feature_names_out().tolist())

## Standardizing the data

One of the last preprocessing steps that need to be performed is standardizing the data. In this technique, the observations are scaled to a mean of 0 and a standard deviation of 1.

```{note}
Only numerical features should be scaled
```

In [12]:
# Decalre the scaler object
scaler = StandardScaler()

# Select features to scale
features_to_scale = ['goal', 'backers_count', 'name_len_clean', 'blurb_len_clean']

# Use the scaler object to scale the data
data = scaler.fit_transform(result[features_to_scale])

In [13]:
# Store the results appropriately
scaled_result = pd.DataFrame(data, columns=features_to_scale)
final_result = result.copy()
final_result[features_to_scale] = scaled_result[features_to_scale]

In [14]:
# Save the result to a file
final_result.to_csv('./../../data/engineered_data.csv')