# Advanced Feature Engineering
In this exercise we will learn about the importance of feature engineering to improve our model's performance.
We will be working on a kaggle dataset of kickstarter projects (if you don't know Kickstarter (shame on you!), your first assignment is to visit the <a href='https://www.kickstarter.com/'>Kickstarter</a> website and find a cool project).
Each record represents one project and some basic information about it. 

The dataset can be found <a href='https://www.kaggle.com/kemical/kickstarter-projects#'> HERE </a>

In this exercise we will try to predict whether a project will be a success or not (binary classification).

Have fun :)

``` ~Lior Hirsch ```

```First, make sure all the following libraries are installed on you computer.```

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
import itertools
from currency_converter import CurrencyConverter
import xgboost as xgb

In [2]:
random_state = 1
np.random.seed(random_state)

## Loading and cleaning the data
```The following code is a simple cleaning of the dataset. In the original dataset there are several states (which is the target column). Instead, we will use a new generated binary column based on the state column - output.```

```Your first assignment is to create a train-test-split that will fit to our dataset.```

#### Questions

```How can you test whether the train-test-split is good?```

```Explain your train-test-split proposal and why does it fit the dataset?```

In [3]:
def split(data):
#     FILL HERE
    return train_test_split(data.set_index("launched").sort_index(), test_size = 0.3, shuffle=False)

In [4]:
def load_clean_split_datasets():
    ks = pd.read_csv('ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
    # Drop live projects
    ks = ks[ks.state != 'live']

    # Add outcome column, "successful" == 1, others are 0
    ks['output'] = (ks['state'] == 'successful').astype(int)

    # Drop pledged columns
    ks = ks.drop(columns = ['pledged', 'backers', 'usd pledged', 'usd_pledged_real', 'usd_goal_real', 'state'])
    
    ks_train, ks_test = split(ks)
    ks_train = ks_train.reset_index()
    ks_test = ks_test.reset_index()
    
    return ks_train, ks_test

## Baseline model
```The following code builds a baseline model. Our data contains categorical columns. Therefore we need to encode them (Don't worry, we will learn about different encoders). For now we will use a basic encoding method called LabelEncoder. Read about this encoder. ```

```Pay attention to the helper methods which will be used in this exercise.```

In [5]:
def get_xy_by_columns(df_train, df_test, columns):
    x_train = ks_train[columns].copy()
    y_train = ks_train['output'].copy()

    x_test = ks_test[columns].copy()
    y_test = ks_test['output'].copy()
    
    return x_train, x_test, y_train, y_test

In [6]:
def fit_evaluate(x_train, x_val, y_train, y_val):
    cls = xgb.XGBClassifier(n_jobs = -1, n_estimators=50, max_depth = 5, random_state=random_state)
    cls.fit(x_train, y_train)
    
    preds = cls.predict_proba(x_train)
    print(f"AUC of ROC on train : {np.round(roc_auc_score(y_train, preds[:,1]), 4)}")
    
    preds = cls.predict_proba(x_val)
    print(f"AUC of ROC on validation : {np.round(roc_auc_score(y_val, preds[:,1]), 4)}")
    
    return cls

In [7]:
ks_train, ks_test = load_clean_split_datasets()
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Before fitting a model to our data, we need to encode the categorical data to a numeric/float types (Why?). LabelEncoder is a simple encoding method. Read about this encoder and use it.```

```You might find categories in the test set which are not exist in the train set. Think how to fix this problem.```

In [8]:
# FILL HERE
cat_features = ['category', 'main_category', 'currency', 'country']
other_category = 'other'

for curr_cat in cat_features:
    encoder = LabelEncoder()
    x_train[curr_cat] = encoder.fit_transform(x_train[curr_cat])
    
    x_test[curr_cat] = x_test[curr_cat].map(lambda s: other_category if s not in encoder.classes_ else s)
    encoder_classes = encoder.classes_.tolist()
    encoder_classes.append(other_category)
    encoder.classes_ = encoder_classes
    x_test[curr_cat] = encoder.transform(x_test[curr_cat])

In [9]:
fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.7282
AUC of ROC on validation : 0.7199


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

## Date encoding
```First, we'll start with a basic date encoding for the launch date and the deadline date. For each date, create three new columns - the hour, day and the month of the date. At the end of this encoding the data should contain six new columns - launched_hour, launched_day, launched_month, deadline_hour, deadline_day, deadline_month ```

#### Questions
```Why we won't create a column for the year?```

In [10]:
def encode_launch_dt(df):
    # FILL HERE
    return df.assign(launched_hour=df.launched.dt.hour,
                     launched_day=df.launched.dt.day,
                     launched_month=df.launched.dt.month)

def encode_deadline_dt(df):
    # FILL HERE
    return df.assign(deadline_hour=df.deadline.dt.hour,
                     deadline_day=df.deadline.dt.day,
                     deadline_month=df.deadline.dt.month)

In [11]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

In [12]:
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Use the label encoder you used in the previous section to encode the categorical columns.```

In [13]:
# FILL HERE
cat_features = ['category', 'main_category', 'currency', 'country']
other_category = 'other'

for curr_cat in cat_features:
    encoder = LabelEncoder()
    x_train[curr_cat] = encoder.fit_transform(x_train[curr_cat])
    
    x_test[curr_cat] = x_test[curr_cat].map(lambda s: other_category if s not in encoder.classes_ else s)
    encoder_classes = encoder.classes_.tolist()
    encoder_classes.append(other_category)
    encoder.classes_ = encoder_classes
    x_test[curr_cat] = encoder.transform(x_test[curr_cat])

In [14]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.7409
AUC of ROC on validation : 0.7331


## Categorical Encoding
```Next, we will learn about different categorical encodings. Read how each encoding method works and try to understand when we should use each one of them. Be ready to discuss this with your tutor```

```You can start by reading the following blog-posts:```

https://wrosinski.github.io/fe_categorical_encoding/

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

### Count encoding

In [15]:
def count_encoding(x_train, x_val, col):
    # FILL HERE
    ce = (x_train.groupby(col)[col].agg("count").rename(f'{col}_count')).reset_index()
    x_train = x_train.merge(ce, on = col, how='left')
    x_val = x_val.merge(ce, on = col, how='left')
    
    return x_train, x_val

In [16]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

In [17]:
cat_features = ['category', 'main_category', 'currency', 'country']

for curr_cat in cat_features:
    x_train, x_test = count_encoding(x_train, x_test, curr_cat)
    
x_test = x_test.fillna(0)

x_train = x_train.drop(columns=cat_features)
x_test = x_test.drop(columns=cat_features)

In [18]:
fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.742
AUC of ROC on validation : 0.7331


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Target encoder
```Target encoding can be used with/without smoothing. We will implement both types. We'll start without smoothing. Read about the difference between them.```

In [19]:
# df - data, by - categorical column, on- target column, m - smoothing hyper-parameter - 0 will be without smoothing.
def calc_smooth_mean(df, by, on, m = 0):
    # FILL HERE
    
    # Compute the global mean
    mean = df[on].mean()

    # Compute the number of values and the mean of each group
    agg = df.groupby(by)[on].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']

    # Compute the "smoothed" means
    smooth = (counts * means + m * mean) / (counts + m)

    # Replace each value by the according smoothed mean
    return smooth

In [20]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

In [21]:
# FILL HERE
for curr_c in cat_features:
    new_col_name = curr_c + "_target"
    c_to_target = pd.DataFrame(calc_smooth_mean(ks_train, curr_c, 'output', m = 0), 
                               columns = [new_col_name])
    ks_train = ks_train.join(c_to_target, on=curr_c)
    ks_test = ks_test.join(c_to_target, on=curr_c)
    relevant_columns.append(new_col_name)
    
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)
x_test = x_test.fillna(0)

x_train = x_train.drop(columns=cat_features)
x_test = x_test.drop(columns=cat_features)

In [22]:
fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.7439
AUC of ROC on validation : 0.7325


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [23]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

In [24]:
# FILL HERE
for curr_c in cat_features:
    new_col_name = curr_c + "_target"
    c_to_target = pd.DataFrame(calc_smooth_mean(ks_train, curr_c, 'output', m = 10), 
                               columns = [new_col_name])
    ks_train = ks_train.join(c_to_target, on=curr_c)
    ks_test = ks_test.join(c_to_target, on=curr_c)
    relevant_columns.append(new_col_name)
    
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)
x_test = x_test.fillna(0)

x_train = x_train.drop(columns=cat_features)
x_test = x_test.drop(columns=cat_features)

In [25]:
fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.7446
AUC of ROC on validation : 0.7335


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Catboost encoding

In [26]:
# Catboost encoding
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

cat_features = ['category', 'main_category', 'currency', 'country']

In [27]:
# FILL HERE
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(x_train[cat_features], y_train)

x_train = x_train.join(target_enc.transform(x_train[cat_features]).add_suffix('_cb'))
x_test = x_test.join(target_enc.transform(x_test[cat_features]).add_suffix('_cb'))

x_train.drop(columns=cat_features, inplace=True)
x_test.drop(columns=cat_features, inplace=True)

  elif pd.api.types.is_categorical(cols):


In [28]:
fit_evaluate(x_train, x_test, y_train, y_test)

AUC of ROC on train : 0.7441
AUC of ROC on validation : 0.7376


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

## Feature Generation
```Creating new features from the raw data is a powerful way to improve your model performance.```

```In the following section we will generate the features and evaluate their impact in the end.```

In [29]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country', 'launched', 'deadline',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

cat_features = ['category', 'main_category', 'currency', 'country']

### Interactions
```One of the easiest ways to create new features is by combining categorical variables. Create all the combinations of any two categoricals features. Don't forget to encode the new categorical features.```

#### Questions
```When is this not a good idea to and how it can be solved?```

In [30]:
# FILL HERE - create all the combinations
for curr_pair in itertools.combinations(cat_features, 2):
    combination_name = f'{curr_pair[0]}_{curr_pair[1]}'
    interactions = x_train[curr_pair[0]] + "_" + x_train[curr_pair[1]]
    x_train = x_train.assign(**{combination_name:interactions})
    
    interactions = x_test[curr_pair[0]] + "_" + x_test[curr_pair[1]]    
    x_test = x_test.assign(**{combination_name:interactions})
    cat_features.append(combination_name)

In [31]:
# FILL HERE - encode the categorical columns
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(x_train[cat_features], y_train)

x_train = x_train.join(target_enc.transform(x_train[cat_features]).add_suffix('_cb'))
x_test = x_test.join(target_enc.transform(x_test[cat_features]).add_suffix('_cb'))

  elif pd.api.types.is_categorical(cols):


### Domain knowledge 
```In this section we will create new features based on the original ones with a pinch of imagination and creativity. The domain knowledge features generation is one of the most successful ways to improve our model. ```

```Don't forget to add the new features both to x_train, x_test```

```Create a feature that contains the goal in USD currency```

In [32]:
# FILL HERE
c = CurrencyConverter()
train_goal_usd = x_train.apply(lambda x: c.convert(x.goal, x.currency, 'USD'), axis = 1)
x_train['goal_usd'] = train_goal_usd

test_goal_usd = x_test.apply(lambda x: c.convert(x.goal, x.currency, 'USD'), axis = 1)
x_test['goal_usd'] = test_goal_usd

```Count the number of projects launched in the preceeding week for each record.```

In [33]:
# FILL HERE
def add_7_days_counter(df):
    launched = pd.Series(df.index, index=df.launched, name="count_7_days").sort_index()
    count_7_days = launched.rolling('7d').count() - 1
    count_7_days.index = launched.values
    count_7_days = count_7_days.reindex(df.index)
    return df.join(count_7_days)

x_train = add_7_days_counter(x_train).fillna(0)
x_test = add_7_days_counter(x_test).fillna(0)

```Count the days each project was online```

In [34]:
# FILL HERE
online_days_train = (x_train.deadline - x_train.launched).dt.total_seconds() / (60 * 60* 24)
online_days_test = (x_test.deadline - x_test.launched).dt.total_seconds() / (60 * 60* 24)

x_train['online_days'] = online_days_train
x_test['online_days'] = online_days_test


```Calculate the goal per day```

In [35]:
# FILL HERE
goal_per_day_train = (x_train.goal / online_days_train)
goal_per_day_test = (x_test.goal / online_days_test)

x_train['goal_per_day'] = goal_per_day_train
x_test['goal_per_day'] = goal_per_day_test


```Calculate the goal per day in USD```

In [36]:
# FILL HERE
goal_usd_per_day_train = x_train.goal_usd / online_days_train
goal_usd_per_day_test = x_test.goal_usd / online_days_test

x_train['goal_usd_per_day'] = goal_usd_per_day_train
x_test['goal_usd_per_day'] = goal_usd_per_day_test

```Calculate the time since the last launch project in the same category```

In [37]:
# FILL HERE
def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.days

df = x_train[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(x_train.index)
x_train['timedeltas'] = timedeltas

df = x_test[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(x_test.index)
x_test['timedeltas'] = timedeltas

### Transforming numerical features
```Numerical features can be transformed with mathematical transformation like log, sqrt etc. Create another two features - log(goal_usd), sqrt(goal_usd)```

#### Questions
```Why are those transformation useful? ```

```In which cases/ models we should use this transformation?```


In [38]:
# FILL HERE
sqrt_goal_usd = np.sqrt(x_train.goal_usd)
log_goal_usd = np.log(x_train.goal_usd)
x_train['sqrt_goal_usd'] = sqrt_goal_usd
x_train['log_goal_usd'] = log_goal_usd

sqrt_goal_usd = np.sqrt(x_test.goal_usd)
log_goal_usd = np.log(x_test.goal_usd)
x_test['sqrt_goal_usd'] = sqrt_goal_usd
x_test['log_goal_usd'] = log_goal_usd

In [39]:
features_to_remove = cat_features + ['deadline', 'launched']
cls = fit_evaluate(x_train.drop(columns=features_to_remove), x_test.drop(columns=features_to_remove), y_train, y_test)

AUC of ROC on train : 0.7676
AUC of ROC on validation : 0.7433


```Great! Now create another five unique and creative features that will make your tutor impressed and improve the model's validation AUC further. ```

In [40]:
# FILL HERE


In [41]:
# FILL HERE


In [42]:
# FILL HERE


In [43]:
# FILL HERE


In [44]:
# FILL HERE
