# Advanced Feature Engineering
In this exercise we will learn about the importance of feature engineering to improve our model's performance.
We will be working on a kaggle dataset of kickstarter projects (if you don't know Kickstarter (shame on you!), your first assignment is to visit the <a href='https://www.kickstarter.com/'>Kickstarter</a> website and find a cool project).
Each record represents one project and some basic information about it. 

The dataset can be found <a href='https://www.kaggle.com/kemical/kickstarter-projects#'> HERE </a>

In this exercise we will try to predict whether a project will be a success or not (binary classification).

Have fun :)

``` ~Lior Hirsch ```

```First, make sure all the following libraries are installed on you computer.```

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
import itertools
from currency_converter import CurrencyConverter
import xgboost as xgb

In [3]:
random_state = 1
np.random.seed(random_state)

## Loading and cleaning the data
```The following code is a simple cleaning of the dataset. In the original dataset there are several states (which is the target column). Instead, we will use a new generated binary column based on the state column - output.```

```Your first assignment is to create a train-test-split that will fit to our dataset.```

#### Questions

```How can you test whether the train-test-split is good?```

```Explain your train-test-split proposal and why does it fit the dataset?```

In [4]:
def split(data):
    return train_test_split(data.set_index("launched").sort_index(), test_size = 0.2, shuffle=False)

In [5]:
def load_clean_split_datasets():
    ks = pd.read_csv('ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
    # Drop live projects
    ks = ks[ks.state != 'live']

    # Add outcome column, "successful" == 1, others are 0
    ks['output'] = (ks['state'] == 'successful').astype(int)

    # Drop pledged columns
    ks = ks.drop(columns = ['pledged', 'backers', 'usd pledged', 'usd_pledged_real', 'usd_goal_real', 'state'])
    
    ks_train, ks_test = split(ks)
    ks_train = ks_train.reset_index()
    ks_test = ks_test.reset_index()
    
    return ks_train, ks_test

## Baseline model
```The following code builds a baseline model. Our data contains categorical columns. Therefore we need to encode them (Don't worry, we will learn about different encoders). For now we will use a basic encoding method called LabelEncoder. Read about this encoder. ```

```Pay attention to the helper methods which will be used in this exercise.```

In [6]:
def get_xy_by_columns(df_train, df_test, columns):
    x_train = df_train[columns].copy()
    y_train = df_train['output'].copy()

    x_test = df_test[columns].copy()
    y_test = df_test['output'].copy()
    
    return x_train, x_test, y_train, y_test

In [7]:
def fit_evaluate(x_train, x_val, y_train, y_val):
    cls = xgb.XGBClassifier(n_jobs = -1, n_estimators=50, max_depth = 5, random_state=random_state)
    cls.fit(x_train, y_train)
    
    preds = cls.predict_proba(x_train)
    print(f"AUC of ROC on train : {np.round(roc_auc_score(y_train, preds[:,1]), 4)}")
    
    preds = cls.predict_proba(x_val)
    print(f"AUC of ROC on validation : {np.round(roc_auc_score(y_val, preds[:,1]), 4)}")
    
    return cls

In [8]:
ks_train, ks_test = load_clean_split_datasets()
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Before fitting a model to our data, we need to encode the categorical data to a numeric/float types (Why?). LabelEncoder is a simple encoding method. Read about this encoder and use it.```

```You might find categories in the test set which are not exist in the train set. Think how to fix this problem.```

In [12]:
for col in x_train.columns:
    print(col)
    if col == 'goal':
        continue
    encoder = LabelEncoder()
    print(x_train[col].shape)
    x_train[col] = encoder.fit_transform(x_train[col])
    
    print(encoder.classes_)
    x_test[col] = x_test[col].map(lambda s: 'unknown' if s not in encoder.classes_ else s)
    
    encoder_classes = encoder.classes_.tolist()
    encoder_classes.append('unknown')
    encoder.classes_ = encoder_classes
    
    x_test[col] = encoder.transform(x_test[col])

category
(300689,)
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157]
main_category
(300689,)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
currency
(300689,)
[0 1 2 3 4 5 6 7 8 9]
goal
country
(300689,)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18]


In [11]:
print(x_test)

       category  main_category  currency      goal  country
0           109              2         4    4700.0        5
1            89             10         9    2950.0       18
2            12              9         9     500.0       18
3            97              0         9    7000.0       18
4            41              7         9   15000.0       18
...         ...            ...       ...       ...      ...
75168        68             10         0     500.0        1
75169       150              9         9  100000.0       18
75170       135              8         9    1000.0       18
75171        25              1         5     100.0        9
75172       135              8         1    5000.0        3

[75173 rows x 5 columns]


In [13]:
fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.7301
AUC of ROC on validation : 0.6763


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

## Date encoding
```First, we'll start with a basic date encoding for the launch date and the deadline date. For each date, create three new columns - the hour, day and the month of the date. At the end of this encoding the data should contain six new columns - launched_hour, launched_day, launched_month, deadline_hour, deadline_day, deadline_month ```

#### Questions
```Why we won't create a column for the year?```

In [22]:
def encode_launch_dt(df):
    df['launched_hour'] = df.launched.dt.hour
    df['launched_day'] = df.launched.dt.day
    df['launched_month'] = df.launched.dt.month
    return df

def encode_deadline_dt(df):
    df['deadline_hour'] = df.deadline.dt.hour
    df['deadline_day'] = df.deadline.dt.day
    df['deadline_month'] = df.deadline.dt.month
    return df

In [23]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

In [24]:
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Use the label encoder you used in the previous section to encode the categorical columns.```

In [23]:
for col in x_train.columns:
    print(col)
    if col == 'goal':
        continue
    encoder = LabelEncoder()
    x_train[col] = encoder.fit_transform(x_train[curr_cat])
    
    x_test[col] = x_test[col].map(lambda s: 'unknown' if s not in encoder.classes_ else s)
    
    encoder_classes = encoder.classes_.tolist()
    encoder_classes.append('unknown')
    encoder.classes_ = encoder_classes
    
    x_test[col] = encoder.transform(x_test[col])

category
main_category
currency
goal
country
launched_hour
launched_day
launched_month
deadline_hour
deadline_day
deadline_month


In [26]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.6607
AUC of ROC on validation : 0.6388


## Categorical Encoding
```Next, we will learn about different categorical encodings. Read how each encoding method works and try to understand when we should use each one of them. Be ready to discuss this with your tutor```

```You can start by reading the following blog-posts:```

https://wrosinski.github.io/fe_categorical_encoding/

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

### Count encoding

In [25]:
def count_encoding(x_train, x_val, col):
    ce = (x_train.groupby(col)[col].agg("count").rename(f'{col}_count')).reset_index()
    
    x_train = x_train.merge(ce, on = col, how='left')
    x_val = x_val.merge(ce, on = col, how='left')
    
    return x_train, x_val

In [73]:
def count_encoding(x_train, x_val, col):
    x_train[col] = x_train[col].astype('object').map(x_train[col].value_counts())
    x_val[col] = x_val[col].astype('object').map(x_val[col].value_counts())
        
    x_train[col] = x_train[col].astype(np.float32)
    x_val[col] = x_val[col].astype(np.float32)

    return x_train, x_val

In [79]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

In [80]:
cat_features = ['category', 'main_category', 'currency', 'country']

for curr_cat in cat_features:
    x_train, x_test = count_encoding(x_train, x_test, curr_cat)

In [81]:
x_train.columns

x_train = x_train.drop(columns=cat_features)
x_test = x_test.drop(columns=cat_features)

In [82]:
x_test = x_test.fillna(0)

In [83]:
fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.7435
AUC of ROC on validation : 0.7304


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

### Target encoder
```Target encoding can be used with/without smoothing. We will implement both types. We'll start without smoothing. Read about the difference between them.```

In [18]:
# df - data, by - categorical column, on- target column, m - smoothing hyper-parameter - 0 will be without smoothing.
def calc_smooth_mean(df, by, on, m = 0):
    # Compute the global mean
    mean = df[on].mean()

    # Compute the number of values and the mean of each group
    agg = df.groupby(by)[on].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']

    # Compute the "smoothed" means
    smooth = (counts * means + m * mean) / (counts + m)

    # Replace each value by the according smoothed mean
    return smooth

In [19]:
def target_encoder(m_smoothing):
    ks_train, ks_test = load_clean_split_datasets()

    ks_train = encode_launch_dt(ks_train)
    ks_train = encode_deadline_dt(ks_train)

    ks_test = encode_launch_dt(ks_test)
    ks_test = encode_deadline_dt(ks_test)

    relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                         'launched_hour', 'launched_day', 'launched_month',
                         'deadline_hour', 'deadline_day', 'deadline_month']
    
    for curr_c in cat_features:
        new_col_name = curr_c + "_target"
        c_to_target = pd.DataFrame(calc_smooth_mean(ks_train, curr_c, 'output', m = m_smoothing), 
                                   columns = [new_col_name])
        ks_train = ks_train.join(c_to_target, on=curr_c)
        ks_test = ks_test.join(c_to_target, on=curr_c)
        relevant_columns.append(new_col_name)
    
    x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)
    x_test = x_test.fillna(0)

    x_train = x_train.drop(columns=cat_features)
    x_test = x_test.drop(columns=cat_features)
    
    fit_evaluate(x_train, x_test, y_train, y_test)

In [108]:
target_encoder(m_smoothing=0)



AUC of ROC on train : 0.7463
AUC of ROC on validation : 0.7308


In [109]:
target_encoder(m_smoothing=1)



AUC of ROC on train : 0.7456
AUC of ROC on validation : 0.7302


In [110]:
target_encoder(m_smoothing=5)



AUC of ROC on train : 0.7452
AUC of ROC on validation : 0.7294


In [111]:
target_encoder(m_smoothing=10)



AUC of ROC on train : 0.7459
AUC of ROC on validation : 0.7296


### Catboost encoding

In [26]:
# Catboost encoding
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

In [28]:
cat_features = ['category', 'main_category', 'currency', 'country']

target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(x_train[cat_features], y_train)

x_train = x_train.join(target_enc.transform(x_train[cat_features]).add_suffix('_cb'))
x_test = x_test.join(target_enc.transform(x_test[cat_features]).add_suffix('_cb'))

x_train.drop(columns=cat_features, inplace=True)
x_test.drop(columns=cat_features, inplace=True)

  elif pd.api.types.is_categorical(cols):


In [29]:
fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.7456
AUC of ROC on validation : 0.7351


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

## Feature Generation
```Creating new features from the raw data is a powerful way to improve your model performance.```

```In the following section we will generate the features and evaluate their impact in the end.```

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country', 'launched', 'deadline',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

cat_features = ['category', 'main_category', 'currency', 'country']

### Interactions
```One of the easiest ways to create new features is by combining categorical variables. Create all the combinations of any two categoricals features. Don't forget to encode the new categorical features.```

#### Questions
```When is this not a good idea to and how it can be solved?```

In [49]:
print(cat_features)

['launched_hour', 'launched_day', 'launched_month', 'deadline_hour', 'deadline_day', 'deadline_month', 'category_main_category', 'category_currency', 'category_country', 'main_category_currency', 'main_category_country', 'currency_country']


In [52]:
print(ks_train.columns)

Index(['launched', 'ID', 'name', 'category', 'main_category', 'currency',
       'deadline', 'goal', 'country', 'output', 'category_main_category',
       'category_currency', 'category_country', 'main_category_currency',
       'main_category_country', 'currency_country'],
      dtype='object')


In [67]:
ks_train, ks_test = load_clean_split_datasets()

cat_features = ['category', 'main_category', 'currency', 'country']

# ks_train = encode_launch_dt(ks_train)
# ks_train = encode_deadline_dt(ks_train)

# ks_test = encode_launch_dt(ks_test)
# ks_test = encode_deadline_dt(ks_test)

In [68]:
comb_features = []

for curr_pair in itertools.combinations(cat_features, 2):
    combination_name = f'{curr_pair[0]}_{curr_pair[1]}'
    interactions = ks_train[curr_pair[0]] + "_" + ks_train[curr_pair[1]]
    ks_train = ks_train.assign(**{combination_name:interactions})
    
    interactions = ks_test[curr_pair[0]] + "_" + ks_test[curr_pair[1]]    
    ks_test = ks_test.assign(**{combination_name:interactions})
    
    comb_features.append(combination_name)
    
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, comb_features)

In [56]:
relevant_columns = []
cat_features = ['category', 'main_category', 'currency', 'country', 'category_main_category', 
                'category_currency', 'category_country', 'main_category_currency', 
                'main_category_country', 'currency_country']

for curr_c in cat_features:
    new_col_name = curr_c + "_target"
    c_to_target = pd.DataFrame(calc_smooth_mean(ks_train, curr_c, 'output', m = 0), 
                               columns = [new_col_name])
    ks_train = ks_train.join(c_to_target, on=curr_c)
    ks_test = ks_test.join(c_to_target, on=curr_c)
    relevant_columns.append(new_col_name)
    
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)
x_test = x_test.fillna(0)

fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.6916
AUC of ROC on validation : 0.6709


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

In [59]:
x_train.columns

Index(['category', 'main_category', 'currency', 'country', 'category_cb',
       'main_category_cb', 'currency_cb', 'country_cb'],
      dtype='object')

In [69]:
target_enc = ce.CatBoostEncoder(cols=comb_features)
target_enc.fit(x_train[comb_features], y_train)

x_train = target_enc.transform(x_train[comb_features]).add_suffix('_cb')
x_test = target_enc.transform(x_test[comb_features]).add_suffix('_cb')

  elif pd.api.types.is_categorical(cols):


In [70]:
x_train.columns

Index(['category_main_category_cb', 'category_currency_cb',
       'category_country_cb', 'main_category_currency_cb',
       'main_category_country_cb', 'currency_country_cb'],
      dtype='object')

In [72]:
print(x_train.shape)

(300689, 6)


In [71]:
fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.691
AUC of ROC on validation : 0.6768


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

### Domain knowledge 
```In this section we will create new features based on the original ones with a pinch of imagination and creativity. The domain knowledge features generation is one of the most successful ways to improve our model. ```

```Don't forget to add the new features both to x_train, x_test```

```Create a feature that contains the goal in USD currency```

In [9]:
ks_train, ks_test = load_clean_split_datasets()

In [10]:
c = CurrencyConverter()

train_goal_usd = ks_train.apply(lambda x: c.convert(x.goal, x.currency, 'USD'), axis = 1)
ks_train['goal_usd'] = train_goal_usd

test_goal_usd = ks_test.apply(lambda x: c.convert(x.goal, x.currency, 'USD'), axis = 1)
ks_test['goal_usd'] = test_goal_usd

```Count the number of projects launched in the preceeding week for each record.```

In [11]:
def add_7_days_counter(df):
    launched = pd.Series(df.index, index=df.launched, name="count_7_days").sort_index()
    count_7_days = launched.rolling('7d').count() - 1
    count_7_days.index = launched.values
    count_7_days = count_7_days.reindex(df.index)
    return df.join(count_7_days)

ks_train = add_7_days_counter(ks_train).fillna(0)
ks_test = add_7_days_counter(ks_test).fillna(0)

```Count the days each project was online```

In [12]:
online_days_train = (ks_train.deadline - ks_train.launched).dt.total_seconds() / (60 * 60* 24)
online_days_test = (ks_test.deadline - ks_test.launched).dt.total_seconds() / (60 * 60* 24)

ks_train['online_days'] = online_days_train
ks_test['online_days'] = online_days_test

```Calculate the goal per day```

In [13]:
goal_per_day_train = (ks_train.goal / online_days_train)
goal_per_day_test = (ks_test.goal / online_days_test)

ks_train['goal_per_day'] = goal_per_day_train
ks_test['goal_per_day'] = goal_per_day_test

```Calculate the goal per day in USD```

In [14]:
goal_usd_per_day_train = ks_train.goal_usd / online_days_train
goal_usd_per_day_test = ks_test.goal_usd / online_days_test

ks_train['goal_usd_per_day'] = goal_usd_per_day_train
ks_test['goal_usd_per_day'] = goal_usd_per_day_test

```Calculate the time since the last launch project in the same category```

In [15]:
def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.days

df = ks_train[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(ks_train.index)
ks_train['timedeltas'] = timedeltas

df = ks_test[['category', 'launched']].sort_values('launched')
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(ks_test.index)
ks_test['timedeltas'] = timedeltas

In [16]:
print(ks_test.columns)

Index(['launched', 'ID', 'name', 'category', 'main_category', 'currency',
       'deadline', 'goal', 'country', 'output', 'goal_usd', 'count_7_days',
       'online_days', 'goal_per_day', 'goal_usd_per_day', 'timedeltas'],
      dtype='object')


In [20]:
created_features = ['launched', 'name', 'category', 'main_category', 'currency',
       'deadline', 'goal', 'country', 'output', 'goal_usd', 'count_7_days',
       'online_days', 'goal_per_day', 'goal_usd_per_day', 'timedeltas']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, created_features)

for col in x_train.columns:
    if x_train[col].dtype == 'object':
        new_col_name = col + "_target"
        c_to_target = pd.DataFrame(calc_smooth_mean(x_train, col, 'output', m = 0), columns = [new_col_name])
        
        x_train = x_train.join(c_to_target, on=col)
        x_test = x_test.join(c_to_target, on=col)

        x_train = x_train.drop(columns=[col])
        x_test = x_test.drop(columns=[col])

x_train = x_train.drop(columns=['launched', 'deadline', 'output', 'name_target'])
x_test = x_test.drop(columns=['launched', 'deadline', 'output', 'name_target'])

fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.763
AUC of ROC on validation : 0.7405


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=1, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

In [21]:
print(x_train.dtypes)

goal                    float64
goal_usd                float64
count_7_days            float64
online_days             float64
goal_per_day            float64
goal_usd_per_day        float64
timedeltas              float64
category_target         float64
main_category_target    float64
currency_target         float64
country_target          float64
dtype: object


### Transforming numerical features
```Numerical features can be transformed with mathematical transformation like log, sqrt etc. Create another two features - log(goal_usd), sqrt(goal_usd)```

#### Questions
```Why are those transformation useful? ```

```In which cases/ models we should use this transformation?```


In [22]:
x_train['sqrt_goal_usd'] = np.sqrt(x_train.goal_usd)
x_train['log_goal_usd'] = np.log(x_train.goal_usd)
x_train = x_train.drop(columns=['goal_usd'])

x_test['sqrt_goal_usd'] = np.sqrt(x_test.goal_usd)
x_test['log_goal_usd'] = np.log(x_test.goal_usd)
x_test = x_test.drop(columns=['goal_usd'])

In [23]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.763
AUC of ROC on validation : 0.7405


```Great! Now create another five unique and creative features that will make your tutor impressed and improve the model's validation AUC further. ```

In [24]:
# FILL HERE
print(x_train.dtypes)
x_train.describe()

goal                    float64
count_7_days            float64
online_days             float64
goal_per_day            float64
goal_usd_per_day        float64
timedeltas              float64
category_target         float64
main_category_target    float64
currency_target         float64
country_target          float64
sqrt_goal_usd           float64
log_goal_usd            float64
dtype: object


Unnamed: 0,goal,count_7_days,online_days,goal_per_day,goal_usd_per_day,timedeltas,category_target,main_category_target,currency_target,country_target,sqrt_goal_usd,log_goal_usd
count,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0,300689.0
mean,46367.54,1148.418765,34.336245,2359.769,2328.624,1.036895,0.354855,0.354855,0.354855,0.354855,108.563596,8.618998
std,1137464.0,529.89976,73.724752,531581.5,531775.4,64.988081,0.140423,0.10083,0.044398,0.0579,181.880981,1.689463
min,0.01,0.0,0.035498,0.0002832798,0.0002832798,0.0,0.060156,0.19772,0.188437,0.027661,0.1,-4.60517
25%,2000.0,820.0,29.100972,66.83388,66.87477,0.0,0.263831,0.299421,0.372411,0.376282,44.72136,7.600902
50%,5000.0,1073.0,29.732558,171.0775,171.3067,0.0,0.341752,0.336858,0.372411,0.376282,70.710678,8.517193
75%,15000.0,1478.0,38.014028,501.3495,499.5274,0.0,0.436132,0.408456,0.372411,0.376282,122.474487,9.615805
max,100000000.0,4012.0,16738.958333,290303100.0,290303100.0,14498.0,0.75,0.629836,0.372411,0.376282,11746.411741,18.742606


In [25]:
# FILL HERE
x_train['sqrt_country_target'] = np.sqrt(x_train.country_target)
x_train['log_country_target'] = np.log(x_train.country_target)
x_train = x_train.drop(columns=['country_target'])

x_test['sqrt_country_target'] = np.sqrt(x_test.country_target)
x_test['log_country_target'] = np.log(x_test.country_target)
x_test = x_test.drop(columns=['country_target'])

In [26]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.763
AUC of ROC on validation : 0.7405


In [27]:
# FILL HERE
x_train['sqrt_goal'] = np.sqrt(x_train.goal)
x_train['log_goal'] = np.log(x_train.goal)
x_train = x_train.drop(columns=['goal'])

x_test['sqrt_goal'] = np.sqrt(x_test.goal)
x_test['log_goal'] = np.log(x_test.goal)
x_test = x_test.drop(columns=['goal'])

cls = fit_evaluate(x_train, x_test, y_train, y_test)



AUC of ROC on train : 0.763
AUC of ROC on validation : 0.7405


In [28]:
# FILL HERE
x_train.columns

Index(['count_7_days', 'online_days', 'goal_per_day', 'goal_usd_per_day',
       'timedeltas', 'category_target', 'main_category_target',
       'currency_target', 'sqrt_goal_usd', 'log_goal_usd',
       'sqrt_country_target', 'log_country_target', 'sqrt_goal', 'log_goal'],
      dtype='object')