# Advanced Feature Engineering
In this exercise we will learn about the importance of feature engineering to improve our model's performance.
We will be working on a kaggle dataset of kickstarter projects (if you don't know Kickstarter (shame on you!), your first assignment is to visit the <a href='https://www.kickstarter.com/'>Kickstarter</a> website and find a cool project).
Each record represents one project and some basic information about it. 

The dataset can be found <a href='https://www.kaggle.com/kemical/kickstarter-projects#'> HERE </a>

In this exercise we will try to predict whether a project will be a success or not (binary classification).

Have fun :)

``` ~Lior Hirsch ```

```First, make sure all the following libraries are installed on you computer.```

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
import itertools
from currency_converter import CurrencyConverter
import xgboost as xgb

In [None]:
random_state = 1
np.random.seed(random_state)

## Loading and cleaning the data
```The following code is a simple cleaning of the dataset. In the original dataset there are several states (which is the target column). Instead, we will use a new generated binary column based on the state column - output.```

```Your first assignment is to create a train-test-split that will fit to our dataset.```

#### Questions

```How can you test whether the train-test-split is good?```

```Explain your train-test-split proposal and why does it fit the dataset?```

In [None]:
def split(data):
#     FILL HERE

In [None]:
def load_clean_split_datasets():
    ks = pd.read_csv('ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
    # Drop live projects
    ks = ks[ks.state != 'live']

    # Add outcome column, "successful" == 1, others are 0
    ks['output'] = (ks['state'] == 'successful').astype(int)

    # Drop pledged columns
    ks = ks.drop(columns = ['pledged', 'backers', 'usd pledged', 'usd_pledged_real', 'usd_goal_real', 'state'])
    
    ks_train, ks_test = split(ks)
    ks_train = ks_train.reset_index()
    ks_test = ks_test.reset_index()
    
    return ks_train, ks_test

## Baseline model
```The following code builds a baseline model. Our data contains categorical columns. Therefore we need to encode them (Don't worry, we will learn about different encoders). For now we will use a basic encoding method called LabelEncoder. Read about this encoder. ```

```Pay attention to the helper methods which will be used in this exercise.```

In [None]:
def get_xy_by_columns(df_train, df_test, columns):
    x_train = ks_train[columns].copy()
    y_train = ks_train['output'].copy()

    x_test = ks_test[columns].copy()
    y_test = ks_test['output'].copy()
    
    return x_train, x_test, y_train, y_test

In [None]:
def fit_evaluate(x_train, x_val, y_train, y_val):
    cls = xgb.XGBClassifier(n_jobs = -1, n_estimators=50, max_depth = 5, random_state=random_state)
    cls.fit(x_train, y_train)
    
    preds = cls.predict_proba(x_train)
    print(f"AUC of ROC on train : {np.round(roc_auc_score(y_train, preds[:,1]), 4)}")
    
    preds = cls.predict_proba(x_val)
    print(f"AUC of ROC on validation : {np.round(roc_auc_score(y_val, preds[:,1]), 4)}")
    
    return cls

In [None]:
ks_train, ks_test = load_clean_split_datasets()
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Before fitting a model to our data, we need to encode the categorical data to a numeric/float types (Why?). LabelEncoder is a simple encoding method. Read about this encoder and use it.```

```You might find categories in the test set which are not exist in the train set. Think how to fix this problem.```

In [None]:
# FILL HERE

In [None]:
fit_evaluate(x_train, x_test, y_train, y_test)

## Date encoding
```First, we'll start with a basic date encoding for the launch date and the deadline date. For each date, create three new columns - the hour, day and the month of the date. At the end of this encoding the data should contain six new columns - launched_hour, launched_day, launched_month, deadline_hour, deadline_day, deadline_month ```

#### Questions
```Why we won't create a column for the year?```

In [None]:
def encode_launch_dt(df):
    # FILL HERE

def encode_deadline_dt(df):
    # FILL HERE

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

In [None]:
relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

```Use the label encoder you used in the previous section to encode the categorical columns.```

In [None]:
# FILL HERE

In [None]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)

## Categorical Encoding
```Next, we will learn about different categorical encodings. Read how each encoding method works and try to understand when we should use each one of them. Be ready to discuss this with your tutor```

```You can start by reading the following blog-posts:```

https://wrosinski.github.io/fe_categorical_encoding/

https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

### Count encoding

In [None]:
def count_encoding(x_train, x_val, col):
    # FILL HERE

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

In [None]:
cat_features = ['category', 'main_category', 'currency', 'country']

for curr_cat in cat_features:
    x_train, x_test = count_encoding(x_train, x_test, curr_cat)
    
x_test = x_test.fillna(0)

x_train = x_train.drop(columns=cat_features)
x_test = x_test.drop(columns=cat_features)

In [None]:
fit_evaluate(x_train, x_test, y_train, y_test)

### Target encoder
```Target encoding can be used with/without smoothing. We will implement both types. We'll start without smoothing. Read about the difference between them.```

In [None]:
# df - data, by - categorical column, on- target column, m - smoothing hyper-parameter - 0 will be without smoothing.
def calc_smooth_mean(df, by, on, m = 0):
    # FILL HERE

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

In [None]:
# FILL HERE

In [None]:
fit_evaluate(x_train, x_test, y_train, y_test)

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']

In [None]:
# FILL HERE

In [None]:
fit_evaluate(x_train, x_test, y_train, y_test)

### Catboost encoding

In [None]:
# Catboost encoding
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

In [None]:
# FILL HERE

In [None]:
fit_evaluate(x_train, x_test, y_train, y_test)

## Feature Generation
```Creating new features from the raw data is a powerful way to improve your model performance.```

```In the following section we will generate the features and evaluate their impact in the end.```

In [None]:
ks_train, ks_test = load_clean_split_datasets()

ks_train = encode_launch_dt(ks_train)
ks_train = encode_deadline_dt(ks_train)

ks_test = encode_launch_dt(ks_test)
ks_test = encode_deadline_dt(ks_test)

relevant_columns = ['category', 'main_category', 'currency', 'goal', 'country', 'launched', 'deadline',
                     'launched_hour', 'launched_day', 'launched_month',
                     'deadline_hour', 'deadline_day', 'deadline_month']
x_train, x_test, y_train, y_test = get_xy_by_columns(ks_train, ks_test, relevant_columns)

cat_features = ['category', 'main_category', 'currency', 'country']

### Interactions
```One of the easiest ways to create new features is by combining categorical variables. Create all the combinations of any two categoricals features. Don't forget to encode the new categorical features.```

#### Questions
```When is this not a good idea to and how it can be solved?```

In [None]:
# FILL HERE - create all the combinations

In [None]:
# FILL HERE - encode the categorical columns

### Domain knowledge 
```In this section we will create new features based on the original ones with a pinch of imagination and creativity. The domain knowledge features generation is one of the most successful ways to improve our model. ```

```Don't forget to add the new features both to x_train, x_test```

```Create a feature that contains the goal in USD currency```

In [None]:
# FILL HERE

```Count the number of projects launched in the preceeding week for each record.```

In [None]:
# FILL HERE

```Count the days each project was online```

In [None]:
# FILL HERE

```Calculate the goal per day```

In [None]:
# FILL HERE

```Calculate the goal per day in USD```

In [None]:
# FILL HERE

```Calculate the time since the last launch project in the same category```

In [None]:
# FILL HERE

### Transforming numerical features
```Numerical features can be transformed with mathematical transformation like log, sqrt etc. Create another two features - log(goal_usd), sqrt(goal_usd)```

#### Questions
```Why are those transformation useful? ```

```In which cases/ models we should use this transformation?```


In [None]:
# FILL HERE

In [None]:
# FILL HERE - Remove unnecessary features

In [None]:
cls = fit_evaluate(x_train, x_test, y_train, y_test)

```Great! Now create another five unique and creative features that will make your tutor impressed and improve the model's validation AUC further. ```

In [None]:
# FILL HERE


In [None]:
# FILL HERE


In [None]:
# FILL HERE


In [None]:
# FILL HERE


In [None]:
# FILL HERE
