

# Business Cases with Data Science 

## Case 3: Prediction of Bookings Cancellation

Just Feature Selection

# Feature Selection


![alt text](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/How-to-Choose-Feature-Selection-Methods-For-Machine-Learning.png)


## Methadology:

Split the dataset into numerical and categorical input features and their respective target variable ("IsCanceled"). The notation is as followed: X (input data) = data; y (target value) = target. 

Apply the following steps:
 
1. **Numerical:**

1.1. RFE and Logistic Regression

1.2. Lasso Regression

2. **Categorical**

2.1. Chi-Squared

3. **Other Aproaches**

3.1. Variance Threshold
3.2. Businesswise / After Data Exploration


**Open questions:**

- Normalize Data before doing feature selection?
- Split the dataset into Train- and Testsets? (In Lab 2 we did it just to get the right number of features; not within the RFE)
- Can we use a Lasso-Regression for feature selection?
- VarianceThreshold not appropriate for supervised learning?
- How to encode the categorical variables?

In [44]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
import collections
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
import networkx as nx
import plotly.offline as po 
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder

In [45]:
df = pd.read_csv("data.csv")

## Paste step dealing with missing values

For explainability: drop all rows with nans (children: 4; Country: 24)

In [46]:
df.dropna(inplace=True)

In [47]:
# 1. Step: divide dataset into data and target (cancellation:"IsCanceled")
target = df.iloc[:,0]
data = df.iloc[:,1:]

In [48]:
# 2. Step: divide data into categorical and numerical input
data_num = data[list(data._get_numeric_data().columns)]

data_cat = data[list(set(data.columns.values) - set(data_num.columns))]

In [49]:
# 3. Step: split each into Training- and Testset
# Numerical Data
X_train_num, X_test_num, y_train_num, y_test_num = train_test_split(data_num,target, test_size = 0.3, random_state = 1)

# Categorical Data
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(data_cat,target, test_size = 0.3, random_state = 1)

### **Regarding the pearson and spearman correlation, no feature can be discarded because in general all variables have less correlation among each other. (see main file)**

## 1. Numerical Features

### 1.1. RFE with logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(estimator = model, n_features_to_select = 9)
X_rfe = rfe.fit_transform(X = data_num, y = target)
model.fit(X=X_rfe, y = target)

In [None]:
rfe.support_

In [None]:
rfe.ranking_

In [None]:
selected_features = pd.Series(rfe.support_, index = data_num.columns)
selected_features

**Figure out the optimal number of features (Source: Practical Lesson 2 Machine Learning**

In [None]:
#no of features
nof_list=np.arange(1,13)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(data_num,target, test_size = 0.3, random_state = 0)
    
    model = LogisticRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

### Conclusion 1.1 - RFE + LogisticRegression: 

#### Discard following features: 

1. LeadTime 
2. ArrivalDateYear
3. ArrivalDateWeekNumber 
4. ArrivalDateDayOfMonth
5. Children
6. Babies
7. DaysInWaitingList


## 1.2. Lasso-Regression

In [None]:
reg = LassoCV()
reg.fit(X=data_num, y=target)
print("Best alpha: %f" % reg.alpha_)

In [None]:
coef = pd.Series(reg.coef_, index=data_num.columns)
coef.sort_values()

Plot the importance (Source: Practical Lesson 2 Machine Learning)

In [None]:
def plot_importance(coef,name):
    imp_coef = coef.sort_values()
    plt.figure(figsize=(8,10))
    imp_coef.plot(kind = "barh")
    plt.title("Feature importance using " + name + " Model")
    plt.show()
plot_importance(coef, "lasso")

### Conclusion 1.2 - Lasso Regression: 

#### Discard following features: 

1. Adults                         
2. StaysInWeekendNights           
3. Babies                        
4. Children

## 2. Categorical Features


#### Encode Categorical Data (Source: https://machinelearningmastery.com/feature-selection-with-categorical-data/)

In [52]:
# prepare input data
def prepare_inputs(X_train_cat, X_test_cat):
    oe = OrdinalEncoder()
    oe.fit(X_train_cat)
    X_train_cat_enc = oe.transform(X_train_cat)
    X_test_cat_enc = oe.transform(X_test_cat)
    return X_train_cat_enc, X_test_cat_enc
 
# prepare target
def prepare_targets(y_train_cat, y_test_cat):
    le = LabelEncoder()
    le.fit(y_train_cat)
    y_train_cat_enc = le.transform(y_train_cat)
    y_test_cat_enc = le.transform(y_test_cat)
    return y_train_cat_enc, y_test_cat_enc

In [53]:
"""# prepare input data
X_train_cat_enc, X_test_cat_enc = prepare_inputs(X_train_cat, X_test_cat)
# prepare output data
y_train_cat_enc, y_test_cat_enc = prepare_targets(y_train_cat, y_test_cat)"""

'# prepare input data\nX_train_cat_enc, X_test_cat_enc = prepare_inputs(X_train_cat, X_test_cat)\n# prepare output data\ny_train_cat_enc, y_test_cat_enc = prepare_targets(y_train_cat, y_test_cat)'

### 2.1. Chi-Squared / Select K Best  

In [39]:
"""# feature selection
def select_features(X_train_cat, y_train_cat, X_test_cat):
    fs = SelectKBest(score_func=chi2, k='all')
    fs.fit(X_train_cat, y_train_cat)
    X_train_fs = fs.transform(X_train_cat)
    X_test_fs = fs.transform(X_test_cat)
    return X_train_fs, X_test_fs, fs"""

"# feature selection\ndef select_features(X_train_cat, y_train_cat, X_test_cat):\n    fs = SelectKBest(score_func=chi2, k='all')\n    fs.fit(X_train_cat, y_train_cat)\n    X_train_fs = fs.transform(X_train_cat)\n    X_test_fs = fs.transform(X_test_cat)\n    return X_train_fs, X_test_fs, fs"

In [40]:
"""select_features(X_train_cat, y_train_cat, X_test_cat)"""

'select_features(X_train_cat, y_train_cat, X_test_cat)'

In [38]:
"""for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()"""

"for i in range(len(fs.scores_)):\n    print('Feature %d: %f' % (i, fs.scores_[i]))\n# plot the scores\npyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)\npyplot.show()"

## 3. Other Aproaches

### 3.1 VarianceThreshold

In [None]:
selector = VarianceThreshold()
selector.fit_transform(data_num)

## 3.2 Businesswise / After Data Exploration

# Feature Engineering:

- 