# Data Preparation (DP) homework, part I: Feature processing

Welcome to the first part of Data Preparation homework!

In this notebook we're going to continue to work with the data about Bank Telemarketing.

We want to understand, how to prepare the data so that it would be ready for actual model building.

# Some code to mount Google Drive to the notebook

It is not necessary to get data exactly this way.

Your could just upload it to `sample_data` folder or use something like `wget` -> `unzip`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
! ls -alh '/content/drive/My Drive/Data Science Basic/03 Exploratory data analysis/DP/Homework/data'

total 6.0M
-rw------- 1 root root 5.6M Jul 16 10:33 bank-additional-full.csv
-rw------- 1 root root 5.4K Jul 16 10:33 bank-additional-names.txt
-rw------- 1 root root  51K Jul 16 11:01 FEATURE_PROCESSING.ipynb
-rw------- 1 root root 310K Jun 25  2019 PIPELINES.ipynb


# All improts necessary

It is a good practice to have all imports in one place.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.impute import MissingIndicator

from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder


from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids
from imblearn.under_sampling import NearMiss

from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

In [None]:
sns.set(font_scale=2)

# A few intro-words

If in a nutshell, then: preprocessing is important))

This could be the end of the story, but it’s about the question “What is your name?” reply "As I was called"))

Generally speaking, data preprocessing can mean quite different sets of activities.

For images, this can be the scaling of images, their cropping, the imposition of various masks, rotations, offsets, etc. etc.

For the text, this, respectively, lemmatization, stamming, the allocation of suffixes and affixes, the use of various regulars.

For tabular data, there’s a song for you, here you have to fill in the gaps in the cells, and normalize the signs, and various linear / non-linear transformations, and quantization, and binarization, and much more.

In this laptop, we, again, will not bite into the deep jungle of the pros and cons of the techniques presented here (although, of course, everything is relative)))

Instead, let's go over some of the techniques for which implementations are contained in scikit-learn and imbalanced-learn, in order to have some general idea.

# Read the data

Before you go through the cells, you must download the data.

It can be found [by this link](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

Also the link lies in the file of [this github-repository](https://github.com/IooHooI/DATA_PREPARATION/blob/master/data/data_description.json).

In [None]:
data_root_path = '/content/drive/My Drive/Data Science Basic/03 Exploratory data analysis/DP/Homework/data'

In [None]:
data = pd.read_csv(os.path.join(data_root_path, 'bank-additional-full.csv'), sep=';')

In [None]:
data.head().T

In [None]:
data.info()

# Add some gaps in the data

Let's pretend there are gaps in the data.

To do this, take all the features except the target one and drop an armful of nanos there:

In [None]:
columns_with_gaps = data.columns[:-1]

In [None]:
columns_with_gaps

We will take the percentage of omissions in the trait from 0 to 30:

In [None]:
minimum = 0
maximum = 0.3

Create the dictionary, it is useful to us in order to run on features and toss gaps there:

In [None]:
columns_with_gaps_dict = dict(
    zip(
        columns_with_gaps,
        np.random.uniform(
            minimum,
            maximum,
            len(columns_with_gaps)
        )
    )
)

In [None]:
columns_with_gaps_dict

Now, in order not to spoil the source data, take a copy of it and fill it with gaps:

In [None]:
data_with_gaps_v1 = data.copy()

In [None]:
for column in columns_with_gaps:
    if columns_with_gaps_dict[column] > 0:
        gaps_count = int(len(data_with_gaps_v1) * columns_with_gaps_dict[column])
        data_with_gaps_v1[column].iloc[
            np.random.randint(
                0,
                len(data_with_gaps_v1),
                gaps_count
            )
        ] = np.nan

In [None]:
data_with_gaps_v1.info(verbose=True, null_counts=True)

# Imputers

Misses, of course, a thing quite expected.

If your task has no gaps, then it is possible that you are dreaming of it or you are working on a task from a parallel world where there is no suffering and everyone is happy.

However, if you are still in the real world, then (in one form or another) you will need to process data gaps.

Generally you can think of several options:
- throw out those rows / columns where gaps occur;
- fill in the blanks with some default values ​​(for example, 0 or -1 for real attributes or "unknown" for categorical ones);
- fill in the gaps with statistics;
- try to allocate a subspace without gaps, count [K-dimensional tree](http://scikit-learn.org/0.19/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree) and take TOP-N of the nearest objects and already on them either take statistics (average or median for real ones, fashion for categorical ones);
- etc.

By the way, an entire [book](http://books.sernam.ru/book_stan.php) has been written on this subject.

We will not dwell on this topic here in detail.

Let's consider only a few of the simplest approaches that you may encounter in practice.

## sklearn.impute.SimpleImputer

Here we use a fairly easy-to-use class that provides several options, namely:
- fill in the blanks with average values;
- fill in the blanks with medians;
- fill in the blanks with mods;
- fill in the blanks with constants.

But first, let us recall from `EDA.ipynb` that in the data there are generally two types of features: continuous and discrete. We apply different strategies to different subsets of attributes:

In [None]:
numerical_features = [
    'age',
    'campaign',
    'cons.conf.idx',
    'cons.price.idx',
    'duration',
    'emp.var.rate',
    'euribor3m',
    'nr.employed',
    'pdays',
    'previous'
]

In [None]:
categorial_features = [
    'contact',
    'day_of_week',
    'default',
    'education',
    'housing',
    'job',
    'loan',
    'marital',
    'month',
    'poutcome'
]

In [None]:
mean_imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='mean',
    verbose=True
)

In [None]:
median_imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='median',
    verbose=True
)

In [None]:
most_frequent_imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='most_frequent',
    verbose=True
)

In [None]:
constant_imputer = SimpleImputer(
    missing_values=np.nan,
    strategy='constant',
    fill_value='unknown',
    verbose=True
)

In [None]:
numericals_with_mean_imputed = pd.DataFrame(
    mean_imputer.fit_transform(data_with_gaps_v1[numerical_features]),
    columns=numerical_features
)

In [None]:
numericals_with_mean_imputed.describe().T

In [None]:
numericals_with_mean_imputed.head().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_mean_imputed)
plt.show()

In [None]:
numericals_with_median_imputed = pd.DataFrame(
    median_imputer.fit_transform(data_with_gaps_v1[numerical_features]),
    columns=numerical_features
)

In [None]:
numericals_with_median_imputed.describe().append(
    data_with_gaps_v1[numerical_features].median().rename("median")
).T

In [None]:
numericals_with_median_imputed.head().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed)
plt.show()

In [None]:
numericals_with_most_frequent_imputed = pd.DataFrame(
    most_frequent_imputer.fit_transform(data_with_gaps_v1[numerical_features]),
    columns=numerical_features
)

In [None]:
numericals_with_most_frequent_imputed.describe().append(
    data_with_gaps_v1[numerical_features].mode().loc[0].rename("most_frequent")
).T

In [None]:
numericals_with_most_frequent_imputed.head().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_most_frequent_imputed)
plt.show()

In [None]:
categorials_with_constant_imputed = pd.DataFrame(
    constant_imputer.fit_transform(data_with_gaps_v1[categorial_features]),
    columns=categorial_features
)

In [None]:
categorials_with_constant_imputed[categorial_features[0]].value_counts()

In [None]:
categorials_with_constant_imputed[categorial_features[2]].value_counts()

In [None]:
categorials_with_constant_imputed[categorial_features[2]].value_counts()

In [None]:
categorials_with_constant_imputed.head()

# Scalers

Normalizing data is commonplace.

This thing is very useful if you use scale-sensitive algorithms (e.g. metric algorithms).

Here we look at some of the options.

## sklearn.preprocessing.MaxAbsScaler

In [None]:
max_abs_scaler = MaxAbsScaler()

In [None]:
numericals_with_mean_imputed_max_abs_scaled = pd.DataFrame(
    max_abs_scaler.fit_transform(numericals_with_mean_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_mean_imputed_max_abs_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_mean_imputed_max_abs_scaled)
plt.show()

In [None]:
numericals_with_median_imputed_max_abs_scaled = pd.DataFrame(
    max_abs_scaler.fit_transform(numericals_with_median_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_median_imputed_max_abs_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_max_abs_scaled)
plt.show()

In [None]:
numericals_with_most_frequent_imputed_max_abs_scaled = pd.DataFrame(
    max_abs_scaler.fit_transform(numericals_with_most_frequent_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_most_frequent_imputed_max_abs_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_most_frequent_imputed_max_abs_scaled)
plt.show()

## sklearn.preprocessing.MinMaxScaler

In [None]:
min_max_scaler = MinMaxScaler()

In [None]:
numericals_with_mean_imputed_min_max_scaled = pd.DataFrame(
    min_max_scaler.fit_transform(numericals_with_mean_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_mean_imputed_min_max_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_mean_imputed_min_max_scaled)
plt.show()

In [None]:
numericals_with_median_imputed_min_max_scaled = pd.DataFrame(
    min_max_scaler.fit_transform(numericals_with_median_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_median_imputed_min_max_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_min_max_scaled)
plt.show()

In [None]:
numericals_with_most_frequent_imputed_min_max_scaled = pd.DataFrame(
    min_max_scaler.fit_transform(numericals_with_most_frequent_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_most_frequent_imputed_min_max_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_most_frequent_imputed_min_max_scaled)
plt.show()

## sklearn.preprocessing.RobustScaler

In [None]:
robust_scaler = RobustScaler()

In [None]:
numericals_with_mean_imputed_robust_scaled = pd.DataFrame(
    robust_scaler.fit_transform(numericals_with_mean_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_mean_imputed_robust_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_mean_imputed_robust_scaled)
plt.show()

In [None]:
numericals_with_median_imputed_robust_scaled = pd.DataFrame(
    robust_scaler.fit_transform(numericals_with_median_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_median_imputed_robust_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_robust_scaled)
plt.show()

In [None]:
numericals_with_most_frequent_imputed_robust_scaled = pd.DataFrame(
    robust_scaler.fit_transform(numericals_with_most_frequent_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_most_frequent_imputed_robust_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_most_frequent_imputed_robust_scaled)
plt.show()

## sklearn.preprocessing.StandardScaler

In [None]:
standard_scaler = StandardScaler()

In [None]:
numericals_with_mean_imputed_standard_scaled = pd.DataFrame(
    standard_scaler.fit_transform(numericals_with_mean_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_mean_imputed_standard_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_mean_imputed_standard_scaled)
plt.show()

In [None]:
numericals_with_median_imputed_standard_scaled = pd.DataFrame(
    standard_scaler.fit_transform(numericals_with_median_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_median_imputed_standard_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_standard_scaled)
plt.show()

In [None]:
numericals_with_most_frequent_imputed_standard_scaled = pd.DataFrame(
    standard_scaler.fit_transform(numericals_with_most_frequent_imputed),
    columns=numerical_features
)

In [None]:
numericals_with_most_frequent_imputed_standard_scaled.describe().T

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_most_frequent_imputed_standard_scaled)
plt.show()

# Transformers

## sklearn.preprocessing.FunctionTransformer

In [None]:
def all_but_last_column(X):
    return X[:, :-1]

In [None]:
function_transformer = FunctionTransformer(all_but_last_column)

In [None]:
numericals_with_median_imputed_standard_scaled_without_last_column = pd.DataFrame(
    function_transformer.fit_transform(numericals_with_median_imputed_standard_scaled),
    columns=numerical_features[:-1]
)

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_standard_scaled_without_last_column)
plt.show()

## sklearn.preprocessing.PowerTransformer

In [None]:
power_transformer = PowerTransformer()

In [None]:
numericals_with_median_imputed_standard_scaled_power_transformed = pd.DataFrame(
    power_transformer.fit_transform(numericals_with_median_imputed_standard_scaled),
    columns=numerical_features
)

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_standard_scaled_power_transformed)
plt.show()

## sklearn.preprocessing.QuantileTransformer

In [None]:
quantile_transformer = QuantileTransformer(output_distribution='normal')

In [None]:
numericals_with_median_imputed_standard_scaled_quantile_transformed = pd.DataFrame(
    quantile_transformer.fit_transform(numericals_with_median_imputed_standard_scaled),
    columns=numerical_features
)

In [None]:
plt.figure(figsize=(35, 15))
sns.boxplot(data=numericals_with_median_imputed_standard_scaled_quantile_transformed)
plt.show()

# Encoders

## sklearn.preprocessing.LabelEncoder

In [None]:
label_encoder = LabelEncoder()

In [None]:
y_encoded = label_encoder.fit_transform(data.y)

In [None]:
label_encoder.classes_

In [None]:
y_encoded

In [None]:
y_decoded = label_encoder.inverse_transform(y_encoded)

In [None]:
y_decoded

## sklearn.preprocessing.LabelBinarizer

In [None]:
label_binarizer = LabelBinarizer()

In [None]:
y_binarized = label_binarizer.fit_transform(data.y)

In [None]:
label_binarizer.classes_

In [None]:
label_binarizer.neg_label

In [None]:
label_binarizer.pos_label

In [None]:
label_binarizer.y_type_

In [None]:
y_binarized

In [None]:
y_original = label_binarizer.inverse_transform(y_binarized)

In [None]:
y_original

## sklearn.preprocessing.OneHotEncoder

In [None]:
one_hot_encoder = OneHotEncoder()

In [None]:
categorials_with_constant_imputed.head().T

In [None]:
categorials_with_constant_imputed_one_hot_encoded = one_hot_encoder.fit_transform(categorials_with_constant_imputed)

In [None]:
categorials_with_constant_imputed_one_hot_encoded.shape

In [None]:
type(categorials_with_constant_imputed_one_hot_encoded)

In [None]:
categorials_with_constant_imputed_one_hot_encoded.todense()

In [None]:
categorials_with_constant_imputed_one_hot_encoded = pd.DataFrame(
    categorials_with_constant_imputed_one_hot_encoded.todense(),
    columns=one_hot_encoder.get_feature_names()
)

In [None]:
categorials_with_constant_imputed_one_hot_encoded.head().T

In [None]:
one_hot_encoder.categories_

In [None]:
one_hot_encoder.get_feature_names()

In [None]:
categorials_with_constant_imputed_one_hot_decoded = one_hot_encoder.inverse_transform(categorials_with_constant_imputed_one_hot_encoded)

In [None]:
categorials_with_constant_imputed_one_hot_decoded

## sklearn.preprocessing.OrdinalEncoder

In [None]:
ordinal_encoder = OrdinalEncoder()

In [None]:
categorials_with_constant_imputed_ordinal_encoded = ordinal_encoder.fit_transform(categorials_with_constant_imputed)

In [None]:
categorials_with_constant_imputed_ordinal_encoded

In [None]:
categorials_with_constant_imputed_ordinal_encoded.shape

In [None]:
ordinal_encoder.categories_

In [None]:
categorials_with_constant_imputed_ordinal_decoded = ordinal_encoder.inverse_transform(categorials_with_constant_imputed_ordinal_encoded)

In [None]:
categorials_with_constant_imputed_ordinal_decoded

# Target balancers

## Under-sampling

### imblearn.under_sampling.RandomUnderSampler

In [None]:
random_undersampler = RandomUnderSampler(random_state=0)

In [None]:
X_resampled, y_resampled = random_undersampler.fit_resample(
    data[numerical_features + categorial_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

### imblearn.under_sampling.ClusterCentroids

In [None]:
cluster_centroids_undersampler = ClusterCentroids(random_state=0)

In [None]:
X_resampled, y_resampled = cluster_centroids_undersampler.fit_resample(
    data[numerical_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

### imblearn.under_sampling.NearMiss

In [None]:
near_miss_undersampler = NearMiss(random_state=0)

In [None]:
X_resampled, y_resampled = near_miss_undersampler.fit_resample(
    data[numerical_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

## Over-sampling

### imblearn.over_sampling.RandomOverSampler

In [None]:
random_oversampler = RandomOverSampler(random_state=0)

In [None]:
X_resampled, y_resampled = random_oversampler.fit_resample(
    data[numerical_features + categorial_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

### imblearn.over_sampling.SMOTE

In [None]:
smote_oversampler = SMOTE(random_state=0)

In [None]:
X_resampled, y_resampled = smote_oversampler.fit_resample(
    data[numerical_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

### imblearn.over_sampling.ADASYN

In [None]:
adasyn_oversampler = ADASYN(random_state=0)

In [None]:
X_resampled, y_resampled = adasyn_oversampler.fit_resample(
    data[numerical_features],
    data.y
)

In [None]:
plt.figure(figsize=(20, 10))
sns.countplot(x=y_resampled)
plt.show()

# A bit about the mess that was created so far

And now let's look at that heap of variables that was created:

In [None]:
list(filter(lambda x: 'categorial' in x or 'numerical' in x, dir()))

# Conclusion

So, we looked:
- how scikit-learn can fill in the gaps using different strategies;
- how to normalize data;
- how to use the help of transformers to transform features, filter out unnecessary ones, etc .;
- How to translate the target variable from a text view to a OneHot view and vice versa;
- How to translate categorical features from a text view into a OneHot view and vice versa;
- how to balance classes through:
    - Under-sampling;
    - Over-sampling.

There are several points that have remained uncovered, namely:
- transformation of a continuous variable into a categorical one (quantization);
- work with outliers;
- pipelines;
- hyperparameters search;
- pipelines + hyperparameters search
- multiple metrics calculation during hyperparameters search.

There is no sense here to draw any conclusions about the quality of the preprocessing, since we have not yet tested it on any algorithm.

On the other hand, at the stage of various transformations, it was already quite clearly seen that some features do not seem to get better (asymmetry persists, outbursts do not disappear, etc.).

It’s worthwhile to take a closer look at such features, maybe it’s worth quantizing there (as, for example, in the case of the pdays feature, where for 999 there were no calls to the client).

In general, of course, it would be much better to try some pre-processing options in combination with some algorithm in order to see which transformations change the result in which direction.