# New Offer Success

There is a company, which prepared special new offer for their customers. They have a dataset which consists information about their clients and how they reacted to the previous offers.

## Objectives

Your goal is to help them with current offer to provide list of customers who will accept new offer and also senior management would like to know why company should send them this new offer.

## Dataset, columns business meaning

### Target feature
- `accepted` - flag related whether customer accepted or not the offer

### Input features
- `offer_class` - previous offer class. e.g:
- `name` - hashed customer name
- `gender` - customer geneder
- `age` - age in years
- `phone_calls` - number of phone conversations with client during last quarter
- `emails` - number of emails sent to client during last 6 months
- `customer_code` - customer code
- `salary` - customer estimated salary
- `offer_code` - previous offer code
- `customer_type` - type of customer
- `number` - serial number of customer device
- `offer_value` - previous offers total value
- `estimated_expenses` - estimated expenses

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
# conda install pyarrow -c conda-forge
import pyarrow
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Load the data

In [31]:
df = pd.read_parquet('client_database.parquet', engine='pyarrow')

In [32]:
oryg_shape = df.shape
oryg_shape

(1309, 15)

- The dataset has 1309 rows (observations) and 15 columns (fetures).

In [33]:
df.head(6).T

Unnamed: 0,0,1,2,3,4,5
offer_class,Medium,Medium,Medium,Medium,Medium,Medium
accepted,yes,yes,no,no,no,yes
name,C7CBB5C5613449B,CFD09C0248BB417,A2A0DC541977473,9068458EB70D427,46F0CD19CF71429,060A000A1260427
gender,female,male,female,male,female,male
age,29,,,30,25,48
phone_calls,0,1,1,1,1,0
emails,0,2,3,2,2,0
customer_code,24160,113781,113781,113781,113781,19952
salary,21133.8,15155,15155,15155,15155,2655
offer_code,4AB,61A,DB4,9B6,191,62F


- There is a categorical feature `center` in the dataset which is not mentioned in the dataset description.

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 1308
Data columns (total 15 columns):
offer_class           1289 non-null object
accepted              1289 non-null object
name                  1289 non-null object
gender                1289 non-null object
age                   877 non-null float64
phone_calls           1286 non-null float64
emails                1287 non-null float64
customer_code         1265 non-null object
salary                1281 non-null float64
offer_code            1287 non-null object
customer_type         1287 non-null object
number                1280 non-null object
offer_value           1277 non-null float64
estimated_expenses    1286 non-null float64
center                1306 non-null object
dtypes: float64(6), object(9)
memory usage: 163.6+ KB


- There are 2 types of data: float64 and strings
- There are missing values in the dataset with `age` having most of them

In [35]:
print(df.groupby('accepted').size())

accepted
no     800
yes    489
dtype: int64


- The dataset is slightly imballanced.

### Numerical features

In [36]:
df_num = df.select_dtypes(include = ['float64'])
df_num.head()

Unnamed: 0,age,phone_calls,emails,salary,offer_value,estimated_expenses
0,29.0,0.0,0.0,21133.75,57.426571,4692.0
1,,1.0,2.0,15155.0,141.639912,3164.0
2,,1.0,3.0,15155.0,154.82113,1852.0
3,30.0,1.0,2.0,15155.0,106.256196,3753.0
4,25.0,1.0,2.0,15155.0,139.237147,2410.0


In [37]:
df_num.describe()

Unnamed: 0,age,phone_calls,emails,salary,offer_value,estimated_expenses
count,877.0,1286.0,1287.0,1281.0,1277.0,1286.0
mean,33.511973,0.497667,0.881896,3297.296011,128.693732,4576.9479
std,12.247058,1.04136,1.02232,5034.240427,57.677807,1909.458459
min,18.0,0.0,0.0,0.0,50.022619,1257.0
25%,24.0,0.0,0.0,789.58,82.632085,2934.75
50%,30.0,0.0,1.0,1445.42,118.645478,4544.0
75%,41.0,1.0,1.0,3127.5,162.482961,6225.75
max,80.0,8.0,10.0,51232.92,368.668534,7891.0


In [38]:
df_num[df_num.phone_calls == 0].count()

age                   617
phone_calls           877
emails                877
salary                873
offer_value           868
estimated_expenses    876
dtype: int64

In [39]:
df_num[df_num.salary == 0].count()

age                    8
phone_calls           17
emails                17
salary                17
offer_value           17
estimated_expenses    17
dtype: int64

In [40]:
df_num[df_num.offer_value == 0].count()

age                   0
phone_calls           0
emails                0
salary                0
offer_value           0
estimated_expenses    0
dtype: int64

In [41]:
n = len(df_num)
for column in df_num:    
    n_unique = len(df_num[column].unique())
    print('Feature: {}'.format(column))
    print('Enique values count: {}'.format(n_unique))
    print('Unique values ratio: {}'.format(n_unique / n))
    print('Missing values ratio: {}'.format(df_num[column].isnull().sum() / n))
    print('Median: {}'.format(df_num[column].median()))
    print('------------------------------------')
    
    print()

Feature: age
Enique values count: 73
Unique values ratio: 0.05576776165011459
Missing values ratio: 0.3300229182582124
Median: 30.0
------------------------------------

Feature: phone_calls
Enique values count: 8
Unique values ratio: 0.006111535523300229
Missing values ratio: 0.01757066462948816
Median: 0.0
------------------------------------

Feature: emails
Enique values count: 11
Unique values ratio: 0.008403361344537815
Missing values ratio: 0.01680672268907563
Median: 1.0
------------------------------------

Feature: salary
Enique values count: 282
Unique values ratio: 0.21543162719633308
Missing values ratio: 0.0213903743315508
Median: 1445.42
------------------------------------

Feature: offer_value
Enique values count: 1278
Unique values ratio: 0.9763177998472116
Missing values ratio: 0.024446142093200916
Median: 118.6454777
------------------------------------

Feature: estimated_expenses
Enique values count: 1186
Unique values ratio: 0.906035141329259
Missing values ratio

- There are 6 numerical features.
- The range of their values needs to be standardized.
- There are no kids represented in the dataset (min. age is 18). 25% of customers is 24 years old or below. Next 25% are customers of age 25-30. Next 25% of customers is up to 41 years old, and last 25% of customers are 42-80.
- There were 2-8 phone calls made only to 25% of customers. The rest got 0 or 1.
- 50% of customers got 0 or 1 email, 25% of customers got 2-10 emails.
- The min. salary is 0, median 1445.42 and max. 51232.
- The previous offers values range between 50 and 368 with the median of 119.
- Estimated expenses values range between 1257 and 7891 with the median of 4544.

### Categorical Features

In [42]:
#list(set(df.dtypes.tolist()))
#df_cat = df.select_dtypes(include = ['O'])

In [43]:
df_cat = df[list(set(df.columns) - set(df_num.columns))]
df_cat.head()

Unnamed: 0,offer_code,center,gender,accepted,offer_class,customer_code,number,customer_type,name
0,4AB,A,female,yes,Medium,24160,9E9FA,S,C7CBB5C5613449B
1,61A,A,male,yes,Medium,113781,1E53D,S,CFD09C0248BB417
2,DB4,A,female,no,Medium,113781,1.36E+06,S,A2A0DC541977473
3,9B6,B,male,no,Medium,113781,F6529,S,9068458EB70D427
4,191,A,female,no,Medium,113781,E2FDF,S,46F0CD19CF71429


In [44]:
#feature_unique_values = {}
n = len(df_cat)
for column in df_cat:
    n_unique = len(df_cat[column].unique())
    #feature_unique_values[column] = n_unique
    print('---------------------')
    print('Feature name: {}'.format(column))
    print('Unique values count (including NaN): {}'.format(n_unique))
    if n_unique <= 5:
        print('Most frequent values:')
        print(df_cat.groupby(column).size())
    print('Variability ratio: {}'.format(n_unique / n))
    print('Values count (not null): {}'.format(df_cat[column].notnull().sum()))
    print('Values count (null): {}'.format(df_cat[column].isnull().sum()))
    print('Missing values ratio: {}'.format(df_cat[column].isnull().sum() / n))
    print()

---------------------
Feature name: offer_code
Unique values count (including NaN): 1102
Variability ratio: 0.8418640183346066
Values count (not null): 1287
Values count (null): 22
Missing values ratio: 0.01680672268907563

---------------------
Feature name: center
Unique values count (including NaN): 3
Most frequent values:
center
A    603
B    703
dtype: int64
Variability ratio: 0.002291825821237586
Values count (not null): 1306
Values count (null): 3
Missing values ratio: 0.002291825821237586

---------------------
Feature name: gender
Unique values count (including NaN): 3
Most frequent values:
gender
female    458
male      831
dtype: int64
Variability ratio: 0.002291825821237586
Values count (not null): 1289
Values count (null): 20
Missing values ratio: 0.015278838808250574

---------------------
Feature name: accepted
Unique values count (including NaN): 3
Most frequent values:
accepted
no     800
yes    489
dtype: int64
Variability ratio: 0.002291825821237586
Values count (not

- I will remove `number`, `name`, `offer_code`, `customer_code` from features list as these are kind of ID
- ??? what to do with `customer_code`, how to encode 900 different values?

In [45]:
df.drop(columns=['number', 'name', 'offer_code', 'customer_code'], inplace=True)

### Missing values

In [46]:
def extract_columns_with_nans(df):
    columns_with_nans = df.isnull().sum()[df.isnull().sum() > 0]
    dic_nan = {'nans_count': columns_with_nans.values, 'percent': np.round(columns_with_nans.values * 100 / df.shape[0], 2)}
    df_nan = pd.DataFrame(data=dic_nan, index=columns_with_nans.index)
    df_nan.sort_values(by='percent', ascending=False, inplace=True)
    return df_nan

In [47]:
# Missing values by column
df_nan = extract_columns_with_nans(df)
df_nan

Unnamed: 0,nans_count,percent
age,432,33.0
offer_value,32,2.44
salary,28,2.14
phone_calls,23,1.76
estimated_expenses,23,1.76
emails,22,1.68
customer_type,22,1.68
offer_class,20,1.53
accepted,20,1.53
gender,20,1.53


In [48]:
# Missing values by row
missing_data_by_row = df.isnull().sum(axis=1)
missing_data_by_row.describe()

count    1309.000000
mean        0.492743
std         1.281836
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        10.000000
dtype: float64

- 25% of rows have between 1 and 14 missing values.
- `age`, `offer_value`, `salary`, `phone_calls`, `estimated_expenses`, `emails` - NaNs will be filled with the median.
- `customer_code` - I will remove 44 rows with NaNs as I do not know how to deal with missing values and leaving them as additional category, would skew the original distribution.
- I will remove rows with `center` (3), `offer_code` (2) and `customer_type' (2)

In [49]:
# Replace age, offer_value NaN with the feature median
df[['age']] = df[['age']].fillna(df.median())
df[['offer_value']] = df[['offer_value']].fillna(df.median())
df[['salary']] = df[['salary']].fillna(df.median())
df[['phone_calls']] = df[['phone_calls']].fillna(df.median())
df[['estimated_expenses']] = df[['estimated_expenses']].fillna(df.median())
df[['emails']] = df[['emails']].fillna(df.median())

In [50]:
#df.loc[df['customer_code'].isna()]
#df.loc[df['column_name'].isin(some_values)]

In [51]:
df_nan = extract_columns_with_nans(df)
df_nan

Unnamed: 0,nans_count,percent
customer_type,22,1.68
offer_class,20,1.53
accepted,20,1.53
gender,20,1.53
center,3,0.23


In [52]:
# Abandon rows where there is NaN in the center, offer_code and customer_type columns
df = df[~df.center.isna()]
#df = df[~df.offer_code.isna()]
df = df[~df.customer_type.isna()]

In [53]:
df_nan = extract_columns_with_nans(df)
df_nan

Unnamed: 0,nans_count,percent


In [54]:
clean_shape = df.shape
clean_shape

(1284, 11)

In [55]:
# Initial & complete dataset comparison
num_removed_cols = oryg_shape[1] - clean_shape[1]
num_removed_cols_ratio = np.around(num_removed_cols * 100 / oryg_shape[1], 2)
num_removed_rows = oryg_shape[0] - clean_shape[0]
num_removed_rows_ratio = np.around(num_removed_rows * 100 / oryg_shape[0], 2)
print('Number (percent) of removed features: {} ({}%)'.format(num_removed_cols, num_removed_cols_ratio))
print('Number (percent) of removed observations (with NaNs): {} ({}%)'.format(num_removed_rows, num_removed_rows_ratio))

Number (percent) of removed features: 4 (26.67%)
Number (percent) of removed observations (with NaNs): 25 (1.91%)


In [56]:
df.head()

Unnamed: 0,offer_class,accepted,gender,age,phone_calls,emails,salary,customer_type,offer_value,estimated_expenses,center
0,Medium,yes,female,29.0,0.0,0.0,21133.75,S,57.426571,4692.0,A
1,Medium,yes,male,30.0,1.0,2.0,15155.0,S,141.639912,3164.0,A
2,Medium,no,female,30.0,1.0,3.0,15155.0,S,154.82113,1852.0,A
3,Medium,no,male,30.0,1.0,2.0,15155.0,S,106.256196,3753.0,B
4,Medium,no,female,25.0,1.0,2.0,15155.0,S,139.237147,2410.0,A


In [57]:
# Gather all data cleaning operations in one function
def clean_data(df):

    # Drop unneeded columns
    df.drop(columns=['number', 'name', 'offer_code', 'customer_code'], inplace=True)
    
    # Replace age, offer_value NaN with the feature median
    df[['age']] = df[['age']].fillna(df.median())
    df[['offer_value']] = df[['offer_value']].fillna(df.median())
    df[['salary']] = df[['salary']].fillna(df.median())
    df[['phone_calls']] = df[['phone_calls']].fillna(df.median())
    df[['estimated_expenses']] = df[['estimated_expenses']].fillna(df.median())
    df[['emails']] = df[['emails']].fillna(df.median())
    
    # Abandon rows where there is NaN in the center, offer_code and customer_type columns
    df = df[~df.center.isna()]
    #df = df[~df.offer_code.isna()]
    df = df[~df.customer_type.isna()]
    
    return df

In [58]:
# Test clean_data function
shape1 = df.shape
df2 = pd.read_parquet('client_database.parquet', engine='pyarrow')
df2 = clean_data(df2)
shape2 = df2.shape

print(shape1, shape2)

(1284, 11) (1284, 11)


In [59]:
df2.head()

Unnamed: 0,offer_class,accepted,gender,age,phone_calls,emails,salary,customer_type,offer_value,estimated_expenses,center
0,Medium,yes,female,29.0,0.0,0.0,21133.75,S,57.426571,4692.0,A
1,Medium,yes,male,30.0,1.0,2.0,15155.0,S,141.639912,3164.0,A
2,Medium,no,female,30.0,1.0,3.0,15155.0,S,154.82113,1852.0,A
3,Medium,no,male,30.0,1.0,2.0,15155.0,S,106.256196,3753.0,B
4,Medium,no,female,25.0,1.0,2.0,15155.0,S,139.237147,2410.0,A


In [60]:
# Save clean dataset
filename = 'client_database_clean.parquet'
df.to_parquet(filename)

In [61]:
# Test
df3 = pd.read_parquet(filename, engine='pyarrow')
df3.shape

(1284, 11)

In [62]:
df3.head()

Unnamed: 0,offer_class,accepted,gender,age,phone_calls,emails,salary,customer_type,offer_value,estimated_expenses,center
0,Medium,yes,female,29.0,0.0,0.0,21133.75,S,57.426571,4692.0,A
1,Medium,yes,male,30.0,1.0,2.0,15155.0,S,141.639912,3164.0,A
2,Medium,no,female,30.0,1.0,3.0,15155.0,S,154.82113,1852.0,A
3,Medium,no,male,30.0,1.0,2.0,15155.0,S,106.256196,3753.0,B
4,Medium,no,female,25.0,1.0,2.0,15155.0,S,139.237147,2410.0,A


### Encode categorical features

In [None]:
df_cat = df[list(set(df.columns) - set(df_num.columns))]
df_cat.head()

In [None]:
cat_binary = []
cat_multi = []
for feature in df_cat:
    if df_cat[feature].nunique() > 2:
        cat_multi.append(feature)
    else:
        cat_binary.append(feature)

In [None]:
cat_binary

In [None]:
cat_multi

In [None]:
# Encode binary features
for column in cat_binary:
    print(df_cat[column].value_counts())

In [None]:
df['center'].replace(['B', 'A'], [1, 0], inplace=True)
df['gender'].replace(['male', 'female'], [1, 0], inplace=True)
df['accepted'].replace(['yes', 'no'], [1, 0], inplace=True)

In [None]:
df_cat = df[list(set(df.columns) - set(df_num.columns))]
df_cat.head()

In [None]:
# Encode multi-class features
for column in cat_multi:
    print(df_cat[column].value_counts())

In [None]:
# One-hot-encoding
df = pd.get_dummies(df, columns=cat_multi)

In [None]:
# Get new features in df_cat
df_cat = df[list(set(df.columns) - set(df_num.columns))]
df_cat.head()

In [None]:
# Re-order columns in the dataset for better (human) visibility
columns = df.columns
columns

In [None]:
ordered_columns = ['accepted', 'gender', 'age', 'phone_calls', 'emails', 
                   'salary', 'offer_value', 'estimated_expenses', 'center', 
                   'customer_type_C', 'customer_type_Q', 'customer_type_S', 
                   'offer_class_High', 'offer_class_Medium', 'offer_class_Premium']

df = df[ordered_columns]
df.head()

### Correlations

In [None]:
corr = df.corr()
corr

In [None]:
# Correlation matrix
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111)
cax = ax.matshow(df.corr(), vmin=-1, vmax=1, interpolation='none') 
fig.colorbar(cax)
plt.show();

In [None]:
c = corr['accepted']
print(c.sort_values(ascending=False))

In [None]:
# Remove columns not correlated enough with the targer feature
c = c[c.abs() < 0.01]
columns_to_drop = c.sort_values(ascending=False)
columns_to_drop

In [None]:
df.drop(columns=list(columns_to_drop.index), inplace=True)

In [None]:
def correlated_columns(df, min_corr_level=0.95):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    to_drop = [column for column in upper.columns if any(upper[column] > min_corr_level)]
    return to_drop

In [None]:
columns_to_drop = correlated_columns(df, min_corr_level=0.75)
columns_to_drop

In [None]:
# Let's see to which other feature customer_type_S is correlated
print(corr['customer_type_S'].sort_values(ascending=False))

In [None]:
df.drop(columns=columns_to_drop, inplace=True)

In [None]:
df.head()

In [None]:
df.shape

### Rescaling

This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-Nearest Neighbors.

In [None]:
# Rescale data between 0 and 1
# separate array into input and output components
#array = dataframe.values
#X = array[:,0:8]
#Y = array[:,8]
#scaler = MinMaxScaler(feature_range=(0, 1))
#rescaledX = scaler.fit_transform(X)

In [None]:
array = df_num.values

In [None]:
array

In [None]:
#scaler = MinMaxScaler(feature_range=(0, 1))
#rescaledX = scaler.fit_transform(array)

In [None]:
df_num

In [None]:
#scaler = MinMaxScaler(feature_range=(0, 1))
#df['age'] = scaler.fit_transform('age')

scaler = RobustScaler()

col = df['age'].values.reshape(-1, 1)

robust_scaled_df = scaler.fit_transform(col)
robust_scaled_df = pd.DataFrame(robust_scaled_df, columns=['age'])
robust_scaled_df

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(x['x1'], ax=ax1)
sns.kdeplot(x['x2'], ax=ax1)
ax2.set_title('After Robust Scaling')
sns.kdeplot(robust_scaled_df['x1'], ax=ax2)
sns.kdeplot(robust_scaled_df['x2'], ax=ax2)
ax3.set_title('After Min-Max Scaling')
sns.kdeplot(minmax_scaled_df['x1'], ax=ax3)
sns.kdeplot(minmax_scaled_df['x2'], ax=ax3)
plt.show()

### Numerical data transformations

In [None]:
# Ensure there are no missing values before transformin data
df_num = df.select_dtypes(include = ['float64'])
df_num.head()

In [None]:
n = len(df_num)
for column in df_num:
    # Distribution
    fig = plt.figure(figsize=(18, 8))
    fig.suptitle(column)
    ax1 = fig.add_subplot(121)
    ax1.title.set_text('Distribution Plot')
    sns.distplot(df_num[column], bins=50, hist_kws={'alpha': 0.4}) 

    # QQ-plot
    ax2 = fig.add_subplot(122)
    stats.probplot(df_num[column], plot=plt, dist='norm')
    plt.show();

In [None]:
df_num['offer_value']

In [None]:
# BoxCox: This will not work for all num variables as BoxCox requires values > 0 (only age conforms this)
def normalize_num_features(df, columns = []):
    # This will not work for all num variables as BoxCox requires values > 0 (only age conforms this)
    df_num_normalized = pd.DataFrame()
    n = len(df)

    df_num = pd.DataFrame()
    df_num[columns] = df[columns]

    for column in df_num:

        # Distribution
        fig = plt.figure(figsize=(18, 8))

        ax1 = fig.add_subplot(221)
        ax1.title.set_text('Distribution Plot')
        sns.distplot(df_num[column], bins=50, hist_kws={'alpha': 0.4}) 

        # QQ-plot
        ax2 = fig.add_subplot(222)
        stats.probplot(df_num[column], plot=plt, dist='norm')

        # Box-Cox Transformation
        # http://www.kmdatascience.com/2017/07/box-cox-transformations-in-python.html
        transform = np.asarray(df[[column]].values)
        dft = stats.boxcox(transform)[0]
        df_num_normalized[column] = pd.Series(dft.ravel())

        ax3 = fig.add_subplot(223)
        ax3.title.set_text('Normalized Distribution Plot')
        sns.distplot(dft, bins=50, hist_kws={'alpha': 0.4})

        ax4 = fig.add_subplot(224)
        ax4.title.set_text('')
        stats.probplot(df_num_normalized[column], plot=plt, dist='norm')

        plt.tight_layout()
        plt.show();

        n_unique = len(df_num[column].unique())
        print('Feature: {}'.format(column))
        print('Enique values count: {}'.format(n_unique))
        print('Unique values ratio: {}'.format(n_unique / n))
        print('Missing values ratio: {}'.format(df_num[column].isnull().sum() / n))
        print('Median: {}'.format(df_num[column].median()))
        print('------------------------------------')

        print()
        
    return df_num_normalized

In [None]:
df_num_normalized = normalize_num_features(df_num, columns = ['age', 'offer_value', 'estimated_expenses'])

In [None]:
# This will not work for all num variables as BoxCox requires values > 0 (only age conforms this)
df_num_normalized = pd.DataFrame()
n = len(df_num)

df_num2 = pd.DataFrame()
df_num2[['age', 'offer_value', 'estimated_expenses']] = df_num[['age', 'offer_value', 'estimated_expenses']]

for column in df_num:
    
    # Distribution
    fig = plt.figure(figsize=(18, 8))

    ax1 = fig.add_subplot(221)
    ax1.title.set_text('Distribution Plot')
    sns.distplot(df_num[column], bins=50, hist_kws={'alpha': 0.4}) 

    # QQ-plot
    ax2 = fig.add_subplot(222)
    stats.probplot(df_num[column], plot=plt, dist='norm')

    # Box-Cox Transformation
    # http://www.kmdatascience.com/2017/07/box-cox-transformations-in-python.html
    transform = np.asarray(df[[column]].values)
    dft = stats.boxcox(transform)[0]
    df_num_normalized[column] = pd.Series(dft.ravel())
    
    ax3 = fig.add_subplot(223)
    ax3.title.set_text('Normalized Distribution Plot')
    sns.distplot(dft, bins=50, hist_kws={'alpha': 0.4})

    ax4 = fig.add_subplot(224)
    ax4.title.set_text('')
    stats.probplot(df_boxcox[column], plot=plt, dist='norm')

    plt.tight_layout()
    plt.show();
    
    n_unique = len(df_num[column].unique())
    print('Feature: {}'.format(column))
    print('Enique values count: {}'.format(n_unique))
    print('Unique values ratio: {}'.format(n_unique / n))
    print('Missing values ratio: {}'.format(df_num[column].isnull().sum() / n))
    print('Median: {}'.format(df_num[column].median()))
    print('------------------------------------')
    
    print()

In [None]:
df3 = pd.DataFrame()
#df3['age'] = dft.tolist()
df3

In [None]:
df3['age'] = pd.Series(dft.ravel())
df3['age']

In [None]:
column = 'age'
df_boxcox = pd.DataFrame()

#http://www.kmdatascience.com/2017/07/box-cox-transformations-in-python.html

# Distribution
fig = plt.figure(figsize=(18, 8))
#fig.suptitle(column)

ax1 = fig.add_subplot(221)
ax1.title.set_text('Distribution Plot')
sns.distplot(df_num[column], bins=50, hist_kws={'alpha': 0.4}) 

# QQ-plot
ax2 = fig.add_subplot(222)
stats.probplot(df_num[column], plot=plt, dist='norm')
#ax2.margins(0.5)

# Box-Cox Transformation
ax3 = fig.add_subplot(223)
ax3.title.set_text('Normalized Distribution Plot')
transform = np.asarray(df[[column]].values)
dft = stats.boxcox(transform)[0]
sns.distplot(dft, bins=50, hist_kws={'alpha': 0.4})

ax4 = fig.add_subplot(224)
df_boxcox[column] = pd.Series(dft.ravel())
ax4.title.set_text('')
stats.probplot(df_boxcox[column], plot=plt, dist='norm')

plt.tight_layout()
plt.show()

In [None]:

df_num.plot(kind='density', subplots=True, layout=(2,8), sharex=False, legend=False, fontsize=1)
plt.show();

In [None]:
# box and whisker plots
df_num.plot(kind='box', subplots=True, sharex=False, sharey=False, fontsize=1)
plt.show()

- attributes do have quite different spreads. Given the scales are the same, it may suggest some benefit in standardizing the data for modeling to get all of the means lined up.

In [None]:
# box and whisker plots
df_num_normalized.plot(kind='box', subplots=True, sharex=False, sharey=False, fontsize=1)
plt.show()

In [None]:
# Standardize data

In [None]:
# Split-out validation dataset
array = dataset.values
X = array[:,0:60].astype(float)
Y = array[:,60]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,
    test_size=validation_size, random_state=seed)

In [None]:
### Correlations!!!