## Telecom Churn Case study

### Business Problem
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, ___retaining high profitable customers is the number one business goal___.

To reduce customer churn, __telecom companies need to predict which customers are at high risk of churn__.

In this project, we will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn.

 __Importing Required Libraries__

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as w
w.filterwarnings('ignore')
pd.set_option('max_columns',500)
pd.set_option('max_rows',500)

In [None]:
data=pd.read_csv('telecom_churn_data.csv')
data.head()

In [None]:
data.shape

In [None]:
data.info()

As checked above, there are 214 numeric columns and 12 non-numeric columns

In [None]:
# look at data statistics
data.describe(include='all')

#### In churn prediction, we assume that there are three phases of customer lifecycle :

The ‘good’ phase [Month 6 & 7]<br>
The ‘action’ phase [Month 8]<br>
The ‘churn’ phase [Month 9]<br><br>
In this case, since we are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase.

### Data Preparation

Let us create some utility functions

In [None]:
# Method for Checking missing values percentages
def checkMissingPercent(dataset, cutoff):
    missing = round(100*(dataset.isnull().sum()/dataset.shape[0]))
    return missing.loc[missing>cutoff]

In [None]:
# Method for imputing data 
def imputeData(df, col_list):
    for i in [x + y for y in ['_6','_7','_8','_9'] for x in col_list]:
        df[i].fillna(0,inplace = True)

__Handling missing values__

In [None]:
mod_data=data.copy()

In [None]:
# Since mobile no has all unique values and represents a particular customer, it can be dropped from the dataset.
# Similarly, circle_id has all same values(109), it also can be dropped.
mod_data.drop(['mobile_number', 'circle_id'], axis=1, inplace=True)

In [None]:
# look at missing value ratio in each column
checkMissingPercent(mod_data, 0)

As checked above, there are so many columns conatining missing values. Among them, there are some columns which has more than 70% of missing values. We will not directly delete those columns. Let us first check that these values as null because of no transactions or because of some other reason.

In [None]:
# getting all columns for month of June which has 75% missing values
cols = checkMissingPercent(mod_data, 74).index

mod_data.loc[mod_data.date_of_last_rech_data_6.isna(),cols].head()

As checked above, all the columns has null values where date of last recharge is missing. This is valid, we can replace these null values with 0 as there is no recharge done.

In [None]:
# imputing all the columns other than those containg date with 0 having more than 50% missing value
cols = list(filter(lambda x : not x.startswith('date') , checkMissingPercent(mod_data, 50).index))

mod_data[cols]=mod_data[cols].apply(lambda x: x.fillna(0))
mod_data[cols].head()

Checking again percent of missing values

In [None]:
checkMissingPercent(mod_data, 0)

Let us have a look at the non-numeric columns

In [None]:
obj=mod_data.select_dtypes(include='object')
for i in obj.columns:
    print(i,'', obj[i].nunique(),'', obj[i].isna().sum()) 

We have already used date to fill the missing values. Further these date columns seems to be irrelevant in our analysis, so we will drop these columns

In [None]:
mod_data = mod_data.drop(obj.columns, axis=1)

Again checking for the missing values

In [None]:
checkMissingPercent(mod_data, 0)

In [None]:
cols=list(checkMissingPercent(mod_data, 0).index)
mod_data[cols].describe()

As checked above, all the columns have their minimum value 0, but since the missing percent is very low around 4-5%, this can be because of technical or human error, its better to fill these values with median rather than 0. 

In [None]:
# filling the columns above with median
mod_data[cols]=mod_data[cols].apply(lambda x: x.fillna(x.median()))
mod_data[cols].head()

Checking if our missing value imputation is successfully done or not

In [None]:
all(mod_data.isna().sum()==0)

In [None]:
# removing duplicates from row
mod_data.drop_duplicates(inplace=True)
mod_data.shape

There are a few columns whose names are not consistent with other columns. Let make them same.

In [None]:
print(list(filter(lambda x: x[-1].isalpha(), mod_data.columns)))
mod_data.rename(columns={'aug_vbc_3g':'vbc_3g_8', 'jul_vbc_3g':'vbc_3g_7', 'jun_vbc_3g':'vbc_3g_6'}, inplace=True)
mod_data.head()

__Taking only the data of high valued customer by taking average of total recharge amount of good months__

In [None]:
mod_data['av_rech_amt_6_7']=((mod_data.av_rech_amt_data_6 * mod_data.total_rech_data_6 + mod_data.total_rech_amt_6)+
                             (mod_data.av_rech_amt_data_7 * mod_data.total_rech_data_7 + mod_data.total_rech_amt_7)) / 2

# mod_data.drop(['av_rech_amt_data_6','total_rech_data_6','total_rech_amt_6','av_rech_amt_data_7',
#                'total_rech_data_7','total_rech_amt_7'], axis=1, inplace=True)


high_value_cust = mod_data[mod_data.av_rech_amt_6_7>mod_data.av_rech_amt_6_7.quantile(0.7)]
len(high_value_cust)

In [None]:
high_value_cust.shape

**Tagging the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are:**
<br>
    1. total_ic_mou_9
    2. total_og_mou_9
    3. vol_2g_mb_9
    4. vol_3g_mb_9

In [None]:
high_value_cust['churn'] = (high_value_cust.total_ic_mou_9+high_value_cust.total_og_mou_9 + high_value_cust.vol_3g_mb_9 + high_value_cust.vol_2g_mb_9).apply(lambda x: 1 if x==0 else 0)
high_value_cust.head()

In [None]:
high_value_cust.churn.value_counts()

In [None]:
print('churn rate:', round((2433/27520)*100,2), '%')

Our dataset has high class imbalance, we will take care of it while building a model.

Removing all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names).

In [None]:
high_value_cust.drop(list(filter(lambda x: x[-1]=='9',high_value_cust.columns)), axis=1, inplace=True)
high_value_cust.head()

In [None]:
high_value_cust.shape

As checked in the data dictionary, columns start with fb and night are schemes which are used for facebook and night packs respectively, so they are categorical columns(yes/no). Same as with churn columns. We will convert then to object type. This will help in doing EDA. 

In [None]:
cols=list(filter(lambda x: x.startswith('fb') or x.startswith('night'), high_value_cust.columns))
cols

In [None]:
cols.append('churn')
high_value_cust[cols]=high_value_cust[cols].astype('object')

### Exploratory Data Analysis

Let us create some utility functions

In [None]:
# Method to add or subtract 2 columns to form a new column. It also have a feature to add or subtract 2 columns 
# based on complete column name or a pattern provided.
    # col_a_end_str - column name or pattern from end for column A
    # col_b_end_str - column name or pattern from end for column B
    # avg_or_diff - 'avg' for average and 'diff' for subtraction of 2 columns
    # new_name_end_str - end pattern to give to new column/s
    # dataframe - a dataframe
    # complete_column_name_given - flag to check if willing to merge all the columns based on pattern or not    

def addOrSubColumns(col_a_end_str, col_b_end_str, avg_or_diff, new_name_end_str, dataframe, complete_column_name_given=False):
    li=[]
    if complete_column_name_given:
        new_name= col_a_end_str+'_'+col_b_end_str+'_'+new_name_end_str
        if avg_or_diff=='diff':
                dataframe[new_name]= (dataframe[col_b_end_str] - dataframe[col_a_end_str])
        else:
            dataframe[new_name]= (dataframe[col_b_end_str] + dataframe[col_a_end_str])/2
        print(new_name)
        li+=[col_a_end_str,col_b_end_str]

    else:
        s=set(filter( lambda x: x[-len(col_a_end_str):]==col_a_end_str, dataframe.columns))
        s1=set(filter( lambda x:  x[-len(col_b_end_str):]==col_b_end_str, dataframe.columns))
        
        for i in list(s):
            k=i[:-len(col_a_end_str)]
            a=k+col_a_end_str
            b=k+col_b_end_str
            if  b in s1:
                if avg_or_diff=='diff':
                    dataframe[k+new_name_end_str]= (dataframe[b] - dataframe[a])
                else:
                    dataframe[k+new_name_end_str]= (dataframe[b] + dataframe[a])/2
                li+=[a,b]
                s.remove(a); s1.remove(b)
        
    return dataframe.drop(li, axis=1)

In [None]:
# ---- Univariate Analysis ---- #
def univariate(dataset,col):
    #col = dataset.columns
    plt.figure(figsize=(12, 6))
    if dataset[col].dtypes != 'object':
        sns.distplot(dataset[col])
        dataset[col].describe()
    else:
        sns.countplot(dataset[col])
        dataset[col].value_counts()
    plt.title( 'Frequency Plot of ' + str(col) , loc='left', fontsize=12, fontweight=0, color='Blue')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

In [None]:
# ---- Bivariate Analysis ---- #
def bivariate(dataset, col1, col2):
    plt.figure(figsize=(12, 6))
    if (dataset[col1].dtypes == 'object' and dataset[col2].dtypes != 'object'):
        sns.boxplot(x = col1, y = col2, data = dataset)
        plt.xlabel(col1)
        plt.ylabel(col2)
    elif (dataset[col1].dtypes != 'object' and dataset[col2].dtypes == 'object'):
        sns.boxplot(x = col2, y = col1, data = dataset)
        plt.xlabel(col2)
        plt.ylabel(col1)
    plt.title( 'Box Plot of ' + str(col1)+ ' vs '+ str(col2) , loc='left', fontsize=12, fontweight=0, color='Blue')
    plt.show()
    
#         if dataset[col2].nunique()>10:
#                 g.set_xticklabels(g.get_xticklabels(),rotation=90)

In [None]:
# ---- Bivariate Analysis with Churn as one of the column ---- #
def bivariate_churn(dataset,col):
    if dataset[col].dtypes != 'object':
        sns.boxplot(col, 'churn')

In [None]:
# method to cap outliers
def capingOutliers(dataframe, quantile, columns, cap=False):
    for i in columns:
        print('outliers in',i, ':', len(dataframe[i][dataframe[i]>dataframe[i].quantile(quantile)]))
        if cap:
            dataframe[i][dataframe[i]>dataframe[i].quantile(quantile)] = dataframe[i].quantile(quantile)

In [None]:
def plot_vs_Churn(dataset,col):
    # per month churn vs Non-Churn
    fig, ax = plt.subplots(figsize=(7,4))
     
    colList=list(data.filter(regex=(col)).columns)
    colList = colList[:3]
    plt.plot(high_value_cust.groupby('churn')[colList].mean().T)
    ax.set_xticklabels(['Jun','Jul','Aug'])
    
    ## Add legend
    plt.legend(['Non-Churn', 'Churn'])
    
    # Add titles
    plt.title( str(col) +" V/S Month", loc='left', fontsize=12, fontweight=0, color='orange')
    plt.xlabel("Month")
    plt.ylabel(col)
    plt.show()
    
    # Numeric stats for per month churn vs Non-Churn
    return high_value_cust.groupby('churn')[colList].mean()

In [None]:
plot_vs_Churn(high_value_cust,'total_ic_mou')

__Observation__
1. Total incoming calls drops at a faster pace for the churners from the month of June to July.
2. For non-churners the graph is almost constant.

In [None]:
plot_vs_Churn(high_value_cust,'total_og_mou')

__Observation__
1. Total outgoing calls drops significantly for the churners from the month of June to July. We could also see that churners were quite higher in number than non churners in making outgoing calls in the month of June.
2. For non-churners the graph is remains constant.

In [None]:
plot_vs_Churn(high_value_cust,'fb_user')

__Observation__
1. As observed, the number of fb users dropped for the churners from the month of June to July. 
2. For non-churners the graph is significantly constant.

In [None]:
plot_vs_Churn(high_value_cust,'total_rech_amt')

__Observation__
1. Total recharge amount drops significantly for the churners from the month of June to July. We have also observed that churners were quite spending higher amount in recharging than non churners in the month of June.
2. For non-churners the graph is almost constant.

In [None]:
plot_vs_Churn(high_value_cust,'max_rech_amt')

__Observation__
1. Maximum recharge amount drops for the churners from the month of June to July and it dropped at a steep rate to August.
2. For non-churners the graph is almost constant.

In [None]:
plot_vs_Churn(high_value_cust,'arpu')

__Observation__
1. Average Revenue Per User drops at a faster pace for the churners from the month of June to July.The ARPU from the churners was quite higher than the non-churners in the month of June.
2. While for non-churners the graph is almost constant.

In [None]:
plot_vs_Churn(high_value_cust,'night_pck_user')

__Observation__
1. Night pack users drops significantly for the churners from the month of June to July.
2. For non-churners the graph is fairly constant.

In [None]:
#After analysis we do not need these columns as we have got a derived column av_rech_amt_6_7
mod_data.drop(['av_rech_amt_data_6','total_rech_data_6','total_rech_amt_6','av_rech_amt_data_7',
               'total_rech_data_7','total_rech_amt_7'], axis=1, inplace=True)

In [None]:
# -- do some analysis

In [None]:
univariate(high_value_cust,'aon')

In [None]:
univariate(high_value_cust,'av_rech_amt_6_7')

In [None]:
bivariate(high_value_cust,'av_rech_amt_6_7','churn')

__Outlier Treatment__

In [None]:
round(high_value_cust.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99]),2)

In [None]:
# box plots to analyse the outliers

In [None]:
# call the function to check and cap the outliers
cols=list(high_value_cust.select_dtypes(exclude='object').columns) # columns to remove ouliers
capingOutliers(high_value_cust, 0.95, cols, True)

In [None]:
# -- do some analysis

In [None]:
# aggregating the columns of good months
high_value_cust=addOrSubColumns('6','7','avg','6_7',high_value_cust)

In [None]:
# -- do some analysis

In [None]:
# getting average recharge amount for action month
high_value_cust['av_rech_amt_8']=(high_value_cust.av_rech_amt_data_8 * high_value_cust.total_rech_data_8 + 
                                  high_value_cust.total_rech_amt_8)

high_value_cust.drop(['av_rech_amt_data_8','total_rech_data_8','total_rech_amt_8'], axis=1, inplace=True)

In [None]:
# -- do some analysis

In [None]:
# difference of the columns between action month and average of good month
high_value_cust=addOrSubColumns('6_7','8','diff','diff',high_value_cust)

In [None]:
# -- do some analysis

In [None]:
# -- do some analysis

Removing the columns having more than 85% of values as a single value (highly skewed columns)

In [None]:
li=[]
for i in high_value_cust.columns:
    if max(high_value_cust[i].value_counts())/len(high_value_cust) >0.85:
        li.append(i)

li.remove('churn')
high_value_cust.drop(li, axis=1, inplace=True)

In [None]:
high_value_cust.shape

## Model Building

In [None]:
# import required libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.pipeline import FeatureUnion
# from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
# from sklearn.metrics import sensitivity_specificity_support
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

In [None]:
high_value_cust.columns

In [None]:
# divide data into train and test
X = high_value_cust.drop("churn", axis = 1)
y = high_value_cust.churn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [None]:
# print shapes of train and test sets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# apply pca to train data
pca = Pipeline([('scaler', MinMaxScaler()), ('pca', PCA())])

In [None]:
pca.fit(X_train)
churn_pca = pca.fit_transform(X_train)

In [None]:
# extract pca model from pipeline
pca = pca.named_steps['pca']

# look at explainded variance of PCA components
print(pd.Series(np.round(pca.explained_variance_ratio_.cumsum(), 4)*100))

In [None]:
# plot feature variance
features = range(pca.n_components_)
cumulative_variance = np.round(np.cumsum(pca.explained_variance_ratio_)*100, decimals=4)
plt.figure(figsize=(175/20,100/20)) # 100 elements on y-axis; 175 elements on x-axis; 20 is normalising factor
plt.plot(cumulative_variance)

In [None]:
# create pipeline
PCA_VARS = 20
steps = [('scaler', MinMaxScaler()),
         ("pca", PCA(n_components=PCA_VARS)),
         ("logistic", LogisticRegression(class_weight='balanced'))
        ]
pipeline = Pipeline(steps)

In [None]:
# fit model
pipeline.fit(X_train, y_train)

# check score on train data
pipeline.score(X_train, y_train)

In [None]:
# predict churn on test data
y_pred = pipeline.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# # check sensitivity and specificity
# sensitivity, specificity, _ = sensitivity_specificity_support(y_test, y_pred, average='binary')
# print("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# check area under curve
y_pred_prob = pipeline.predict_proba(X_test)[:, 1]
print("AUC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

In [None]:
# PCA
pca = PCA()

# logistic regression - the class weight is used to handle class imbalance - it adjusts the cost function
logistic = LogisticRegression(class_weight={0:0.1, 1: 0.9})

# create pipeline
steps = [("scaler", MinMaxScaler()), 
         ("pca", pca),
         ("logistic", logistic)
        ]

# compile pipeline
pca_logistic = Pipeline(steps)

# hyperparameter space
params = {'pca__n_components': [20, 30], 'logistic__C': [0.1, 0.5, 1, 2, 3, 4, 5, 10], 'logistic__penalty': ['l1', 'l2']}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# create gridsearch object
model = GridSearchCV(estimator=pca_logistic, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
# fit model
model.fit(X_train, y_train)

In [None]:
# cross validation results
pd.DataFrame(model.cv_results_)

In [None]:
# print best hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)

In [None]:
# predict churn on test data
y_pred = model.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# # check sensitivity and specificity
# sensitivity, specificity, _ = sensitivity_specificity_support(y_test, y_pred, average='binary')
# print("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("AUC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

In [None]:
# random forest - the class weight is used to handle class imbalance - it adjusts the cost function
forest = RandomForestClassifier(class_weight={0:0.1, 1: 0.9}, n_jobs = -1)

# hyperparameter space
params = {"criterion": ['gini', 'entropy'], "max_features": ['auto', 0.4]}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# create gridsearch object
model = GridSearchCV(estimator=forest, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
# fit model
model.fit(X_train, y_train)

In [None]:
# print best hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)

In [None]:
# predict churn on test data
y_pred = model.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# # check sensitivity and specificity
# sensitivity, specificity, _ = sensitivity_specificity_support(y_test, y_pred, average='binary')
# print("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("AUC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

In [None]:
# run a random forest model on train data
max_features = int(round(np.sqrt(X_train.shape[1])))    # number of variables to consider to split each node
print(max_features)

rf_model = RandomForestClassifier(n_estimators=100, max_features=max_features, class_weight={0:0.1, 1: 0.9}, oob_score=True, random_state=4, verbose=1)

In [None]:
# fit model
rf_model.fit(X_train, y_train)

In [None]:
# OOB score
rf_model.oob_score_

In [None]:
# predict churn on test data
y_pred = rf_model.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# # check sensitivity and specificity
# sensitivity, specificity, _ = sensitivity_specificity_support(y_test, y_pred, average='binary')
# print("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# check area under curve
y_pred_prob = rf_model.predict_proba(X_test)[:, 1]
print("ROC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))

In [None]:
# predictors
features = high_value_cust.drop('churn', axis=1).columns

# feature_importance
importance = rf_model.feature_importances_

# create dataframe
feature_importance = pd.DataFrame({'variables': features, 'importance_percentage': importance*100})
feature_importance = feature_importance[['variables', 'importance_percentage']]

# sort features
feature_importance = feature_importance.sort_values('importance_percentage', ascending=False).reset_index(drop=True)
print("Sum of importance=", feature_importance.importance_percentage.sum())
feature_importance

In [None]:
# extract top 'n' features
top_n = 10
top_features = feature_importance.variables[0:top_n]

In [None]:
# plot feature correlation
import seaborn as sns
plt.rcParams["figure.figsize"] =(15,10)
# mycmap = sns.diverging_palette(199, 359, s=99, center="light", as_cmap=True)
sns.heatmap(data=X_train[top_features].corr(), center=0.0, annot=True)

In [None]:
X_train = X_train[top_features]
X_test = X_test[top_features]

In [None]:
# logistic regression
steps = [('scaler', MinMaxScaler()), 
         ("logistic", LogisticRegression(class_weight={0:0.1, 1:0.9}))
        ]

# compile pipeline
logistic = Pipeline(steps)

# hyperparameter space
params = {'logistic__C': [0.1, 0.5, 1, 2, 3, 4, 5, 10], 'logistic__penalty': ['l1', 'l2']}

# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# create gridsearch object
model = GridSearchCV(estimator=logistic, cv=folds, param_grid=params, scoring='roc_auc', n_jobs=-1, verbose=1)

In [None]:
# fit model
model.fit(X_train, y_train)

In [None]:
# print best hyperparameters
print("Best AUC: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)

In [None]:
# predict churn on test data
y_pred = model.predict(X_test)

# create onfusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# # check sensitivity and specificity
# sensitivity, specificity, _ = sensitivity_specificity_support(y_test, y_pred, average='binary')
# print("Sensitivity: \t", round(sensitivity, 2), "\n", "Specificity: \t", round(specificity, 2), sep='')

# check area under curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
print("ROC:    \t", round(roc_auc_score(y_test, y_pred_prob),2))