The main idea of this notebook is to show validation framework for our model. We are going to split the whole dataset into three parts: train, validation and test set. 

Test set will serve as a proxy for using model when it is completely finished, fine-tuned etc. In other words it will serve as a proxy for using our model in production, and again assesing 'real world' quality of the model. 

We are going to use distribution of 60%-20%-20% to split the whole dataset. I already know that there exists (stratified) cross-validation but as this course is intended as a 'first course' I will follow it along as is. No one said that you should always use cross-validation. Some people even propose that it is not that necessary depending on how much data you have! 

In terms of assesing the model performance, the idea of stratified k-fold CV makes complete sense. I'm not going to say it is ideal, but I would 'shake' less knowing that I used stratified k-fold CV. ;)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("./data/data-preparation.csv")
df.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,0
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,...,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,0
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,...,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,1
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,...,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.3,1840.75,0
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,...,no,no,no,no,month-to-month,yes,electronic_check,70.7,151.65,1


In [3]:
#train_test_split?

In [4]:
#train_test_split splits the whole dataset into only two datasets, so first we are going to do 80%-20% split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [5]:
#Now, to get the same amount of data for validation set like in test set for it being full 60-20-20
#we need to see how much is 20%/80%. So, finally, we should split the df_full_train into 55%-25% where 55%
#belongs to the train_set and 25% belongs to the validation_set:

df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state=1)

In [6]:
len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [7]:
df_train

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
3897,8015-ihcgw,female,0,yes,yes,72,yes,yes,fiber_optic,yes,...,yes,yes,yes,yes,two_year,yes,electronic_check,115.50,8425.15,0
1980,1960-uycnn,male,0,no,no,10,yes,yes,fiber_optic,no,...,yes,no,no,yes,month-to-month,yes,electronic_check,95.25,1021.55,0
6302,9250-wypll,female,0,no,no,5,yes,yes,fiber_optic,no,...,no,no,no,no,month-to-month,no,electronic_check,75.55,413.65,1
727,6786-obwqr,female,0,yes,yes,5,yes,no,fiber_optic,no,...,no,no,yes,no,month-to-month,yes,electronic_check,80.85,356.10,0
5104,1328-euzhc,female,0,yes,no,18,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,20.10,370.50,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3774,1309-xgfsn,male,1,yes,yes,52,yes,yes,dsl,no,...,yes,no,yes,yes,one_year,yes,electronic_check,80.85,4079.55,0
6108,4819-hjpiw,male,0,no,no,18,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,no,mailed_check,25.15,476.80,0
1530,3703-vavcl,male,0,yes,yes,2,yes,no,fiber_optic,no,...,yes,yes,no,yes,month-to-month,no,credit_card_(automatic),90.00,190.05,1
3701,3812-lrzir,female,0,yes,yes,27,yes,yes,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,electronic_check,24.50,761.95,0


In [8]:
#Instructor says that it is not necessary that we do the part to make indeces from random numbers to ordered 
#ones. Indeed, it wont harm the model or any process in any way but it seems convinient to adopt it.

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [9]:
df_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,8015-ihcgw,female,0,yes,yes,72,yes,yes,fiber_optic,yes,...,yes,yes,yes,yes,two_year,yes,electronic_check,115.5,8425.15,0
1,1960-uycnn,male,0,no,no,10,yes,yes,fiber_optic,no,...,yes,no,no,yes,month-to-month,yes,electronic_check,95.25,1021.55,0
2,9250-wypll,female,0,no,no,5,yes,yes,fiber_optic,no,...,no,no,no,no,month-to-month,no,electronic_check,75.55,413.65,1
3,6786-obwqr,female,0,yes,yes,5,yes,no,fiber_optic,no,...,no,no,yes,no,month-to-month,yes,electronic_check,80.85,356.1,0
4,1328-euzhc,female,0,yes,no,18,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,20.1,370.5,0


In [10]:
#Finally, we will extract the target variables from each of those datasets. 

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

#and delete them as ML libraries work in such a fashion to split the feature matrix and target variables!
del df_train['churn']
del df_val['churn']
del df_test['churn']

At this point it would make sense to make another notebook for EDA, but I think it would be much more messy to extract these as csv's and work my way from there. I will stick with the same notebook for the EDA also!

# EDA (Explanatory Data Analysis)

In this part we will do basic EDA and see global churn rate, asses categorical and numerical features and start our analysis from there. We use ```df_full_train``` as assesing 'population' because we expect that ```df_test``` is something that might suprise us, in a sense, by putting model into production we can't guarantee what kind of distribution will happen. But if we do good with this data presented I think we shouldn't expect some drastic changes in the future distribution that await us. 

In [11]:
df_full_train = df_full_train.reset_index(drop=True)

In [12]:
df_full_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
2,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
3,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
4,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


In [13]:
#Check for missing values:
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [14]:
#Check the churn rate:
round(df_full_train.churn.mean(),3)

0.27

The churn rate is the percentage of customers that actually churned. Say this dataset is a proxy for 'population' of one telco company then we see that the churn rate is relatively high (~27%) for a month!

Now, let's identify numerical and categorical features.

In [15]:
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

In [16]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']
categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [17]:
#Number of unique categories for all categorical features:
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## On feature importance

I've seen people do all sort of stuff for assesing feature importance. They simply train a Decision Trees or Random Forest model out of blue and see which variables they found that are imporant. That seems like a plausible starting point, and I've used it in my projects also!

Since ML is more an engineering part, I think this is where the 'art of engineering' lays. 

I learned today new heuristics that sound intuitive, and yet simple in nature with good grounding for building on top of! 

Speaking of categorical variables and churn rate, one way to asses their importance is by seeing what is the churn rate relative to a category. Let's use for our first example 'gender'.

In [18]:
churn_male = df_full_train[df_full_train.gender == 'male'].churn.mean()
churn_male

0.2632135306553911

So, we are using a subset of male clients and relative to that subset we obtain how much males churned. We see that it is almost the same as our 'global churn rate'. 

Let's see what happens with female clients:

In [19]:
churn_female = df_full_train[df_full_train.gender == 'female'].churn.mean()
churn_female

0.27682403433476394

It would be too tideous to just do this for each categorical variable. Another thing to consider is how are we gonna compare it to 'global churn rate' which in a nutshell is average churn rate for 'population'? 

First, natural, idea that comes to mind is taking the difference between sub-category churn rate and average churn rate. Not only that it could make assesing importance much more easier by having just one number that desribes it, but also we can make a small 'for loop' to make it work for every category! 

The second association when it comes to 'comparing' two stuff reminds us of division, and thus we will also asses what is called 'a risk ratio'.

They both give us the same information, but I find it a bit easier to work with ratios. 

Remember these are all heuristics!

If you are working with differences as a heuristic then:

1) if ```global_churn - churn_of_subcategory << 0``` then it means that ```subcategory``` is more likely to churn  because in the difference is significantly less then zero it means that ```churn_of_subcategory``` is bigger,

2) if ```global_churn - churn_of_subcategory >> 0``` then it means that ```subcategory``` is less likely to churn because if the difference is more than zero it means that ```churn_of_subcategory``` is smaller than the ```global_churn```.

Know that we are not talking here about 'marginal differences' like those for female and male category, but when we see 'bump' with like >5% etc. then there might be some signal in that! 

It doesn't mean that one should prefer having 1) over 2) in choosing features, or vice versa. The important note is that there is significant difference, be it positive or negative, so to speak. One has to asses feature as a whole.

Here is the example of ```partner``` category:

In [20]:
churn_partner = df_full_train[df_full_train.partner == 'yes'].churn.mean()
churn_partner

0.20503330866025166

In [21]:
churn_no_partner = df_full_train[df_full_train.partner == 'no'].churn.mean()
churn_no_partner

0.3298090040927694

Let's asses the differences:

In [22]:
global_churn = df_full_train.churn.mean()

global_churn - churn_partner, global_churn - churn_no_partner

(0.06493474245795922, -0.05984095297455855)

Looking this as a whole, it could indicate that this categorical feature might be important on predicting churn overall. In some sense, depending on having a partner or not it could have an effect on overall churn rate!

You can imagine it as having a machine with levers, where you can have ones which no matter how you pull them it wont affect working of a machine at all, but the other ones could change depend on how much you pull them, or in which direction!

Same thing applies to assesing importance via 'risk ratio' by comparing ```churn_of_subcategory``` with ```global_churn```. You divide two numbers. If the result of division is equal to 1 or pretty close to it then it means there is no risk of churn. If the result of division is larger than 1, then it means that we have higher risk of churn for that subcategory and conversely if the result of division is lower than 1, then it means there is a less risk of that subcategory to impact churn. 

Again, same as for differences heuristic, having results much less than 1 and/or much more than 1 means that "churn rate might depend choosing this over the other". 

In [23]:
churn_partner/global_churn, churn_no_partner/global_churn

(0.7594724924338315, 1.2216593879412643)

One final thing to mention is how to interpret these numbers. 

Dividing two numbers gives you (almost) a way to say percentage of one in another.

ex. 20/100 = 0.2, 15/30 = 0.5, 453/453 = 1

When you multiply 0.2 or 0.5 or 1 with 100 the result gives you the percentage. 

People tend to loosely say 20% when the result of division is 0.2, or 50% when the result of division is 0.5 and 100% when the result of division is 1. 

---

Oh, but now we have rate divided by rate! Almost the same thing applies, but now the result of division is interpreted like this:

1) if the result is less than 1 then solving for x in ```1-x=result``` tells you the percentage difference between the two rates,

2) if the result is more than 1 then solving for x in ```1+x=result```tells you the percentage difference between the two rates.

3) if the result is 1 then ```x=0```. ;)

Again, when you get the x you have to multiply it by 100 to say ```percentage difference between two rates```, but people do it mentaly in their head, and after a while you will also get used to it.


In [24]:
from IPython.display import display

In [25]:
for c in categorical:
    print(c)
    df_group = df_full_train.groupby(c).churn.agg(['mean', 'count'])
    df_group['diff'] = global_churn - df_group['mean']
    df_group['risk'] = df_group['mean'] / global_churn
    display(df_group)
    print()
    print()

gender


Unnamed: 0_level_0,mean,count,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.276824,2796,-0.006856,1.025396
male,0.263214,2838,0.006755,0.97498




seniorcitizen


Unnamed: 0_level_0,mean,count,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.24227,4722,0.027698,0.897403
1,0.413377,912,-0.143409,1.531208




partner


Unnamed: 0_level_0,mean,count,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.329809,2932,-0.059841,1.221659
yes,0.205033,2702,0.064935,0.759472




dependents


Unnamed: 0_level_0,mean,count,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.31376,3968,-0.043792,1.162212
yes,0.165666,1666,0.104302,0.613651




phoneservice


Unnamed: 0_level_0,mean,count,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.241316,547,0.028652,0.89387
yes,0.273049,5087,-0.003081,1.011412




multiplelines


Unnamed: 0_level_0,mean,count,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.257407,2700,0.012561,0.953474
no_phone_service,0.241316,547,0.028652,0.89387
yes,0.290742,2387,-0.020773,1.076948




internetservice


Unnamed: 0_level_0,mean,count,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
dsl,0.192347,1934,0.077621,0.712482
fiber_optic,0.425171,2479,-0.155203,1.574895
no,0.077805,1221,0.192163,0.288201




onlinesecurity


Unnamed: 0_level_0,mean,count,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.420921,2801,-0.150953,1.559152
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.153226,1612,0.116742,0.56757




onlinebackup


Unnamed: 0_level_0,mean,count,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.404323,2498,-0.134355,1.497672
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.217232,1915,0.052736,0.80466




deviceprotection


Unnamed: 0_level_0,mean,count,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.395875,2473,-0.125907,1.466379
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.230412,1940,0.039556,0.85348




techsupport


Unnamed: 0_level_0,mean,count,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.418914,2781,-0.148946,1.551717
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.159926,1632,0.110042,0.59239




streamingtv


Unnamed: 0_level_0,mean,count,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.342832,2246,-0.072864,1.269897
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.302723,2167,-0.032755,1.121328




streamingmovies


Unnamed: 0_level_0,mean,count,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.338906,2213,-0.068938,1.255358
no_internet_service,0.077805,1221,0.192163,0.288201
yes,0.307273,2200,-0.037305,1.138182




contract


Unnamed: 0_level_0,mean,count,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
month-to-month,0.431701,3104,-0.161733,1.599082
one_year,0.120573,1186,0.149395,0.446621
two_year,0.028274,1344,0.241694,0.10473




paperlessbilling


Unnamed: 0_level_0,mean,count,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,0.172071,2313,0.097897,0.637375
yes,0.338151,3321,-0.068183,1.25256




paymentmethod


Unnamed: 0_level_0,mean,count,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bank_transfer_(automatic),0.168171,1219,0.101797,0.622928
credit_card_(automatic),0.164339,1217,0.10563,0.608733
electronic_check,0.45589,1893,-0.185922,1.688682
mailed_check,0.19387,1305,0.076098,0.718121






Well, you see that there are a lot of potential features and some of them having higher importance based on risk ratio or difference, but which of them are more significant than the others? 

## Mutual Information

---


The high-level idea of mutual information is: a measure of mutual dependence between two variables or the "amount of information" obtained about one random variable by observing the other random variable. 

My personal taste is to dive into this concept more deeply by understanding every bit of formula etc. but I've already done my first course in Information Theory, and I think there is nothing special with the formula itself, but somehow you had to find such 'model' (function) that relates to idea of increasing/decreasing information and the ideal one was formula for the entropy.  


I'm not going to spit out the definition of information but what I think is necessary for understanding this concept could be summed up in this next question.

What will give you more information, and from which book you will learn more: 

a) the book with random letters scattered all over the place with some words appearing out of randomness now and then, or

b) the book with more order (ex. Fight Club by Chuck Palahniuk) where you, ideally speaking, understand each word, punctuation etc.

I think the second one will give you more information and, theoretically speaking, more info will give you better knowledge than just random stuff. Look at the static on TV. Random noise. I never learned anything from that. But the more you start tuning your TV to get better picture, noise starts to decrease, picture starts to appear and at some point you get on what is screening on that channel. At some cut-off you understand what is presenting, but to understand all the details, ideally, you should tune to perfect picture!

That is loosely speaking the idea of entropy and information since they are relatable in that kind of 'equivalent relation' sense. 

You keep decreasing the entropy ("disorder"), you gain more information. The more information you have, you obtain more knowledge.

Draw 30 random points and then draw 30 points such that they are ordered that it looks like a circle. Give it to somebody and ask what they see. You 'rearanged' those points such that it looks like a circle. 

What about 'mutual information'? Think of it as reading a book to understand something. Say you have to understand Tolkien's World. What will give you more information (knowledge) about it:

a) Reading The Fellowship of The Ring, or

b) Reading Harry Potter?

I am not going to say that the b) will give you NO mutual information/knowledge whatsoever. You will see in next code snippets that some features give some amount of mutual information even if it is negligable! It may not be tightly related, but some stuff from Harry Potter will be applicable to Tolkien's world! What I can guarantee you is that by ignoring Harry Potter books will make you a better fan of Tolkien. ;)

Nobody said that you shouldn't test mutual information between other variables, but we started with the idea of testing each variable relative to target variable and rank them to see which ones are the most important. 

Later on, we will introduce SHAP values!

In [26]:
from sklearn.metrics import mutual_info_score

In [27]:
def calculate_mi(series):
    return mutual_info_score(series, df_full_train.churn)

df_mi = df_full_train[categorical].apply(calculate_mi) #For each column apply calculate_mi function
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')#Sort it by descending order


display(df_mi.head())
display(df_mi.tail())

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Unnamed: 0,MI
partner,0.009968
seniorcitizen,0.00941
multiplelines,0.000857
phoneservice,0.000229
gender,0.000117


## Decision Tree

As I've mentioned, some people use Decision Trees, Random Forests to see if their intuition about feature engineering is right, sometimes whilst doing feature selection and sometimes out of blue. It never hurts to try different methods and possibly decide out of those methods which features could be best becasue they occur very much. 

Lets see what plain Decision Tree (without hyperparameter tuning) can say about the importance of features. Decision Trees are also using information gain ranking (Gini Impurity also!), and in that sense it could be tightly related with mutual information. They are also relatively fast to train. Also, similar to mutual information, we use it on 'full train' dataset and relative to 'global churn'. We are not trying to leak anything! 

In [28]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

RANDOM_STATE = 42

dt = DecisionTreeClassifier(random_state = RANDOM_STATE, criterion = "entropy")

In [29]:
#For modeling using scikit-learn, all the variables should be numeric, so we will have to change the labels.
dt_full = df_full_train[categorical]
le = LabelEncoder()
le_df = dt_full.apply(le.fit_transform)

In [30]:
dt.fit(le_df, df_full_train.churn)

In [31]:
#Obtaining feature importances
tree_features = pd.DataFrame(data = dt.feature_importances_,
                             index = dt_full.columns,
                             columns = ['Importance'])

In [32]:
tree_features.sort_values(by = 'Importance', ascending = False).head()

Unnamed: 0,Importance
contract,0.219455
paymentmethod,0.10833
partner,0.065569
multiplelines,0.063916
gender,0.063827


In [33]:
df_mi.head()

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Well, seems like ```contract``` is our best feature indeed, but you see the rest of the features are ones which MI concluded that are one of the "worst ones". I suspect that we might be overfiting here and should tune Decision Tree, so let's give it a try and tune them using GridSearchCV!

In [34]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

#Defining hyperparameters for tuning a decision tree
tree_params = {'max_depth': range(2, 11), 
               'min_samples_leaf' : range(2,11)}

#Defining 5-fold stratified cross-validation scheme
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = RANDOM_STATE)

In [35]:
#Using GridSearchCV to test all possible permutations of tree_params to find the optimal score on validation set
tree_grid = GridSearchCV(estimator = dt, scoring = 'accuracy',  param_grid = tree_params, cv = skf, n_jobs = -1)

In [36]:
%%time
tree_grid.fit(le_df, df_full_train.churn)

CPU times: user 564 ms, sys: 199 ms, total: 764 ms
Wall time: 3.43 s


In [40]:
#Saving all the features
tree_features = pd.DataFrame(data = tree_grid.best_estimator_.feature_importances_,
                             index = dt_full.columns,
                             columns = ['Importance according to DTs'])

#Ploting the feature importances
tree_features.sort_values(by = "Importance according to DTs", ascending = False).head()

Unnamed: 0,Importance according to DTs
contract,0.687786
onlinesecurity,0.117954
internetservice,0.091018
streamingmovies,0.0424
onlinebackup,0.027831


In [42]:
tree_features.sort_values(by = "Importance according to DTs", ascending = False).head()

Unnamed: 0,Importance according to DTs
contract,0.687786
onlinesecurity,0.117954
internetservice,0.091018
streamingmovies,0.0424
onlinebackup,0.027831


In [43]:
df_mi.head()

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Seems like tuned tree with the best score captured pretty much what MI score was capturing. It could be debatable whether ```techsupport``` is actually more crucial than ```streamingmovies```. In my opinion, where I live, the choice of movies and TV series on ISP is pretty poor and many people I know don't watch TV channels as they used to. Having a stable Internet and cheap streaming platforms offering top-notch movies and series is a way to go these days. Tech support is also crucial factor for choosing an ISP, and since we have a historical data of past months we can't get the full picture if it is faulty at all or was during those months of data collection.