# How to extract insights from data


The job of a data scientist is typically to extract insights from the data and, based on the insights, come up with ideas to improve the product.
The standard approach is:

1.	Collect a dataset including your target variable (label) and variables that you think might be related

2.	Build a model trying to predict the label

3.	Look into the model and figure out how each variable impacts the output

4.	Based on that, come up with product recommendation (aka the famous actionable insights you see in pretty much any DS job posting)

Models and insights

The most effective ways to extract insights from a model are:

1.	Build a logistic or linear regression for, respectively, binary and continuous outputs, and look at the coefficients

2.	Build a decision tree and look at its structure

3.	Build any model and look at the model partial dependence plots

4.	Build RuleFit and look at the dummy features it created

Obviously, model insights are meaningful only if the model is predicting well. If a model predictive power is very bad, then looking at its structure is totally meaningless. However, checking and optimizing model performance is beyond the scope of this section.
We’ll look now at each of those techniques in details.


# Data

Let’s assume you work in the marketing department and your product manager has asked you to get back to her with some project ideas on how to improve email click-through-rate. That is, the company has been sending marketing emails and they want to increase the percentage of people who click on the company link inside the email.
You have a dataset like the one below. You can also download it from here.


*	email_id : the Id of the email that was sent. It is unique by email
*	email_text : two different versions of the email have been sent: one has “long text” (i.e. has 4 paragraphs) and one has “short text” (just two paragraphs)
*	email_version : some emails were “personalized” (i.e. they had the name of the user receiving the email in the incipit, such as “Hi John,”), while some emails were “generic” (the incipit was just “Hi,”)
*	hour : the local time on which the email was sent
*	weekday : the weekday on which the email was sent
*	user_country : the country where the user receiving the email is based. It comes from the user ip address when they created the account
*	user_past_purchases : how many items in the past were bought by the user receiving the email
*	clicked - Whether the user has clicked on the link inside the email. This is our label and, most importantly, the goal of the project is to increase this


## Regressions and Coefficients


We will focus here on logistic regression given that the label we are trying to predict (“clicked”) is binary. However, the overall approach if you were dealing with a linear regression would be similar. After all, a logistic regression can be seen as a linear method with a particular link function (logit) to constrain the output between 0 and 1, so that it can be used for binary classification problems.


In [1]:
import pandas
import statsmodels.api as sm
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)

In [2]:
#Read from google drive. This is the same dataset described in the previous section
data = pandas.read_csv('https://drive.google.com/uc?export=download&id=1PXjbqSMu__d_ppEv92i_Gnx3kKgfvhFk')


In [3]:
data.head()

Unnamed: 0,email_id,email_text,email_version,hour,weekday,user_country,user_past_purchases,clicked
0,8,short_email,generic,9,Thursday,US,3,0
1,33,long_email,personalized,6,Monday,US,0,0
2,46,short_email,generic,14,Tuesday,US,3,0
3,49,long_email,personalized,11,Thursday,US,10,0
4,65,short_email,generic,8,Wednesday,UK,3,0


In [4]:
#Before building the regression, we need to know which ones are the reference levels for the categorical variables
#only keep categorical variables
data_categorical = data.select_dtypes(['object']).astype("category") 
#find reference level, i.e. the first level
print(data_categorical.apply(lambda x: x.cat.categories[0]))



email_text       long_email
email_version       generic
weekday              Friday
user_country             ES
dtype: object


email_text       long_email
email_version       generic
weekday              Friday
user_country             ES
dtype: object


In [5]:
#make dummy variables from categorical ones. Using one-hot encoding and drop_first=True 
data = pandas.get_dummies(data, drop_first=True)
  
#add intercept
data['intercept'] = 1
#drop the label
train_cols = data.drop('clicked', axis=1)

In [6]:
#Build Logistic Regression
logit = sm.Logit(data['clicked'], train_cols)
output = logit.fit()



Optimization terminated successfully.
         Current function value: 0.092770
         Iterations 9


In [8]:
output_table = pandas.DataFrame(dict(coefficients = output.params, SE = output.bse, z = output.tvalues, p_values = output.pvalues))

In [9]:
#get coefficients and pvalues
print(output_table)
                           

                            coefficients            SE          z       p_values
email_id                   -3.848609e-08  7.780379e-08  -0.494656   6.208432e-01
hour                        1.670684e-02  5.005879e-03   3.337445   8.455247e-04
user_past_purchases         1.878107e-01  5.725787e-03  32.800855  5.725039e-236
email_text_short_email      2.793085e-01  4.530477e-02   6.165101   7.043829e-10
email_version_personalized  6.387251e-01  4.691461e-02  13.614631   3.277989e-42
weekday_Monday              5.410326e-01  9.341014e-02   5.792011   6.954864e-09
weekday_Saturday            2.828638e-01  9.777629e-02   2.892969   3.816190e-03
weekday_Sunday              1.836278e-01  1.001194e-01   1.834088   6.664099e-02
weekday_Thursday            6.254040e-01  9.233999e-02   6.772839   1.262790e-11
weekday_Tuesday             6.162222e-01  9.237223e-02   6.671077   2.539336e-11
weekday_Wednesday           7.554637e-01  9.084515e-02   8.315950   9.102053e-17
user_country_FR            -

In [10]:
#only keep significant variables and order results by coefficient value
print(output_table.loc[output_table['p_values'] < 0.05].sort_values("coefficients", ascending=False))
 


                            coefficients        SE          z       p_values
user_country_UK                 1.155255  0.122060   9.464618   2.946372e-21
user_country_US                 1.141360  0.115963   9.842487   7.386228e-23
weekday_Wednesday               0.755464  0.090845   8.315950   9.102053e-17
email_version_personalized      0.638725  0.046915  13.614631   3.277989e-42
weekday_Thursday                0.625404  0.092340   6.772839   1.262790e-11
weekday_Tuesday                 0.616222  0.092372   6.671077   2.539336e-11
weekday_Monday                  0.541033  0.093410   5.792011   6.954864e-09
weekday_Saturday                0.282864  0.097776   2.892969   3.816190e-03
email_text_short_email          0.279308  0.045305   6.165101   7.043829e-10
user_past_purchases             0.187811  0.005726  32.800855  5.725039e-236
hour                            0.016707  0.005006   3.337445   8.455247e-04
intercept                      -6.880922  0.156067 -44.089646   0.000000e+00

对于一些continuous variable, 首先不会significant，otherwise misleading. 原本coefficient>0说明越大越好，To solve this, you should manually create segments (i.e. indicator variables) before building the model. One segment could be night time, one morning to noon, etc.

○	More importantly, note the super low coefficient for email_id compared to the other ones. That doesn’t mean that the variable is irrelevant. The super low coefficient simply depends on the fact that email_id scale is way larger than the other variables. The max value of all other variables is 24 for hour. The max value of email_id is 100K! So the low coefficient is meant to balance the different scale, otherwise email_id would entirely drive the regression output.

Only thing, looking at the scale of the intercept vs the scale of the other coefficients * the possible values of those variables can be useful to get a sense of by how much you can affect the output

■	-> If I send emails on Wednesday, that variable value becomes 0.7 (i.e. 0.7 coefficient times the value of the variable that would be 1) which is pretty high relative to the -6.8 intercept. So opportunities of meaningful improvements are there. Imagine my intercept were -1000 and Wednesday coefficient were the same. Then optimizing the day would be almost irrelevant from a practical standpoint.

✓ The absolute value of a coefficient is often used to quickly estimate variable importance. However, that depends on the variable scale more than anything else. You could normalize variables, so they are all on the same scale. But that’s rarely a good idea if your goal is presenting to product people. It is hard to get a product manager excited by saying: “If we increase variable X by one standard deviation, we could achieve this and that”"
delta z = 1, 等价于x - x0 = sigma(sd)

## Decision Trees


Unlike regressions, Decision Trees (DT) are very good when the relationship between variables and outcome is non-linear. The best possible example is the behavior of the variable “hour” in the email click data set. While regressions would fail in finding the non-linearity, trees would easily identify that probability of clicks is high in the morning/early afternoon and low outside this segment. Also, trees are really good in looking at how variables interact with each other, by automatically creating segments including multiple variables. We will see all this in details.

Perhaps even more importantly, trees can be very useful in practice when it comes to building metrics, which is one of the most important tasks of a data scientist. The high high majority of metrics are based on the idea of finding one hard threshold that separates “good” from “bad” and then trying to increase the percentage of users falling in the good bucket. For instance:

●	Early FB growth metric: users with at least X friends in Y days

●	Engagement: users performing at least X actions per day

●	Response rate: proportion of questions with at least 1 answer within X hour

●	Conversion rate: proportion of users who convert within X time since their first visit

And so on. It is hard to think about one single metric that doesn’t use the template above. And to find those X and Y thresholds in the metrics above, trees can be very useful. Like going back to the email click dataset, we know by now that the higher the number of purchases, the more engaged is the customer (fairly obviously). But is there a threshold based on which we can define customers as “power users” vs “non-power user”? If so, we could then create a metric like:

●	power user: user with > X purchases

And then build a team whose goal is to increase the percentage of power users within the customer user base. There is hardly something more effective than this approach. And really the famous FB growth metric 7 friends in 10 days has exactly this idea.

Let’s now build the tree and see in practice how to use its output for insights using the same email dataset.


## Build Decision Trees

●	R
●	Python

In [13]:
pip install graphviz

Collecting graphviz
  Downloading https://files.pythonhosted.org/packages/83/cc/c62100906d30f95d46451c15eb407da7db201e30f42008f3643945910373/graphviz-0.14-py2.py3-none-any.whl
Installing collected packages: graphviz
Successfully installed graphviz-0.14
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas 
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
  
#Read from google drive. Always the same dataset.
data = pandas.read_csv('https://drive.google.com/uc?export=download&id=1PXjbqSMu__d_ppEv92i_Gnx3kKgfvhFk')
  
#prepare the data for the model by creating dummy vars and removing the label
data_dummy = pandas.get_dummies(data, drop_first=True)
train_cols = data_dummy.drop('clicked', axis=1)
  
#build the tree
tree=DecisionTreeClassifier(
    #set max tree dept at 4. Bigger than that it just becomes too messy
    max_depth=4,
    #change weights given that we have unbalanced classes. Our data set is now perfectly balanced. It makes easier to look at tree output
    class_weight="balanced",
    #only split if it's worthwhile. The default value of 0 means always split no matter what if you can increase overall performance, which might lead to irrelevant splits
    min_impurity_decrease = 0.001
    )
tree.fit(train_cols,data_dummy['clicked'])
  



DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.001, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [2]:
#visualize it
export_graphviz(tree, out_file="tree.dot", feature_names=train_cols.columns, proportion=True, rotate=True)
with open("tree.dot") as f:
    dot_graph = f.read()
s = Source.from_file("tree.dot")
s.view()

ExecutableNotFound: failed to execute ['dot', '-Tpdf', '-O', 'tree.dot'], make sure the Graphviz executables are on your systems' PATH

可以对于同一个feature连续split两次

The way to interpret the output is:

●	Each block is a tree node. The nodes all the way to the right are called leaves and the final model classification depends on the leaf where an event ends up

●	Within a node you have 4 values:
1.	The split. This is the split that leads to the two nodes to the right
2.	The gini index of that node. It represents purity of the node. 0.5 means random guess, so it is the worst possible value. 0 means perfect classification, so it is the best possible value. Look for nodes with values as close as possible to 0
3.	Samples: the proportion of events in that node. The higher the better. It means that node is very important cause captures many people
4.	Value: proportion of class 0 and class 1 events. The sum of those two values is always 1. Similarly to Gini, it gives an idea of how pure the node is. Ideally, you want one of the two values to be close to 1 and the other to be close to 0. That’s when Gini will be small. If the first of those two values is higher than 0.5, the node is labeled as class 0. Otherwise it is class 1 node

●	So our starting point is the first node to the left. There we have 100% of samples (obviously, we haven’t even started splitting), the proportion between classes is a perfect 50/50 (we balanced the data before building the model), and, therefore, the gini is 0.5, as bad as it can possibly be. From this node, the first split is on user_past_purchases <= 3.5. The way you read that is:
1.	If a user has <= 3.5 purchases, follow the True arrow. So you end up in the node up and right. That node has Gini of 0.44 (so we improved), 52.7% events (meaning 52.7% of people in our dataset have <= 3.5 purchases), and of those 52.7% of people 67.4% did not click while 32.6% did click. So this is a 0 class node. That split helped us identify a segment with a lower proportion of clicks compared to the starting point of 50/50.
2.	If you go right and down, you are following the False arrow. So you end up in a node representing people with > 3.5 purchases (i.e. not true that they have <= 3.5 purchases). In this case Gini is 0.474, samples is 47.3% and class 0/class 1 proportions are, respectively, 38.6% and 61.4%. So here we found a segment with a significantly higher percentage of people who click

●	Let’s now move one more step to the right. Let’s consider the node up/right, the one with 0.44, 52.7% and .674/.326 values inside. This is the starting point for the new split, which is user_past_purchases <= 0.5.
1.	As before, up means true, down means false. So if we go up, we find users who have <= 3.5 purchases AND <= 0.5 purchases. Since I am splitting on the same variable twice, this is the same as simply saying <= 0.5 purchases. We only have 13.9% of total users there and only 0.7% of them click! This is a really interesting node because it is so pure. Almost achieves perfect classification, as you can see from the super low gini.
2.	If we go down, we find users who have <= 3.5 purchases AND > 0.5 purchases. Basically, between 1 and 3 purchases. We have 38.9% of total users there and almost 40% of them click. The percentage of users who click is higher than the previous node. This means that by removing users with 0 purchases, we managed to find a better segment for our label

●	As you keep going right this way, you get to the leaves, which are the final classification of the tree. For instance, let’s take the leaf all the way up/right. Those are users with:
1.	<= 3.5 purchases AND
2.	More than 0.5 purchases (i.e. false that they have <= 0.5 purchases) AND
3.	Email_version_personalized <= 0.5 (meaning email is not personalized) AND
4.	User_country_France <= 0.5 (meaning the user is not from France)
These users represent 17.5% of total users. Out of those users, 68.9% don’t click and the remaining 31.1% click. So this is a class 0 leaf. If an event ends up there, we predict that will not click.


## Product Insights

●	By far the most important insight from a tree is given by the first split. This model is telling us that the most important segment is whether users bought more or less than 3 times in the past. Increasing the proportion of users with more than 3 purchases would be a great company-wide yearly goal.
We don’t have timestamp of the purchases here, but if we had that we could see if the tree also splits on that and create a metric like: percentage of users with at least 3 purchases within X time

●	If a user has zero purchases, the tree doesn’t split on any other variable. That’s a leaf. This means that everything else becomes irrelevant if a user has never bought. Changing time of the day, weekday, subject, etc, makes no difference there. To make these users click the change will need to be much more dramatic than just changing the email template or when to send it.
The next step should be crafting a totally different email with a different message just for these users. These are the hard users to win, but they are also where more value is. They already came to the site and gave their email address, so they have some sort of intent. But they never bought anything. Understanding why that happened could unlock so much value and is way easier to get these people to buy vs having to get new users and then trying to make them convert

●	Country UK/US and email_personalized = TRUE always lead to higher proportion of clicks, in both R and Python. R also splits on weekday with a clear weekend/weekday split, i.e. one side is Friday/Saturday/Sunday and the other side is the other days. Note that R can split on multiple levels at the same time, while Python will look at each dummy variable independently. In practical terms, this means that Python will need larger trees to extract that information. I.e. to split on those 3 days would require to go down 3 times, firstly splitting on say Friday, then Saturday, and then Sunday.
Also, a split on multiple levels has more power than a split on just one level, i.e. can separate the classes better. So always expect categorical variables with many levels to look more important in R than Python

●	There is no split on any other variables beside those above. This depends on the fact that we built rather small trees, so the tree only focused on macro-information. Beside the fact that is really hard to visualize large trees, splits in large trees are not that informative either.
If I had a split at the bottom on the variable hour, this would be conditional on all the previous splits, like purchases < X AND email_personalized = Y AND purchases > Z AND country = J, etc. So it wouldn’t tell me in absolute terms when it is the best time to send an email, but only for that specific segment. And given how specific that segment would be, there would be few events in that node, so overall it would not be particularly important
