# 1.  Introduction to Predictive Analytics in Python

you will learn how to build a logistic regression model with meaningful variables. You will also learn how to use this model to make predictions and how to present it and its performance to business stakeholders.


## Building Logistic Regression Models

Learn the basics of logistic regression: how can you predict a binary target with continuous variables and, how should you interpret this model and use it to make predictions for new examples?

## Exploring the base table

Before diving into model building, it is important to understand the data you are working with. In this exercise, you will learn how to obtain the population size, number of targets and target incidence from a given basetable.

In [1]:
import pandas as pd
import numpy as np

basetable = pd.read_csv('C:/Users/15027/Documents/GitHub_JK/IntroPredictiveAnalyticsPython/basetable.csv')

In [2]:
basetable.head()

Unnamed: 0,target,gender_F,income_high,income_low,country_USA,country_India,country_UK,age,time_since_last_gift,time_since_first_gift,max_gift,min_gift,mean_gift,number_gift
0,0,1,0,1,0,1,0,65,530,2265,166,87,116.0,7
1,0,1,0,0,0,1,0,71,715,715,90,90,90.0,1
2,0,1,0,0,0,1,0,28,150,1806,125,74,96.0,9
3,0,1,0,1,1,0,0,52,725,2274,117,97,104.25,4
4,0,1,1,0,1,0,0,82,805,805,80,80,80.0,1


In [3]:
# Assign the number of rows in the basetable to the variable 'population_size'.
population_size  = len(basetable)

# Print the population size.
print(population_size)

# Assign the number of targets to the variable 'targets_count'.
targets_count = sum(basetable['target'])

# Print the number of targets.
print(targets_count)

# Print the target incidence.
print(targets_count / population_size)

25000
1187
0.04748


In [4]:
# Count and print the number of females.
print(sum(basetable['gender_F'] == 1))

# Count and print the number of males.
print(sum(basetable['gender_F'] == 0))

# Count and print the number of females.
# print(sum(basetable['gender'] == 'F'))

# Count and print the number of males.
# print(sum(basetable['gender'] == 'M'))

12579
12421


## Building a Logistic Regression Model

You can build a logistic regression model using the module linear_model from sklearn. First, you create a logistic regression model using the LogisticRegression() method:

logreg = linear_model.LogisticRegression()
Next, you need to feed data to the logistic regression model, so that it can be fit. X contains the predictive variables, whereas y has the target.

X = basetable[["predictor_1","predictor_2","predictor_3"]]`

y = basetable[["target"]]

logreg.fit(X,y)

In this exercise you will build your first predictive model using three predictors.

In [5]:
# Import linear_model from sklearn.
from sklearn import linear_model

# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift.
X = basetable[['age', 'gender_F', 'time_since_last_gift']]

# Create a dataframe y that contains the target.
y = basetable[['target']]

# Create a logistic regression model logreg and fit it to the data.
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

LogisticRegression()

## Showing the coefficients and intercept

Once the logistic regression model is ready, it can be interesting to have a look at the coefficients to check whether the model makes sense.

Given a fitted logistic regression model logreg, you can retrieve the coefficients using the attribute coef_. The order in which the coefficients appear, is the same as the order in which the variables were fed to the model. The intercept can be retrieved using the attribute intercept_.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [6]:
# Construct a logistic regression model that predicts the target using age, gender_F and time_since_last gift
predictors = ["age","gender_F","time_since_last_gift"]
X = basetable[predictors]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(predictors,list(coef[0])):
    print(p + '\t' + str(c))
    
# Assign the intercept to the variable intercept
intercept = logreg.intercept_
print(intercept)

age	0.007801469599056383
gender_F	0.10964341264647998
time_since_last_gift	-0.001287260703994978
[-2.59072469]


## Making Predictions with Logistic Regression Model

Once your model is ready, you can use it to make predictions for a campaign. It is important to always use the latest information to make predictions.

In this exercise you will, given a fitted logistic regression model, learn how to make predictions for a new, updated basetable.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [7]:
current_data = pd.read_csv('C:/Users/15027/Documents/GitHub_JK/IntroPredictiveAnalyticsPython/current_data.csv')

current_data.head()

Unnamed: 0,gender_F,age,time_since_last_gift
0,0,87,702
1,0,21,1773
2,0,28,782
3,0,82,1121
4,0,60,1137


In [8]:
# Fit a logistic regression model
from sklearn import linear_model
X = basetable[["age","gender_F","time_since_last_gift"]]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

# Create a dataframe new_data from current_data that has only the relevant predictors 
new_data = current_data[['age','gender_F','time_since_last_gift']]

# Make a prediction for each observation in new_data and assign it to predictions
predictions = logreg.predict_proba(new_data)
print(predictions[0:5])

[[0.94351589 0.05648411]
 [0.99106857 0.00893143]
 [0.96703924 0.03296076]
 [0.96751723 0.03248277]
 [0.97304475 0.02695525]]


In [12]:
new_data.head()

Unnamed: 0,age,gender_F,time_since_last_gift
0,87,0,702
1,21,0,1773
2,28,0,782
3,82,0,1121
4,60,0,1137


In [10]:
# Creating pandas dataframe from numpy array
# dataset = pd.DataFrame({'Non_Target_Prob': predictions[:, 0], 'Target_Prob': predictions[:, 1]})
# df_all = new_data.merge(dataset, on='index', indicator = True)

Note:  The second value in the array above is the probability that an observation is a target

## Donor Most Likely to Donate

The predictions that result from the predictive model reflect how likely it is that someone is a target. For instance, assume that you constructed a model to predict whether a donor will donate more than 50 Euro for a certain campaign. If the prediction for a certain donor is 0.82, it means that there is an 82% chance that he will donate more than 50 Euro.

In this exercise you will find the donor that is most likely to donate more than 50 Euro.

Recall that you can sort a pandas dataframe df according to a certain column c using

In [11]:
# predictions
# Sort the predictions
# predictions_sorted = predictions.sort('probability')

# Print the row of predictions_sorted that has the donor that is most likely to donate
# print(predictions_sorted.tail(1))

## 2.  Forward stepwise variable selection for logistic regression

Learn why variable selection is crucial for building a useful model. You'll also learn how to implement forward stepwise variable selection for logistic regression and how to decide on the number of variables to include in your final model.

- Avoiding overfitting that usually happens when more variables are used
- Reduce run time of model
- Make model interpretable which is difficult to do if the number of variables are too large

Goal:  Select a set of variables that has optimal performance.  The AUC value is a measure of accuracy. 

## Calculating AUC

The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target. In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.

You will make predictions again, before calculating its roc_auc_score.

In [14]:
import numpy as np
from sklearn.metrics import roc_auc_score

# Make predictions
predictions = logreg.predict_proba(X)
predictions_target = predictions[:,1]

# Calculate the AUC value
auc = roc_auc_score(y, predictions_target)
print(round(auc,2))

0.63


Note:  Let's see if we can improve by using a different set of variables

Adding more variables and therefore more complexity to your logistic regression model does not automatically result in more accurate models. In this exercise you can verify whether adding 3 variables to a model leads to a more accurate model.

variables_1 and variables_2 are available in your environment: you can print them to the console to explore what they look like.

In [16]:
variables_1 = ['mean_gift','income_low']
variables_2 = ['mean_gift', 'income_low', 'gender_F', 'country_India', 'age']

In [17]:
# Create appropriate dataframes
X_1 = basetable[variables_1]
X_2 = basetable[variables_2]
y = basetable[["target"]]

In [23]:
# Create the logistic regression model
logreg = linear_model.LogisticRegression()

# Make predictions using the first set of variables and assign the AUC to auc_1
logreg.fit(X_1, y.values.ravel())
predictions_1 = logreg.predict_proba(X_1)[:,1]
auc_1 = roc_auc_score(y, predictions_1)

# Make predictions using the second set of variables and assign the AUC to auc_2
logreg.fit(X_2, y.values.ravel())
predictions_2 = logreg.predict_proba(X_2)[:,1]
auc_2 = roc_auc_score(y, predictions_2)

# Print auc_1 and auc_2
print(round(auc_1,2))
print(round(auc_2,2))

0.68
0.69


Note:  You can see that the model with 5 variables has a similar AUC as the model using only 2 variables. Adding more variables doesn't always increase the AUC.

##  Forward Stepwise Variable Selection

Selects among all varialbes the one with the best AUC.  Next it selects another variable in combination with the first variable.  This process repeats until all variables are used. 

##  Implementation of Forward Stepwise Procedure

- Function AUC that calculates AUC given a certain set of variables
- Function best_next that returns the next best variable in combination with current variables
- Loop until desired amount of variables

##  Selecting the next best variable

The forward stepwise variable selection method starts with an empty variable set and proceeds in steps, where in each step the next best variable is added. To implement this procedure, two handy functions have been implemented for you.

The auc function calculates for a given variable set variables the AUC of the model that uses this variable set as predictors. The next_best function calculates which variable should be added in the next step to the variable list.

In this exercise, you will experiment with these functions to better understand their purpose. You will calculate the AUC of a given variable set, calculate which variable should be added next, and verify that this indeed results in an optimal AUC.

In [27]:
def auc(variables, target, basetable):
    X = basetable[variables]
    Y = basetable[target].values.ravel()
    logreg = linear_model.LogisticRegression()
    logreg.fit(X, Y)
    predictions = logreg.predict_proba(X)[:,1]
    auc = roc_auc_score(Y, predictions)
    return(auc)

Note:  This is a custom funtion, but if you want to see the contents of a function you are using you can type the following code:  function_name??

In [25]:
def next_best(current_variables,candidate_variables, target, basetable):
    best_auc = -1
    best_variable = None
    
        # Calculate the auc score of adding v to the current variables
    for v in candidate_variables:
        auc_v = auc(current_variables + [v], target, basetable)
        
                # Update best_auc and best_variable adding v led to a better auc score
        if auc_v >= best_auc:
            best_auc = auc_v
            best_variable = v
            
    return best_variable

- The auc function has been created. Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors. You should pass these variables in a list as the first argument to the auc function.
- The next_best function has been created. Calculate which variable should be added next, given that "max_gift", "mean_gift" and "min_gift" are currently in the model, and "age" and "gender_F" are the candidate next predictors. The first argument of the next_best function is a list with the current variables, while the second argument is a list with the candidate predictors.
- Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors.
- Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "gender_F" as predictors.

In [35]:
# Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors
auc_current = auc(['max_gift', 'mean_gift', 'min_gift'], ["target"], basetable)
print(round(auc_current,4))

# Calculate which variable among "age" and "gender_F" should be added to the variables "max_gift", "mean_gift" and "min_gift"
next_variable = next_best(['max_gift','mean_gift','min_gift'], ['age','gender_F'], ["target"], basetable)
print(next_variable)

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors
auc_current_age = auc(['max_gift','mean_gift','min_gift','age'], ["target"], basetable)
print(round(auc_current_age,4))

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "gender_F" as predictors
auc_current_gender_F = auc(['max_gift','mean_gift','min_gift','gender_F'], ["target"], basetable)
print(round(auc_current_gender_F,4))

0.7126
age
0.7149
0.7131


Note:  The model that has age as next variable has a better AUC than the model that has gender_F as next variable. Therefore, age is selected as the next best variable.




##  Finding the Order of Variables

The forward stepwise variable selection procedure starts with an empty set of variables, and adds predictors one by one. In each step, the predictor that has the highest AUC in combination with the current variables is selected.

In this exercise you will learn to implement the forward stepwise variable selection procedure. To this end, you can use the next_best function that has been implemented for you. It can be used as follows:

next_best(current_variables,candidate_variables,target,basetable)
where current_variables is the list of variables that is already in the model and candidate_variables the list of variables that can be added next.

In [43]:
# Find the candidate variables
candidate_variables = list(basetable.columns.values)
candidate_variables.remove("target")

# Initialize the current variables
current_variables = []

# The forward stepwise variable selection procedure
number_iterations = 5
for i in range(0, number_iterations):
    next_variable = next_best(current_variables, candidate_variables, ["target"], basetable)
    current_variables = current_variables + [next_variable]
    candidate_variables.remove(next_variable)
    print("Variable added in step " + str(i+1)  + " is " + next_variable + ".")
print(current_variables)

Variable added in step 1 is max_gift.
Variable added in step 2 is number_gift.
Variable added in step 3 is time_since_last_gift.
Variable added in step 4 is mean_gift.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Variable added in step 5 is age.
['max_gift', 'number_gift', 'time_since_last_gift', 'mean_gift', 'age']


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Note:  Need to look into the error above that states that the total number of iterations has reached limit.

##  Correlated Variables

The first 10 variables that are added to the model are the following:

['max_gift', 'number_gift', 'time_since_last_gift', 'mean_gift', 'income_high', 'age', 'country_USA', 'gender_F', 'income_low', 'country_UK']

As you can see, min_gift is not added. Does this mean that it is a bad variable? You can test the performance of the variable by using it in a model as a single variable and calculating the AUC. How does the AUC of min_gift compare to the AUC of income_high? To this end, you can use the function auc():

auc(variables, target, basetable)

It can happen that a good variable is not added because it is highly correlated with a variable that is already in the model. You can test this calculating the correlation between these variables:

import numpy
numpy.corrcoef(basetable["variable_1"],basetable["variable_2"])[0,1]

In [48]:
import numpy as np

# Calculate the AUC of the model using min_gift only
auc_min_gift = auc(['min_gift'], ["target"], basetable)
print(round(auc_min_gift,2))

# Calculate the AUC of the model using income_high only
auc_income_high = auc(['income_high'], ["target"], basetable)
print(round(auc_income_high,2))

# Calculate the correlation between min_gift and mean_gift
correlation = np.corrcoef(basetable["min_gift"], basetable["mean_gift"])[0,1]
print(round(correlation,2))
 

0.57
0.52
0.76


Note: You can observe that min_gift has more predictive power than income_high, but that it is highly correlated with mean_gift and therefore not included in the selected variables.

##  Decide on Number of Variables

Partitioning

In order to properly evaluate a model, one can partition the data in a train and test set. The train set contains the data the model is built on, and the test data is used to evaluate the model. This division is done randomly, but when the target incidence is low, it could be necessary to stratify, that is, to make sure that the train and test data contain an equal percentage of targets.

In this exercise you will partition the data with stratification and verify that the train and test data have equal target incidence. The train_test_split method has already been imported, and the X and y dataframes are available in your workspace.

In [49]:
# Load the partitioning module
from sklearn.cross_validation import train_test_split

# Create dataframes with variables and target
X = basetable.drop("target", 1)
y = basetable["target"]

# Carry out 50-50 partititioning with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, stratify = y)

# Create the final train and test basetables
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

# Check whether train and test have same percentage targets
print(round(sum(train['target'])/len(train), 2))
print(round(sum(test['target'])/len(test), 2))

ModuleNotFoundError: No module named 'sklearn.cross_validation'