# Introduction to Predictive Analytics in Python

you will learn how to build a logistic regression model with meaningful variables. You will also learn how to use this model to make predictions and how to present it and its performance to business stakeholders.


## Building Logistic Regression Models

Learn the basics of logistic regression: how can you predict a binary target with continuous variables and, how should you interpret this model and use it to make predictions for new examples?

## Exploring the base table

Before diving into model building, it is important to understand the data you are working with. In this exercise, you will learn how to obtain the population size, number of targets and target incidence from a given basetable.

In [7]:
import pandas as pd
import numpy as np

basetable = pd.read_csv('C:/Users/15027/Documents/GitHub_JK/IntroPredictiveAnalyticsPython/basetable.csv')

In [3]:
basetable.head()

Unnamed: 0,target,gender_F,income_high,income_low,country_USA,country_India,country_UK,age,time_since_last_gift,time_since_first_gift,max_gift,min_gift,mean_gift,number_gift
0,0,1,0,1,0,1,0,65,530,2265,166,87,116.0,7
1,0,1,0,0,0,1,0,71,715,715,90,90,90.0,1
2,0,1,0,0,0,1,0,28,150,1806,125,74,96.0,9
3,0,1,0,1,1,0,0,52,725,2274,117,97,104.25,4
4,0,1,1,0,1,0,0,82,805,805,80,80,80.0,1


In [8]:
# Assign the number of rows in the basetable to the variable 'population_size'.
population_size  = len(basetable)

# Print the population size.
print(population_size)

# Assign the number of targets to the variable 'targets_count'.
targets_count = sum(basetable['target'])

# Print the number of targets.
print(targets_count)

# Print the target incidence.
print(targets_count / population_size)

25000
1187
0.04748


In [9]:
# Count and print the number of females.
print(sum(basetable['gender_F'] == 1))

# Count and print the number of males.
print(sum(basetable['gender_F'] == 0))

# Count and print the number of females.
# print(sum(basetable['gender'] == 'F'))

# Count and print the number of males.
# print(sum(basetable['gender'] == 'M'))

12579
12421


## Building a Logistic Regression Model

You can build a logistic regression model using the module linear_model from sklearn. First, you create a logistic regression model using the LogisticRegression() method:

logreg = linear_model.LogisticRegression()
Next, you need to feed data to the logistic regression model, so that it can be fit. X contains the predictive variables, whereas y has the target.

X = basetable[["predictor_1","predictor_2","predictor_3"]]`

y = basetable[["target"]]

logreg.fit(X,y)

In this exercise you will build your first predictive model using three predictors.

In [12]:
# Import linear_model from sklearn.
from sklearn import linear_model

# Create a dataframe X that only contains the candidate predictors age, gender_F and time_since_last_gift.
X = basetable[['age', 'gender_F', 'time_since_last_gift']]

# Create a dataframe y that contains the target.
y = basetable[['target']]

# Create a logistic regression model logreg and fit it to the data.
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

LogisticRegression()

## Showing the coefficients and intercept

Once the logistic regression model is ready, it can be interesting to have a look at the coefficients to check whether the model makes sense.

Given a fitted logistic regression model logreg, you can retrieve the coefficients using the attribute coef_. The order in which the coefficients appear, is the same as the order in which the variables were fed to the model. The intercept can be retrieved using the attribute intercept_.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [13]:
# Construct a logistic regression model that predicts the target using age, gender_F and time_since_last gift
predictors = ["age","gender_F","time_since_last_gift"]
X = basetable[predictors]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

# Assign the coefficients to a list coef
coef = logreg.coef_
for p,c in zip(predictors,list(coef[0])):
    print(p + '\t' + str(c))
    
# Assign the intercept to the variable intercept
intercept = logreg.intercept_
print(intercept)

age	0.007801469599056383
gender_F	0.10964341264647998
time_since_last_gift	-0.001287260703994978
[-2.59072469]


## Making Predictions with Logistic Regression Model

Once your model is ready, you can use it to make predictions for a campaign. It is important to always use the latest information to make predictions.

In this exercise you will, given a fitted logistic regression model, learn how to make predictions for a new, updated basetable.

The logistic regression model that you built in the previous exercises has been added and fitted for you in logreg.

In [17]:
current_data = pd.read_csv('C:/Users/15027/Documents/GitHub_JK/IntroPredictiveAnalyticsPython/current_data.csv')

current_data.head()

Unnamed: 0,gender_F,age,time_since_last_gift
0,0,87,702
1,0,21,1773
2,0,28,782
3,0,82,1121
4,0,60,1137


In [19]:
# Fit a logistic regression model
from sklearn import linear_model
X = basetable[["age","gender_F","time_since_last_gift"]]
y = basetable[["target"]]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y.values.ravel())

# Create a dataframe new_data from current_data that has only the relevant predictors 
new_data = current_data[['age','gender_F','time_since_last_gift']]

# Make a prediction for each observation in new_data and assign it to predictions
predictions = logreg.predict_proba(new_data)
print(predictions[0:5])

[[0.94351589 0.05648411]
 [0.99106857 0.00893143]
 [0.96703924 0.03296076]
 [0.96751723 0.03248277]
 [0.97304475 0.02695525]]


In [30]:
new_data.head()

Unnamed: 0,age,gender_F,time_since_last_gift
0,87,0,702
1,21,0,1773
2,28,0,782
3,82,0,1121
4,60,0,1137


In [31]:
# Creating pandas dataframe from numpy array
# dataset = pd.DataFrame({'Non_Target_Prob': predictions[:, 0], 'Target_Prob': predictions[:, 1]})
# df_all = new_data.merge(dataset, on='index', indicator = True)

Note:  The second value in the array above is the probability that an observation is a target

## Donor Most Likely to Donate

The predictions that result from the predictive model reflect how likely it is that someone is a target. For instance, assume that you constructed a model to predict whether a donor will donate more than 50 Euro for a certain campaign. If the prediction for a certain donor is 0.82, it means that there is an 82% chance that he will donate more than 50 Euro.

In this exercise you will find the donor that is most likely to donate more than 50 Euro.

Recall that you can sort a pandas dataframe df according to a certain column c using

In [22]:
predictions
# Sort the predictions
# predictions_sorted = predictions.sort('probability')

# Print the row of predictions_sorted that has the donor that is most likely to donate
# print(predictions_sorted.tail(1))

array([[0.94351589, 0.05648411],
       [0.99106857, 0.00893143],
       [0.96703924, 0.03296076],
       [0.96751723, 0.03248277],
       [0.97304475, 0.02695525],
       [0.99670801, 0.00329199],
       [0.99675616, 0.00324384],
       [0.99427554, 0.00572446],
       [0.9456776 , 0.0543224 ],
       [0.98845718, 0.01154282],
       [0.96582973, 0.03417027],
       [0.99769552, 0.00230448],
       [0.92230869, 0.07769131],
       [0.99546043, 0.00453957],
       [0.99775107, 0.00224893],
       [0.99096161, 0.00903839],
       [0.87781713, 0.12218287],
       [0.937766  , 0.062234  ],
       [0.97730938, 0.02269062],
       [0.92327366, 0.07672634],
       [0.94658774, 0.05341226],
       [0.97872119, 0.02127881],
       [0.99367743, 0.00632257],
       [0.99499597, 0.00500403],
       [0.98595216, 0.01404784],
       [0.96080186, 0.03919814],
       [0.99179936, 0.00820064],
       [0.99752858, 0.00247142],
       [0.995342  , 0.004658  ],
       [0.93206783, 0.06793217],
       [0.