# Logistic regression model

## Overview

Logistic regression model example using Tensorflow 2.0 and `tf.estimator` API.


## Setup

In [0]:
%%capture
!pip install sklearn

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

In [0]:
%%capture
!pip install tensorflow==2.0.0-alpha0

import tensorflow.compat.v2.feature_column as fc
import tensorflow as tf

## Load and preview the dataset


**Datset source:** UCI Machine Learning repository

**Dataset description:** The data about financial institution customers and their responses for the marketing campaign. The goal (y) is to predict if the client will subscribe (1) or not (0) for a term deposit after receive call from the marketer.

In [0]:
# Download dataset
dataset_url='https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv'
dataset=pd.read_csv(dataset_url, header=0)

In [0]:
dataset.dtypes

In [0]:
# Preview dataset
dataset.head()

## Explore the data

In [0]:
print(dataset.shape)

The dataset contains 41188 rows with 10 numerical and 10 categorical variables.

In [0]:
dataset.describe()

In [0]:
# Data exploration
dataset['y'].hist(grid=False)

In the dataset we have more clients which were not decided for long term loan.

In [0]:
dataset.age.hist(bins=20,grid=False)

Dataset also contains the biggest representation of group 30-45 years old.

In [0]:
dataset['job'].value_counts().plot(kind='barh')

We have also the biggest represenation of administrative jobs.

## Split to train and test datasets

We'll split dataset for 70% training and 30% evaluation subsets.

In [0]:
from sklearn.model_selection import train_test_split

x = dataset.loc[:, dataset.columns != 'y']
y = dataset.pop('y')

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

In [0]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## Feature Engineering for the Model
Estimators use a [feature columns](https://www.tensorflow.org/guide/feature_columns) to provide features type description.

For categorical columns we'll apply One-Hot Encoding and we'll cast numerical values to common type `float32`.



In [0]:
CATEGORICAL_COLUMNS = ['job',
                        'marital',
                        'education',
                        'housing',
                        'loan',
                        'month',
                        'poutcome'
                      ]
NUMERIC_COLUMNS = ['age', 
                   'duration',
                  'pdays',
                  'previous',
                  'emp_var_rate',
                  'euribor3m',
                  'nr_employed'
                  ]

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
  vocabulary = x_train[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocabulary))

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(feature_name, dtype=tf.float32))

In [0]:
# Print prepared feature columns
feature_columns

In [0]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):
  def input_function():
    """
    Convert data to a tf.data.Dataset in streaming way
    """
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds = ds.shuffle(1000)
    ds = ds.batch(batch_size).repeat(num_epochs)
    return ds
  return input_function

train_input_fn = make_input_fn(x_train, y_train)
eval_input_fn = make_input_fn(x_test, y_test, num_epochs=1, shuffle=False)

Preview prepared dataset: feature names, feature values, labels.

In [0]:
ds = make_input_fn(x_train, y_train, batch_size=10)()

for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys()))
  print()
  print('A batch of catgorical feature values:', feature_batch['job'].numpy())
  print()
  print('A batch of Labels:', label_batch.numpy())

## Model training

After added features to the dataset, we can train the model.

In [0]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn)

clear_output()

## Model evaluation

In [0]:
result = linear_est.evaluate(eval_input_fn)
result

Trained model has 91% accuracy.

In [0]:
linear_est.get_variable_names()

## Make prediction

Make prediction on the evaluation set using `linear_est.predict(eval_input_fn)`.
Plot the histogram of predicted labels.



In [0]:
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities')

We'll plot receiver operating characteristic (ROC) of the results, to analyze the tradeoff between true positive and false positive rate.

In [0]:
from sklearn.metrics import roc_curve
from matplotlib import pyplot as plt

fpr, tpr, _ = roc_curve(y_test, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)