# CS110 Final Project - Churn Prediction Application (https://github.com/snehankekre/ANN_Churn)
   **Snehan Kekre** 

   **Minerva Schools at KGI**


## Brief description of the dataset 

**Dataset Location:** https://www.sgi.com/tech/mlc/db/churn.all (https://archive.fo/HJx3i)


### Application:  Predicting churn

Below presented is my analysis using a supervised model to predict churn (ie. when customers cancel their plan). My application that predicts churn is of value as it is usually easier to retain current customers than to get new ones. 

The analyzed dataset contains the following variables:

* State: the states in the U.S. (categorical)
* Account length: the number of days that the account has been active
* Area code: area code
* Phone: phone number
* Int'l Plan: Whether the customer has internal plan or not
* VMail Plan: Whther the customer has a voicemail plan or not
* VMail Message: number of voice mail messages
* Day Mins: number of minutes the customer spoke per day
* Day Calls: number of calls made by the customer per day
* Day Charge: charge incurred by the customer per day
* Eve Mins: number of minutes the customer spoke during the afternoon
* Eve Calls: number of calls made by the customer in the afternoon
* Eve Charge: charge incurred by the customer in the afternoon
* Night Mins: number of minutes the customer spoke at night
* Night Calls:  number of calls made by the customer at night
* Night Charge: charge incurred by the customer at night
* Intl Mins: number of minutes spent in international calls
* Intl Calls: number of international calls made by the customer
* Intl Charge: charged incurred by the customer for international calls
* CustServ Calls: number of calls to customer service
* Churn: if the customer has cancelled the plan or not

## 1. Importing the required libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

## 2. Parsing and exploring trends in the dataset

In [2]:
exp_data = pd.read_csv("churn.csv", sep=',', decimal='.', header=0)

In [3]:
exp_data.describe()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


## 3. Reading and transforming fields of interest

### 3.1 Loading the dataset

In [10]:
def read_dataset(filename):
    # List the columns that are not of interest, besides the target
    drop_cols = ['State','Area Code','Phone','Churn?']
    
    # List the categorical columns
    yes_no_cols = ["Int'l Plan","VMail Plan"]
    
    # Load the dataset
    churn_data = pd.read_csv(filename, sep=',', decimal='.', header=0)
    
    # Convert the categorical columns to boolean
    churn_data[yes_no_cols] = churn_data[yes_no_cols] == 'yes'
    
    # Isolate the target
    y = churn_data['Churn?'] 
    
    # Remove the listed columns
    churn_data = churn_data.drop(drop_cols, axis=1)
    
    # Isolate the names of the columns
    feature_names = churn_data.columns
    
    # Convert the dataset to an array
    X = churn_data.as_matrix().astype(np.float32)
    
    # Apply one-hot encoding to the target variable
    y_one_hot = pd.get_dummies(y)
    
    # Covert the target to an array
    y = y_one_hot.as_matrix().astype(np.float32)
    
    print("X.shape")
    print(X.shape)
    print("y.shape")
    print(y.shape)
    
    return X, y, feature_names

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### 3.2 Separate in to Test and Train data + Standardize

In [12]:
def scale_and_split(X, y):
    
    # Split the dataset into Train and Test
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=2)
    
    # Standardize the variables
    scaler = StandardScaler()
    
    X_train = scaler.fit_transform(X_train)

    X_test = scaler.transform(X_test)

    print(X_train.shape)
    
    print(X_test.shape)
        
    return X_train, X_test, y_train, y_test

### Feedfoward

![Image of feedforward_matrix](https://i.imgur.com/diNer3g.png)

In [13]:
def forward_prop(tf_dataset, tf_labels, tf_dropout_rate):
    # Input layer    
    with tf.name_scope("hidden_layer1"):
        
        # Define the weights matrix
        weights = tf.Variable(tf.truncated_normal([17, num_hidden_units_1]))
        
        # Define the bias matrix
        biases = tf.Variable(tf.zeros([num_hidden_units_1]), name="biases")
        
        # Define the net entries of the network
        h1_net = tf.matmul(tf_dataset, weights) + biases
        
        # Define the activation function
        h1_activ = tf.nn.relu(h1_net)
        
        h1_reg = tf.nn.l2_loss(weights)
    
    # Output layer
    with tf.name_scope("output_layer"):
        
        weights = tf.Variable(tf.truncated_normal([num_hidden_units_1, num_labels]))
        
        biases = tf.Variable(tf.zeros([num_labels]), name="biases")
        
        out_net = tf.matmul(h1_activ, weights) + biases
        
        out_reg = tf.nn.l2_loss(weights)

    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=out_net, labels=tf_labels))
    
    loss = loss + l2_reg_param * (h1_reg + out_reg)
    
    return out_net, loss

In [14]:
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import linear_model

## Set the parameters of the neural network

In [15]:
batch_size = 50
num_hidden_units_1 = 50
num_hidden_units_2 = 30
l2_reg_param = 0.5e-3 # Scale the loss on output and inner layers
learning_rate = 1.5 
num_steps = 2000 # Number of steps is inversely proportional to batch size
num_labels = 2
dropout_rate = 0.5

In [16]:
# Return the accuracy of the method
def accuracy(predictions, labels):
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))/ predictions.shape[0])

In [17]:
def run_ann(X_train, X_test, y_train, y_test):
    
    graph = tf.Graph()
    
    with graph.as_default():
        
        tf_dataset = tf.placeholder(tf.float32, shape=(None, X_train.shape[1]))
        
        tf_labels = tf.placeholder(tf.float32, shape=(None, num_labels))
        
        tf_dropout_rate = tf.placeholder(tf.float32)
        
        print("DataSet tf")
        
        print(tf_dataset.get_shape()[1])
        
        logits, loss = forward_prop(tf_dataset, tf_labels, tf_dropout_rate)
        
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
        
        prediction = tf.nn.softmax(logits)

    with tf.Session(graph=graph) as session:
        
        tf.initialize_all_variables().run()
        
        print("Initialized")
        
        for step in range(num_steps):
            
            offset = (step * batch_size) % (y_train.shape[0] - batch_size)
            
            # Generate a minibatch.
            batch_data = X_train[offset:(offset + batch_size), :]
            
            batch_labels = y_train[offset:(offset + batch_size), :]
            
            feed_dict = {tf_dataset : batch_data, tf_labels : batch_labels, tf_dropout_rate: dropout_rate}
            
            _, l, predictions = session.run(
            [optimizer, loss, prediction], feed_dict=feed_dict)
            
            if (step % 100 == 0):
                
                idx = np.random.permutation(batch_size)
                
                print("Minibatch loss at step %d: %f" % (step, l))
                
                print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))

        #idx = np.random.permutation(batch_size)
        test_pred = session.run(prediction, feed_dict={tf_dataset: X_test, tf_dropout_rate: dropout_rate})
        print("\n\nTest accuracy: %.1f%%" % accuracy(test_pred, y_test))
        #correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

        y_hat = tf.argmax(test_pred,1).eval()
        y_true = tf.argmax(y_test,1).eval()

        print('Y hat\n',y_hat)
        print('Y True\n',y_true)

        print(metrics.classification_report(y_true, y_hat))
        print('AUC score: ', metrics.roc_auc_score(y_true, y_hat))
        print("Accuracy: %f" % metrics.accuracy_score(y_true, y_hat))
        cm = metrics.confusion_matrix(y_true, y_hat)
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.matshow(cm)
        #plt.title('Confusion Matrix',size=10)
        ax.set_xticklabels([''] + ['no churn', 'churn'], size=10)
        ax.set_yticklabels([''] + ['no churn', 'churn'], size=10)
        #plt.ylabel('Prediction',size=10)
        plt.xlabel('Real',size=10)
        for i in range(2):
            for j in range(2):
                ax.text(i, j, cm[i,j], va='center', ha='center',color='white',size=20)
        fig.set_size_inches(4,4)
        plt.show()

In [None]:
def main():
    X, y, feature_names = read_dataset('churn.csv')
    X_train, X_test, y_train, y_test = scale_and_split(X, y)
    run_ann(X_train, X_test, y_train, y_test)

if __name__ == '__main__':
    main()

In [None]:
x.shape
(3333, 17)
y.shape
(3333, 2)
(2666, 17)
(667, 17)
DataSet tf
17
WARNING:tensorflow:From <ipython-input-17-85ca0fd19645>:25: initialize_all_variables (from tensorflow.python.ops.
                                                                                      variables) is deprecated and 
                                                                                        will be removed)
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized
Minibatch loss at step 0: 9.642170
Minibatch accuracy: 36.0%
Minibatch loss at step 100: 0.496644
Minibatch accuracy: 80.0%
Minibatch loss at step 200: 0.475079
Minibatch accuracy: 86.0%
Minibatch loss at step 300: 0.247336
Minibatch accuracy: 88.0%
Minibatch loss at step 400: 0.261087
Minibatch accuracy: 88.0%
Minibatch loss at step 500: 0.305205
Minibatch accuracy: 92.0%
Minibatch loss at step 600: 0.269468
Minibatch accuracy: 94.0%
Minibatch loss at step 700: 0.165396
Minibatch accuracy: 96.0%
Minibatch loss at step 800: 0.199834
Minibatch accuracy: 94.0%
Minibatch loss at step 900: 0.107609
Minibatch accuracy: 100.0%
Minibatch loss at step 1000: 0.296621
Minibatch accuracy: 92.0%
Minibatch loss at step 1100: 0.145511
Minibatch accuracy: 96.0%
Minibatch loss at step 1200: 0.161333
Minibatch accuracy: 96.0%
Minibatch loss at step 1300: 0.182208
Minibatch accuracy: 96.0%
Minibatch loss at step 1400: 0.225527
Minibatch accuracy: 94.0%
Minibatch loss at step 1500: 0.219653
Minibatch accuracy: 94.0%
Minibatch loss at step 1600: 0.066998
Minibatch accuracy: 100.0%
Minibatch loss at step 1700: 0.309811
Minibatch accuracy: 88.0%
Minibatch loss at step 1800: 0.063317
Minibatch accuracy: 100.0%
Minibatch loss at step 1900: 0.101592
Minibatch accuracy: 98.0%


Test accuracy: 92.7%
Y hat
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1
 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
 0]
Y True
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0
 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1
 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0
 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0]
             precision    recall  f1-score   support

          0       0.94      0.98      0.96       571
          1       0.82      0.62      0.71        96

avg / total       0.92      0.93      0.92       667

AUC score:  0.801116462347
Accuracy: 0.926537

![Image of Confusion Matrix](https://i.imgur.com/onx2ENg.png)

## Overview and comments

I used TensorFlow in this application for its portability (flexible architecture allows for the deployment of computation to one or more CPUs or GPUs), but more importantly for the capability of auto-differention. 

Tensorflow keeps separate the definition of computations and their executions by first creating a graph and then using a session to carry out operations in it: ![Image of Data Flow Graph](https://i.imgur.com/QMnMGX1.jpg)

Variables, constants, and operators are the nodes in the **Data Flow Graph**, while tensors (n-dimensional matrix) are edges. For greater ease in computation and lower load on hardware, it is possible to split graphs into smaller subgraphs and parallely run them across multiple cores (distributed computation). By fixing the underlying graph and activation function, the network is parameterized by a weight vector **w** belonging to **R^d**. We wish to learn the vector **w**.

A benefit associated with the above implementation is that it avoids the computational complexity of **O(n^3)** from the Normal Equation associated with inverting **X^T**, where **X** is an ** *n x n* **matrix and ** *n* ** is the number of features. But using a linear regression model post training confers faster prediction as the instances and the number of features is linear w.r.t to the computational complexity. 

I initially wanted to implement ** *Batch Gradient Descent Optimization* ** where the gradient of the cost function is computed for each parameter. An adavantage of this method is that gradients can be computed all at once using the following equation: ![Image of Gradient vector of the cost function](https://i.imgur.com/aR0Q4NQ.jpg)

For our sizable training set it is computationaly very expensive to use Batch Graident Descent as it uses the entire training data at every step. Another option was to use Stochastic Gradient Descent (SGD) which is faster as it picks a random point in the training set and computes the gradients relying only on the one point. However, the cost function will decrease only on average due to the method's stochastic nature. Consequently, the values of our output parameters are not optimal. This can be mitigated by gradually tuning the learning rate by simulated annealing, but the extra cost associated with this makes it undesirable. I instead used ** *Mini-batch Gradient Descent Optimization* ** because unlike batch gradient which computed gradients based on the entire training set, and unlike SGD which did it based on one point, ** *Mini-batch Gradient Descent* ** provides a performance boost and lowers computational cost by compting gradients on subsets or **mini-batches** of the training set and thus eases in **tensor** operations.

In computing the loss, additional time complexity is added due to the computationally expensive (computing the exponential of every score and then normalizing them) Softmax function:![Image of Softmax function](https://i.imgur.com/02UPYxW.jpg)To minimize the cost function, cross entropy is measured to see how well estimated probablities match the target class of probablities (*tf.nn.softmax_cross_entropy_with_logits*). 

The aforementioned gradients are computed by taking the derivate of y w.r.t. each tensor in the list. TensorFlow allows for automatic differention computed on a graph: ![Image of Gradients computed for a graph](https://i.imgur.com/3pdWiyI.jpg)

To reduce the likelihood of a vanishing gradient my choice of ativation function was ReLU instead of a sigmoid function or softplus.
## Rectified linear function

![Image of Softplus_vs_Rectifier](https://i.imgur.com/y7WFOOC.png)

The scaling behavior of backpropgagtion in this example can be understood through its time complexity. In the case of ** *x* ** training points, ** *y* ** features, ** *z* ** hidden layes (** *z* ** = 0 here), with each ** *n* ** neurons and ** *o* ** output neurons and ** *i* ** iterations, the time complexity is ** *O*(*x*.*y*.(*n* ^*z*).*o*.*i*) **
