<a href="https://colab.research.google.com/github/marchaem/Binomial-classification-tensorflow/blob/master/Lead_Scoring_Neural_Nets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color=Darkred>Neural Networks for Binomial Classification</font>

---


## <font color=Darkblue>Developing Deep Learning with Keras  </font>

> __*Using Google Colab*__

In this exercise, we are going to build a lead scoring model using a neural network model.
<br>

The goal is to predict the probability that a customer is a qualified lead for a demo.
<br>

This is done by building a *binomial classifier*. It makes a guess at the probability of being a qualified lead by applying supervised learning to past data containing demographics and behavioural information of users in the last 90 days.
<br>

<br><br>
### <font color=Darkred> 1. Import the basic Python libraries:</font>

*   **numpy**: scientific computing in Python.
*   **pandas**: data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables.
*   **mathplotlib** 2D plotting of publication-quality figures in a variety of hardcopy formats and interactive environments.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<br><br>
### <font color=Darkred>2. Upload the dataset</font>


In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
#Save uploaded file on the Virtual Machine's  

with open("lead_score_B2B_SaaS_to_do (3).csv", 'w') as f:
    f.write(uploaded[uploaded.keys()[0]])

In [0]:
# Once your file is on the Virtual Machine, you can check if the file is there.
!ls


<br><br>
### <font color=Darkred>**3. Declare the dataset as a data frame and visualise it**</font>

In [0]:
ls_data=pd.read_csv("lead_score_B2B_SaaS_to_do (3).csv")    # This imports the dataset as a dataframe
ls_data.head(10)      # This shows the first 10 rows of the data frame

In [0]:
ls_data.shape # It contains 4999 rows and 15 columns.

<br><br>
Let's also see which types of variables we have. 

There are two main types of **numerical** variables:

*   A *float* is a number that has a decimal point,
*   An *int* is a number without a decimal point.


**Categorical** or string variables are of two types

* An *object* is text labeling many categories,
* A *boolean* indicates a binomial variable (True, False; Yes/No etc) 

In [0]:
ls_data.dtypes

<br><br>
To produce Descriptive Statistics and the Model later on, we have to convert the categorical variables (*object* and *boolean*) into a numerical representation.<br>
This is called _One-hot Encoding_

In [0]:
ls_data = pd.read_csv("lead_score_B2B_SaaS_to_do (3).csv",
                   dtype={"tag_b2c": object, 
                          "subscribed_newsletter": object,
                          "clicked_last_email_link": object,
                          "webinar_attended": object,
                          "downloaded_ebook": object,
                          "visited_pricing_page": object,
                          "qualified_lead": object                         
                         },
                   index_col="hashed_id")  # This indicates that the column "hashed_id"is the index column

ls_data = pd.get_dummies(ls_data,drop_first=True) # This hot-encodes the categorical variables
ls_data.head(10)

<br><br>
### <font color=Darkred>**4. Take a look at the descriptive stats**</font>

* What is the average number of days since registration?  
* What is the maximum number of days since last session?
* What is the country with most registered users?
* What is the proportion of users who have registered to the newsletter?
* What is the rate of qualified leads?

In [0]:
ls_data.describe().round(2)

In [0]:
# Create a histogram to observe the distribution of Recency (Days since last session)
ls_data['days_since_last_session'].hist(color='red', alpha=0.5, bins=20)

# Add labels
plt.title('Histogram')
plt.xlabel('Days Since Last Session')
plt.ylabel('Frequency')

<br><br>
### <font color=Darkred>5. Declare the outcome variable ($y$) and the set of features ($X_i$)</font>

In [0]:
y = ls_data.loc[:,"qualified_lead_TRUE"].values
y.shape

In [0]:
X = ls_data.loc[:,"days_since_registration":"visited_pricing_page_TRUE"].astype("float64")
X.shape

<br><br>
### <font color=Darkred>6. Set up the Training and Testing set</font>

In [0]:
# The scikit-learn library implements a set of machine learning process 
!pip install -U scikit-learn
from sklearn.model_selection import train_test_split

# Fix random seed for reproducibility
seed = 1337
np.random.seed(seed)

# Set the train ratio to 70%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

In [0]:
# Let's check the number of cases in which there is a qualified lead training set
from collections import Counter
print('Training dataset shape {}'.format(Counter(y_train)))

*There* are 3499 users in the training set. Of these, 387 are qualified leads.
<br> 
In order to balance the training data, we can oversample the minority class by picking samples at random with replacement.

In [0]:
# The imblearn.over_sampling provides a set of methods to perform over-sampling.
!pip install -U imbalanced-learn 
from imblearn.over_sampling import RandomOverSampler 


In [0]:
# Random over-sampler
ros = RandomOverSampler()
>>> X_resampled, y_resampled = ros.fit_sample(X_train, y_train)
>>> print('Resampled dataset shape {}'.format(Counter(y_resampled)))

Now we have equal number of cases for qualified and non-qualified leads.

<br>
Our features are measured in different units (e.g. days since last session in days, categorical variables are binomial), but it is a general requirement for many machine learning algorithms, including neural networks, to have them in the same scale.

Let's use the so-called **Min-Max scaling** (often also simply called “normalization”). In this approach, the data is scaled to a fixed range - usually 0 to 1.

In [0]:
# Feature Scaling
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
X_train = sc.fit_transform(X_resampled)
X_test = sc.fit_transform(X_test)
y_train = y_resampled

<br><br>
### <font color=Darkred>7. Start Keras</font>
Keras is a high level API for building Artificial Neural Networks. It uses Tensorflow for it’s under-the-hood operations. 

To install Keras, you must have Tensorflow installed on your machine. Colaboratory has Tensorflow already installed on the Virtual Machine.

In [0]:
# To determine which version of TensorFlow the virtual machine is using... 
!pip show tensorflow

In [0]:
# Importing the Keras libraries and packages.
!pip install -q keras
import keras
from keras import regularizers # Specifies the type of regularization
from keras.models import Sequential # Specifies the model
from keras.layers import Dense # Specifies the layers 
from keras.wrappers.scikit_learn import KerasClassifier # Returns the constructed neural network model for training
from keras.models import load_model # Enables saving and reloading a pre-trained model

# Install and import TensorBoard for Colaboratory
!pip install -U tensorboardcolab
from tensorboardcolab import TensorBoardColab, TensorBoardColabCallback
tbc=TensorBoardColab() # TensorFlow tool to visualize and explore your models

# More scikit-learn libraries and packages.
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve



<br><br>
### <font color=Darkred>8. Define the Neural Network model</font>

Let's start with an example of baseline model of a simple network with:

* One hidden layer (shallow) containing 100 neurons;
* Rectifier activation function (ReLU);
* L2 regularization ($\lambda$= 0.001);
* Adam optimizer.
<br><br>

In summary:

<font color=Darkblue>10 inputs </font>  $\rightarrow$ <font color=Darkblue>Hidden layer[ReLU: 100 nodes] </font> $\rightarrow$ <font color=Darkblue>2 outputs: Yes = 1, No = 0</font>

<br>Below, see an example of how to code it.

```python
# Define baseline model 
def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(100, input_dim=27, kernel_regularizer=regularizers.l2(0.001), activation='relu')  # Hidden layer
	model.add(Dense(1, activation='sigmoid')) #Output layer
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
```

<br>
Now, in order to build a deep neural network with:

* Two hidden layers, each one containing respectively 100 and 50 neurons;
* Rectifier activation function (ReLU) in the hidden layers;
* L2 regularization ( 𝜆 = 0.001);
* Adam optimizer;

<br>
In summary:

<font color=Darkblue>10 inputs </font>  $\rightarrow$ <font color=Darkblue>Hidden layers[ReLU: 100 nodes, 50 nodes] </font> $\rightarrow$  <font color=Darkblue>2 outputs: Yes = 1, No = 0</font>

<br>Execute the code below.

In [0]:
# Define the neural net model 
def nnet_model():
    # Create model
    model = Sequential()
    model.add(Dense(100, input_dim = 27,
                    kernel_regularizer=regularizers.l2(0.001), 
                    activation='relu'))  #First hidden layer
    model.add(Dense(50, 
                    kernel_regularizer=regularizers.l2(0.001),
                    activation='relu'))  #Second hidden layer
    model.add(Dense(1, 
                    activation='sigmoid')) #Output layer
    
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [0]:
# Get a summary of the model
model = nnet_model()
model.summary()

<br><br>
### <font color=Darkred>9. Train the model</font>

In [0]:
train_hist = model.fit(X_train, y_train, validation_split=0.30, 
                       epochs=400, batch_size=200, verbose=1, 
                       callbacks=[TensorBoardColabCallback(tbc)])

# While the model is trained we can follow its learning history (verbose=1)

In [0]:
# summarize history for accuracy
print(train_hist.history.keys())
plt.plot(train_hist.history['acc'])
plt.plot(train_hist.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [0]:
# summarize history for loss
plt.plot(train_hist.history['loss'])
plt.plot(train_hist.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

<br><br>
### <font color=Darkred>10. Evaluate the model in the test set</font>

In [0]:
accuracy = model.evaluate(X_test, y_test, batch_size=200)
print("%s: %.2f%%" % (model.metrics_names[1], accuracy[1]*100))

Then, we use the model on the reserved test data to generate the probability values.
<br>

In [0]:
# Predict the test set results as probabilities
y_score = model.predict(X_test).ravel()

After that, we use the probabilities and ground true labels to generate two data array pairs necessary to plot an ROC curve:

* ```fpr```: false positive rates for each possible threshold;
<br>
* ```tpr```: true positive rates for each possible threshold.
<br><br>We can call the ```roc_curve()``` function to generate the two. Here is the code to make this happen:

In [0]:
# Generating true positives and false positives
fpr_nnet, tpr_nnet, thresholds = roc_curve(y_test, y_score)

# The area under the curve (AUC) value can also be calculated 
auc_nnet = auc(fpr_nnet, tpr_nnet)
print(" Test ROC AUC: %0.3f " % auc_nnet.round(3))

In [0]:
# Generate a ROC AUC chart
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_nnet, tpr_nnet, label='[ReLU: 100, 50] (area = {:.3f})'.format(auc_nnet))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

In [0]:
# Plot the Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_score)

plt.step(recall, precision, color='b', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='b')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.0])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve');

In [0]:
# Building the Confusion Matrix
y_pred = (y_score>0.8)
cm = confusion_matrix(y_test, y_pred)  
cm

This is how to read it: 
*  True Negatives = top left;
*  False Negatives = the bottom left;
*  False Positives = top right;
*  True Positives = bottom right.

<br>Finally, you can save the model...

In [0]:
!pip install h5py
model.save('nnet_DDMMYY.h5')
!ls # To check whether the model has been saved to your virtual machine datalab.

In [0]:
!ls