# Project Walkthrough: Analyzing Sales Data

Here are the imports for the rest of the code here.  We need sklearn, pandas, numpy, and matplotlib
* conda doesn't have __`pdpbox`__, so you'll need to __`pip install pdpbox`__

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from pdpbox import pdp

### The data  comes from the shared Google drive folder.  Link it under the working directory from where you launch Jupyter

In [None]:
dat = pd.read_csv("ml_course_shared/WA_Fn-UseC_-Sales-Win-Loss.csv")

### The data is sales information for different opportunities, including contextual information such as region, timeframe, client details, and whether or not the opportunity was a success (Won).

In [None]:
dat

* The data above contains categorical data.  Unfortunately, __`scikit-learn`__ does not support categorical data for most of its model types
* As a first step, we will turn categorical label fields into a series of per-label boolean fields (this process is called _binarization_)
* The cell below prints out the list of original field names as well as the final binarized versions
* We also need to drop fields from training because they are not valid measurements (Opportunity Number, which is an ID), they are *[leakage](https://www.kaggle.com/wiki/Leakage)* fields (Opportunity Amount USD, which is only populated if the Opp is won), or are the objective field itself (Opportunity Result)

In [None]:
output = dat['Opportunity Result'] == "Won"
dat_filtered = dat.drop(["Opportunity Result","Opportunity Number","Opportunity Amount USD"],axis=1)
dat_filtered = pd.get_dummies(dat_filtered)
print('***ORIGINAL***\n', dat.columns)
print('\n')
print('***BINARIZED***\n', dat_filtered.columns)

* For most modeling tasks, we need to split the data into a training set and a test set
* We train the model on one set of data, and then evaluate it on another.  Models are always at risk of *[overfitting](https://en.wikipedia.org/wiki/Overfitting)* the data they are trained on
* Evaluating a model on new data gives a better understanding of its potential performance in a real world scenario
* The function below splits the filtered data and objective field into train/test counterparts
* The result below shows the shape of the two datasets generated:  ~68K records for training, and ~10K records for evaluation
* There's no hard and fast rule for what ratio to use–generally an evaluation set that contains 10%-20% of the original data is a good place to start
* You'll also need a minimum number of records for evaluation, which differs depdending on the dataset characteristics

For a more formal evaluation, consider using [cross-fold validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)

In [None]:
train_split, test_split, train_output, test_output = train_test_split(dat_filtered, output, test_size=10000)
print(train_split.shape)
print(test_split.shape)

* Once the data is prepared, training is a snap
* We create the model, and pass in the train/objective data as arguments

In [None]:
logreg = LogisticRegression()
logclf = logreg.fit(train_split, train_output)

* Logistic regression have coefficients based on the features they are trained on
* Inspecting these coefficients reveal which features the model believes are important for prediction, and whether those features have a positive or negative impact on the predicted score
* We can look at a sorted list for the best-to-worst indicators of a successful opportunity.

In [None]:
coef = pd.DataFrame({"coef" : logclf.coef_[0].tolist()}, index=train_split.columns)
coef.sort_values('coef', ascending=False)

* We'll create a small helper function for evaluation
* This function creates an *[ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)* curve, which helps understand the model performance across a range of sensitivity settings
* The area of the curve drawn by ROC ([AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve)), is a good overall performance metric for a binary classifier

In [None]:
def performance(model, data, actual):
    probas = model.predict_proba(data)
    fpr, tpr, thr = roc_curve(actual, probas[:,1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, 'b', label="AUC = %0.2f" % roc_auc)
    plt.plot([0,1],[0,1],'r--')
    plt.title('Receiver Operating Characteristic')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.legend(loc='lower right')
    plt.show()

In [None]:
performance(logclf, test_split, test_output)

* Creating and evaluating the random forest model proceeds the same
* Note that the random forest model has a higher AUC than Logistic Regression
* On a pure performance basis, random forest is the winner
* However, random forest models are more complex than linear models, both in the algorithmic sense, and in the interpretability sense

In [None]:
rf = RandomForestClassifier()
rfclf = rf.fit(train_split,train_output)

In [None]:
performance(rfclf, test_split, test_output)

* Random forests do not use coefficients–instead, they have a series of split criteria scattered across a multitude of decision trees
* We calculate feature importance by measuring the *[information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)* of each split point for its given feature
* If a split point uses a feature to effectively separate a large amount of data, the importance score for that split is added to the feature
* All importance scores are then tallied
* __`scikit-learn`__ provides this information as part of the fitted model object

* Note that we *cannot* deduce whether the feature has an overall positive or negative effect on the score based on its importance
* Inside a non-linear model such as a random forest, a given feature value could have positive *and* negative effects at multiple parts of a tree, and in multiple trees in the forest

In [None]:
importances = rfclf.feature_importances_
pd.DataFrame({"feature" : train_split.columns, "importance" : importances}).sort_values("importance", ascending=False)

* It's still possible to understand the effects of a given feature range using techniques like *[partial dependence plots](https://towardsdatascience.com/introducing-pdpbox-2aa820afd312)*
* These techniques evaluate the predicted value for the given data, while sweeping a given feature through its entire range
* It's possible to understand if a given feature produces a linear response in the predicted value, or if it suggests a more complex (or non-linear) function  
* In this case, we'll look at __`Total Days Identified Through Closing`__, i.e., the number of days the deal has been active
  * It's clear that likelihood for a deal to close increases through its age, but only to a point
  * After that point, the likelihood of it closing falls off dramatically   * A more complex distribution like this is not going to be modeled well by a simpler model
  * However, the random forest is able to capture this behavior through training

* As an aside, __`scikit-learn`__ does not provide a default partial dependence plot function for random forests; a separate library is utiltized here
* The ICEplot title refers to the paper and technique for *[Individual Conditional Expectation](https://arxiv.org/abs/1309.6392)*

In [None]:
pdp_elapsed = pdp.pdp_isolate(rfclf, train_split, 'Total Days Identified Through Closing')
pdp.pdp_plot(pdp_elapsed, 'Total Days Identified Through Closing', plot_org_pts=True, plot_lines=True, frac_to_plot=1000)

* If an opportunity is __`won`__, the data provides additional information on the amount of money the opportunity is worth
* We can capture this data and model it separately as a linear regression * We just need to filter for __`won`__ opportunities

In [None]:
won = dat[dat["Opportunity Result"] == "Won"].drop(["Opportunity Result"], axis=1)
won = pd.get_dummies(won)
won_output = won["Opportunity Amount USD"]
won_filtered = won.drop(["Opportunity Number", "Opportunity Amount USD"],axis=1)
won_train, won_test, won_train_output, won_test_output = train_test_split(won_filtered, won_output, test_size=10000)

In [None]:
regr = LinearRegression()
linreg = regr.fit(won_train, won_train_output)

* Linear regression coefficients can be interpreted more or less the same as logistic regression coefficients

In [None]:
coef = pd.DataFrame({"coef" : linreg.coef_.tolist()},index=won_train.columns)
coef.sort_values('coef', ascending=False)

* Evaluating the model for a linear regression typically involves understanding the error of the prediction
* We need to know how "off" the model is on average (_mean absolute error_)
* It's also helpful to put this number in context by giving the mean of the opportunity amount as well

In [None]:
from sklearn.metrics import mean_absolute_error
y_pred = linreg.predict(won_test)
y_actual = won_test_output
print("mae : $" , mean_absolute_error(y_actual, y_pred))
print("mean: $", np.mean(y_actual))

* We can dig deeper into the results of the model by better understanding the distribution of actual values
  * To do that, we can create a small histogram generation function.

In [None]:
def gen_histogram(dist, x_label, y_label, main, log_scale=False):
    n, bins, patches = plt.hist(dist, 100, normed=1, facecolor='green', alpha=0.75)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(main)
    plt.grid(True)
    if log_scale:
        plt.yscale('log', nonposy='clip')
    plt.show()


* We can inspect the histogram of opportunity amounts
  * This looks like a fairly standard distribution with an exponential fall-off
  * We can examie this information in original or log scaling

In [None]:
gen_histogram(y_actual, 'USD','Probability','Histogram of Predicted Opportunity Amount USD')
gen_histogram(y_actual, 'USD','Probability','(Log) Histogram of Predicted Opportunity Amount USD', True)

* We can examine the distribution of *residuals* from the model
  * The residuals are the predicted amounts subtracted from the original opportunity amounts

* The histogram shows that most of the errors are centered around 0, which is a good sign
  * The distribution looks a bit bimodal... or rather trimodal
  * Finding lumps of errors and/or misclassifications like this typically can highlight records that the model gets consistently wrong

In [None]:
residual = y_actual - y_pred
gen_histogram(residual, 'Error','Probability','Histogram of Residual Predicted Opportunity Amount USD')