# Classification Methods, SVMs, Tuning and CV


Like R, Python uses packages in data mining/machine learning. The 3 mose common ones are Pandas (manipulation), Scikit Learn (machine learning) and Matplotlit (graphics).

In [None]:
#Add packages
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score, KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
from sklearn import tree 
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
import scipy.stats as ss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
import time
from operator import itemgetter
import os
os.getcwd()

In [None]:
cd '/Users/mpgartland/Documents/Courses/Data Mining/Week 5'

# Read in Data
# Churn Calls Data
This is a Pandas operation.

In [None]:
#import data
df = pd.read_csv("Churn_Calls.csv", sep=',')
df.head(10)

In [None]:
# See each collum name
print df.columns

In [None]:
df.shape

#Target
In this step I took the target variable and moved it to the first collum. I aslo made a reference to it called targetName. This just helps me with some below steps.

In [None]:
# designate target variable name
targetName = 'churn'
# move target variable into first column
targetSeries = df[targetName]
del df[targetName]
df.insert(0, targetName, targetSeries)
expected=targetName
df.head(10)

#EDA
Just a touch of EDA. This is the distribution of the target. As you can see, the datset is imbalanced and the target class of interest "yes" is in the minority (a common occurance in classification).

In [None]:
gb = df.groupby(targetName)
targetEDA=gb[targetName].aggregate(len)
plt.figure()
targetEDA.plot(kind='bar', grid=False)
plt.axhline(0, color='k')

#Preprocessing
The below two steps are for preprocessing. The first cell changes the yes/no of the target to numeric. I needed to do this as some models require the target to be numeric. The second cell takes all the category features and creates dummies with them. This is stock code I have used for long time (and I did not write it). It is nice because it will take any dataframe of any size and handle categorial features. I do not have to change a single line in it. It can be used generically on bascially any dataframe. Saves a lot of time of coding each feature.

In [None]:
from sklearn import preprocessing
le_dep = preprocessing.LabelEncoder()
#to convert into numbers
df['churn'] = le_dep.fit_transform(df['churn'])

In [None]:
# perform data transformation
for col in df.columns[1:]:
	attName = col
	dType = df[col].dtype
	missing = pd.isnull(df[col]).any()
	uniqueCount = len(df[attName].value_counts(normalize=False))
	# discretize (create dummies)
	if dType == object:
		df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
		del df[attName]

# Test/Train
I split the data into a 60/40 train test. The features are stored in "features_train" and "features_test". The targets are in "target_train" and "target_test". I used a biggest test when I have an imbalanced set. 

In [None]:
# split dataset into testing and training
features_train, features_test, target_train, target_test = train_test_split(
    df.ix[:,1:].values, df.ix[:,0].values, test_size=0.40, random_state=0)

Just a view of the size of each test/train set.
Note there are now 73 features, and the test set is imbalanced (14.6%)

In [None]:
print features_test.shape
print features_train.shape
print target_test.shape
print target_train.shape
print "Percent of Target that is Yes", target_test.mean()
#data.groupby(['col1', 'col2'])

#Models
All the models are done in Sci-Kit Learn.

#Decision Tree
I created a decision tree from the data. The accurancy of the model was 921%, while the test data classified at 92%. However notice that the "yes" class (the class I am interested in) only properly classified at 74% (specificity) and .71 (recall). That is so-so. Again, not uncommon with imbalanced data. 

In [None]:
#Decision Tree train model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features_train, target_train)
#DT test model
target_predicted_dt = clf.predict(features_test)
print "DT Accuracy Score", accuracy_score(target_test, target_predicted_dt)
# print classification report
target_names = ["Fail = no", "Fail = yes"]
print(classification_report(target_test, target_predicted_dt, target_names=target_names))

#Cross Validation of Decision Tree
I cross validated with 10 repeats. You can see the OOB score for each repeat and the mean. The mean is .92, which is quite close to the orginal model. I am not going to worry about over fitting.

In [None]:
#verify DT with Cross Validation
scores = cross_val_score(clf, features_train, target_train, cv=10)
print "Cross Validation Score for each K",scores
scores.mean()                             

#Visual of Confusion Matrix for Decision Tree

In [None]:
# display confusion matrix
cm = confusion_matrix(target_test, target_predicted_dt)
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(cm)

#Random Forest
Using the same data, I built a random forest with 500 bootstrapped trees. Notice I parallelized this to 4 cores as big random forest can be computationally expensive. 

My overall results went up by 3% over the decision tree. Also, my minory target precision, but the recall decresed.  

In [None]:
# train random forest model
#paralleized to 4 cores 
rf = RandomForestClassifier(n_estimators= 500, n_jobs=-1,oob_score=True)
rf.fit(features_train, target_train)
# test random forest model
target_predicted_rf = rf.predict(features_test)
print accuracy_score(target_test, target_predicted_rf)
target_names = ["Churn = no", "Churn = yes"]
print(classification_report(target_test, target_predicted_rf, target_names=target_names))
print(confusion_matrix(target_test, target_predicted_rf))


#Cross Validation of Random Forest
I cross validated with 10 repeats. You can see the OOB score for each repeat and the mean. The mean is .949, which is quite close to the orginal model. I am not going to worry about over fitting.

In [None]:
#verify RF with cross validation
scores_rf = cross_val_score(rf, features_train, target_train, cv=10, n_jobs=-1)
print "Cross Validation Score for each K",scores_rf
scores_rf.mean()

#Visual of Confusion Matrix for Random Forest

In [None]:
# display confusion matrix
cm = confusion_matrix(target_test, target_predicted_rf)
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(cm)

#Model Tuning
You can tune any argument in these models. I did a grid search only on max_features (mtry in R). I parallelized the job to 4 cores for speed. You can see that max_features (mtry) of 5 had the best results. But frankly was very little difference from the other parameter results.

In [None]:
# use a full grid over all parameters
param_grid = {"max_features": [2, 3, 4, 5]}
start_time = time.clock()




# run grid search
grid_search = GridSearchCV(rf, param_grid=param_grid,n_jobs=-1)

grid_search.fit(features_train, target_train)

print time.clock() - start_time, "seconds"
print grid_search.grid_scores_


#KNN
I performed KNN on K=3 and K=5. For both K's the accurancy was 85% and 87% respectively and I still have problems with the minority class. KNN and Decision Tree perform about the same. I find this to be true frequently, which is why I use them as my base comparative models. 

In [None]:
neigh3 = KNeighborsClassifier(n_neighbors=3)
neigh3.fit(features_train, target_train)
# test KNN 3
target_predicted_knn3 = neigh3.predict(features_test)
print accuracy_score(target_test, target_predicted_knn3)
target_names = ["Churn = no", "Churn = yes"]
print(classification_report(target_test, target_predicted_knn3, target_names=target_names))

In [None]:
neigh5 = KNeighborsClassifier(n_neighbors=5)
neigh5.fit(features_train, target_train)
# test KNN 3
target_predicted_knn5 = neigh5.predict(features_test)
print accuracy_score(target_test, target_predicted_knn5)
target_names = ["Churn = no", "Churn = yes"]
print(classification_report(target_test, target_predicted_knn5, target_names=target_names))

#More Details
Now that we know our random forest was the best model of the three I ran, I will gather some other information. Below is a non-ordered list of feature importance. I only showed 20 for purposes of space.

In [None]:
#Show importance of each feature in Random Forest
zip(df.columns[1:20], rf.feature_importances_)

#ROC curve for Random Forest
Finally a ROC curve that shows the lift I get from the Random Forest model. 

In [None]:
# Determine the false positive and true positive rates
fpr, tpr, _ = roc_curve(target_test, rf.predict_proba(features_test)[:,1]) 
    
# Calculate the AUC
roc_auc = auc(fpr, tpr)
print 'ROC AUC: %0.3f' % roc_auc
 
# Plot of a ROC curve for a specific class
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

Random Forest does the best, but I still am not getting the accurancy on my target class of interest. I have a few tricks I can do to work on this, but that is for another day/class.

# SUPPORT VECTOR MACHINES

linear SVM with L2 penalty, Cost function of 1 and auto class weight. 

In [None]:
from sklearn.svm import LinearSVC
clf_linSVC=LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, class_weight='auto')
clf_linSVC.fit(features_train, target_train)
predicted_SVC=clf_linSVC.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_SVC,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_SVC))
print accuracy_score(expected,predicted_SVC)

#SVC kernel= linear
#Change Class_Weight

In [None]:
from sklearn.svm import SVC
#standard linear SVC
clf_lin = SVC(kernel='linear', C=1.0,class_weight=None,gamma=0)
clf_lin.fit(features_train, target_train)
predicted_SVM=clf_lin.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_SVM,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_SVM))
print accuracy_score(expected,predicted_SVM)

In [None]:
from sklearn.svm import SVC
#standard linear SVC
clf_lin = SVC(kernel='linear', C=1.0,class_weight='auto',gamma=0)
clf_lin.fit(features_train, target_train)
predicted_SVM=clf_lin.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_SVM,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_SVM))
print accuracy_score(expected,predicted_SVM)

#Grid Search of Cost Function (with cross validation)

In [None]:
from sklearn.svm import SVC
parameters = {'C':[.01,.05,1,2,3,4,5,9,10]}
svr = SVC(kernel='linear')
grid_svm = GridSearchCV(svr, parameters,n_jobs=-1, cv=5)
grid_svm.fit(features_train, target_train)
print "SCORES", grid_svm.grid_scores_
print "BEST SCORE", grid_svm.best_score_
print "BEST PARAM", grid_svm.best_params_

#Grid Search of Several Functions (with cross validation)

In [None]:
from sklearn.svm import SVC
parameters = {'kernel':('linear', 'rbf'), 'C':[.001,.01,1,3,5,10]}
svr = SVC()
grid_svm = GridSearchCV(svr, parameters,n_jobs=-1, cv=5)
grid_svm.fit(features_train, target_train)
print "SCORES", grid_svm.grid_scores_
print "BEST Estm",grid_svm.best_estimator_ 
print "BEST SCORE",grid_svm.best_score_
print "BEST PARAM", grid_svm.best_params_

#How does "Best" perform?

In [None]:
from sklearn.svm import SVC
#standard linear SVC
clf_lin = SVC(kernel='linear', C=10.0,class_weight='auto',gamma=0)
clf_lin.fit(features_train, target_train)
predicted_SVM=clf_lin.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_SVM,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_SVM))
print accuracy_score(expected,predicted_SVM)

#SVM using a RBF (non-linear) Kernel (High dimensional Space). Untuned.

In [None]:
from sklearn.svm import SVC
#standard linear SVC
clf_rbf = SVC(kernel='rbf', C=1.0,class_weight='auto',gamma=0.1)
clf_rbf.fit(features_train, target_train)
predicted_rbf=clf_rbf.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_rbf,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_rbf))
print accuracy_score(expected,predicted_rbf)

#SVM using Polynominal Kernel (2nd Degree), untuned.
Would not fit at 2nd and 3rd degree given 24 hours

from sklearn.svm import SVC
#standard linear SVC
clf_poly = SVC(kernel='poly', degree=2, C=1.0,class_weight=None)
clf_poly.fit(features_train, target_train)
predicted_poly=clf_poly.predict(features_test)
expected = target_test
# summarize the fit of the model
print(classification_report(expected, predicted_poly,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_poly))
print accuracy_score(expected,predicted_poly)

In [None]:
#Gradient Boost Classification
from sklearn.ensemble import GradientBoostingClassifier
clf_GBC = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
clf_GBC.fit(features_train, target_train)
predicted_GBC=clf_GBC.predict(features_test)
expected = target_test
print(classification_report(expected, predicted_GBC,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_GBC))
print accuracy_score(expected,predicted_GBC)

In [None]:
#AdaBoost of a Decision Tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),
                         algorithm="SAMME",
                         n_estimators=200)
bdt.fit(features_train, target_train)
predicted_bdt=bdt.predict(features_test)
expected = target_test
print(classification_report(expected, predicted_bdt,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_bdt))
print accuracy_score(expected,predicted_bdt)

In [None]:
#Extra Trees- Extremely Random Trees
from sklearn.ensemble import ExtraTreesClassifier
xtree = DecisionTreeClassifier(max_depth=None, min_samples_split=1,
random_state=0)
xtree.fit(features_train, target_train)
predicted_xtree=xtree.predict(features_test)
print scores_xtree = cross_val_score(xtree, features_train, target_train)
print scores_xtree.mean() 
print(classification_report(expected, predicted_xtree,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_xtree))
print accuracy_score(expected,predicted_xtree)

In [None]:
#Adaboost Only
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=100)
ada.fit(features_train, target_train)
predicted_ada=ada.predict(features_test)
print scores_ada = cross_val_score(ada, features_train, target_train)
print scores_ada.mean()
print(classification_report(expected, predicted_ada,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_ada))
print accuracy_score(expected,predicted_ada)

In [None]:
#Stocastic Gradient Descent
from sklearn.linear_model import SGDClassifier
#as with all models, there are lots of arguments to adjust
SGD=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)
SGD.fit(features_train, target_train)
predicted_SGD=SGD.predict(features_test)
print scores_SGD = cross_val_score(SGD, features_train, target_train)
print scores_SGD.mean()
print(classification_report(expected, predicted_SGD,target_names=['No', 'Yes']))
print(confusion_matrix(expected, predicted_SGD))
print accuracy_score(expected,predicted_SGD)