# Introduction

This project is complex enough to capture many concepts: (1) it is a relational dataset; (2) it contains text and non-text features; (3) its labels is imbalanced, i.e. only 10% of the labels are exciting (successful). 

The data process is described in process_data.py. 

Briefly speaking, we merge 3 datasets by their primary keys, randomly sample 20% from the merged datasets. We then scale numeric features to have zero mean and unit variance, create dummy variables for categorical features, and extract TF-IDF then apply LSA for text features. When everything is ready, the processed data is split into training set and testing set.

The purpose of machine learning is to learn the underlying function in training set and generalize the algorithm to testing set. Although are many ways to evaluate the performance of machine learning models, in this context we use cross validation, which is suitable for medium-sized dataset, to select the best model. Finally, we will look at precision instead of accuracy due to the imbalance of labels.

In [1]:
from process_data import get_processed_data
features_processed_all, labels = get_processed_data()

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features_processed_all, labels, test_size=0.2, random_state=1111)

# Majority Voting Classifier

A majority voting classifier in employed, which is comprised of many base learners: logistic regression, random forest, dicision tree, support vector machine, and K nearest neighbors. Before we feed the data into base classifiers, a dimension reduction tool (principal component analysis) is applied to reduce the noise in the features.

First presented is our base estimator. Afterwards, we utilize GridSearchCV for tuning model parameters. Since there are 5 classifiers, it is a daunting task to exhaust all possible parameter combinations. In the following we only demonstrate the concept of model (parameter) tuning via GridSearchCV.

In [2]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

pca = PCA(n_components=10)
clf1 = LogisticRegression(class_weight='balanced', C=1)
clf2 = RandomForestClassifier(n_estimators=200, random_state=1111)
clf3 = DecisionTreeClassifier(max_depth=1, criterion='entropy', random_state=1111)
clf4 = SVC(kernel='poly')
clf5 = KNeighborsClassifier(n_neighbors=1, p=2, metric='minkowski')

pipe1 = Pipeline([ ['pca', pca], ['lr', clf1] ])
pipe2 = Pipeline([ ['pca', pca], ['rf', clf2] ])
pipe3 = Pipeline([ ['pca', pca], ['tree', clf3] ])
pipe4 = Pipeline([ ['pca', pca], ['svm', clf4] ])
pipe5 = Pipeline([ ['pca', pca], ['knn', clf5] ])

from sklearn.ensemble import VotingClassifier
mv_clf = VotingClassifier(estimators=[('lr', pipe1), ('rf', pipe2), ('tree', pipe3), ('svm', pipe4), ('knn', pipe5)], voting='hard')

In [3]:
mv_clf.fit(X_train, Y_train)
Y_predicted_mv = mv_clf.predict(X_test)

# Precision and Recall

The result shows that the precision is about 89% and recall is about 69%. What does this mean? 

Precision aims to answer the question: what proportion of positive "predictions" was actually correct?

Recall aims to answer the question: What proportion of "real" positives was predicted correctly?

In [4]:
confusion_matrix = confusion_matrix(Y_test, Y_predicted_mv)
print(confusion_matrix)

[[7102   72]
 [ 255  572]]


In [5]:
print(metrics.classification_report(Y_test, Y_predicted_mv))

             precision    recall  f1-score   support

          0       0.97      0.99      0.98      7174
          1       0.89      0.69      0.78       827

avg / total       0.96      0.96      0.96      8001



# Model Tuning

What if we are not satisfied with the base estimator and intend to try other parameters?

GridSearchCV can be used to find best parameters, as illustrated below. If we use pipeline in the procedure, it may not be easy to find parameters in GridSearchCV. Fortunately, the .get_params().keys() method is useful to figure out what those parameters are.

In [6]:
# check parameters by: mv_clf.get_params().keys()
params = {'lr__lr__C': [0.5, 1.0, 1.5], 'rf__rf__n_estimators': [200, 300, 400], 'svm__svm__kernel': ['rbf', 'poly']}
grid = GridSearchCV(estimator=mv_clf, param_grid=params, cv=5)
grid = grid.fit(X_test, Y_test)

In [7]:
print("Best parameters set found on development set:")
print()
print(grid.best_params_)
print()
print("Grid scores on development set:")
print()
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
print()

Best parameters set found on development set:

{'lr__lr__C': 0.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'rbf'}

Grid scores on development set:

0.953 (+/-0.010) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.013) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'poly'}
0.953 (+/-0.010) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 300, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.011) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 300, 'svm__svm__kernel': 'poly'}
0.954 (+/-0.010) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.009) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'poly'}
0.953 (+/-0.011) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'rbf'}
0.944 (+/-0.013) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'poly'}
0.954 (+/-0.010) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 300, 'sv

Since the best parameters is C = 0.5 in logistic regression, n_estimators = 400 in random forest, and svm_kernel = 'rbf in support vector machine as suggested by cross validation score, it is worth trying these parameters and see what will happen to the precision and the recall.

The result shows that the precision decreases 1% but recall increases 9%, hence eventually we have 88% precision and 79% recall. The improvement could be made by considering: (1) proceed with more refined feature engineering; (2) favor "soft" voting rule because not every classifier possesses the same certainty; (3) exhaust all possible combinations of parameters in GridSearchCV.

In [8]:
pca = PCA(n_components=10)
clf1_2 = LogisticRegression(class_weight='balanced', C=0.5)
clf2_2 = RandomForestClassifier(n_estimators=400, random_state=1111)
clf3_2 = DecisionTreeClassifier(max_depth=1, criterion='entropy', random_state=1111)
clf4_2 = SVC(kernel='rbf')
clf5_2 = KNeighborsClassifier(n_neighbors=1, p=2, metric='minkowski')

pipe1_2 = Pipeline([ ['pca', pca], ['lr', clf1_2] ])
pipe2_2 = Pipeline([ ['pca', pca], ['rf', clf2_2] ])
pipe3_2 = Pipeline([ ['pca', pca], ['tree', clf3_2] ])
pipe4_2 = Pipeline([ ['pca', pca], ['svm', clf4_2] ])
pipe5_2 = Pipeline([ ['pca', pca], ['knn', clf5_2] ])

from sklearn.ensemble import VotingClassifier
mv_clf2 = VotingClassifier(estimators=[('lr', pipe1_2), ('rf', pipe2_2), ('tree', pipe3_2), ('svm', pipe4_2), ('knn', pipe5_2)], voting='hard')

In [9]:
mv_clf2.fit(X_train, Y_train)
Y_predicted_mv2 = mv_clf2.predict(X_test)

In [10]:
confusion_matrix2 = confusion_matrix(Y_test, Y_predicted_mv2)
print(confusion_matrix2)

[[7089   85]
 [ 176  651]]


In [11]:
print(metrics.classification_report(Y_test, Y_predicted_mv2))

             precision    recall  f1-score   support

          0       0.98      0.99      0.98      7174
          1       0.88      0.79      0.83       827

avg / total       0.97      0.97      0.97      8001



Other measurements coming soon...