This project mainly contains 3 relational datasets with text and non-text features. The data process is described in process_data.py. 

Briefly speaking, we merge 3 datasets by their primary keys, randomly sample 20% from the merged datasets. We then scale numeric features to have zero mean and unit variance, create dummy variables for categorical features, and extract TF-IDF then apply LSA for text features. When everything is ready, the processed data is split into training set and testing set.

The purpose of machine learning is to learn the underlying function in training set and generalize the algorithm to testing set. Although there are different measures for the performance of this generalization, in the following context we adopt precision and recall as our measurements.

In [1]:
from process_data import get_processed_data
features_processed_all, labels = get_processed_data()

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(features_processed_all, labels, test_size=0.2, random_state=1111)

We employ majority voting classifier in this project, which is comprised of many weak learners: logistic regression, random forest, dicision tree, support vector machine, and nearest neighbors. Before we feed data into weak classifiers, a dimension reduction tool (principal component analysis) is performed to filter noise in the features.

In [2]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

pca = PCA(n_components=10)
clf1 = LogisticRegression(class_weight='balanced')
clf2 = RandomForestClassifier(n_estimators=200, random_state=1111)
clf3 = DecisionTreeClassifier(max_depth=1, criterion='entropy', random_state=1111)
clf4 = SVC(kernel='poly')
clf5 = KNeighborsClassifier(n_neighbors=1, p=2, metric='minkowski')

pipe1 = Pipeline([ ['pca', pca], ['lr', clf1] ])
pipe2 = Pipeline([ ['pca', pca], ['rf', clf2] ])
pipe3 = Pipeline([ ['pca', pca], ['tree', clf3] ])
pipe4 = Pipeline([ ['pca', pca], ['svm', clf4] ])
pipe5 = Pipeline([ ['pca', pca], ['knn', clf5] ])

from sklearn.ensemble import VotingClassifier
mv_clf = VotingClassifier(estimators=[('lr', pipe1), ('rf', pipe2), ('tree', pipe3), ('svm', pipe4), ('knn', pipe5)], voting='hard')

We first present our base model. The result shows that the precision is about 89% and recall is about 69%. What does this mean? 

Precision aims to answer the question: what proportion of positive "predictions" was actually correct?

Recall aims to answer the question: What proportion of "real" positives was predicted correctly?

In [3]:
import warnings
warnings.filterwarnings('ignore')
mv_clf.fit(X_train, Y_train)
Y_predicted_mv = mv_clf.predict(X_test)

In [4]:
confusion_matrix = confusion_matrix(Y_test, Y_predicted_mv)
print(confusion_matrix)

[[7102   72]
 [ 254  573]]


In [5]:
print(metrics.classification_report(Y_test, Y_predicted_mv))

             precision    recall  f1-score   support

          0       0.97      0.99      0.98      7174
          1       0.89      0.69      0.78       827

avg / total       0.96      0.96      0.96      8001



GridSearchCV can be used to find best parameters, as illustrated below. If we use pipeline in the procedure, it may not be easy to find parameters in GridSearchCV. Fortunately, the .get_params().keys() method is useful to figure out what those parameters are.

In [6]:
# check parameters by: mv_clf.get_params().keys()
params = {'lr__lr__C': [0.5, 1.0, 1.5], 'rf__rf__n_estimators': [200, 300, 400], 'svm__svm__kernel': ['rbf', 'poly']}
grid = GridSearchCV(estimator=mv_clf, param_grid=params, cv=5)
grid = grid.fit(X_test, Y_test)

In [7]:
print("Best parameters set found on development set:")
print()
print(grid.best_params_)
print()
print("Grid scores on development set:")
print()
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
print()

Best parameters set found on development set:

{'lr__lr__C': 1.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'rbf'}

Grid scores on development set:

0.953 (+/-0.011) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.013) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'poly'}
0.954 (+/-0.010) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 300, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.011) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 300, 'svm__svm__kernel': 'poly'}
0.953 (+/-0.010) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.009) for {'lr__lr__C': 0.5, 'rf__rf__n_estimators': 400, 'svm__svm__kernel': 'poly'}
0.953 (+/-0.010) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'rbf'}
0.943 (+/-0.013) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 200, 'svm__svm__kernel': 'poly'}
0.954 (+/-0.011) for {'lr__lr__C': 1.0, 'rf__rf__n_estimators': 300, 'sv