<h1>Proper names detector</h1>
<p>Training binary classifier indentifying if the input string is a proper name</p>
<h3>Approach</h3>
<p>Basic idea is to train Random Forest classifier with in-word character n-grams as input features</p>
<p><b>Features:</b> Character n-grams of decent length incorporate almost all reasonable features for proper names detection: gazetteer, affixes, prefixes, punctuation, usual character sequences. To differentiate suffixes and prefixes we add symbol '_' at the beginning and the end of each word.</p>
<p><b>Classifier:</b> To avoid design of complex features (e.g. suffix='Abd'&amp;prefix='la') we use non-linear classifier. GradientBoosting is buggy in sklearn-0.18. Deep learning requires more manual programming and hyper-parameter optimizations. RandomForest is a good option though the resulting model may be quite large.</p>

In [55]:
import numpy as np
from random import randint
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
import re

<h3>Data preparation</h3>
<p><i>proper_names.txt</i> -- proper names from <i>persondata_en.nt</i> with stripped XML-tags and end-word markers added. We will use only half of them.</p>
<p><i>geo_names.txt</i> -- geographic names from dbpedia <i>geonames_links_en.ttl</i> with stripped XML-tags and end-word markers added</p>
<p><i>company_names.txt</i> -- companies listed on NYMEX (quite few, actually) with end-word markers added</p>
<p><i>random_phrases.txt</i> -- 200k random 1-4 grams of most frequent words from English, Spanish, German, Italian and French with end-word markers added</p>

In [2]:
PROPER_NAMES_FILE_PATH = '/data/home/mkudinov/Data/proper_names/proper_names.txt'
GEO_NAMES_FILE_PATH = '/data/home/mkudinov/Data/proper_names/geo_names.txt'
COMPANY_NAMES_FILE_PATH = '/data/home/mkudinov/Data/proper_names/company_names.txt'
RANDOM_PHRASES_FILE_PATH = '/data/home/mkudinov/Data/proper_names/random_phrases.txt'

In [3]:
def read_dataset():
    dataset = []
    with open(PROPER_NAMES_FILE_PATH, 'r') as source_file:
        for line in source_file:
            rn = randint(1,100) #use half of the list of proper names
            if rn <= 50:
                dataset.append(unicode(line, 'utf-8').strip())
    labels = [1] * len(dataset)
    print "Proper Names: %s" % len(dataset) 
    lens = [0] * 3
    for i, source in enumerate([GEO_NAMES_FILE_PATH, COMPANY_NAMES_FILE_PATH, RANDOM_PHRASES_FILE_PATH]):
        with open(source, 'r') as source_file:
            for line in source_file:
                dataset.append(unicode(line, 'utf-8').strip())
                lens[i] +=1
    print "Geo: %s Company: %s Random: %s" % (lens[0], lens[1], lens[2])
    labels += [0] * sum(lens)
    return dataset, labels

In [21]:
#fit model parameters and print metrics
def model_fit(clf_, x_train_, y_train_, cv_): 
    clf_.fit(x_train_, y_train_)
    train_pred = clf_.predict(x_train_)
    train_prob = clf_.predict_proba(x_train_)[:,1]
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(y_train_, train_pred)
    print "Precision : %.4g" % metrics.precision_score(y_train_, train_pred)
    print "Recall : %.4g" % metrics.recall_score(y_train_, train_pred)
    print "F-1 measure : %.4g" % metrics.f1_score(y_train_, train_pred)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(y_train_, train_prob)
    if cv_ > 0:
        cv_score = cross_validation.cross_val_score(clf_, x_train_, y_train_, cv=cv_, scoring='roc_auc', n_jobs=jobs)
        print "CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score))

<h3>Read data and extract features</h3>
<p>We use ngrams of 3<=n<=8 with more than 70 occurences in different phrases</p>

In [22]:
X, Y = read_dataset()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
vectorizer = CountVectorizer(strip_accents='unicode', analyzer='char_wb', min_df=70, ngram_range=(3,8), binary=True)
x_train_bin = vectorizer.fit_transform(x_train)
x_test_bin = vectorizer.transform(x_test)
print "Number of features: ",len(vectorizer.get_feature_names())

Proper Names: 534311
Geo: 326222 Company: 2690 Random: 200000
Number of features:  93754


<h3>Grid search</h3>
Grid search with cross-validation on maximum depth of tree. Not surpisingly the more the better :)

In [10]:
param_depth = {'max_depth':range(20,80,10)}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=100), 
param_grid = param_depth, scoring='roc_auc',n_jobs=6,iid=False, cv=5)
gsearch2.fit(x_train_bin,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=False, n_jobs=6,
       param_grid={'max_depth': [20, 30, 40, 50, 60, 70]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [13]:
print gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
best_depth = gsearch2.best_params_['max_depth']

[mean: 0.88721, std: 0.00152, params: {'max_depth': 20}, mean: 0.90948, std: 0.00115, params: {'max_depth': 30}, mean: 0.92608, std: 0.00201, params: {'max_depth': 40}, mean: 0.93723, std: 0.00096, params: {'max_depth': 50}, mean: 0.94501, std: 0.00096, params: {'max_depth': 60}, mean: 0.95236, std: 0.00027, params: {'max_depth': 70}] {'max_depth': 70} 0.952358981906


<h3>Test best performing classifier</h3>
Best performing model has max_depth=70. Check test and train accuracy, AUC, P, R and F-1.

In [42]:
clf = RandomForestClassifier(n_estimators=100, max_depth=best_depth).fit(x_train_bin, y_train)
model_fit(clf, x_train_bin, y_train, 0)


Model Report
Accuracy : 0.889
Precision : 0.9381
Recall : 0.8341
F-1 measure : 0.8831
AUC Score (Train): 0.966983


In [45]:
test_pred = clf.predict(x_test_bin)
test_prob = clf.predict_proba(x_test_bin)[:,1]

In [46]:
print "\nModel Report"
print "Accuracy : %.4g" % metrics.accuracy_score(y_test, test_pred)
print "Precision : %.4g" % metrics.precision_score(y_test, test_pred)
print "Recall : %.4g" % metrics.recall_score(y_test, test_pred)
print "F-1 measure : %.4g" % metrics.f1_score(y_test, test_pred)
print "AUC Score (Train): %f" % metrics.roc_auc_score(y_test, test_prob)


Model Report
Accuracy : 0.8722
Precision : 0.9192
Recall : 0.8173
F-1 measure : 0.8652
AUC Score (Train): 0.954025


<h3>50 misclassified strings<h3>

In [68]:
errors = []
for i in range(len(y_test)):
    if test_pred[i] != y_test[i]:
        sample = re.sub(r'_', '', x_test[i])
        if y_test[i] == 1:
            errors.append((sample, 'FN'))
        else:
            errors.append((sample, 'FP'))

In [69]:
for i in range(50):
    print u'{:<30} {}'.format(errors[i][0], errors[i][1])

goldie goldthorpe              FN
subhash misra                  FN
micah kogo                     FN
malaefou mackenzie             FN
elena lunda                    FN
nicoya                         FP
rado krošelj                   FN
chitra magimairaj              FN
raoul trujillo                 FN
kitso mokaila                  FN
ashley green                   FP
neal abberley                  FN
virignin                       FP
chiquito filipe do carmo       FN
pilar seurat                   FN
håkon col                      FP
handover mars jellyfish        FP
nancy cato                     FN
dale neufeld                   FN
greg halsey-brandt             FN
regis pitbull                  FN
felix cheong                   FN
ramiro prialé                  FN
yiorgo moutsiaras              FN
silvana mangano                FN
sia carol                      FP
lola cornero                   FN
clem ohameze                   FN
jacques awoke                  FP
cynthia sikes 