### Add feature engineering one piece at a time and monitor performance.  This notebook goes through 1st and 2nd model with small changes.  See part 2 for later models.

| model | agg. log loss  |  agg. F1 score |  comment
|-------|:--------------:|:--------------:|----------
| mod0  |  1.356         |    0.441       | numerical features only
|mod0_1 |  1.323         |    0.441       | same as mod0, but use standard scaler before prediction
|mod0_1a|  1.295         |    0.454       | scaling + convert total to absolute value
| mod0_2|  1.362         |    0.406       | same as mod0 but use standard scaler and default imputer before prediction
| mod1  |  0.512         | 0.853          | pipeline, numerical features and text features (fillna with empty string; combine all text columns within row; default count vectorizer)
|mod1_1 | 0.094          | 0.974          | same as mod1 but ignore numerical data
|mod1_1_1 | 0.094          | 0.974          | same as mod1_1; work around n_jobs=-1 bug for faster fit


    

##### Note: Several problems have happened having to do with using n_jobs=-1 (all processors) with OneVsAll.  At times, it works well.  Other times it hangs.  The code runs fine without the parameter, just slower than if it were working.   Upgraded to latest in attempt to fix, but that actually made things fail that were working before the upgrade.  Can't recreate with smaller problem.

#### Imports/setup

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_columns', 60)

from timeit import default_timer as timer

In [2]:
# metrics
from sklearn.metrics import f1_score, accuracy_score, classification_report
# unflattener
import python.flat_to_labels as ftl
# drivendata's spltter: ensures train and test both have enough of all the labels
from python.multilabel import multilabel_train_test_split
# drivendata's log loss metric
from python.dd_mmll import multi_multi_log_loss, BOX_PLOTS_COLUMN_INDICES

In [3]:
# for the pipeline
from sklearn.pipeline import Pipeline
# for the selectors
from sklearn.preprocessing import FunctionTransformer, StandardScaler
# for gluing preprocessed text and numbers together
from sklearn.pipeline import FeatureUnion
# for nans in the numeric data
from sklearn.preprocessing import Imputer

In [4]:
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### Load the data

In [5]:
# Get data
df = pd.read_csv('data/TrainingData.csv', index_col=0)

In [6]:
# take a look
df.head()

Unnamed: 0,Function,Use,Sharing,Reporting,Student_Type,Position_Type,Object_Type,Pre_K,Operating_Status,Object_Description,Text_2,SubFund_Description,Job_Title_Description,Text_3,Text_4,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
134338,Teacher Compensation,Instruction,School Reported,School,NO_LABEL,Teacher,NO_LABEL,NO_LABEL,PreK-12 Operating,,,,Teacher-Elementary,,,,,1.0,,,KINDERGARTEN,50471.81,KINDERGARTEN,General Fund,
206341,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,Non-Operating,CONTRACTOR SERVICES,BOND EXPENDITURES,BUILDING FUND,(blank),Regular,,,,,RGN GOB,,UNDESIGNATED,3477.86,BUILDING IMPROVEMENT SERVICES,,BUILDING IMPROVEMENT SERVICES
326408,Teacher Compensation,Instruction,School Reported,School,Unspecified,Teacher,Base Salary/Compensation,Non PreK,PreK-12 Operating,Personal Services - Teachers,,,TCHER 2ND GRADE,,Regular Instruction,,,1.0,,,TEACHER,62237.13,Instruction - Regular,General Purpose School,
364634,Substitute Compensation,Instruction,School Reported,School,Unspecified,Substitute,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,TEACHER SUBS,GENERAL FUND,"Teacher, Short Term Sub",Regular,,,,,UNALLOC BUDGETS/SCHOOLS,,PROFESSIONAL-INSTRUCTIONAL,22.3,GENERAL MIDDLE/JUNIOR HIGH SCH,,REGULAR INSTRUCTION
47683,Substitute Compensation,Instruction,School Reported,School,Unspecified,Teacher,Substitute Compensation,NO_LABEL,PreK-12 Operating,TEACHER COVERAGE FOR TEACHER,TEACHER SUBS,GENERAL FUND,"Teacher, Secondary (High)",Alternative,,,,,NON-PROJECT,,PROFESSIONAL-INSTRUCTIONAL,54.166,GENERAL HIGH SCHOOL EDUCATION,,REGULAR INSTRUCTION


####  Encode the targets as categorical variables

In [7]:
### bind variable LABELS - these are actually the targets and we're going to one-hot encode them...
LABELS = ['Function',  'Use',  'Sharing',  'Reporting',  'Student_Type',  'Position_Type', 
          'Object_Type',  'Pre_K',  'Operating_Status']

### This turns out to be key.  Submission requires the dummy versions of these vars to be in this order.
LABELS.sort()

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)

Function            category
Object_Type         category
Operating_Status    category
Position_Type       category
Pre_K               category
Reporting           category
Sharing             category
Student_Type        category
Use                 category
dtype: object


##### Let's save the unique labels for each output (category)

In [8]:
# build a dictionary
the_labels = {col : df[col].unique().tolist() for col in df[LABELS].columns}

In [9]:
the_labels['Use']

['Instruction',
 'NO_LABEL',
 'O&M',
 'Pupil Services & Enrichment',
 'ISPD',
 'Leadership',
 'Business Services',
 'Untracked Budget Set-Aside']

#### Setting up a train-test split  for modeling

In [10]:
NUMERIC_COLUMNS = ['FTE', 'Total']

In [11]:
# Create the new DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)

__======================== Begin Model 0 =========================__

#### Start with a simple model

The first model ignores everything but the two numeric columns to get started and check for correct format (104 columns of predictions).  Create a multi-label classifier clf by placing LogisticRegression() inside OneVsRestClassifier().

In [12]:
# Make the classifier
mod0 = OneVsRestClassifier(LogisticRegression(), n_jobs=-1)

start = timer()
# Fit the classifier to the training data
mod0.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 126.77165587835282 seconds


##### The accuracy metric is not applicable here because the predictions (mod.predict(X_test) from this model are not correct. ftl.flat_to_labels(probas) gets the predictions that it should produce.

The overall model should be constrained to predict the highest probabilty label within a target. 1-vs-103 doesn't work that way; it produces a set of 104 classifiers that are independent.

##### The call to flat_to_labels produces 9 columns of targets, each populated with the appropriate per-target labels.   We also need to apply it to the y_test since that has been one-hot encoded. 
##### FTL restores the original Y data.  We then produce the actual predictions from the predicted probabilities.  Within a single target the most likely label is asserted as the prediction.  

##### With data in this format the standard sklearn metrics can be applied to each column.

In [13]:
# ftl wants ndarray, not pd.Dataframe
the_ys = ftl.flat_to_labels(y_test.values)

In [14]:
### First we need the probabilities, not the predicted labels
mod0_train_probas = mod0.predict_proba(X_train)
mod0_test_probas = mod0.predict_proba(X_test)

In [15]:
# check accuracy on first column (target: Function)
accuracy_score(the_ys[:, 0], ftl.flat_to_labels(mod0_test_probas)[:, 0])

0.2759602773093498

In [16]:
# check F1 on first column (target: Function)
f1_score(the_ys[:, 0], ftl.flat_to_labels(mod0_test_probas)[:, 0], average='weighted')

  'precision', 'predicted', average, warn_for)


0.1996178882196799

#### Show metrics for each target and average for all targets.

In [17]:
def report_f1(true, pred):
    the_scores = []
    for target in range(len(LABELS)):
        the_score = f1_score(true[:, target], pred[:, target], average='weighted')
        print('F1 score for target {}: {:.3f}'.format(LABELS[target], the_score))
        the_scores.append(the_score)
    print('Average F1 score for all targets : {:.3f}'.format(np.mean(the_scores)))

def report_accuracy(true, pred):
    the_scores = []
    for target in range(len(LABELS)):
        the_score = accuracy_score(true[:, target], pred[:, target])
        print('Accuracy score for target {}: {:.3f}'.format(LABELS[target], the_score))
        the_scores.append(the_score)
    print('Average accuracy score for all targets : {:.3f}'.format(np.mean(the_scores)))


In [18]:
report_accuracy(the_ys, ftl.flat_to_labels(mod0_test_probas))

Accuracy score for target Function: 0.276
Accuracy score for target Object_Type: 0.425
Accuracy score for target Operating_Status: 0.859
Accuracy score for target Position_Type: 0.373
Accuracy score for target Pre_K: 0.766
Accuracy score for target Reporting: 0.641
Accuracy score for target Sharing: 0.634
Accuracy score for target Student_Type: 0.557
Accuracy score for target Use: 0.508
Average accuracy score for all targets : 0.560


In [19]:
report_f1(the_ys, ftl.flat_to_labels(mod0_test_probas))

  'precision', 'predicted', average, warn_for)


F1 score for target Function: 0.200
F1 score for target Object_Type: 0.297
F1 score for target Operating_Status: 0.794
F1 score for target Position_Type: 0.277
F1 score for target Pre_K: 0.665
F1 score for target Reporting: 0.501
F1 score for target Sharing: 0.492
F1 score for target Student_Type: 0.399
F1 score for target Use: 0.343
Average F1 score for all targets : 0.441


#### Log loss

In [20]:
multi_multi_log_loss(mod0_train_probas, y_train.values, BOX_PLOTS_COLUMN_INDICES)

1.353521736853036

In [21]:
multi_multi_log_loss(mod0_test_probas, y_test.values, BOX_PLOTS_COLUMN_INDICES)

1.3557282270290794

#### Submitted prediction file.  Scored 1.33.

__================ End of Model 0 ==================__

---

__================ Begin Mod0_1 *(add scaling)* ==================__

In [22]:
# Instantiate the classifier
mod0_1 = Pipeline([('scale', StandardScaler()),
                   ('clf',   OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
                  ])

start = timer()
# Fit the classifier to the training data
mod0_1.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 37.78306902934091 seconds


#### Fits much faster with scaled numerical input.  Classification is not improved.

In [23]:
# predict probilities
mod0_1_test_probas = mod0_1.predict_proba(X_test)

In [24]:
# get accuracy.  The ys are the same as before (test set hasn't changed)
report_accuracy(the_ys, ftl.flat_to_labels(mod0_1_test_probas))

Accuracy score for target Function: 0.283
Accuracy score for target Object_Type: 0.417
Accuracy score for target Operating_Status: 0.859
Accuracy score for target Position_Type: 0.383
Accuracy score for target Pre_K: 0.766
Accuracy score for target Reporting: 0.641
Accuracy score for target Sharing: 0.634
Accuracy score for target Student_Type: 0.557
Accuracy score for target Use: 0.508
Average accuracy score for all targets : 0.561


In [25]:
report_f1(the_ys, ftl.flat_to_labels(mod0_1_test_probas))

  'precision', 'predicted', average, warn_for)


F1 score for target Function: 0.206
F1 score for target Object_Type: 0.281
F1 score for target Operating_Status: 0.794
F1 score for target Position_Type: 0.289
F1 score for target Pre_K: 0.665
F1 score for target Reporting: 0.501
F1 score for target Sharing: 0.492
F1 score for target Student_Type: 0.399
F1 score for target Use: 0.343
Average F1 score for all targets : 0.441


In [30]:
multi_multi_log_loss(mod0_1_test_probas, y_test.values, BOX_PLOTS_COLUMN_INDICES)

1.3230752955144411

__================ End of Mod0_1 ==================__

---

__================ Begin Mod0_1a *(add scaling; convert total to absolute value)* ==================__

In [31]:
def rectify_total(in_df):
    # gets a copy of the desired columns and fills nans with 0
    rval = in_df[['FTE','Total']].fillna(0) 
    # now munge the copy and return it
    rval.loc[:, 'Total'] = np.abs(rval['Total'])
    return rval

In [32]:
# convert X_train.FTE to abs val before fitting; also fill both with 0 == fill with -9999 does bad things to mean later
get_numeric_data_abs = FunctionTransformer(lambda x: rectify_total(x[NUMERIC_COLUMNS]), validate=False)

In [33]:
# make a classifier; abs val Total before fitting

mod0_1a = Pipeline([('getnum', get_numeric_data_abs),
                   ('scale', StandardScaler()),
                   ('clf',   OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
                  ])

start = timer()
# Fit the classifier to the training data
mod0_1a.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 34.34 seconds


#### Classification is improved slightly by using absolute value of Total.

In [34]:
# predict probilities
mod0_1a_test_probas = mod0_1a.predict_proba(X_test)

In [36]:
# get accuracy.  The ys are the same as before (test set hasn't changed)
report_accuracy(the_ys, ftl.flat_to_labels(mod0_1a_test_probas))

Accuracy score for target Function: 0.292
Accuracy score for target Object_Type: 0.447
Accuracy score for target Operating_Status: 0.859
Accuracy score for target Position_Type: 0.392
Accuracy score for target Pre_K: 0.766
Accuracy score for target Reporting: 0.643
Accuracy score for target Sharing: 0.636
Accuracy score for target Student_Type: 0.563
Accuracy score for target Use: 0.517
Average accuracy score for all targets : 0.568


In [37]:
report_f1(the_ys, ftl.flat_to_labels(mod0_1a_test_probas))

  'precision', 'predicted', average, warn_for)


F1 score for target Function: 0.215
F1 score for target Object_Type: 0.332
F1 score for target Operating_Status: 0.795
F1 score for target Position_Type: 0.299
F1 score for target Pre_K: 0.665
F1 score for target Reporting: 0.506
F1 score for target Sharing: 0.498
F1 score for target Student_Type: 0.416
F1 score for target Use: 0.363
Average F1 score for all targets : 0.454


In [38]:
multi_multi_log_loss(mod0_1a_test_probas, y_test.values, BOX_PLOTS_COLUMN_INDICES)

1.2952599960510929

__================ End of Mod0_1a ==================__

__================ Begin Mod0_2 *(add scaling, default imputer)* ==================__

#### Redo train/test split so we can use Imputer (instead of replacing NaNs by -9999)

In [40]:
# X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NUMERIC_COLUMNS],
#                                                                label_dummies,
#                                                                size=0.2, 
#                                                                seed=123)
### Work on the full input data; we'll select numerics in the pipeline.
X_train, X_test, y_train, y_test = multilabel_train_test_split(df,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)

In [41]:
# make a FunctionTransformer and tell the Pipeline we'll deal with the NaNs (Imputer does it)

get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

# Instantiate the classifier: clf
mod0_2 = Pipeline([('select num', get_numeric_data),
                   ('imputer', Imputer()),
                   ('scale', StandardScaler()),
                   ('clf',   OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
                  ])

start = timer()
# Fit the classifier to the training data
mod0_2.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2} seconds'.format(end - start))

fit time: 3.8e+01 seconds


#### Fits 10% faster with mean imputation for numerical input.  Classification is somewhat degraded.

In [42]:
# predict probabilties
mod0_2_test_probas = mod0_2.predict_proba(X_test)

In [43]:
report_accuracy(the_ys, ftl.flat_to_labels(mod0_2_test_probas))

Accuracy score for target Function: 0.281
Accuracy score for target Object_Type: 0.197
Accuracy score for target Operating_Status: 0.859
Accuracy score for target Position_Type: 0.322
Accuracy score for target Pre_K: 0.766
Accuracy score for target Reporting: 0.641
Accuracy score for target Sharing: 0.634
Accuracy score for target Student_Type: 0.557
Accuracy score for target Use: 0.508
Average accuracy score for all targets : 0.529


In [44]:
report_f1(the_ys, ftl.flat_to_labels(mod0_2_test_probas))

  'precision', 'predicted', average, warn_for)


F1 score for target Function: 0.163
F1 score for target Object_Type: 0.098
F1 score for target Operating_Status: 0.793
F1 score for target Position_Type: 0.201
F1 score for target Pre_K: 0.665
F1 score for target Reporting: 0.501
F1 score for target Sharing: 0.492
F1 score for target Student_Type: 0.398
F1 score for target Use: 0.342
Average F1 score for all targets : 0.406


In [45]:
    multi_multi_log_loss(mod0_2_test_probas, y_test.values, BOX_PLOTS_COLUMN_INDICES)

1.3624418801241875

__================ End of Mod0_2 ==================__

---

__================ Beginning of Mod1 ==================__

### Add text processing to the model

#### Combining text columns for tokenization

In [49]:
# merge all text feature strings into one string.
def combine_text_columns(df, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text columns in each row of df to single string """
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(df.columns.tolist())
    text_data = df.drop(to_drop, axis=1)  
    # Replace nans with blanks
    text_data.fillna('', inplace=True)    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

#### Rebinding X/y train/test...

It needs to be done because X is a different feature subset.

In [51]:
# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)
# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Preprocess the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

#### Build the pipeline

In [52]:
# Complete the pipeline: pl
mod1 = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([('selector', get_numeric_data),
                                               ('imputer', Imputer())])),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer())]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod1.fit(X_train, y_train)
end = timer()
print('fit time: {:0.1f} seconds'.format(end - start))
# old 489 sec; new machine 418 sec

fit time: 469.1 seconds


In [53]:
### For log loss we need the probabilities, not the predicted labels
start = timer()
mod1_train_probas = mod1.predict_proba(X_train)
mod1_test_probas = mod1.predict_proba(X_test)
end = timer()
print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 17.28 seconds


In [54]:
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod1_train_probas, 
                                                                      y_train.values, BOX_PLOTS_COLUMN_INDICES)))
print('log loss on test set: {:0.4f}'.format(multi_multi_log_loss(mod1_test_probas, 
                                                                      y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.5110
log loss on training set: 0.5117


In [55]:
report_f1(the_ys, ftl.flat_to_labels(mod1_test_probas))

F1 score for target Function: 0.785
F1 score for target Object_Type: 0.869
F1 score for target Operating_Status: 0.931
F1 score for target Position_Type: 0.834
F1 score for target Pre_K: 0.951
F1 score for target Reporting: 0.858
F1 score for target Sharing: 0.793


  'precision', 'predicted', average, warn_for)


F1 score for target Student_Type: 0.867
F1 score for target Use: 0.791
Average F1 score for all targets : 0.853


In [56]:
report_accuracy(the_ys, ftl.flat_to_labels(mod1_test_probas))

Accuracy score for target Function: 0.798
Accuracy score for target Object_Type: 0.873
Accuracy score for target Operating_Status: 0.936
Accuracy score for target Position_Type: 0.842
Accuracy score for target Pre_K: 0.953
Accuracy score for target Reporting: 0.865
Accuracy score for target Sharing: 0.815
Accuracy score for target Student_Type: 0.873
Accuracy score for target Use: 0.803
Average accuracy score for all targets : 0.862


### Predicted, submitted and scored with log-loss of 0.75.

#### =============================== End of Mod1 ============================================

---

#### ================= Beginning of Mod1_1; just the text features (no numerics) ===================================

#### Funny thing, but when I simplify the pipeline (remove feature union and selection/preprocessing for numeric data), OneVsRest fails with n_jobs=-1.  Runs without it, but slow (~2x).

#### Build the pipeline, but ignore numerical features.

In [57]:
# Complete the pipeline: pl
mod1_1 = Pipeline([('selector', get_text_data),
                   ('vectorizer', CountVectorizer()),
                   ('clf', OneVsRestClassifier(LogisticRegression()))])
start = timer()
# Fit to the training data
mod1_1.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 826.2440557586237 seconds


In [58]:
### For log loss we need the probabilities, not the predicted labels
start = timer()
mod1_1_train_probas = mod1_1.predict_proba(X_train)
mod1_1_test_probas = mod1_1.predict_proba(X_test)
end = timer()
print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 17.47 seconds


In [59]:
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod1_1_train_probas, 
                                                                      y_train.values, BOX_PLOTS_COLUMN_INDICES)))
print('log loss on test set: {:0.4f}'.format(multi_multi_log_loss(mod1_1_test_probas, 
                                                                      y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.0874
log loss on test set: 0.0940


In [60]:
report_f1(the_ys, ftl.flat_to_labels(mod1_1_test_probas))

F1 score for target Function: 0.955
F1 score for target Object_Type: 0.984
F1 score for target Operating_Status: 0.984
F1 score for target Position_Type: 0.982
F1 score for target Pre_K: 0.990
F1 score for target Reporting: 0.972
F1 score for target Sharing: 0.962
F1 score for target Student_Type: 0.973
F1 score for target Use: 0.961
Average F1 score for all targets : 0.974


In [61]:
report_accuracy(the_ys, ftl.flat_to_labels(mod1_1_test_probas))

Accuracy score for target Function: 0.955
Accuracy score for target Object_Type: 0.984
Accuracy score for target Operating_Status: 0.985
Accuracy score for target Position_Type: 0.983
Accuracy score for target Pre_K: 0.990
Accuracy score for target Reporting: 0.973
Accuracy score for target Sharing: 0.962
Accuracy score for target Student_Type: 0.973
Accuracy score for target Use: 0.961
Average accuracy score for all targets : 0.974


#### Predict on holdout set and create submission file.

In [3]:
# # Load the holdout data: holdout
# ### Over here the file is TestData.csv
# holdout = pd.read_csv('data/TestData.csv', index_col=0)

In [4]:
# start = timer()
# # Generate predictions: predictions
# mod1_1_predictions = mod1_1.predict_proba(holdout)
# end = timer()
# print('predict time: {} seconds'.format(end - start))

In [5]:
# pred_mod1_1 = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, 
#                              index=holdout.index,
#                              data=mod1_1_predictions)

# pred_mod1_1.to_csv('pred_mod1_1.csv')

### Top 10 finish!!  0.6827 on holdout set at Drivendata

#### ======================== End of Mod1_1 ===================================

---

#### ==================== Begin Mod1_1_1 - work around the n_jobs problem ================================

### Okay, the numeric data is not helpful.

Between mod1 and mod3 (mod2 only changes the classifier to RandomForest, leaving preprocessing alone) the changes are:

1. tokenize on alphanumeric (instead of default)
2. Add bigrams to CountVectorizer (previously was used with default settings).  This doubles the size of the wordvec space.
3. Dimension reduction with SelectKBest using chi-squared (300 features)

Some things to note about the default CountVectorizer:
1) All strings are downcased
2) The default setting selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).  This means single letter or digit tokens are ignored.
3) If the vectorizer is used to transform another input (e.g. test), any tokens not in the original corpus are ignored.


#### One other way to work around bug exposed with CountVectorizer/OneVsRest/Logistic would be to replace all the numeric values with 0.  The classifiers  should ignore (and might work with n_jobs=-1).

Yes, this works well and uses all processors yielding the same results as the slower, 1-processor version above.  Fits in 464 sec instead of 827 sec.

why am I recreating X_train, etc. here?  I was doing it because I used sample to downsize data set....
I'll leave it here for now.  So without the bigrams this should take about 7 minutes...

In [66]:
# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the columns that are features in the original df
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)
# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Use all 0s instead of noise: get_numeric_data
get_numeric_data_hack = FunctionTransformer(lambda x: np.zeros(x[NUMERIC_COLUMNS].shape, dtype=np.float), validate=False)

#### Build the pipeline

In [67]:
# Complete the pipeline: pl
mod_1_1_1 = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([('selector', get_numeric_data_hack),
                                               ('imputer', Imputer())])),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer())]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_1_1_1.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 384.25 seconds


In [68]:
### For log loss we need the probabilities, not the predicted labels
start = timer()
mod_1_1_1_train_probas = mod_1_1_1.predict_proba(X_train)
mod_1_1_1_test_probas = mod_1_1_1.predict_proba(X_test)
end = timer()
print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 16.33 seconds


In [69]:
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_1_1_1_train_probas, 
                                                                      y_train.values, BOX_PLOTS_COLUMN_INDICES)))
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_1_1_1_test_probas, 
                                                                      y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.0874
log loss on training set: 0.0940


In [70]:
report_f1(the_ys, ftl.flat_to_labels(mod_1_1_1_test_probas))

F1 score for target Function: 0.955
F1 score for target Object_Type: 0.984
F1 score for target Operating_Status: 0.984
F1 score for target Position_Type: 0.982
F1 score for target Pre_K: 0.990
F1 score for target Reporting: 0.972
F1 score for target Sharing: 0.962
F1 score for target Student_Type: 0.973
F1 score for target Use: 0.961
Average F1 score for all targets : 0.974


In [71]:
report_accuracy(the_ys, ftl.flat_to_labels(mod_1_1_1_test_probas))

Accuracy score for target Function: 0.955
Accuracy score for target Object_Type: 0.984
Accuracy score for target Operating_Status: 0.985
Accuracy score for target Position_Type: 0.983
Accuracy score for target Pre_K: 0.990
Accuracy score for target Reporting: 0.973
Accuracy score for target Sharing: 0.963
Accuracy score for target Student_Type: 0.973
Accuracy score for target Use: 0.961
Average accuracy score for all targets : 0.974


#### =============================== End of mod_1_1_1 ============================================

***

#### ====================== Beginning of mod_1_1_2: now add bigrams =======================================

#### Build the pipeline

### Super strange:  I can run this with 90% of the data and it works very well.  If I use all the data, it seems to fit and then never comes back. Even though machine is not busy, it refuses to be interrupted.  Have broken this out into its own file for experimentation.

In [None]:
# # Complete the pipeline: pl
# mod_1_1_2 = Pipeline([
#         ('union', FeatureUnion(
#             transformer_list = [
#                 ('numeric_features', Pipeline([('selector', get_numeric_data_hack),
#                                                ('imputer', Imputer())])),
#                 ('text_features', Pipeline([('selector', get_text_data),
#                                             ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
#              ])),
#         ('clf', OneVsRestClassifier(LogisticRegression(), n_jobs=-1))
#     ])

# start = timer()
# # Fit to the training data
# mod_1_1_2.fit(X_train, y_train)
# end = timer()
# print('fit time: {:0.2f} seconds'.format(end - start))

In [68]:
# ### For log loss we need the probabilities, not the predicted labels
# start = timer()
# mod_1_1_2_train_probas = mod_1_1_2.predict_proba(X_train)
# mod_1_1_2_test_probas = mod_1_1_2.predict_proba(X_test)
# end = timer()
# print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 16.33 seconds


In [69]:
# print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_1_1_2_train_probas, 
#                                                                       y_train.values, BOX_PLOTS_COLUMN_INDICES)))
# print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_1_1_2_test_probas, 
#                                                                       y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.0874
log loss on training set: 0.0940


In [70]:
# report_f1(the_ys, ftl.flat_to_labels(mod_1_1_2_test_probas))

F1 score for target Function: 0.955
F1 score for target Object_Type: 0.984
F1 score for target Operating_Status: 0.984
F1 score for target Position_Type: 0.982
F1 score for target Pre_K: 0.990
F1 score for target Reporting: 0.972
F1 score for target Sharing: 0.962
F1 score for target Student_Type: 0.973
F1 score for target Use: 0.961
Average F1 score for all targets : 0.974


In [71]:
# report_accuracy(the_ys, ftl.flat_to_labels(mod_1_1_2_test_probas))

Accuracy score for target Function: 0.955
Accuracy score for target Object_Type: 0.984
Accuracy score for target Operating_Status: 0.985
Accuracy score for target Position_Type: 0.983
Accuracy score for target Pre_K: 0.990
Accuracy score for target Reporting: 0.973
Accuracy score for target Sharing: 0.963
Accuracy score for target Student_Type: 0.973
Accuracy score for target Use: 0.961
Average accuracy score for all targets : 0.974


#### ======================== End of mod_1_1_2: now add bigrams =========================================