### Do one model.

#### Try some regularization 

I thought I dialed in good a C but it did not score better on the holdout.  But I overdid it. __C=0.01 improves things considerably; C=0.10 is even better__.

#### Here are the defaults for LogisticRegression:
```
LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, 
                   class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', 
                   verbose=0, warm_start=False, n_jobs=1)
```

In [1]:
#### Imports/setup

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_columns', 60)

from timeit import default_timer as timer

# for the pipeline
from sklearn.pipeline import Pipeline
# for the selectors
from sklearn.preprocessing import FunctionTransformer, StandardScaler
# for gluing preprocessed text and numbers together
from sklearn.pipeline import FeatureUnion
# for nans in the numeric data
from sklearn.preprocessing import Imputer

# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

# metrics
from sklearn.metrics import f1_score, accuracy_score, classification_report

# unflattener
import python.flat_to_labels as ftl

#### Set up a train-test split making sure we have all labels in both splits
from python.multilabel import multilabel_train_test_split

from python.dd_mmll import multi_multi_log_loss, BOX_PLOTS_COLUMN_INDICES

#### Load the data

In [2]:
# Get data
the_data = pd.read_csv('data/TrainingData.csv', index_col=0)

# take a look
the_data.head()

Unnamed: 0,Function,Use,Sharing,Reporting,Student_Type,Position_Type,Object_Type,Pre_K,Operating_Status,Object_Description,Text_2,SubFund_Description,Job_Title_Description,Text_3,Text_4,Sub_Object_Description,Location_Description,FTE,Function_Description,Facility_or_Department,Position_Extra,Total,Program_Description,Fund_Description,Text_1
134338,Teacher Compensation,Instruction,School Reported,School,NO_LABEL,Teacher,NO_LABEL,NO_LABEL,PreK-12 Operating,,,,Teacher-Elementary,,,,,1.0,,,KINDERGARTEN,50471.81,KINDERGARTEN,General Fund,
206341,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,NO_LABEL,Non-Operating,CONTRACTOR SERVICES,BOND EXPENDITURES,BUILDING FUND,(blank),Regular,,,,,RGN GOB,,UNDESIGNATED,3477.86,BUILDING IMPROVEMENT SERVICES,,BUILDING IMPROVEMENT SERVICES
326408,Teacher Compensation,Instruction,School Reported,School,Unspecified,Teacher,Base Salary/Compensation,Non PreK,PreK-12 Operating,Personal Services - Teachers,,,TCHER 2ND GRADE,,Regular Instruction,,,1.0,,,TEACHER,62237.13,Instruction - Regular,General Purpose School,
364634,Substitute Compensation,Instruction,School Reported,School,Unspecified,Substitute,Benefits,NO_LABEL,PreK-12 Operating,EMPLOYEE BENEFITS,TEACHER SUBS,GENERAL FUND,"Teacher, Short Term Sub",Regular,,,,,UNALLOC BUDGETS/SCHOOLS,,PROFESSIONAL-INSTRUCTIONAL,22.3,GENERAL MIDDLE/JUNIOR HIGH SCH,,REGULAR INSTRUCTION
47683,Substitute Compensation,Instruction,School Reported,School,Unspecified,Teacher,Substitute Compensation,NO_LABEL,PreK-12 Operating,TEACHER COVERAGE FOR TEACHER,TEACHER SUBS,GENERAL FUND,"Teacher, Secondary (High)",Alternative,,,,,NON-PROJECT,,PROFESSIONAL-INSTRUCTIONAL,54.166,GENERAL HIGH SCHOOL EDUCATION,,REGULAR INSTRUCTION


####  Encode the targets as categorical variables

In [3]:
### bind variable LABELS - these are actually the targets and we're going to one-hot encode them...
LABELS = ['Function',  'Use',  'Sharing',  'Reporting',  'Student_Type',  'Position_Type', 
          'Object_Type',  'Pre_K',  'Operating_Status']

### This turns out to be key.  Submission requires the dummy versions of these vars to be in this order.
LABELS.sort()

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
the_data[LABELS] = the_data[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(the_data[LABELS].dtypes)

Function            category
Object_Type         category
Operating_Status    category
Position_Type       category
Pre_K               category
Reporting           category
Sharing             category
Student_Type        category
Use                 category
dtype: object


#### Save the unique labels for each output (category)

In [4]:
# build a dictionary
the_labels = {col : the_data[col].unique().tolist() for col in the_data[LABELS].columns}
# take a look at one entry
the_labels['Use']

['Instruction',
 'NO_LABEL',
 'O&M',
 'Pupil Services & Enrichment',
 'ISPD',
 'Leadership',
 'Business Services',
 'Untracked Budget Set-Aside']

#### Change fraction to suit.
Note: small fractions will have a hard time ensuring labels in both splits.

In [5]:
# downsize it or not
# df = the_data.sample(frac=0.10, random_state=777) # this seed gets a split with enough labels in both sets
df = the_data.sample(frac=1.0, random_state=777)

#### Get targets as set of one-hot encoded columns

In [6]:
# name these columns
NUMERIC_COLUMNS = ['FTE', 'Total']

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

#### Setting up a train-test split  for modeling

#### ======================== Begin Mod3_1  (add bigrams) with regularization ===================================

Some things to note about the default CountVectorizer and HashingVectorizer:
1. All strings are downcased
2. The default setting selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).  This means single letter or digit tokens are ignored.
3. If the vectorizer is used to transform another input (e.g. test), any tokens not in the original corpus are ignored.

In [7]:
# define combine_text_columns()
def combine_text_columns(df, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text columns in each row of df to single string """
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(df.columns.tolist())
    text_data = df.drop(to_drop, axis=1)  
    # Replace nans with blanks
    text_data.fillna('', inplace=True)    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

In [8]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the features in the data
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)
# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Use all 0s instead of noise: get_numeric_data
get_numeric_data_hack = FunctionTransformer(lambda x: np.zeros(x[NUMERIC_COLUMNS].shape, dtype=np.float), validate=False)

In [9]:
#### Build the pipeline
mod_reg = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', get_numeric_data_hack),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
             ])),
    # no scaler here  
    ('clf', OneVsRestClassifier(LogisticRegression(C=0.0009), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_reg.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 220.05 seconds


#### regularization results on 1/10 of the data

* C=1: 43 sec; train: 0.0522 test: 0.1074
* C=0.5: 39 sec; train: 0.0678 test: 0.1162
* C=0.75: 41 sec; train: 0.0580 test: 0.1103
* C=0.25:  34 sec; train: 0.0897 test: 0.1318
* C=0.125: 31 sec; train: 0.1195 test: 0.1562
* C=0.0625 28 sec; train: 0.1589 test: 0.1913
* C=0.030 27 sec; train: 0.2131 test: 0.2421
* C=0.016 24 sec; train: 0.2721 test: 0.2987
* C=0.010 23 sec; train: 0.3250  test: 0.3502
* C=0.0125 24 sec; train: 0.2989  test: 0.3247

#### regularization results on all the data, C=0.0125
log loss on training set: 0.1386, log loss on test set: 0.1412

#### regularization results on all the data, C=0.0100, 329sec
log loss on training set: 0.1386, log loss on test set: 0.1412

#### regularization results on all the data, C=0.0050, 282 sec
log loss on training set: 0.1897 log loss on test set: 0.1921

#### regularization results on all the data, C=0.0050, 282 sec
log loss on training set: 0.1897 log loss on test set: 0.1921

#### regularization results on all the data, C=0.0010, 219 sec
log loss on training set: 0.3366 log loss on test set: 0.3386

#### regularization results on all the data, C=0.0030, 265 sec
log loss on training set: 0.0.2274 log loss on test set: 0.2297

#### regularization results on all the data, C=0.0001, 159 sec
log loss on training set: 0.7229 log loss on test set: 0.7237

#### regularization results on all the data, C=0.0005, 203 sec
log loss on training set: 0.4291 log loss on test set: 0.4308

#### regularization results on all the data, C=0.0006, 210 sec
log loss on training set: 0.4028 log loss on test set: 0.4046

#### regularization results on all the data, C=0.0007, 210 sec
log loss on training set: 0.4028 log loss on test set: 0.4046

#### regularization results on all the data, C=0.0008, 214 sec
log loss on training set: 0.3642 log loss on test set: 0.3661

### regularization results on all the data, C=0.0009, 214 sec  - this is the thing to submit...
```
log loss on training set: 0.3494 log loss on test set: 0.3514;
log loss on training set: 0.3493, log loss on test set: 0.3518; rs=999
log loss on training set: 0.3489, log loss on test set: 0.3534; rs=123
log loss on training set: 0.3497, log loss on test set: 0.3504, rs=1111
log loss on training set: 0.3491, log loss on test set: 0.3527; rs=5851
log loss on training set: 0.3499, log loss on test set: 0.3494; rs=2559
log loss on training set: 0.3495, log loss on test set: 0.3512; rs=3097
log loss on training set: 0.3495, log loss on test set: 0.0.3514; rs=20001207
log loss on training set: 0.3495, log loss on test set: 0.0.3514; rs=37
log loss on training set: 0.3494, log loss on test set: 0.3514; rs=777
```


#### For log loss we need the probabilities, not the predicted labels

In [10]:
# get probas
start = timer()
mod_reg_train_probas = mod_reg.predict_proba(X_train)
mod_reg_test_probas = mod_reg.predict_proba(X_test)
end = timer()
print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 24.95 seconds


In [11]:
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_reg_train_probas, 
                                                                      y_train.values, BOX_PLOTS_COLUMN_INDICES)))
print('log loss on test set: {:0.4f}'.format(multi_multi_log_loss(mod_reg_test_probas, 
                                                                      y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.3494
log loss on test set: 0.3514


In [12]:
def report_f1(true, pred):
    the_scores = []
    for target in range(len(LABELS)):
        the_score = f1_score(true[:, target], pred[:, target], average='weighted')
        print('F1 score for target {}: {:.3f}'.format(LABELS[target], the_score))
        the_scores.append(the_score)
    print('Average F1 score for all targets : {:.3f}'.format(np.mean(the_scores)))

def report_accuracy(true, pred):
    the_scores = []
    for target in range(len(LABELS)):
        the_score = accuracy_score(true[:, target], pred[:, target])
        print('Accuracy score for target {}: {:.3f}'.format(LABELS[target], the_score))
        the_scores.append(the_score)
    print('Average accuracy score for all targets : {:.3f}'.format(np.mean(the_scores)))

In [13]:
# ftl wants ndarray, not pd.Dataframe
the_ys = ftl.flat_to_labels(y_test.values)

In [14]:
report_f1(the_ys, ftl.flat_to_labels(mod_reg_test_probas))

report_accuracy(the_ys, ftl.flat_to_labels(mod_reg_test_probas))

  'precision', 'predicted', average, warn_for)


F1 score for target Function: 0.849
F1 score for target Object_Type: 0.939
F1 score for target Operating_Status: 0.966
F1 score for target Position_Type: 0.911
F1 score for target Pre_K: 0.975
F1 score for target Reporting: 0.934
F1 score for target Sharing: 0.899
F1 score for target Student_Type: 0.933
F1 score for target Use: 0.889
Average F1 score for all targets : 0.922
Accuracy score for target Function: 0.861
Accuracy score for target Object_Type: 0.941
Accuracy score for target Operating_Status: 0.968
Accuracy score for target Position_Type: 0.916
Accuracy score for target Pre_K: 0.976
Accuracy score for target Reporting: 0.935
Accuracy score for target Sharing: 0.904
Accuracy score for target Student_Type: 0.935
Accuracy score for target Use: 0.894
Average accuracy score for all targets : 0.925


#### =========================== predict on holdout set ==================================

In [15]:
# Load the holdout data: holdout
### Over here the file is TestData.csv
holdout = pd.read_csv('data/TestData.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [16]:
start = timer()
# Generate predictions: predictions
mod_reg_predictions = mod_reg.predict_proba(holdout)
end = timer()
print('predict time: {} seconds'.format(end - start))

predict time: 3.068298013462197 seconds


In [17]:
pred_mod_reg = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, 
                             index=holdout.index,
                             data=mod_reg_predictions)

pred_mod_reg.to_csv('pred_mod_reg.csv')

#### ====================== End of mod_reg; score: 0.6943 ============================================

####  ============ Try another submission with C=0.01.  I might have overdone it. ========================

In [12]:
#### Build the pipeline
mod_reg_01 = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', get_numeric_data_hack),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(C=0.01), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_reg_01.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 368.27 seconds


#### =========================== predict on holdout set ==================================

In [13]:
# Load the holdout data: holdout
### Over here the file is TestData.csv
holdout = pd.read_csv('data/TestData.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [14]:
start = timer()
# Generate predictions: predictions
mod_reg_01_predictions = mod_reg_01.predict_proba(holdout)
end = timer()
print('predict time: {} seconds'.format(end - start))

predict time: 2.7244055672565537 seconds


In [17]:
pred_mod_reg_01 = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, 
                             index=holdout.index,
                             data=mod_reg_01_predictions)

pred_mod_reg_01.to_csv('pred_mod_reg_01.csv')

####  ============ end mod_reg_tenth.  0.5400: 4th place ========================

####  ============ Try another submission with C=0.1, reg_mod_tenth.  I might have overdone it. ========================

In [25]:
#### Build the pipeline
mod_reg_tenth = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', get_numeric_data_hack),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(C=0.1), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_reg_tenth.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 506.39 seconds


#### =========================== predict on holdout set ==================================

In [26]:
# Load the holdout data: holdout
### Over here the file is TestData.csv
holdout = pd.read_csv('data/TestData.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [27]:
holdout.shape, the_data.shape

((50064, 16), (400277, 25))

In [28]:
start = timer()
# Generate predictions: predictions
mod_reg_tenth_predictions = mod_reg_tenth.predict_proba(holdout)
end = timer()
print('predict time: {} seconds'.format(end - start))

predict time: 4.0949001371336635 seconds


In [29]:
pred_mod_reg_tenth = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, 
                             index=holdout.index,
                             data=mod_reg_tenth_predictions)

pred_mod_reg_tenth.to_csv('pred_mod_reg_tenth.csv')

####  ============ end mod_reg_tenth 0.5341.   4th  ========================

####  ============ Try another submission with C=0.33.    not run yet... ========================

##### Note: typo here: s.b. C=0.3333; separate file.

In [18]:
#### Build the pipeline
mod_reg_third = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', get_numeric_data_hack),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(C=0.0333), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_reg_third.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 354.80 seconds


#### =========================== predict on holdout set ==================================

In [19]:
# Load the holdout data: holdout
### Over here the file is TestData.csv
holdout = pd.read_csv('data/TestData.csv', index_col=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [24]:
holdout.shape, the_data.shape

((50064, 16), (400277, 25))

In [20]:
start = timer()
# Generate predictions: predictions
mod_reg_third_predictions = mod_reg_third.predict_proba(holdout)
end = timer()
print('predict time: {} seconds'.format(end - start))

predict time: 4.815847237900016 seconds


In [21]:
pred_mod_reg_third = pd.DataFrame(columns=pd.get_dummies(df[LABELS], prefix_sep='__').columns, 
                             index=holdout.index,
                             data=mod_reg_third_predictions)

pred_mod_reg_third.to_csv('pred_mod_reg_third.csv')

####  =================== end mod_reg_third  ========================

#### Just for yucks let's try a test set that's a similar size to the hold out and see if scores get closer to holdout scores...

##### Note: train/test has just been screwed with.  Need to fix if we need the old split.

In [30]:
# Import FunctionTransformer
from sklearn.preprocessing import FunctionTransformer

# Get the dummy encoding of the labels
dummy_labels = pd.get_dummies(df[LABELS])

# Get the features in the data
NON_LABELS = [c for c in df.columns if c not in LABELS]

# Split into training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2, 
                                                               seed=123)
# Preprocess the text data: get_text_data
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

# Use all 0s instead of noise: get_numeric_data
get_numeric_data_hack = FunctionTransformer(lambda x: np.zeros(x[NUMERIC_COLUMNS].shape, dtype=np.float), validate=False)

In [31]:
#### Build the pipeline
mod_reg_tenth = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', get_numeric_data_hack),
                ('text_features', Pipeline([('selector', get_text_data),
                                            ('vectorizer', CountVectorizer(ngram_range=(1,2)))]))
             ])),
        ('clf', OneVsRestClassifier(LogisticRegression(C=0.1), n_jobs=-1))
    ])

start = timer()
# Fit to the training data
mod_reg_tenth.fit(X_train, y_train)
end = timer()
print('fit time: {:0.2f} seconds'.format(end - start))

fit time: 499.28 seconds


In [32]:
# get probas
start = timer()
mod_reg_tenth_train_probas = mod_reg_tenth.predict_proba(X_train)
mod_reg_tenth_test_probas = mod_reg_tenth.predict_proba(X_test)
end = timer()
print('Predict.proba time: {:0.2f} seconds'.format(end - start))

Predict.proba time: 24.92 seconds


In [33]:
print('log loss on training set: {:0.4f}'.format(multi_multi_log_loss(mod_reg_tenth_train_probas, 
                                                                      y_train.values, BOX_PLOTS_COLUMN_INDICES)))
print('log loss on test set: {:0.4f}'.format(multi_multi_log_loss(mod_reg_tenth_test_probas, 
                                                                      y_test.values, BOX_PLOTS_COLUMN_INDICES)))

log loss on training set: 0.0743
log loss on test set: 0.0808


#### Moving to a 90/10 split doesn't seem to move results any closer to holdout results.

***