### Objective: Label line-item in a school budget
How to accurately classify line-items in a school budget based on what that money is being used for?<br>
There are 9 columns of labels, each will be converted to categorical variable, so it is 9 one-vs-all classifications.<br>
There are two numeric features and fourteen free form text columns.<br>
The free form text columns are converted to tens of thousands features using tokenization, bag of words and n-gram NLP techniques.

In [225]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# read training dataset
df = pd.read_csv('TrainingData.csv', index_col=0)
print(df.shape)

(400277, 25)


There are 9 columns of labels in the dataset. Each of these columns is a category that has many possible values it can take. The compitition web site list all the labels column names under "Label example" section. So populate LABELS list below.

In [226]:
LABELS = ['Function', 'Object_Type', 'Operating_Status', 'Position_Type', 'Pre_K', 'Reporting', 'Sharing', 'Student_Type', 'Use']

In [227]:
NUMERIC_COLUMNS = ['FTE', "Total"]

Convert the columns of labels to categorical class

In [228]:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)

Function            category
Object_Type         category
Operating_Status    category
Position_Type       category
Pre_K               category
Reporting           category
Sharing             category
Student_Type        category
Use                 category
dtype: object


Always create a small size subset of df for testing. The subset and training and test sets should include samples covered all classes.

In [229]:
from multilabel import multilabel_sample_dataframe, multilabel_train_test_split

NON_LABELS = [c for c in df.columns if c not in LABELS]

# flag for production (submission)
f_prod = 1

if f_prod:
    print('production run')
    dummy_labels = pd.get_dummies(df[LABELS])
    
    #X_train=df[NON_LABELS]
    #y_train=dummy_labels
    
    X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)
    
    #print(X_train.shape)
    #print(y_train.shape)
    
else:
    print('pre-production run')
    SAMPLE_SIZE = 40000
    
    df_samples = multilabel_sample_dataframe(df,
                                       pd.get_dummies(df[LABELS]),
                                       size=SAMPLE_SIZE,
                                       min_count=25,
                                       seed=34)
    
    dummy_labels = pd.get_dummies(df_samples[LABELS])
                                             
    X_train, X_test, y_train, y_test = multilabel_train_test_split(df_samples[NON_LABELS],
                                                               dummy_labels,
                                                               0.2,
                                                               min_count=3,
                                                               seed=43)


production run


In [230]:
type(dummy_labels)

pandas.core.frame.DataFrame

In [231]:
dummy_labels.shape[1]

104

### Utility functions

In [232]:
def compute_log_loss(predicted, actual, eps=1e-14):
    """ Computes the logarithmic loss between predicted and
    actual when these are 1D arrays.
    :param predicted: The predicted probabilities as floats between 0-1
    :param actual: The actual binary labels. Either 0 or 1.
    :param eps (optional): log(0) is inf, so we need to offset our
    predicted values slightly by eps from 0 or 1.
    """
    predicted = np.clip(predicted, eps, 1 - eps)
    loss = -1 * np.mean(actual * np.log(predicted)
            + (1 - actual)
            * np.log(1 - predicted))
    return loss

In [233]:
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ Takes the dataset as read in, drops the non-feature, non-text columns and
        then combines all of the text columns into a single vector that has all of
        the text for a row.
        
        :param data_frame: The data as read in with read_csv (no preprocessing necessary)
        :param to_drop (optional): Removes the numeric and label columns by default.
    """
    # drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis=1)
    
    # replace nans with blanks
    text_data.fillna("", inplace=True)
    
    # joins all of the text items in a row (axis=1)
    # with a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

In [234]:
# convert python functions to 'model', class so it can be used as model
from sklearn.preprocessing import FunctionTransformer

get_text_data = FunctionTransformer(combine_text_columns, validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

In [235]:
get_text_data.fit_transform(df_samples.head(5))

13     Personal Services - Secretaries   SECRETARY  L...
200    Personal Services - Other Compensation     Off...
375    EMPLOYEE BENEFITS STUDENT SERVICES CHARTER SCH...
533    SALARIES OF REGULAR EMPLOYEES MAINTENANCE  MIL...
610    Telephone Service  Community Services    Commu...
dtype: object

In [236]:
get_numeric_data.fit_transform(df_samples.head(5))

Unnamed: 0,FTE,Total
13,1.0,21128.632098
200,0.0,1166.79
375,,143.84
533,,1627.94296
610,,3.66


### Build model using Pipeline

In [237]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import chi2, SelectKBest

from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.preprocessing import Imputer

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MaxAbsScaler

from sklearn.feature_extraction.text import CountVectorizer
from SparseInteractions import SparseInteractions

from sklearn.metrics import log_loss

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Select 300 best features
chi_k = 8000

In [238]:
# neural network
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score

input_d = chi_k * (chi_k - 1)
layer1_nodes=chi_k**2
output_nodes=dummy_labels.shape[1]

# define baseline model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(layer1_nodes, input_shape=(1128753,), activation='relu'))
    model.add(Dense(output_nodes, activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [239]:
%%time

# Instantiate pipeline: pl
pl = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
                                                   ngram_range=(1, 2)))
#                    ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
#                                                      non_negative=True, norm=None, binary=False,
#                                                     ngram_range=(1, 2))),
#                    ('dim_red', SelectKBest(chi2, chi_k))
                ]))
             ]
        )),
#        ('int', SparseInteractions(degree=2)),
        ('scale', MaxAbsScaler()),
#        ('clf', OneVsRestClassifier(SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, random_state=22)))
        ('clf', OneVsRestClassifier(LogisticRegression()))
#        ('clf', KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=400, verbose=0))
    ])


"""
parameters = {'clf__estimator__loss': ['log'],
             'clf__estimator__alpha': [0.001, 0.0001, 0.00001],
           'clf__estimator__penalty': ['elasticnet']}

# Instantiate the GridSearchCV object: pl_cv
pl_cv = GridSearchCV(pl, param_grid=parameters, cv=5)

pl_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(pl_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(pl_cv.best_score_))
"""

#pl.fit_transform(X_train, y_train)
pl.fit(X_train, y_train)



Wall time: 16min 52s


In [240]:
X_train.shape

(320222, 16)

In [241]:
#if f_prod:
if 1:
    print('pre-production run')
    
    # Compute and print accuracy
    accuracy = pl.score(X_test, y_test)
    print("\nAccuracy on data - test data: ", accuracy)

    # compute log loss instead
    predictionsCV = pl.predict_proba(X_test)
    
    #print(predictionsCV.)



    predictionCV_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS],prefix_sep='_').columns,
                             index=X_test.index,
                             data=predictionsCV)

    ll = log_loss(y_test, predictionCV_df)

    print("\nlog_loss on test data: ", ll)


pre-production run

Accuracy on data - test data:  0.920067453626

log_loss on test data:  20.3658167409


In [248]:
predictionCV_df.head()

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Student_Type_Special Education,Student_Type_Unspecified,Use_Business Services,Use_ISPD,Use_Instruction,Use_Leadership,Use_NO_LABEL,Use_O&M,Use_Pupil Services & Enrichment,Use_Untracked Budget Set-Aside
206341,0.000179,0.000286,9.2e-05,5.628961e-05,0.000224,7.3e-05,0.000189,2.7e-05,2e-06,6.6e-05,...,9.6e-05,5e-06,0.000257,6.6e-05,4.6e-05,1e-06,0.999997,1.8e-05,0.000254,8.7e-05
126378,7.3e-05,3.8e-05,5.2e-05,4.448786e-05,0.000104,2e-05,0.001135,2.5e-05,0.002636,1.6e-05,...,6.2e-05,0.99922,0.000301,0.000113,0.656508,0.001842,0.016257,0.033548,0.001633,4.2e-05
18698,1e-05,1e-05,1e-05,9.472766e-07,1.5e-05,4e-06,3e-06,4.3e-05,3.4e-05,3e-06,...,0.00059,0.996658,7e-06,2.9e-05,0.999914,4e-06,0.000319,2.1e-05,2e-06,1e-06
169914,7.7e-05,0.00027,0.00084,3.486713e-05,6.1e-05,0.000747,0.000125,9.6e-05,0.017872,6.9e-05,...,0.000309,0.993098,0.001401,0.000169,7e-06,0.002416,0.015353,0.994863,0.001797,0.000108
43727,0.000106,5.7e-05,0.0001,0.006676766,0.000174,2.2e-05,0.001222,4.8e-05,0.000122,3.1e-05,...,0.000108,0.036362,0.00054,0.016078,0.000424,0.012615,0.951176,0.000195,0.002471,0.063056


In [242]:
print(predictionsCV.shape)

(80055, 104)


### Make prediction and export to predictions.csv file

In [243]:
if f_prod:
    print('production run')
    # Load the holdout data: holdout
    holdout = pd.read_csv('HoldoutData.csv', index_col=0, low_memory=False)

    predictions = pl.predict_proba(holdout)

    # Format predictions in DataFrame: prediction_df
    prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS],prefix_sep='__').columns,
                             index=holdout.index,
                             data=predictions)

    prediction_df.to_csv('predictions.csv')

    print(predictions.shape)


production run
(50064, 104)


In [244]:
# following will return (1, n_features)
pl.get_params()['clf'].estimators_[0].coef_.shape

(1, 28322)