# Notebook 4: Predicting Small Trigger Classes

## Overview

This notebook focuses on the smaller classes. We aim to determine whether a custom sampling method can improve identification and predictions of the smaller classes. To achieve this, we create a test set that consists of evenly sampled selections of the larger classes. This becomes our "Is Trigger" set, and we balance this with an equivalently sized "Not Trigger" set. We then make predictions on the remaining data.

Steps performed by the included functions:
- import the reshaped and lemmatized data from Notebook 1
- Create a custom training set by downsampling the larger classes, and balancing with an equivalent amount of Nontetrigger sentences
- Fit the data via logistic regression & review model performance against the Test data set, containing the unseen Trigger categories
- Calculate the percentage of unseen Trigger categories that were correctly predicted

## Import Packages

In [1]:
import pandas as pd
import numpy as np
np.random.seed(99)
RANDOM_STATE = 99
from sklearn.utils import resample
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
from spacy.lang.en.stop_words import STOP_WORDS
import time

## Define Model Functions

In [2]:
def get_reshaped_lemmatized():
    # Import the CSV file containing the reshaped data set
    df = pd.read_csv('../data/reshaped_lemmatized.csv').drop(columns = ('Unnamed: 0'), axis = 1)
    return df

In [3]:
def downsampling_data_set(df, threshold):
    '''This function creates custom classes for our training and test set. The training set consists 
    of evenly sampled number of each of the larger classes (chosen via the variable 'threshold'). 
    All remaining data (including the small classes) becomes the test set.'''
    #Total sum per row: 
    downsampling_set = df
    downsampling_set.loc[:,'Total'] = downsampling_set.sum(axis=1)

    # select only Sentences with 1 or 2 triggers
    downsampling_set = downsampling_set[downsampling_set['Total'].isin([1,2])]

    
    # isolate the trigger columns to sample from 
    # (only includes trigger types that exist in more than the threhold number of sentences)
    n = threshold
    trigger_cols = downsampling_set.drop(['Document', 'Sentence', 'SentenceLemmas', 'SentenceTokens'], axis=1).sum(axis=0)
    trigger_cols = trigger_cols.where(lambda x: x > n).dropna() # Skip types that occur less than the threshold amount
    trigger_cols = [t for t in list(trigger_cols.index) if t not in ['Total', 'nontrigger', 'unspecified']]
    nontrigger_cols = ['nontrigger']
    
    # randomly sample n rows from the selected trigger columns without replacement - samples is the training set
    init = True

    # Randomly sample the threshold amount of instances for each of the larger trigger classes
    for col in trigger_cols:
        temp_col = downsampling_set[downsampling_set[col] == 1]
        sampled_col = resample(temp_col, replace = False, n_samples = n, random_state = RANDOM_STATE)
        if init:
            samples = sampled_col
            init = False
        else:
            samples = pd.concat([samples,sampled_col])
            
    n_unspecified = samples.shape[0] # Prepare to randomly collect nontrigger data of an equivalent size
    # Randomly sample an equivalently sized set of nontrigger data
    for col in nontrigger_cols:
        temp_col = downsampling_set[downsampling_set[col] == 1]
        nontrigger_sampled_col = resample(temp_col, replace = False, n_samples = n_unspecified, random_state = RANDOM_STATE)
        samples = pd.concat([samples, nontrigger_sampled_col])
        
    # remove these rows from the main data set - select index and remove by index (call new set "filtered")
    rmv_index = list(samples.index)
    filtered = df.drop(rmv_index, axis='index') # This will become our Test Set
    
    # make 'is trigger' column
    samples['istrigger'] = np.where(samples['nontrigger'] > 0, 0, 1)
    filtered['istrigger'] = np.where(filtered['nontrigger'] > 0, 0, 1)
    
    # Check which trigger types were included in the training set
    in_train_set = (downsampling_set.drop(['Document', 'Sentence', 'Total', 'unspecified','SentenceLemmas', 'SentenceTokens'], axis=1).sum(axis=0) > n).to_frame()
    
    return samples, filtered, in_train_set # Samples will be the Training Set, Filtered will be the test set

In [4]:
# Incorporate Stopwords
def get_stopwords():
    '''This function includes the creation/usage of various Stopword lists, which can be modified as needed.'''
    short_stopwords = ['the', 'to', 'of', 'be', 'and', 'in', 'a', 'marriott']
    short_stopwords2 = ['the', 'and', 'a', 'to', 'it', 'be', 'for', 'with', 'that', 'marriott']
    stopwords = list(STOP_WORDS) + ['marriott']

    return short_stopwords, short_stopwords2, stopwords

In [5]:
# Function to split data for each target column (trigger type) 
def run_model(df, threshold):
    '''This function carries out the Logistic Regression modeling for the custom dataset, 
    with the GridSearch set of hyperparameters defined below'''
    downsampling_data_set(df, threshold)
    short_stopwords, short_stopwords2, stopwords = get_stopwords()
    
    X_train = samples['SentenceLemmas']
    y_train = samples['istrigger']
    X_test = filtered['SentenceLemmas']
    y_test = filtered['istrigger']

    y_train = y_train.astype('int')
    y_test = y_test.astype('int')

    
    train_index = samples.index
    test_index = filtered.index
    
    pipe_cvec = Pipeline([('cvec', CountVectorizer()), ('lr', LogisticRegression(solver = 'liblinear', random_state = RANDOM_STATE))]) 
    cvec_params = {
        'cvec__ngram_range': [(1,2), (1,3), (1,4), (1,5)],
        'cvec__stop_words': [short_stopwords, short_stopwords2, stopwords],  
        'cvec__max_features': [100, 200, 400, 600, 1000],
        'cvec__min_df': [2],
        'cvec__max_df': [.99],
        }

    gs_cvec = GridSearchCV(pipe_cvec, param_grid = cvec_params, cv = 3, scoring = 'roc_auc')

    # Fit the data set (predicting "istrigger" yes/no)
    results_cvec = gs_cvec.fit(X_train, y_train)

    # Print Train/Test Scores
    print(f'Training score is {results_cvec.score(X_train, y_train):.3f}')
    print(f'Test score is {results_cvec.score(X_test, y_test):.3f}')
    
    return results_cvec, X_train, y_train, X_test, y_test, train_index, test_index

In [6]:
def misclassification(results_cvec, X_train, y_train, X_test, y_test, train_index, test_index, filtered):
    '''This function creates a view of the misclassified predictions for each of the trigger categories.
    From here, we can calculate how many of the small classes (excluded from the training set) were
    correctly predicted.'''
    best_model = results_cvec.best_estimator_
    preds = best_model.predict(X_test)
    pred_proba = [i[1] for i in results_cvec.predict_proba(X_test)]
    pred_df = pd.DataFrame({'true_values': y_test,
                        'pred_probs':pred_proba})
    result_cols = ['index', 'prediction', 'actual', 'model_input']
    results = pd.DataFrame({'index': list(test_index),'prediction': list(preds), 'actual': list(y_test), 'model_input': list(X_test)})
    results.set_index('index', inplace = True)
    misclassified = results[results['prediction'] != results['actual']]
    misclassified = misclassified.merge(df, how = 'left', left_index = True, right_index = True)
    misclassified = misclassified[['prediction', 'actual', 'model_input', 'Document', 'Sentence', 'loan_default', 'aggregate_dscr_fall', 'dscr_fall', 'unspecified', 'debt_yield_fall', 'aggregate_debt_yield_fall', 'mezzanine_default', 'tenant_failure', 'mezzanine_outstanding', 'operator_termination', 'bankruptcy', 'sponsor_termination', 'renovations', 'nontrigger', 'sff', 'delayed_repayment']]
    full_test_set = filtered.drop(['Document', 'Sentence', 'Total', 'istrigger', 'SentenceTokens', 'SentenceLemmas'], axis = 1).sum(axis = 0).to_frame()
    misclassified_test_set = misclassified.drop(['prediction', 'actual', 'Document', 'Sentence', 'model_input'], axis=1).sum(axis=0).to_frame()
    misclassified_results = full_test_set.merge(misclassified_test_set, left_index = True, right_index = True)
    misclassified_results.rename(columns = {'0_x': 'full_test_set', '0_y': 'num_misclassified'}, inplace = True)
    misclassified_results['percent_misclassified'] = 100 * misclassified_results['num_misclassified'] / misclassified_results['full_test_set']
    misclassified_results['percent_misclassified'] = misclassified_results['percent_misclassified'].round(1)
    misclassified_results = misclassified_results.merge(in_train_set, left_index = True, right_index = True)
    misclassified_results.rename(columns = {0: 'in_train_set'}, inplace = True)
    misclassified_results['in_train_set'] = misclassified_results['in_train_set'].map({True: 'yes', False: 'no'})
    return misclassified_results

## Perform modeling steps

We call the above functions, in which we have created a custom training set that consists of evenly sampled selections of the largest classes. We then test how well that model performs against all remaining data.

In [7]:
df = get_reshaped_lemmatized()
samples, filtered, in_train_set = downsampling_data_set(df, 10)
results_cvec, X_train, y_train, X_test, y_test, train_index, test_index = run_model(df, 10)
misclassified_results = misclassification(results_cvec, X_train, y_train, X_test, y_test, train_index, test_index, filtered)

Training score is 1.000
Test score is 0.911


In [8]:
small_class_predict_correct = 1-(misclassified_results[misclassified_results['in_train_set']== 'no'].sum(axis = 0)['num_misclassified'])/ misclassified_results[misclassified_results['in_train_set']== 'no'].sum(axis = 0)['full_test_set']
print(f'{100 *small_class_predict_correct:.1f}% of the small classes were predicted correctly.')

91.4% of the small classes were predicted correctly.


In [9]:
# Display a summary table of the results
misclassified_results

Unnamed: 0,full_test_set,num_misclassified,percent_misclassified,in_train_set
loan_default,524,37,7.1,yes
aggregate_dscr_fall,7,0,0.0,no
dscr_fall,13,0,0.0,yes
debt_yield_fall,171,10,5.8,yes
aggregate_debt_yield_fall,9,0,0.0,yes
mezzanine_default,62,0,0.0,yes
tenant_failure,64,6,9.4,yes
mezzanine_outstanding,7,2,28.6,no
operator_termination,7,0,0.0,yes
bankruptcy,44,2,4.5,no
