# Baseline with Text Classification

Now that we have our baseline model and our text classification model, we have to figure out a method to combine the two.

In [1]:
import pandas as pd
from tqdm import tqdm_notebook as tqdm

In [2]:
# retrieve the data
merged = pd.read_csv('../../DataPlus/all_data_merged.csv')

## Baseline Model

Model only based on age and Gleason score

In [3]:
# predictive model module
import GeneralModel as gm

In [4]:
# prepares dataframe for the baseline model (prepare_df defaults to baseline)
bm_df = gm.prepare_df(merged)

# of Data Points: 392


In [5]:
bm_fscore, _ = gm.general_model(bm_df)

KeyboardInterrupt: 

In [None]:
# F-score for baseline model
bm_fscore

## Text Classification

Model based only on text classification

In [None]:
import Preprocessing as pre
import KFoldTextClassification as ktc
import CompilingCorpus as cc

In [None]:
PROCESS_PIPELINE = [
    pre.remove_non_alpha,
    pre.remove_parentheses,
    pre.make_lowercase,
    pre.remove_stopwords,
    pre.lemmatize
]

In [None]:
# takes dataframe and processes it for text classification
def create_text_df(df):
    dropped_df = df.dropna(subset=['Convo_1', 'txgot_binary'])
    
    # preprocesses transcripts
    col_processed = [pre.text_preprocessing(text, PROCESS_PIPELINE) for text in tqdm(dropped_df['Convo_1'])]
    dropped_df['Convo_1'] = cc.untokenize(col_processed)
    
    return dropped_df

In [None]:
text_df = create_text_df(merged)

In [None]:
decision_values, _ = ktc.strat_kfold_text(text_df, folds=10, iterations=5)

In [None]:
import pickle

In [None]:
# save decision values for later use
pickle.dump(decision_values, open('dec_values.p', 'wb'))

### Visualizing Text Classification Results

The decision values that were outputted represents the results of the model's decision function. When predicting, we defined a threshold to turn decision values into discrete predictions. But maybe we can use these decision values as a feature, adding them to the baseline model.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
text_df['decision_values'] = decision_values

In [None]:
g = sns.FacetGrid(text_df,  row="txgot_binary")
g = g.map(plt.hist, "decision_values")

So it looks like there is a distinct difference in the distributions of the decision values for those who chose active surveillance and those who did not. I'm guessing that the default threshold for text classification is 0, but looking at the distributions, a lower threshold could produce even better results.

## Combining Text Classification and Baseline Model

To combine text classification and the baseline model, we will use the decision values from text classification as a new feature to add to the baseline model.

In [None]:
# we will use text_df as the base dataframe
raw_df = text_df.copy()

In [None]:
combined_df = gm.prepare_df(raw_df, cont_vars=['age', 'decision_values'])

In [None]:
combined_fscore = gm.general_model(combined_df)

In [None]:
# F-score for combined model
combined_fscore