# General Model Walkthrough

We will walk through the capabilities of the GeneralModel module and how it can be used for analysis.

In [1]:
import pandas as pd
import GeneralModel as gm

In [2]:
# retrives data
merged = pd.read_csv('../../DataPlus/feature_dataframe.csv')

## Preparing DataFrame

Before training a model, we first have to prepare a dataframe. Processing steps include dealing with null values, normalizing continuous variables, and converting categorical variables into dummy variables.

prepare_df(df, cont_vars=['age'], cat_vars=['gleason'], target_var='txgot_binary', print_dims=True)

Inputs (meaning - default):
* df: original dataframe
* cont_vars: continuous features (as list) - baseline model feature 'age'
* cat_vars: categorical features (as list) - baseline model feature 'gleason'
* target_var: target variable - binary variable of whether patient received active surveillance
* print_dims: whether to print # of examples - True

Output:
* dataframe ready to feed into general_model method

In [3]:
# prepares dataframe for the baseline model
prepared_df = gm.prepare_df(merged)

# of Data Points: 392


In [4]:
# adds education feature to baseline model
prepared_edu_df = gm.prepare_df(merged, cat_vars=['gleason', 'edu_binary'])

# of Data Points: 392


## Training Models

You train a model using the general_model method.

general_model(df, algorithm='svm', pred_var='txgot_binary', folds=10, iterations=3, print_results=True, tqdm_on=True, find_auc=True)

Inputs (meaning - default):
* df: dataframe prepared for training
* algorithm: algorithm used to train on data - support vector machine
* pred_var: target variable - 'txgot_binary'
* folds: # of folds for k-fold cross validation - 10 is a standard value for k
* iterations: # of iterations over the k-fold cross validation - 3 to balance time and noise
* print_results: whether to print precision/recall metrics - True
* tqdm_on: whether to display progress bar - True
* find_auc: whether to calculate area under ROC curve - True

Output (in order):
* F-score showing precision and recall performance on the positive class
* dictionary of precision/recall performance for positive and negative class
* area under curve metric (if find_auc set to True)

### Baseline Model Example (Basic Output)

In [5]:
_, _, _ = gm.general_model(prepared_df, print_results=False, tqdm_on=False, find_auc=False)


F-score: 0.686


### With Progress Bar

In [6]:
_, _, _ = gm.general_model(prepared_df, print_results=False, tqdm_on=True, find_auc=False)



F-score: 0.698


### With Precision/Recall Results

In [7]:
_, _, _ = gm.general_model(prepared_df, print_results=True, tqdm_on=True, find_auc=False)


Average Metrics:
Positive Class Precision: 0.643
Positive Class Recall: 0.756
Negative Class Precision: 0.842
Negative Class Recall: 0.737

F-score: 0.695


### With AUC Metric

In [8]:
_, _, _ = gm.general_model(prepared_df, print_results=True, tqdm_on=True, find_auc=True)


Average Metrics:
Positive Class Precision: 0.639
Positive Class Recall: 0.777
Negative Class Precision: 0.852
Negative Class Recall: 0.73

F-score: 0.702
AUC: 0.754


### Using Logistic Regression and Random Forest

When using the logistic regression or random forest algorithm, the printed output will also include coefficients or feature importance for the logistic regression/random forest model respectively.

In [9]:
# random forest
_, _, _ = gm.general_model(prepared_df, algorithm='rf', print_results=False)


  Feature    Weight
0     age  0.416956
1       7  0.583044


F-score: 0.601
AUC: 0.687


In [12]:
# logistic regression
_, _, _ = gm.general_model(prepared_df, algorithm='lr', print_results=False)


  Feature    Weight
0     age  0.651040
1       7 -2.479551


F-score: 0.67
AUC: 0.734


### Adding Education Feature to Baseline Model

We change the features used for training the model when preparing the dataframe. Here is the model when using the dataframe that includes the education variable.

In [11]:
_, _, _ = gm.general_model(prepared_edu_df, algorithm='rf', print_results=False)


             Feature    Weight
0                age  0.561160
1                  7  0.403151
2  No College Degree  0.035689


F-score: 0.609
AUC: 0.687
