# BA prediction ML Trial 4

## Summary up to this point:
### Previous:
* __"1D Feature Experiments:"__
  * The _"1D Features"_ refers to the original experiments based on quantitative (and typically, whole-molecule) calculations in RDKit for each drug... e.g., MolWt, LogP, NumHDonors, NumHAcceptors, etc. 
  * Of the many RDKit calculations available as features, we tested performance of linear regression model according to various feature sets -- the feature sets represented different subsets of the RDKit calculations
    * feature set 1 (1a,1b) had all RdKit calculations, only differed in whether VSA was represented as absolute value of VSA _versus_ percent-VSA of total VSA.
    * feature set 4 had the fewest number of calculations, focusing only on those reported in literature as being strongly tied to bioavailability; these are the metrics commonly used as filters in screening
    * feature set 2 was a copy of feature set 1, but narrowed by running set 1 thru a cross-correlation matrix and removing variables with correlation > 0.95
    * there were also ANOVA variants of the feature sets, which used an ANOVA test against a 3-category label (low BA, mid BA, high BA), and deleted features which did not meet statistical significance criteria (i.e., features for which no statistically significant difference existed in that variable for at least 1 of the categories)
  * Regarding feature selection, __Feature set 1b__ was the best performing RDKit feature subset, followed most closely by 1b-ANOVA.
  * Regarding labeling scheme, only _Continuous_ (regression) and _3-category_ (classification) were tested.
    * for regression, the best r2 value (i.e. feature set 1b) was typically between 0.14-0.20 in Linear Regression models.
    * for classification (3-category), accuracy was somewhere around 0.5 
<br><br>
* __"Fragment experiments:"__
  * Hypothesis: 
    1. it is hypothesized that the model does not predict well __because it covers many drug types / classes, which are not significantly differentiated by the features__. 
    2. it is further hypothesized that the missing information about drug classes _could be_ tied to its 2D structure - or more specifically, tied to the presence/absence of key structural motifs / scaffolds in the chemical structure of various drug classes. _(For a simple example of a link, a drug with acidic/basic groups allow its preparation as a pharmaceutical salt, boosting bioavailability.)_
    <br>
  * Thus, it was decided to attempt a representation of drug classes and chemical structure using information about their 2D chemical structure - _specifically_, using an array of chemical fragments. 
    * Each drug was subjected to fragmentation, creating a fragment library.
    * Each drug's Murcko Scaffold, and fragmentation products thereof, were added to the fragment library.
    * Drugs were linked to the fragments, scaffolds, and scaffold fragments by explicit lookup between drug & fragment library, _but were also_ linked indirectly by searching for Substructure Matches across all drugs/fragments. <br><br>
  * __Latest finding:__ 
    * The fragment set having most correlation with bioavailability was __`frags_all`__, which is the comprehensive set based on both fragment production _and_ matches of substructure search.  
    * The NLP technique best representing the generated fragments was: (1) using __CountVectorizer__ on fragment array... (2) __TF-IDF__... (3) using __n-gram (n=2) + CountVectorizer__.... followed by (4) word2vec

### Agenda for this NB:

* Determine the best Regression Model for BA prediction based on fragments
* Determine the best Classification Model for BA prediction based on fragments
* Eventually:
  * compare performance of fragments when used in conjunction with original 1D feature sets

## UPDATED Agenda for this NB:

* Initially, an attempt was made to combine the fragment NLP features w/ the RDKit "1D" features. The resulting r2 value was terrible (negative); this is because the numerical RDKIT features become rather useless once they've been joined into the large / high-dimensionality vector high of the NLP features.
* __To resolve this problem,__ 
  * We will use NLP features and the Numerical RDKit calculations separately, instead of combining.
  * Essentially, the new idea will be to: 
    - Run a Clasification Model using the NLP features to classify the BA of each drug (use cat labels)
    - Run a Regression Model using numerical features _and predicted class_ to predict final bioavailability 

### <font color='red'> Actual Steps: </font>
* <font color='green'>__Step 1.__</font> verify this approach is feasible using RFC w/ NLP features (__"Model 1"__) followed by RFR w/ Numerical features (__"Model 2"__) <br><br>
* <font color='green'>__Step 2(a).__</font> if feasible (as per above), then determine the best model type for _Model 1_ (classification step) 
* <font color='green'>__Step 2(b).__</font> does the choice of categorical label affect which model appears best model for classification _Model 1_ ?  <br> _(e.g., label_cat0, label_cat1, label_cat2, label_cat3)_ <br><br>

* <font color='green'>__Step 3.__</font> Run trials comparing the combination of Model 1 and Model 2. <br> __Variables to study include:__
  * Classification as 1, Regression as 2 <br> _vs._ Regression as 1, Classification as 2 <br> _vs._ Regression as 1 and 2 <br> _vs._ Classification as 1 and 2 
  * number of classes for Classification _(e.g., label_cat0, label_cat1, label_cat2, label_cat3)_
  * Labels for Regression Model 2: predict absolute BA value..? or predict ___residual BA___ from model 1?
  * Manual combination of Model 1/Model 2 predictions, <br> _versus_ direct use of Model 1 prediction in Model 2 features?

In [1]:
# MSM VM config prep
import findspark
findspark.init('/home/mitch/spark-3.3.0-bin-hadoop2')
import pyspark
 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('BApredsV4').getOrCreate()

# --- suppress future spark warnings/error/etc output ---
spark.sparkContext.setLogLevel("OFF")

22/09/08 16:35:20 WARN Utils: Your hostname, mitch-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
22/09/08 16:35:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/08 16:35:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
import pandas as pd
data = pd.read_pickle("data_final_NEW_2022-09-05_aspandas.pkl")
pd.set_option('display.max_columns',None)
data = data.rename(columns={'Name':'name','_c0':'index'})
data = data.drop(columns='drug_name')
data = spark.createDataFrame(data)

In [9]:
data.printSchema()

root
 |-- index: long (nullable = true)
 |-- name: string (nullable = true)
 |-- label_q0: double (nullable = true)
 |-- label_cat0: double (nullable = true)
 |-- label_cat1: double (nullable = true)
 |-- label_cat2: double (nullable = true)
 |-- label_cat3: double (nullable = true)
 |-- label_cat4: double (nullable = true)
 |-- smile: string (nullable = true)
 |-- MolWt: double (nullable = true)
 |-- ExactMolWt: double (nullable = true)
 |-- qed: double (nullable = true)
 |-- MolLogP: double (nullable = true)
 |-- MolMR: double (nullable = true)
 |-- VSA_total: double (nullable = true)
 |-- LabuteASA: double (nullable = true)
 |-- TPSA: double (nullable = true)
 |-- MaxPartialCharge: double (nullable = true)
 |-- MinPartialCharge: double (nullable = true)
 |-- MaxAbsPartialCharge: double (nullable = true)
 |-- MinAbsPartialCharge: double (nullable = true)
 |-- NumHAcceptors: double (nullable = true)
 |-- NumHDonors: double (nullable = true)
 |-- HeavyAtomCount: double (nullable = true

## List of column titles

In [4]:
# all column titles in data
columns_all = [
    'index','name',
    'label_q0','label_cat0','label_cat1','label_cat2','label_cat3','label_cat4',
    'smile','MolWt','ExactMolWt','qed','MolLogP','MolMR','VSA_total','LabuteASA',
    'TPSA','MaxPartialCharge','MinPartialCharge','MaxAbsPartialCharge','MinAbsPartialCharge','NumHAcceptors',
    'NumHDonors','HeavyAtomCount','NumHeteroatoms','NumRotatableBonds','NHOHCount','NOCount','FractionCSP3',
    'RingCount','NumAliphaticRings','NumAromaticRings','NumAliphaticHeterocycles','NumAromaticHeterocycles',
    'NumSaturatedHeterocycles','NumSaturatedRings','BalabanJ','BertzCT','HallKierAlpha','fracVSA_PEOE01',
    'fracVSA_PEOE02','fracVSA_PEOE03','fracVSA_PEOE04','fracVSA_PEOE05','fracVSA_PEOE06','fracVSA_PEOE07',
    'fracVSA_PEOE08','fracVSA_PEOE09','fracVSA_PEOE10','fracVSA_PEOE11','fracVSA_PEOE12','fracVSA_PEOE13',
    'fracVSA_PEOE14','fracVSA_SMR01','fracVSA_SMR02','fracVSA_SMR03','fracVSA_SMR04','fracVSA_SMR05',
    'fracVSA_SMR06','fracVSA_SMR07','fracVSA_SMR08','fracVSA_SMR09','fracVSA_SMR10','fracVSA_SlogP01',
    'fracVSA_SlogP02','fracVSA_SlogP03','fracVSA_SlogP04','fracVSA_SlogP05','fracVSA_SlogP06','fracVSA_SlogP07',
    'fracVSA_SlogP08','fracVSA_SlogP09','fracVSA_SlogP10','fracVSA_SlogP11','fracVSA_SlogP12',
    'FEAT_rdkit_1a','FEAT_rdkit_1b','FEAT_rdkit_2a','FEAT_rdkit_2b','FEAT_rdkit_3','FEAT_rdkit_4a',
    'FEAT_rdkit_4b','FEAT_rdkit_1bANOVA','FEAT_rdkit_2bANOVA',
    'frags_all','frags_better','frags_best','frags_efgs','frags_brics',
    'frags_cv','frags_cv_idf','frags_tf','frags_tf_idf','frags_w2v','frags_n2g','frags_n2g_cv','frags_n2g_cv_idf',
    'FEAT_frags_cv','FEAT_frags_cv_idf','FEAT_frags_tf','FEAT_frags_tf_idf',
    'FEAT_frags_w2v','FEAT_frags_n2g_cv','FEAT_frags_n2g_cv_idf']

# colun titles - all labels and feature vectors
columns_labelsAndFeatures_all = [
    'index','name',
    'label_q0','label_cat0','label_cat1','label_cat2','label_cat3','label_cat4',
    
    'FEAT_rdkit_1a','FEAT_rdkit_1b','FEAT_rdkit_2a','FEAT_rdkit_2b','FEAT_rdkit_3','FEAT_rdkit_4a',
    'FEAT_rdkit_4b','FEAT_rdkit_1bANOVA','FEAT_rdkit_2bANOVA',
    
    'FEAT_frags_cv','FEAT_frags_cv_idf','FEAT_frags_tf','FEAT_frags_tf_idf',
    'FEAT_frags_w2v','FEAT_frags_n2g_cv','FEAT_frags_n2g_cv_idf']

# column titles - labels + feature vectors (in order of performance)
columns_labelsAndFeatures_subset = [
    'index','name',
    'label_q0','label_cat0','label_cat1','label_cat2','label_cat3','label_cat4',
    
    # rdkit features in order of efficacy
    'FEAT_rdkit_1b','FEAT_rdkit_1bANOVA','FEAT_rdkit_2b','FEAT_rdkit_3',
    
    # fragment features in order of efficacy
    'FEAT_frags_cv','FEAT_frags_cv_idf', 
    'FEAT_frags_tf','FEAT_frags_tf_idf',
    'FEAT_frags_n2g_cv','FEAT_frags_n2g_cv_idf',
    'FEAT_frags_w2v']

columns_labelsAndFeatures = [
    'index','name',
    'label_q0','label_cat0','label_cat1','label_cat2','label_cat3','label_cat4',
    'FEAT_rdkit_1b',
    'FEAT_frags_cv']

### Select data subset for use in training and testing

In [68]:
data_final = data.select(columns_labelsAndFeatures_subset)

In [69]:
(training,testing) = data_final.randomSplit([0.7,0.3])

# Notebook Agenda, step 1:
* <font color='green'>__Step 1.__</font> verify this approach is feasible using RFC w/ NLP features (__"Model 1"__) followed by RFR w/ Numerical features (__"Model 2"__) <br>

* Note: labels are
  *	'label_q0' (Regression)
  *	'label_cat0' (5-class Classification)
  *	'label_cat1' (3-class Classification)
  *	'label_cat2' (4-class Classification)
  *	'label_cat3' (5-class Classification) 
  *	'label_cat4' (5-class Classification)


In [78]:
(training,testing) = data_final.randomSplit([0.5,0.5])

In [79]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

featuresName = 'FEAT_frags_cv_idf'
labelName = 'label_cat1' #'label_cat4'

rfc = RandomForestClassifier(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction_1')
mcEvaluator = MulticlassClassificationEvaluator(labelCol=labelName, predictionCol="prediction_1", metricName="accuracy")

mymodel_rfc = rfc.fit(training)
myresults_rfc = mymodel_rfc.transform(testing)

model_eval_rfc = mcEvaluator.evaluate(myresults_rfc)

print(f"RFC accuracy = {model_eval_rfc}")

RFC accuracy = 0.4889705882352941


In [80]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

featuresName = 'FEAT_rdkit_1b'
labelName = 'label_q0' 

rfr = RandomForestRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction_2')
regEvaluator = RegressionEvaluator(labelCol=labelName,predictionCol='prediction_2',metricName='r2')

mymodel_rfr1 = rfr.fit(training)
myresults_rfr1 = mymodel_rfr1.transform(myresults_rfc)

model_eval_rfr1 = regEvaluator.evaluate(myresults_rfr1)

print(f"RFR_1 r2 = {model_eval_rfr1}")

RFR_1 r2 = 0.20825864238974512


* let's try training an alternative regression model to RFR1, that incorporates the rfc prediction into its features

In [99]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler( inputCols=['prediction_1','FEAT_rdkit_1b'],outputCol='features_RFR2')

combined_model_data = assembler.transform(myresults_rfc)

In [100]:
(combined_training,combined_testing) = combined_model_data.randomSplit([0.5,0.5])

In [101]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

featuresName = 'features_RFR2'
labelName = 'label_q0' 

rfr2 = RandomForestRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='combinedPrediction_1')
regEvaluator2 = RegressionEvaluator(labelCol=labelName,predictionCol='combinedPrediction_1',metricName='r2')

mymodel_rfr2 = rfr2.fit(combined_training)
myresults_rfr2 = mymodel_rfr2.transform(combined_testing)

model_eval_rfr2 = regEvaluator2.evaluate(myresults_rfr2)

print(f"RFR_2 r2 = {model_eval_rfr2}")

RFR_2 r2 = 0.28112328420151345


* let's try just training a separate model to analyze the first two independent predictions (rfc and rfr1) only

In [84]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler( inputCols=['prediction_1','prediction_2'],outputCol='features_RFR3')
combined_model_data = assembler.transform(myresults_rfr1)

In [85]:
(combined_training,combined_testing) = combined_model_data.randomSplit([0.7,0.3])

In [86]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator

featuresName = 'features_RFR3'
labelName = 'label_q0' 

rfr3 = RandomForestRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction_3')
regEvaluator3 = RegressionEvaluator(labelCol=labelName,predictionCol='prediction_3',metricName='r2')

mymodel_rfr3 = rfr3.fit(combined_training)
myresults_rfr3 = mymodel_rfr3.transform(combined_testing)

model_eval_rfr3 = regEvaluator3.evaluate(myresults_rfr3)

print(f"RFR_3 r2 = {model_eval_rfr3}")

RFR_3 r2 = 0.20084863301003586


# <font color='red'> Prior Scratch Work </font>

In [38]:
''' # create a history log for evaluation trials
'''
evaluation_history = {}

import pickle
with open('evaluation_history_temp.pickle', 'wb') as handle:
    pickle.dump(evaluation_history, handle, protocol=pickle.HIGHEST_PROTOCOL)


# Note: to load eval history, use:
# with open('evaluation_history.pickle', 'rb') as handle:
#    evaluation_history = pickle.load(handle)

In [53]:
(training,testing) = data_final.randomSplit([0.7,0.3])

In [54]:
trial = 0
trial_description = 'regression performance test: best NLP & RDkit features combined (frags_all CV with RDKit features_1b)'
featuresName = 'features' 
labelName = 'label_q0'  

'''# Import Regression Models
'''
from pyspark.ml.regression import (LinearRegression,
                                   DecisionTreeRegressor,RandomForestRegressor,GBTRegressor,
                                   GeneralizedLinearRegression,IsotonicRegression)

lr = LinearRegression(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')
dtr = DecisionTreeRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')
rfr = RandomForestRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')
gbtr = GBTRegressor(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')
glr = GeneralizedLinearRegression(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')
ir = IsotonicRegression(featuresCol=featuresName,labelCol=labelName,predictionCol='prediction')

models_regression = {'lr':lr,'dtr':dtr,'rfr':rfr,'gbtr':gbtr,'glr':glr,'ir':ir}


''' # Create DataFrame for displaying eval metrics of current trial
'''
eval_df = pd.DataFrame() # already exists from prior run


''' # Specify experiment type
'''
modeltype = lr  # --- SPECIFY!!! (lr,dtr,rfr,gbtr,glr,ir)
modelname = 'lr'
modelname_short = 'lr' # SPECIFY
iteration = 0


''' # Load eval History file for logging eval metrics of all trials
'''
import pickle
with open('evaluation_history.pickle', 'rb') as handle:
    evaluation_history = pickle.load(handle)

trialName = f"trial_{trial}"
iterationName = f"iteration_{iteration}"
evaluation_history[trialName] = {}
evaluation_history[trialName]['features_set'] = featuresName
evaluation_history[trialName]['description'] = trial_description
evaluation_history[trialName][modelname_short] = {}
evaluation_history[trialName][modelname_short][iteration] = {}
evaluation_history[trialName][modelname_short][iteration]['label'] = labelName


''' # BEGIN TRIAL
'''


evaluation_history[trialName][modelname_short][iteration] = {}

mymodel = modeltype.fit(training)
myresults = mymodel.transform(testing)

# CALCULATE KEY EVALS
from pyspark.ml.evaluation import RegressionEvaluator
regEvaluator = RegressionEvaluator(labelCol=labelName,predictionCol='prediction')

evaluator = regEvaluator
evalMetrics = {regEvaluator:['rmse','mse','mae','r2','var']}

evaluation = []

for each_metric in evalMetrics[evaluator]:        
    metric = each_metric

    result = evaluator.evaluate(myresults, {evaluator.metricName: metric})

    evaluation.append((metric,result))

    evaluation_history[trialName][modelname_short][iteration][metric] = result

#r2_adj = mymodel.summary.r2adj
#evaluation.append(('r2_adj(Training)',r2_adj))
column0 = [x for x,y in evaluation]
column1 = [y for x,y in evaluation]
eval_df['metric'] = column0
eval_df[modelname] = column1

In [55]:
eval_df.head(10)

Unnamed: 0,metric,lr
0,rmse,62.343237
1,mse,3886.679154
2,mae,44.574759
3,r2,-2.518276
4,var,3521.569333


In [45]:
eval_df.head(10)

Unnamed: 0,metric,lr
0,rmse,61.310566
1,mse,3758.985443
2,mae,46.397048
3,r2,-2.205402
4,var,3413.480928


In [None]:
''' # BACKUP EVALUATION HISTORY
'''
import pickle
with open('evaluation_history.pickle', 'wb') as handle:
    pickle.dump(evaluation_history, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [9]:
# iteration 5
pd.set_option('display.max_columns',None)
eval_df

Unnamed: 0,metric,rfr5_FEAT_frags_cv,rfr5_FEAT_frags_cv_idf,rfr5_FEAT_frags_tf,rfr5_FEAT_frags_tf_idf,rfr5_FEAT_frags_w2v,rfr5_FEAT_frags_n2g_cv,rfr5_FEAT_frags_n2g_cv_idf
0,rmse,31.357003,31.357003,31.452699,31.452699,32.463218,32.245083,32.245083
1,mse,983.261616,983.261616,989.272279,989.272279,1053.860549,1039.74539,1039.74539
2,mae,26.999519,26.999519,27.02489,27.02489,27.826697,27.886761,27.886761
3,r2,0.142043,0.142043,0.136798,0.136798,0.080441,0.092757,0.092757
4,var,113.862026,113.862026,109.195099,109.195099,93.289556,71.799789,71.799789


In [20]:
# iteration 4
pd.set_option('display.max_columns',None)
eval_df

Unnamed: 0,metric,rfr4_all_cv,rfr4_all_cv_idf,rfr4_all_w2v,rfr4_all_n2g_cv,rfr4_all_n2g_cv_idf,rfr4_better_cv,rfr4_better_cv_idf,rfr4_better_w2v,rfr4_better_n2g_cv,rfr4_better_n2g_cv_idf,rfr4_best_cv,rfr4_best_cv_idf,rfr4_best_w2v,rfr4_best_n2g_cv,rfr4_best_n2g_cv_idf,rfr4_efgs_cv,rfr4_efgs_cv_idf,rfr4_efgs_w2v,rfr4_efgs_n2g_cv,rfr4_efgs_n2g_cv_idf,rfr4_brics_cv,rfr4_brics_cv_idf,rfr4_brics_w2v,rfr4_brics_n2g_cv,rfr4_brics_n2g_cv_idf
0,rmse,30.17598,30.17598,31.409608,30.949282,30.949282,30.280202,30.280202,31.080018,30.751594,30.751594,32.942816,32.942816,32.942816,32.942816,32.942816,30.937496,30.937496,31.27324,31.722631,31.722631,31.940796,31.940796,32.455095,32.44206,32.44206
1,mse,910.589795,910.589795,986.563479,957.85808,957.85808,916.890642,916.890642,965.967548,945.66053,945.66053,1085.229105,1085.229105,1085.229105,1085.229105,1085.229105,957.128681,957.128681,978.015518,1006.325306,1006.325306,1020.214464,1020.214464,1053.33322,1052.487287,1052.487287
2,mae,25.837154,25.837154,26.815872,26.594224,26.594224,26.007038,26.007038,26.338364,26.332094,26.332094,28.537745,28.537745,28.537745,28.537745,28.537745,26.648317,26.648317,26.654979,27.422521,27.422521,27.685907,27.685907,27.903745,28.05279,28.05279
3,r2,0.160775,0.160775,0.090755,0.117211,0.117211,0.154968,0.154968,0.109737,0.128453,0.128453,-0.000178,-0.000178,-0.000178,-0.000178,-0.000178,0.117883,0.117883,0.098633,0.072542,0.072542,0.059742,0.059742,0.029218,0.029998,0.029998
4,var,115.263109,115.263109,89.208954,69.161578,69.161578,116.994071,116.994071,119.705079,90.484681,90.484681,0.192768,0.192768,0.192768,0.192768,0.192768,63.837409,63.837409,131.507831,39.141144,39.141144,47.292935,47.292935,85.452032,19.895657,19.895657


### Results:
**Findings:**
1. the most effective fragment representation is `frags_all`
2. the most effective nlp representation is **CountVectorizer** or CV-IDF > bigram CV > word2vec

## NEW: Train & Evaluate `frags_all` (CV, bigram CV, word2vec) using different regression models
**Goal:** 
1. Determine the most useful type of regression model for this data

In [19]:
# iteration 3
eval_df

Unnamed: 0,metric,rfr3_w2vec_def,rfr3_w2vec_cust,rfr3_cv_idf,rfr3_tf_idf,rfr3_n2grams_cv_idf,rfr3_n3grams_cv_idf,rfr3_n4grams_cv_idf
0,rmse,31.168074,31.133882,30.123685,30.159424,30.834382,31.182895,31.495046
1,mse,971.448852,969.318623,907.436409,909.590845,950.759135,972.372924,991.937937
2,mae,26.523286,26.843417,25.845951,25.864042,26.494599,26.798564,27.15982
3,r2,0.102098,0.104067,0.161264,0.159273,0.121221,0.101244,0.08316
4,var,111.697618,81.278117,114.494916,117.116534,69.047863,69.24537,60.039057


* Observations:
  * at least when sourced from "frags_all", the most effective vector features for the fragments are:
    * cv-idf > tf-idf > 2-gram-cv-idf > word2vec
    * sometimes, word2vec has been better than bigram cv-idf
    * default settings word2vec usually perform best
  * it might be interesting to **check how well bigrams perform in a word2vec model**

In [22]:
''' # BACKUP EVALUATION HISTORY
# --- NOTE: to load eval history, use:
# with open('evaluation_history.pickle', 'rb') as handle:
#    evaluation_history = pickle.load(handle)
'''
import pickle

with open('evaluation_history.pickle', 'wb') as handle:
    pickle.dump(evaluation_history, handle, protocol=pickle.HIGHEST_PROTOCOL)


''' 

'''

' \n\n'

In [35]:
eval_df

Unnamed: 0,metric,rfr_features_w2vec_def,rfr_features_w2vec_cust,rfr_features_cv_idf,rfr_features_tf_idf,rfr_features_ngrams_cv_idf
0,rmse,30.965411,31.48614,29.843273,30.128548,30.991749
1,mse,958.856685,991.377036,890.620956,907.729404,960.488518
2,mae,26.766015,27.368773,25.910357,26.088403,26.862186
3,r2,0.115671,0.085678,0.178603,0.162824,0.114166
4,var,88.740958,70.540491,96.338847,105.874621,64.765121


In [32]:
testx = eval_df.set_index('metric')
testx = testx.transpose()
testx.head(6)

metric,rmse,mse,mae,r2,var
rfr_features_w2vec_def,32.081343,1029.212583,27.718737,0.091856,100.840483
rfr_features_w2vec_cust,32.149451,1033.587185,27.879429,0.087996,85.211317
rfr_features_cv_idf,31.647149,1001.542048,27.534192,0.116271,114.702764
rfr_features_tf_idf,31.820964,1012.573776,27.683315,0.106537,94.232878
rfr_features_ngrams_cv_idf,32.336071,1045.621459,28.12613,0.077377,55.617227


In [156]:
''' # Check feature importances
'''
import pandas as pd
for k, v in test_subset_final.schema["features_tf_idf"].metadata["ml_attr"]["attrs"].items():
    features_df = pd.DataFrame(v)
    # format into a pandas dataframe and display
    rf_output = rfmodel.featureImportances
    features_df['Importance'] = features_df['idx'].apply(lambda x: rf_output[x] if x in rf_output.indices else 0)

features_df.sort_values("Importance", ascending=False, inplace=True)
features_df.head()


Unnamed: 0,idx,name,Importance
582,582,frags_tf_idf_582,0.082723
1602,1602,frags_tf_idf_1602,0.074515
2375,2375,frags_tf_idf_2375,0.0478
866,866,frags_tf_idf_866,0.039371
2280,2280,frags_tf_idf_2280,0.034088


* NOTE: to get vocab back, you can do it from the CountVectorizer indices
  * fit CountVectorizer separately, store the vocab like this:
    * `vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)`
    * `vectorizer.vocabulary`

## Test: Check feature importances of fragments as features 

# <font color='red'> Proceed with other feature vectors </font>

In [14]:
''' 
# FEATURE SELECTION:
'''
# to load the Features Information, use the command:
featuresDF = pd.read_parquet('featuresCatalogDF_2022-08-16.parquet')

In [None]:
#index_pos = featuresDF[featuresDF['name']=='F1a'].index[0]
feature_set1a = featuresDF.loc[0,'features']
feature_set1b = featuresDF.loc[1,'features']
feature_set2a = featuresDF.loc[2,'features']
feature_set2b = featuresDF.loc[3,'features']
feature_set3  = featuresDF.loc[4,'features']
feature_set4a = featuresDF.loc[5,'features']
feature_set4b = featuresDF.loc[6,'features']


# VECTOR ASSEMBLY - feature sets 1a,1b,2a,2b,3,4a,4b
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import (VectorAssembler,VectorIndexer)

vec_assembler1a = VectorAssembler(inputCols = feature_set1a, outputCol='features1a')
vec_assembler1b = VectorAssembler(inputCols = feature_set1b, outputCol='features1b')
vec_assembler2a = VectorAssembler(inputCols = feature_set2a, outputCol='features2a')
vec_assembler2b = VectorAssembler(inputCols = feature_set2b, outputCol='features2b')
vec_assembler3 = VectorAssembler(inputCols = feature_set3, outputCol='features3')
vec_assembler4a = VectorAssembler(inputCols = feature_set4a, outputCol='features4a')
vec_assembler4b = VectorAssembler(inputCols = feature_set4b, outputCol='features4b')

from pyspark.ml import Pipeline
feature_pipeline = Pipeline(stages=[vec_assembler1a,
                                    vec_assembler1b,
                                    vec_assembler2a,
                                    vec_assembler2b,
                                    vec_assembler3,
                                    vec_assembler4a,
                                    vec_assembler4b])
data_features = feature_pipeline.fit(data).transform(data)


''' 
# LABELS 
# Data has 5 labels; 1 continuous BA percentage, 4 discretized groups.
#  - ranges of discretized groups were selected by examining histogram distributions, mean & stdev of BA %
#  - we'll add one more discretized label alternative using Spark's QuantileDiscretizer, into 5 groups

# -- Add QuantileDiscretizer labels
from pyspark.ml.feature import QuantileDiscretizer
import pandas as pd
qd5 = QuantileDiscretizer(numBuckets=5,inputCol='BA_pct',outputCol='label_QD5')

data_features = qd5.fit(data_features).transform(data_features)

# -- INDEX / ENCODE LABELS
from pyspark.ml.feature import (StringIndexer,OneHotEncoder)

label_quant0 = 'BA_pct'
label_cat0_vector = OneHotEncoder(inputCol='label_QD5',outputCol='label_cat0_vector')

label_cat1_index = StringIndexer(inputCol='label1',outputCol='label_cat1_index')
label_cat1_vector = OneHotEncoder(inputCol='label_cat1_index',outputCol='label_cat1_vector')

label_cat2_index = StringIndexer(inputCol='label2',outputCol='label_cat2_index')
label_cat2_vector = OneHotEncoder(inputCol='label_cat2_index',outputCol='label_cat2_vector')

label_cat3_index = StringIndexer(inputCol='label3a',outputCol='label_cat3_index')
label_cat3_vector = OneHotEncoder(inputCol='label_cat3_index',outputCol='label_cat3_vector')

label_cat4_index = StringIndexer(inputCol='label3b',outputCol='label_cat4_index')
label_cat4_vector = OneHotEncoder(inputCol='label_cat4_index',outputCol='label_cat4_vector')

from pyspark.ml import Pipeline
label_pipeline = Pipeline(stages=[label_cat0_vector,
                                 label_cat1_index,label_cat1_vector,
                                 label_cat2_index,label_cat2_vector,
                                 label_cat3_index,label_cat3_vector,
                                 label_cat4_index,label_cat4_vector])
'''
data_features = data_features.select(['Name','BA_pct','label_QD5','label1','label2','label3a','label3b','features1a','features1b','features2a','features2b','features3','features4a','features4b'])

#data_prefinal = label_pipeline.fit(data_features).transform(data_features)

data_prefinal2 = data_prefinal.withColumnRenamed('BA_pct','label_q0')
data_prefinal2 = data_prefinal2.withColumnRenamed('label_QD5','label_cat0')
data_prefinal2 = data_prefinal2.withColumnRenamed('label_cat1_index','label_cat1')
data_prefinal2 = data_prefinal2.withColumnRenamed('label_cat2_index','label_cat2')
data_prefinal2 = data_prefinal2.withColumnRenamed('label3a','label3')
data_prefinal2 = data_prefinal2.withColumnRenamed('label3b','label4')
data_prefinal2 = data_prefinal2.withColumnRenamed('label_cat3_index','label_cat3')
data_prefinal2 = data_prefinal2.withColumnRenamed('label_cat4_index','label_cat4')

data_final = data_prefinal2.select(['Name','label_q0',
                                    'label_cat0','label_cat1','label_cat2','label_cat3','label_cat4',
                                    'features1a','features1b','features2a','features2b','features3','features4a','features4b'])