Throughout this testing document, arrays from the original code may be replaced with my_array in solutions in order to create ease of testing. 

In [1]:
import numpy as np

my_array = np.array([[0,0,0,1],[4,4,4,4], [3,2,9,0]])

In [2]:
np.all(my_array == 4, axis = 1 )

array([False,  True, False])

# Problem 1

This issue occurs at line 120 in genra.rax.skl.reg and involves both a zero denominator and an inconsistency between tests using non-Jaccard metrics. 

The two proposed solutions yield vastly different results based on fingerprint type and metric. TEST prints perform on-par with our optimal fingerprint under solution 1 and the Canberra metric, while Morgan fingerprints have traditionally performed best under the Jaccard metric with a version of Solution 2 applied. Early tests into applying solution 2 tank performance of TEST prints while application of Solution 1 with Morgan prints also yields poor results. Perhaps this will become a keyword setting we wish to add to GenRA ('universal_distance' = True) to be toggled on and off?

## Proposed Solution 1: Chemical-specific Maximum Distance

The addition of a 1 in the parentheses of the max function causes numpy to calculate the max for each row, meaning that this number will depend only on the chemical being tested. By setting all zeroes in the division array equal to 1, we eliminate the zero denominator problem. This is not an issue, since in those rows we will obtain 0/1 = 0 for a similarity score of 1. 

With this method, the nth neighbor in the set will not actually directly contribute to the prediction. Instead, its distance serves as a scaling factor, but its endpoint value does not become included in the prediction because its relative similarity is zero. 

In [3]:
division_array = my_array.max(1).copy()
division_array[division_array == 0] = 1

In [4]:
from genra.rax.skl.hybrid import GenRAPredValueHybrid
import pandas as pd 

chems = {'id': ['DTXSID234567', 'DTXSID123456', 'DTXSID3456789', 'DTXSID4567890', 'DTXSID012345', 'DTXSID112233'], 'p11':[0,0,1,1,1,1],\
          'p12':[0,1,1,1,1,1], 'p13':[1,0,0,0,0,0], 'p14':[0,1,0,0,0,0], 'p21':[1,1,1,1,1,1], 'p22': [0,0,0,0,0,0,], 'p23':[1,1,0,0,0,0], \
            'p24':[0,0,0,0,0,0], 'y' : [1,1,0,1,0,1]}
chems_df = pd.DataFrame(chems)

chems_df

Unnamed: 0,id,p11,p12,p13,p14,p21,p22,p23,p24,y
0,DTXSID234567,0,0,1,0,1,0,1,0,1
1,DTXSID123456,0,1,0,1,1,0,1,0,1
2,DTXSID3456789,1,1,0,0,1,0,0,0,0
3,DTXSID4567890,1,1,0,0,1,0,0,0,1
4,DTXSID012345,1,1,0,0,1,0,0,0,0
5,DTXSID112233,1,1,0,0,1,0,0,0,1


In [5]:
slices = [slice(0,4), slice(4,8)]
estimator = GenRAPredValueHybrid(n_neighbors=3, slices = slices, hybrid_weights=[0,1], metric = 'canberra')
estimator.fit(chems_df.iloc[2:,1:-1], chems_df.iloc[2:,-1])
estimator.predict(chems_df.iloc[0:2, 1:-1])

array([0.33333333, 0.33333333])

In [6]:
estimator.kneighbors(chems_df.iloc[0:2,1:-1])

(array([[1., 1., 1.],
        [1., 1., 1.]]),
 array([[0, 1, 2],
        [0, 1, 2]], dtype=int64))

## Proposed Solution 2: Universal Maximum Distance

This will work only for metrics that have some maximum or for binary-only fingerprints. We would need to incorporate an alternative method in order to adapt to more complex metrics/fingerprints. While the original code could be adapted to work for error handling, as in Problem 2, it creates different results for different test sets when implemented here, which is very undesirable. 

For some metrics, like Canberra, this is a counterintuitive solution. The Canberra metric takes values between 1 and 0 for each fingerprint feature, meaning that its maximum length is the same as the length of the fingerprint. However, unlike the Jaccard metric, features which are not present on one chemical but are present on the other pose only a small penalty. As most fingerprints consist of many zeroes and only a few ones, the odds of a print having length anywhere near the maximum possible length are often very low, so dividing by this global maximum greatly compresses the differences between neighbors, leading all neighbors to contribute approximately equally to the prediction for a chemical. 

Note: The code for this is also used in Problem 2. 

In [7]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import pairwise_distances
#The following function is defined immediately before kneighbors_sim, and is already incorporated and tested in my
#local genra-py
def maxDistance(self, X):
    """
    Compute the maximum distance between two chemicals (with binary print values, also works for
    Canberra metric with continuous prints)
    
    Helps to identify test chemicals lacking source analogues.
    """
    dims = X.shape[1]
    empty = np.zeros((1, dims), dtype=np.float64)
    full = np.empty((1,dims), dtype=np.float64)
    full.fill(1)
    max_distance = pairwise_distances(empty, full, metric = self.metric)
    
    return max_distance

In [8]:
#Test run
estimator = GenRAPredValueHybrid(n_neighbors=3, slices = slices, hybrid_weights=[1,1], metric = 'canberra')
self = estimator
X = np.array(chems_df.iloc[0:2, 1:-1])
estimator.maxDistance(chems_df.iloc[0:2, 1:-1])

4.0

# Problem 2

This issue occurs at line 154 (approximately) with the code: "y_pred[:, j] = num / denom" . As denom = np.sum(neigh_sim, axis=1), there is a possibility of this denominator being 0 if the tested chemical has no similarities with any of the fit set. In this case, we could use the same trick as in Problem 1, Solution 1 to eliminate the problem with the zero denominator but if a chemical has no true source analogues, we should probably be issuing an error message as well. It would be best to do so without disrupting the other tests. 

In [None]:
denom=np.sum(my_array, axis=1)
neigh_sim, neigh_ind = estimator.kneighbors_sim(chems_df.iloc[0:2,1:-1])
if 0 in denom:
    zeros = []
    for x in range(len(list(denom))):
        if list(denom)[x] == 0:
            zeros.append(x)
    denom[denom == 0] = neigh_sim.shape[1]
    num = np.sum(_y[neigh_ind, j], axis = 1)
    print(f"The training data may not contain source analogues for the chemical(s) \n with row indices: {zeros} ")


In [9]:
# This code block and the following 2 will recreate the error if run with the 
# original GenRA
from genra.rax.skl.hybrid import GenRAPredValueHybrid
import pandas as pd 

chems = {'id': ['DTXSID234567', 'DTXSID123456', 'DTXSID3456789', 'DTXSID4567890', 'DTXSID012345', 'DTXSID112233'], 'p11':[0,0,1,1,1,1],\
          'p12':[0,1,1,1,1,1], 'p13':[1,0,0,0,0,0], 'p14':[0,1,0,0,0,0], 'p21':[1,1,1,1,1,1], 'p22': [0,0,0,0,0,0,], 'p23':[1,1,0,0,0,0], \
            'p24':[0,0,0,0,0,0], 'y' : [1,1,0,1,0,1]}
chems_df = pd.DataFrame(chems)

chems_df

Unnamed: 0,id,p11,p12,p13,p14,p21,p22,p23,p24,y
0,DTXSID234567,0,0,1,0,1,0,1,0,1
1,DTXSID123456,0,1,0,1,1,0,1,0,1
2,DTXSID3456789,1,1,0,0,1,0,0,0,0
3,DTXSID4567890,1,1,0,0,1,0,0,0,1
4,DTXSID012345,1,1,0,0,1,0,0,0,0
5,DTXSID112233,1,1,0,0,1,0,0,0,1


In [10]:
slices = [slice(0,4), slice(4,8)]
estimator = GenRAPredValueHybrid(n_neighbors=3, slices = slices, hybrid_weights=[0,1], metric = 'canberra')
estimator.fit(chems_df.iloc[2:,1:-1], chems_df.iloc[2:,-1])
estimator.predict(chems_df.iloc[0:2, 1:-1])

array([0.33333333, 0.33333333])

This led to recognition of an additional issue: the difference between a chemical with all neighbors that are equally far from it and a chemical with no real neighbors (at the maximum possible distance).

My current solution is the following: 
- Introduce a function that calculates the maximum distance between points. This will only work if there is a maximum distance, as in Canberra, Jaccard, Cosine, etc. It will not work with Minkowski or Euclidean. 
- Create a printed warning if all neighbors identified in kneighbors() are at this maximum distance. 

Note: This first version does not treat hybrid fingerprints correctly, as it ignores their weight in calculating maxDistance. I have since edited hybrid.py to account for this. I believe it to now be working appropriately, but there may be some poor error handling that I have not caught yet.

In [11]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import pairwise_distances
#The following function is defined immediately before kneighbors_sim
def maxDistance(self, X):
    """
    Compute the maximum distance between two chemicals (with binary print values, also works for
    Canberra metric with continuous prints)
    
    Helps to identify test chemicals lacking source analogues.
    """
    dims = X.shape[1]
    empty = np.zeros((1, dims), dtype=np.float64)
    full = np.empty((1,dims), dtype=np.float64)
    full.fill(1)
    max_distance = pairwise_distances(empty, full, metric = self.metric)
    
    return max_distance

# Modification to give values to variables housed in the package code
self = estimator
X = chems_df.iloc[0:2, 1:-1]
# This code currently resides inside the kneighbors_sim function after the use of kneighbors 
neigh_dist, neigh_ind = self.kneighbors(X)

lost_chems = np.all(neigh_dist == self.maxDistance(X), axis = 1)
lost_chem_indices = []
counter = 0
for boolean in lost_chems:
    if boolean:
        lost_chem_indices.append(counter)
    counter += 1
if len(lost_chem_indices) > 0:
    print(f"The training data may not contain source analogues for the chemical(s) with the \n following row indices: {lost_chem_indices} ")


In [12]:
chems = {'id': ['DTXSID234567', 'DTXSID123456', 'DTXSID3456789', 'DTXSID4567890', 'DTXSID012345', 'DTXSID112233'], 'p11':[0,0,1,1,1,0],\
          'p12':[0,1,1,1,1,0], 'p13':[1,0,1,1,1,0], 'p14':[0,1,1,1,1,0], 'p21':[1,1,1,1,1,0], 'p22': [0,0,1,1,1,0,], 'p23':[1,1,1,1,1,0], \
            'p24':[0,0,1,1,1,0], 'y' : [1,1,0,1,0,0]}
chems_df = pd.DataFrame(chems)

chems_df

Unnamed: 0,id,p11,p12,p13,p14,p21,p22,p23,p24,y
0,DTXSID234567,0,0,1,0,1,0,1,0,1
1,DTXSID123456,0,1,0,1,1,0,1,0,1
2,DTXSID3456789,1,1,1,1,1,1,1,1,0
3,DTXSID4567890,1,1,1,1,1,1,1,1,1
4,DTXSID012345,1,1,1,1,1,1,1,1,0
5,DTXSID112233,0,0,0,0,0,0,0,0,0


In [13]:
slices = [slice(0,4), slice(4,8)]
estimator = GenRAPredValueHybrid(n_neighbors=3, slices = slices, hybrid_weights=[0,1], metric = 'canberra')
estimator.fit(chems_df.iloc[2:5,1:-1], chems_df.iloc[2:5,-1])
estimator.predict(chems_df.iloc[5:, 1:-1])

According to this metric, the training data may not contain source analogues for the chemical(s) 
 with the following row indices within the testing set: [0] 


array([0.33333333])

# Fit-Predict Feature Addition

The following is a test of whether the changes to allow predictions on the fit set is functional. For this feature, many small changes were needed, but most of them were just turning X in to an optional argument instead of a required one for functions. 

In [14]:
estimator = GenRAPredValueHybrid(n_neighbors=3, slices = slices, hybrid_weights=[0,1], metric = 'canberra')
estimator.fit(chems_df.iloc[0:5,1:-1], chems_df.iloc[0:5,-1])
estimator.predict()

array([1. , 1. , 0.5, 0. , 0.5])

In [15]:
estimator.kneighbors_sim()

(array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 0.],
        [1., 1., 0.]]),
 array([[1, 2, 3],
        [0, 2, 3],
        [3, 4, 0],
        [2, 4, 0],
        [2, 3, 0]], dtype=int64))

In [16]:
from genra.rax.skl.reg import GenRAPredValue
estimator = GenRAPredValue(n_neighbors = 3, metric = 'jaccard')
estimator.fit(np.array(chems_df.iloc[0:, 1:-1]), chems_df.iloc[:, -1])
estimator.predict()

According to this metric, the training data may not contain source analogues for the chemical(s) 
 with the following row indices within the testing set: [5] 




array([0.34782609, 0.33333333, 0.6       , 0.2       , 0.6       ,
       0.66666667])

## Finalization Tests

Here I bring in all of the LD50 data to see how long computations take on this set, and make sure that all changes are running smoothly. 

In [1]:
import pandas as pd
ld50 = pd.read_csv("LD50_pre-grid-search_and_TEST.csv").set_index("Unnamed: 0")
ld50.head()

Unnamed: 0_level_0,LD50_LM,tp_fp0,tp_fp1,tp_fp2,tp_fp3,tp_fp4,tp_fp5,tp_fp6,tp_fp7,tp_fp8,...,P=S,-CF3 [aliphatic attach],-CF3 [aromatic attach],-CCl3 [aromatic attach],-CCl3 [aliphatic attach],Halogen [Nitrogen attach],As(=O),-N=C=S,Sn=O,-N=S=O
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DTXSID5020281,-0.465339,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID8020961,-0.734786,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID0021834,-0.087091,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID2044347,-1.058925,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID4025745,-1.022972,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [218]:
ld50.iloc[0:10, 725:735]

Unnamed: 0_level_0,tp_fp724,tp_fp725,tp_fp726,tp_fp727,tp_fp728,mg_fp0,mg_fp1,mg_fp2,mg_fp3,mg_fp4
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
DTXSID5020281,0,0,0,0,0,0,0,0,0,0
DTXSID8020961,0,0,0,0,0,0,0,0,0,0
DTXSID0021834,0,0,0,0,0,0,0,0,0,0
DTXSID2044347,0,0,0,0,0,0,0,0,0,0
DTXSID4025745,0,0,0,0,0,0,0,0,0,0
DTXSID9059208,0,0,0,0,0,0,0,0,0,0
DTXSID7026653,0,0,0,0,0,0,0,0,0,0
DTXSID6026080,0,0,0,0,0,0,0,0,0,0
DTXSID80870440,0,0,0,0,0,0,0,0,0,0
DTXSID7026655,0,0,0,0,0,0,0,0,0,0


In [219]:
from sklearn.metrics import r2_score
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
tester = GenRAPredValueHybrid(n_neighbors = 8, slices = slices, hybrid_weights=[0,1,0,0,0],  metric = 'jaccard')
tester.fit(ld50.iloc[:,1:], ld50.iloc[:,0])
y_preds = tester.predict()



In [220]:
r2_score(ld50.iloc[:,0], y_preds)

0.5291021179031044

In [221]:
#Test that Pred and PredHybrid are making the same calculations if fed the same prints and metric
from sklearn.metrics import r2_score
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
value_tester = GenRAPredValue(n_neighbors = 8,  metric = 'jaccard')
value_tester.fit(ld50.iloc[:,730:2778], ld50.iloc[:,0])
y_preds_value = value_tester.predict()
r2_score(ld50.iloc[:,0], y_preds_value)



0.5289070399583202

In [223]:
diffs2 = y_preds - y_preds_value
list(diffs2).index(diffs2.max())

1895

In [224]:
ld50.iloc[1895, 0]

1.71038745566

In [225]:
diffs2.max()

0.3525366596899835

Weight (100 vs. 1) is not affecting results, query_is_train True vs. False is not affecting results. Hybrid vs. traditional IS affecting results. Results yielding worse r2 scores than usual. Problem persists through Jaccard use, so should not be a result of Problem 1 fixes.

With new query_is_train functionality, kneighbors and kneighbors_sim working appropriately and match for both sets even where final results do not.

In [147]:
sim_test = tester.kneighbors_sim()




In [148]:
value_sim_test = value_tester.kneighbors_sim()





We see here that a difference arises between what the two tests consider to be the neighbors. Despite having identical distances, the neighbor indices do not match. This is true in the version of GenRA from before my edits began. It appears that the two different predictors prioritize equivalently distanced chemicals differently, which can alter predictions based on which indices are kept in the neighborhood. 

In [149]:
sim_test[0][1895]

array([0.71428571, 0.53488372, 0.36363636, 0.29411765, 0.29411765,
       0.27777778, 0.27777778, 0.27027027])

In [180]:
sim_test[1][1895]

array([5602, 5648, 2644, 1342, 4955, 4954, 5105,  349], dtype=int64)

In [150]:
value_sim_test[0][1895]

array([0.71428571, 0.53488372, 0.36363636, 0.29411765, 0.29411765,
       0.27777778, 0.27777778, 0.27027027])

In [181]:
value_sim_test[1][1895]

array([5602, 5648, 2644, 4955, 1342, 4954, 5105, 4718], dtype=int64)

In [152]:
y_preds[1895], y_preds_value[1895]

(0.9205860407023327, 0.5680493810123493)

In [213]:
#Manual calculation on problem chemical (1895) using hybrid similarity
np.dot(ld50.iloc[list(sim_test[1][1895]), 0], sim_test[0][1895]/sum(sim_test[0][1895]))

-0.8158115911278622

In [179]:
#Manual calculation on problem chemical (1895) using plain similarity
np.dot(ld50.iloc[list(value_sim_test[1][1895]), 0], value_sim_test[0][1895]/sum(value_sim_test[0][1895]))

0.5680493810123494

In [155]:
checking = tester.kneighbors_sim()



In [169]:
checking[1][0]

array([   2,    4,    1, 5805,  837,  870,  103, 3701], dtype=int64)

In [170]:
#Manual calculation on first chemical
np.dot(ld50.iloc[list(checking[1][0]), 0], checking[0][0]/sum(checking[0][0]))

-0.8992473910260691

In [171]:
#Manually produce predictions
neigh_sim = checking[0][0:10]
neigh_ind = checking[1][0:10]
_y = np.array(ld50.iloc[:, 0]).reshape((-1,1))
y_pred = np.empty((10, _y.shape[1]), dtype=np.float64)

denom=np.sum(neigh_sim, axis=1)
for j in range(_y.shape[1]):
    num = np.sum(_y[neigh_ind, j] * neigh_sim, axis=1)
    if 0 in denom:
        denom[denom == 0] = neigh_sim.shape[1]
        num = np.sum(_y[neigh_ind, j], axis = 1)
    y_pred[:, j] = num / denom

In [172]:
# Put the chem in its own prediction to check whether the model is doing the same
neighbors_and_self = list(checking[1][0])[:7]
neighbors_and_self.append(0)

filtered_sims = list(checking[0][0])[:7]
filtered_sims.append(1)
neighbors_and_self, filtered_sims

([2, 4, 1, 5805, 837, 870, 103, 0],
 [0.5172413793103448,
  0.46875,
  0.4666666666666667,
  0.4666666666666667,
  0.4545454545454546,
  0.4516129032258065,
  0.4411764705882353,
  1])

In [173]:
#We see here whether the prediction engine is still cheating, as the y_preds value should not match
# what we obtain if we use the chemical in its own prediction
np.dot(ld50.iloc[neighbors_and_self, 0], filtered_sims/sum(filtered_sims))

-0.7366903749546412

In [174]:
#Predicted value
y_preds[0:10]

array([-0.89924739, -0.81581159, -0.83386653, -1.09043454, -0.73886316,
        0.47486605, -1.2055316 , -1.02671351, -0.73293375,  0.42294367])

In [175]:
#Actual value
ld50.iloc[0:10,0]

Unnamed: 0
DTXSID5020281    -0.465339
DTXSID8020961    -0.734786
DTXSID0021834    -0.087091
DTXSID2044347    -1.058925
DTXSID4025745    -1.022972
DTXSID9059208    -1.176648
DTXSID7026653    -1.090401
DTXSID6026080    -1.071803
DTXSID80870440   -0.988958
DTXSID7026655    -1.295371
Name: LD50_LM, dtype: float64

In [187]:
#This is a check that the numbers are the same between old calculations and new, excepting where we have fixed the calculations
print('scp aleary@v2626umcth031.rtord.epa.gov:/home/aleary/Grid-Search_Optimization/outputs/pre-edit_hybrids.csv data/')

scp aleary@v2626umcth031.rtord.epa.gov:/home/aleary/Grid-Search_Optimization/outputs/pre-edit_hybrids.csv data/


In [188]:
#Import the slow version of the results done on a VM
old_df = pd.read_csv('pre-edit_hybrids.csv').set_index("Unnamed: 0")

In [206]:
#Check for significant (non-rounding) differences
((old_df['0'] - y_preds) > 0.001).value_counts()

0
False    5829
True        1
Name: count, dtype: int64

In [197]:
#The only difference occurs at chemical 1 (note that other differences are just rounding issues, as they
# are extremely small)
old_df['0']-y_preds

Unnamed: 0
0       0.000000e+00
1       2.027197e-01
2       0.000000e+00
3       0.000000e+00
4       0.000000e+00
            ...     
5825   -1.110223e-16
5826    0.000000e+00
5827   -8.326673e-17
5828    0.000000e+00
5829    1.110223e-16
Name: 0, Length: 5830, dtype: float64

In [216]:
#Checking this with VM results, we find that this is because of differences in selecting between equidistant
#neighbors, a change which we've seen also creates differences between GenraPredValue and GenraPredHybridValue
sim_test[0][1], sim_test[1][1]

(array([0.46666667, 0.46666667, 0.46666667, 0.45454545, 0.4516129 ,
        0.4375    , 0.42424242, 0.42424242]),
 array([   0,    2, 5805, 4143,  870, 3701,    4,    5], dtype=int64))

## universal_distance

In [4]:
from sklearn.metrics import r2_score
from genra.rax.skl.reg import GenRAPredValue
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
value_tester = GenRAPredValue(n_neighbors = 8,  metric = 'canberra', universal_distance = False)
value_tester.fit(ld50.iloc[:,730:2778], ld50.iloc[:,0])
y_preds_value = value_tester.predict()
r2_score(ld50.iloc[:,0], y_preds_value)

0.40869731381220464

In [5]:
from genra.rax.skl.hybrid import GenRAPredValueHybrid
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
tester = GenRAPredValueHybrid(n_neighbors = 8, slices = slices, hybrid_weights=[0,1,0,0,0],  metric = 'canberra', universal_distance = False)
tester.fit(ld50.iloc[:,1:], ld50.iloc[:,0])
y_preds = tester.predict()
r2_score(ld50.iloc[:, 0], y_preds)

0.4085822190914844

In [6]:
from sklearn.metrics import r2_score
from genra.rax.skl.reg import GenRAPredValue
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
value_tester = GenRAPredValue(n_neighbors = 8,  metric = 'canberra', universal_distance = True)
value_tester.fit(ld50.iloc[:,730:2778], ld50.iloc[:,0])
y_preds_value = value_tester.predict()
r2_score(ld50.iloc[:,0], y_preds_value)

0.3208001073985547

In [7]:
from genra.rax.skl.hybrid import GenRAPredValueHybrid
slices = [slice(0,729),slice(729,2777), slice(2777,4825), slice(4825, 5727), slice(5727, None)]
tester = GenRAPredValueHybrid(n_neighbors = 8, slices = slices, hybrid_weights=[0,1,0,0,0],  metric = 'canberra', universal_distance = True)
tester.fit(ld50.iloc[:,1:], ld50.iloc[:,0])
y_preds = tester.predict()
r2_score(ld50.iloc[:, 0], y_preds)

0.32413184429146624

# Grid Search

In [44]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import get_scorer, r2_score, root_mean_squared_error

def GenRAGridSearch(X, y, params, estimator, scoring, scorer_names = None, n_splits = 5, refit = None):
    cv = ShuffleSplit(n_splits = n_splits, test_size = 0.25)
    cv_results = {}
    number_names = scorer_names is None
    if number_names:
        scorer_names = range(len(scoring))
    for param in params.keys():
        cv_results[param] = []
    for split in range(n_splits):
        for scorer in scorer_names:
            cv_results[f'split{split}_scorer_{scorer}'] = []
    params_list = ParameterGrid(params)
    X = ld50.iloc[:, 1:]
    for test_case in params_list:
        for key, value in test_case.items():
            setattr(estimator, key, value)
            cv_results[key].append(value)
        estimator.fit(X,y)
        y_preds = estimator.predict()
        split = 0
        for train, test in cv.split(X):
        
            for x in range(len(scoring)):
                cv_results[f'split{split}_scorer_{scorer_names[x]}'].append(scoring[x](y.iloc[test].values, y_preds[test]))
            split += 1
    return cv_results

In [19]:
get_scorer('r2')(y_true = [1,0,1], estimator = estimator, X = ld50.iloc[:3, 1:])

-7.741799641215682

In [29]:
import numpy as np

def generalJaccard(row1, row2):
    diff = np.array(row1)-np.array(row2)
    denom = (np.dot(diff, diff)+np.dot(row1, row2))
    if denom != 0:
        similarity = np.dot(row1, row2)/(np.dot(diff, diff)+np.dot(row1, row2))
    else:
        similarity = 0
    return similarity

def generalJaccardDistance(row1, row2):
    similarity = generalJaccard(row1, row2)
    distance = 1 - similarity
    return distance

In [45]:
#Test without score names
from genra.rax.skl.hybrid import GenRAPredValueHybrid

estimator = GenRAPredValueHybrid(slices = slices, hybrid_weights = [0,10,0,0,0])
params = {'n_neighbors': [5,3], 'metric':['cosine', 'canberra']}
results = GenRAGridSearch(ld50.iloc[:, 1:], ld50.iloc[:,0], estimator = estimator, params = params, scoring = [r2_score, root_mean_squared_error])

In [46]:
#View and check results
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,n_neighbors,metric,split0_scorer_0,split0_scorer_1,split1_scorer_0,split1_scorer_1,split2_scorer_0,split2_scorer_1,split3_scorer_0,split3_scorer_1,split4_scorer_0,split4_scorer_1
0,5,cosine,0.552874,0.620729,0.477578,0.625033,0.517619,0.610308,0.482408,0.64006,0.493,0.655923
1,3,cosine,0.435959,0.656569,0.442906,0.660034,0.447501,0.678343,0.461575,0.661315,0.466902,0.649122
2,5,canberra,0.346869,0.751469,0.393115,0.715251,0.372982,0.73675,0.360284,0.723898,0.310428,0.781357
3,3,canberra,0.310255,0.769751,0.303151,0.72838,0.308976,0.748938,0.359187,0.722717,0.339781,0.727279


In [47]:
#Test with score names
from genra.rax.skl.hybrid import GenRAPredValueHybrid

estimator = GenRAPredValueHybrid(slices = slices, hybrid_weights = [0,10,0,0,0])
params = {'n_neighbors': [5,3], 'metric':['cosine', 'canberra']}
results = GenRAGridSearch(ld50.iloc[:, 1:], ld50.iloc[:,0], estimator = estimator, params = params, scoring = [r2_score, root_mean_squared_error], scorer_names = ['r2', 'rmse'])

In [48]:
#View and check results
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,n_neighbors,metric,split0_scorer_r2,split0_scorer_rmse,split1_scorer_r2,split1_scorer_rmse,split2_scorer_r2,split2_scorer_rmse,split3_scorer_r2,split3_scorer_rmse,split4_scorer_r2,split4_scorer_rmse
0,5,cosine,0.502115,0.657672,0.509102,0.656562,0.490355,0.651194,0.451952,0.646172,0.476466,0.670157
1,3,cosine,0.470067,0.688506,0.491152,0.656325,0.510373,0.633938,0.493559,0.663551,0.460882,0.665037
2,5,canberra,0.337227,0.722201,0.346824,0.715709,0.315082,0.752058,0.2946,0.748112,0.315651,0.73762
3,3,canberra,0.363521,0.736552,0.380568,0.720413,0.352519,0.73916,0.321911,0.747936,0.386011,0.699463


In [None]:
from genra.rax.skl.hybrid import GenRAPredValueHybrid
from sklearn.base import BaseEstimator, clone
from sklearn.model_selection._validation import _fit_and_score


params = {'n_neighbors': [0,1,3], 'metric':['jaccard', 'canberra']}
params_list = ParameterGrid(params)
for i in params_list:
    [print(i)]

estimator = GenRAPredValueHybrid(hybrid_weights = [0,1,0,0,0], slices = slices)
for key, value in i.items():
    setattr(estimator, key, value)
estimator

{'metric': 'jaccard', 'n_neighbors': 0}
{'metric': 'jaccard', 'n_neighbors': 1}
{'metric': 'jaccard', 'n_neighbors': 3}
{'metric': 'canberra', 'n_neighbors': 0}
{'metric': 'canberra', 'n_neighbors': 1}
{'metric': 'canberra', 'n_neighbors': 3}


In [46]:
base_estimator = clone(estimator)
base_estimator.set_params(n_neighbors = 8)
base_estimator

TypeError: BaseEstimator.set_params() takes 1 positional argument but 2 were given

In [16]:
X.iloc[train, 1:]

Unnamed: 0_level_0,tp_fp1,tp_fp2,tp_fp3,tp_fp4,tp_fp5,tp_fp6,tp_fp7,tp_fp8,tp_fp9,tp_fp10,...,P=S,-CF3 [aliphatic attach],-CF3 [aromatic attach],-CCl3 [aromatic attach],-CCl3 [aliphatic attach],Halogen [Nitrogen attach],As(=O),-N=C=S,Sn=O,-N=S=O
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DTXSID3040649,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID3047477,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID3059397,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID7044974,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID30604010,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
DTXSID10188045,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID9023889,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID4045638,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DTXSID20213352,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
