# Preparing LOCO-CV and kernelised LOCO-CV splits

In this file we'll go through an example of preparing clusterings for use with LOCO-CV. We will use do this with an without a kernel function, using radial basis function as our example kernel

## First some imports and configurations

In [1]:
import pandas as pd
import os
from utilities import find_clusterings
from sklearn.kernel_approximation import RBFSampler
import json

In [2]:
data_folder = 'data/case_studies'
task_info = 'task_info.json'

with open(task_info) as f:
    tasks = json.load(f)

## Now choose which featurisation and dataset

In [3]:
featurisation_method = 'oliynyk'
task = 'HH stability'

In [4]:
task_folder = os.path.join(data_folder, #were the data is
                 'featurised', #whether we are investigating CBFVs or random projections
                 tasks[task]['study_folder'], #Which study?
                 'LOCO-CV',#80_20_split or LOCO-CV?
                 tasks[task]['type'], #regression or classification?
                 tasks[task]['task_folder']) #Which task?
data_file = os.path.join(task_folder,f'{featurisation_method}_CBFV.csv')

## We define a function to split up a given data space with several applications of k-means
from the source code we can see this is a simple function

In [5]:
?? find_clusterings

[0;31mSignature:[0m  [0mfind_clusterings[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mformulae[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mfind_clusterings[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mformulae[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""clusters data using kmeans clustering for values of k between 2 and 10[0m
[0;34m[0m
[0;34m    Parameters:[0m
[0;34m    data (pandas Dataframe or np.ndarray): data to cluster[0m
[0;34m    formulae (pandas Series or list of strings): formulae associated with each row of data[0m
[0;34m    Returns:[0m
[0;34m    list: clusters in the form [{'k':2, 'formulae':['H2O','NaCl'....],[0m
[0;34m          'clusters':[0,1...]},{'k':2...][0m
[0;34m[0m
[0;34m   """[0m[0;34m[0m
[0;34m[0m    [0mclusters[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0;32mfor[0m [0mk[0m [0;32min[0m [0mrange[0m[0;34m([0m[0;36m2[0m[0;34m,[0m[0;36m11[0m[0

## Read in file

In [6]:
df = pd.read_csv(data_file)
formulae = df['formula']
featurised_data = df.drop(['target','formula'], axis=1)

## For normal LOCO-CV we can just pass this data to our function
we can then save this for later use

In [7]:
#This line can take a minute or two
clusters = find_clusterings(featurised_data,formulae)

In [8]:
with open('example_clustering.json', 'w') as f:
    json.dump(clusters, f)

## For Kernelised LOCO-CV we must first apply the kernel function
Again we can then save this for later use

In [9]:
rbf = RBFSampler()
#This line can take a minute or two
kernelised_clusters = find_clusterings(rbf.fit_transform(featurised_data),formulae)

In [10]:
with open('example_kernelised_clustering.json', 'w') as f:
    json.dump(kernelised_clusters, f)