This notebook tracks the merging of the runtimes recorded from Janus and Boyana's properties for the UF collection.

In [1]:
import pandas as pd
import numpy as np

In [2]:
timings = pd.read_csv('./data/all_results_janus_single_node_1-14-17.csv')

Changing the name of the columns to more simple names for ease-of-use.

In [3]:
timings.columns= ['np', 'matrix', 'solver', 'prec', 'status', 'time', 'iters', 'resid']

Read in the properties file and merge the files together based on the 'matrix' column.

In [4]:
properties = pd.read_csv('./data/uflorida-features.csv',header=0)

In [5]:
properties.columns = ['rows', 'cols', 'min_nnz_row', 'row_var', 'col_var', 'diag_var', 'nnz', 'frob_norm', 'symm_frob_norm', 'antisymm_frob_norm', 'one_norm', 'inf_norm', 'symm_inf_norm', 'antisymm_inf_norm', 'max_nnz_row', 'trace', 'abs_trace', 'min_nnz_row', 'avg_nnz_row', 'dummy_rows', 'dummy_rows_kind', 'num_value_symm_1', 'nnz_pattern_symm_1', 'num_value_symm_2', 'nnz_pattern_symm_2', 'row_diag_dom', 'col_diag_dom', 'diag_avg', 'diag_sign', 'diag_nnz', 'lower_bw', 'upper_bw', 'row_log_val_spread', 'col_log_val_spread', 'symm', 'matrix']

Combining the two dataframes into a single dataframe called 'combined.'  
Replacing the string data with numerical data. 


In [6]:
combined = pd.merge(timings, properties, on='matrix')

In [None]:
combined = combined.drop(['matrix','solver','prec','status','iters','resid'], axis=1)

In [7]:
np.nan_to_num(combined)

array([[1, 'saylr3.mtx', 'FIXED_POINT', ..., 5.31281, 5.31543, 0],
       [1, 'saylr3.mtx', 'FIXED_POINT', ..., 5.31281, 5.31543, 0],
       [1, 'saylr3.mtx', 'FIXED_POINT', ..., 5.31281, 5.31543, 0],
       ..., 
       [12, 'patents_main.mtx', 'PSEUDOBLOCK_GMRES', ..., 6.30103, 6.30103,
        0],
       [12, 'patents_main.mtx', 'PSEUDOBLOCK_GMRES', ..., 6.30103, 6.30103,
        0],
       [12, 'patents_main.mtx', 'PSEUDOBLOCK_GMRES', ..., 6.30103, 6.30103,
        0]], dtype=object)

Turn any string-based categories into int-based.

In [9]:
combined['solver_num'] = combined.solver.map({'FIXED_POINT': 0, 'BICGSTAB': 1, 'MINRES': 2, 'PSEUDOBLOCK_CG': 3, 'PSEUDOBLOCK_STOCHASTIC_CG': 4, 'PSEUDOBLOCK_TFQMR': 5, 'TFQMR': 6, 'LSQR': 7, 'PSEUDOBLOCK_GMRES': 8}).astype(int)

In [10]:
combined['prec_num'] = combined.prec.map({'ILUT': 0, 'RILUK': 1, 'RELAXATION': 2, 'CHEBYSHEV': 3, 'NONE': 4}).astype(int)

In [11]:
combined['status_num'] = combined.status.map({'error': -1, 'unconverged': 0, 'converged': 1}).astype(int)

Output to csv if needed/wanted:  
combined.to_csv('./data/combined.csv')

Check for any weird inputs before running model.  
Should print T F F F

In [34]:
print(np.all(np.isfinite(combined)), np.any(np.isnan(combined)), np.any(np.isinf(combined)), np.all(np.isscalar(combined))) 

True False False False


% Commented out because it can just eat up everything  
from sklearn.ensemble import RandomForestClassifier  
X_train = combined.drop('good_or_bad', axis = 1)  
Y_train = combined['good_or_bad']  
rf = RandomForestClassifier(n_estimators = 100)  
rf.fit(X_train.as_matrix(), Y_train.as_matrix())  