## Name: GraphAny for tabular data
### Date: 05/09/2024
### Status: Done. Seems to work surprisingly well , better than RF on average across many datasets.
### Idea: 
Apply the idea of [GraphAny](https://arxiv.org/pdf/2405.20445) to tabular data.

As a first step we use a knn graph (filtered for each dataset). Specifically, for each data point we draw edges to the top-100 closest samples and create an adjacency matrix this way. The adjacency matrix is then filtered to only keep the top-10% of the edges, base on the distance.

After that, we run GraphAny as usual, either transductively or inductively.



### Results:

1. Across 74 datasets both inductive and transductive GraphAny are better in 40/74 (54%) and on the average rank.
2. What I found is that usually, the passthrough, low1, high1 are good enough.
3. The results are consinte irrespective of homophily.
4. A lot if interesting by-products to be seen.

In [7]:
import pandas as pd
import plotly.express as px

def apply(x):
    if x >= 0.5:
        return (1-x)
    else:
        return x
    


res1 = pd.read_csv("./graphany_many_big_ind.csv")
res2 = pd.read_csv("./graphany_many_small_ind.csv")
res3 = pd.read_csv("./graphany_nik.csv")
res = pd.concat((res1, res2, res3))

res['minority_class_ratio'] = res['pos_class_ratio'].apply(lambda x: apply(x))

res['m_times_e'] = res['minority_class_ratio']*res['edge_homophily']

res.drop_duplicates(inplace=True)
#res

res = res[res['model'].isin(['RF', 'GraphAny_ind', 'GraphAny_linear', 'LR', 'kNN'])]


In [5]:
px.scatter(res, x='edge_homophily', y='f1', color='model', symbol='model', trendline='ols', hover_data='dataset')

In [10]:
# Step 2: Sort each group by 'f1'
sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)

# Step 3: Assign ranks within each group
sorted_df['rank'] = sorted_df.groupby('dataset').cumcount() + 1

# Step 4: Calculate mean rank for each model across all datasets
mean_ranks = sorted_df.groupby('model')['rank'].mean().reset_index().sort_values(by='rank')

mean_ranks

  sorted_df = res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False)).reset_index(drop=True)


Unnamed: 0,model,rank
1,GraphAny_linear,2.040541
3,RF,2.445946
0,GraphAny_ind,2.905405
2,LR,3.445946
4,kNN,4.162162


In [9]:
import numpy as np
#  res.groupby('dataset').apply(lambda x: x.sort_values(by='f1', ascending=False))
model_names = res['model'].unique()
wins_score = np.zeros((len(model_names), len(model_names)))

score_to_check = 'f1'
print(f'NUM of EXPERIMENTS: {res["dataset"].unique().shape[0]}')
for classification_dataset in res['dataset'].unique():
    cur_df = res[res['dataset'] == classification_dataset]
    cur_df = cur_df.set_index('model')
    score_metric = cur_df[score_to_check]
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names[i:]):
            if cur_df.loc[m1][score_to_check] > cur_df.loc[m2][score_to_check]:
                wins_score[i, j+i] += 1
            elif cur_df.loc[m1][score_to_check] < cur_df.loc[m2][score_to_check]:
                wins_score[j+i, i] += 1
            else:
                pass

order_of_models = wins_score.mean(axis=1).argsort()[::-1]
wins_score = wins_score[order_of_models, :][:, order_of_models]
print('WINS')
print(pd.DataFrame(wins_score, columns = np.array(model_names)[order_of_models], index=np.array(model_names)[order_of_models]))

NUM of EXPERIMENTS: 74
WINS
                   RF  GraphAny_linear  GraphAny_ind    LR   kNN
RF                0.0             34.0          34.0  53.0  67.0
GraphAny_linear  40.0              0.0          13.0  53.0  57.0
GraphAny_ind     40.0              5.0           0.0  53.0  57.0
LR               21.0             20.0          20.0   0.0  51.0
kNN               6.0             16.0          16.0  22.0   0.0
