<a href="https://colab.research.google.com/github/poltergeistjoya/Frequentist/blob/main/Freqproj3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Project 3: K-fold Validation*** \\
Joya Debi, Melina Tsai, Sue (Xueru) Zhou

*** Summary ***
Re-implement the example in section 7.10.2 using any simple, out of the box classifier (like K nearest neighbors from sci-kit). Reproduce the results for the incorrect and correct way of doing cross-validation. 

In the incorrect way of doing K-fold cross validation on low quality data, accuracy on the validation set was 90-100%. Additionally, the accuracy was relatively the same for each fold of the data. This shows that when doing K-fold cross validation improperly on low quality, randomly generated data, the results can be quite decieving by indicating strong patterns in the data when there are none.

When doing K-fold cross validation properly on the same low-quality data, we see the accuracy range from 20% to 70%, with different accuracies for each fold. This is more expected on the dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
from scipy.stats import pearsonr
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

In [None]:
# make data 50 x 5000 gaussians
#note there is no rng so every time this cell is run, there will be different results
data = np.random.normal(0, 0.1,(50,5000))
data_df= pd.DataFrame(data)

#make labels of 50% 1 and 50% 0 
labels = np.random.randint(2, size=50)
labels_df =pd.DataFrame(labels)

In [None]:
#function to return training set with top 100 features and indices of top 100 features

def get_top_100(data_df, labels_df, right, tmptrain):
  
  #1x5000 correlation vector to hold correlation values
  correlation = np.zeros(data_df.shape[1])#:5000)

  #go through every column in data and correlate with labels
  for i in range(0,data_df.shape[1]):
    corr = scipy.stats.pearsonr(data_df.iloc[:,i], labels_df)
    correlation[i] = corr[0]

  #Pearsons coeffients gives values from -1 to 1 so must absolute val
  correlation = np.abs(correlation)

  #get indices of 100 most correlated
  top_100 = np.argsort(correlation)[-100:]

  #extract top columns from to make training est
  if right == False:
    top_100_df = data_df.iloc[np.arange(data_df.shape[0]),top_100]
  elif right == True:
    top_100_df = tmptrain.iloc[np.arange(tmptrain.shape[0]),top_100]
  return top_100_df, top_100

## The Wrong Way ##

In [None]:
#in the wrong way, only get top 100 features once, outside the fold
top_100_df, top_100 = get_top_100(data_df, labels_df, False, data_df)
neigh = KNeighborsClassifier(n_neighbors=1)

# make folder
nfolds = 5
kf = KFold(n_splits=nfolds)
for train_index, test_index in kf.split(top_100_df):
  train_data = top_100_df.iloc[train_index]
  train_labels = labels_df.iloc[train_index]
  val_data = top_100_df.iloc[test_index]
  val_labels = labels_df.iloc[test_index]
  neigh.fit(train_data, train_labels.to_numpy().flatten())
  acc = neigh.score(val_data,val_labels)
  print(acc)




1.0
0.9
1.0
0.9
1.0


## The Right Way ##

In [None]:
nfolds = 5
kf = KFold(n_splits=nfolds)
#right way to split data
for train_index, test_index in kf.split(data_df):
  #get training set from folds
  train_temp = data_df.iloc[train_index]
  train_labels = labels_df.iloc[train_index]

  #get training set with top 100 features, and indices of those features
  r_top_100_df, r_top_100 = get_top_100(train_temp, train_labels, True, train_temp)
  train_data = r_top_100_df
  neigh.fit(train_data, train_labels.to_numpy().flatten()) #to numpy and flatten for data leakage warning
  
  #make val data with test indices and top 100 features
  val_data = data_df.iloc[test_index, r_top_100]
  val_labels = labels_df.iloc[test_index]

 #two different ways to compute accuracy. Acc1 uses .predict and a separate accuracy score, acc2 uses .score
  pred = neigh.predict(val_data)
  acc1 = accuracy_score(val_labels,pred)
  acc2 = neigh.score(val_data,val_labels)
  print(acc1,acc2)

0.7 0.7
0.5 0.5
0.7 0.7
0.3 0.3
0.6 0.6
