<a href="https://colab.research.google.com/github/harvard-visionlab/neuro_science_fiction/blob/main/2022/analyze_my_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyze My Features

The goal of this notebook is to analyze your features so you can identify which raters and features to use in your "brain-prediction" / "mind-reading" experiment.

# Imports (run me first)

This section imports helpful Python libraries, and defines functions that analyze your ratings. Run every cell in this section, or collapse this section (arrows on the left), and run the whole section (see video walkthrough if this is unclear to you).

In [None]:
%config InlineBackend.figure_format = 'retina'

In [None]:
import os
import pandas as pd 
from datetime import date
import numpy as np
import pandas as pd 
from collections import defaultdict
import seaborn as sns 
from functools import lru_cache
from pdb import set_trace 

def download_ratings(team_name, dropRaters=[], dropFeatures=[]):
  print(f"==> Downloading data: {team_name}")
  team_name = team_name.lower()
  filename = f"{team_name}_Ratings.csv"
  url = f"https://scorsese.wjh.harvard.edu/turk/experiments/nsf/survey/{team_name}/data"
  df = pd.read_csv(url)
  df = df[~df.workerId.isin(dropRaters)]
  df = df[~df.featureName.isin(dropFeatures)]

  # drop rows for raters with incomplete datasets
  # assuming max_count of ratings is expected/desired count
  counts = df.groupby('workerId').rating.count()
  max_count = counts.max()
  dropRaters = counts[counts<max_count].index.values
  if len(dropRaters) > 0:
    print(f"==> Dropping incomplete datasets: {dropRaters}")
    df = df[~df.workerId.isin(dropRaters)]

  group_name = df.iloc[0].groupName
  num_items = len(df.itemName.unique())
  num_features = len(df.featureName.unique())
  num_raters = len(df.workerId.unique())

  print("="*50)
  print(f"FEATURE RATINGS: {group_name}")
  print(f"{num_items} items, {num_features} features, {num_raters} raters")
  print(date.today().strftime("%B %d, %Y"))
  print("="*50)
  
  df.to_csv(filename, index=False)

  items = sorted(df.itemName.unique())
  features = sorted(df.featureName.unique())
  raters = sorted(df.workerId.unique())
  num_rows = len(df)
  print("items:", items)
  print("features:", features)
  print("raters:", raters)
  expected_rows = len(items) * len(features) * len(raters)
  #assert expected_rows == len(df), f"Oops, expected {expected_rows}, got {len(df)}"
  #print("number of rows:", num_rows)

  return df 

def load_ratings(team_name):
  team_name = team_name.lower()
  filename = f"{team_name}_Ratings.csv"
  df = pd.read_csv(filename)
  group_name = df.iloc[0].groupName
  num_items = len(df.itemName.unique())
  num_features = len(df.featureName.unique())
  num_raters = len(df.workerId.unique())

  print("="*50)
  print(f"FEATURE RATINGS: {group_name}")
  print(f"{num_items} items, {num_features} features, {num_raters} raters")
  print(date.today().strftime("%B %d, %Y"))
  print("="*50)

  items = sorted(df.itemName.unique())
  features = sorted(df.featureName.unique())
  raters = sorted(df.workerId.unique())
  num_rows = len(df)
  print("items:", items)
  print("features:", features)
  print("raters:", raters)
  expected_rows = len(items) * len(features) * len(raters)
  assert expected_rows == len(df), f"Oops, expected {expected_rows}, got {len(df)}"
  print("number of rows:", num_rows)

  return df

@lru_cache(maxsize=None)
def compute_ratings_by_feature(team_name):
  df = load_ratings(team_name)
  items = sorted(df.itemName.unique())
  features = sorted(df.featureName.unique())
  raters = sorted(df.workerId.unique())

  RatingsByFeature = {}
  for featureName in features:
    RatingsByFeature[featureName] = []
    for rater in raters:
      ratings = []
      for itemName in items:
        subset = df[(df.featureName==featureName) & (df.workerId==rater) & (df.itemName==itemName)]
        assert len(subset)==1
        ratings.append(subset.iloc[0].ratingScaled)
      RatingsByFeature[featureName].append(np.array(ratings))
    RatingsByFeature[featureName] = np.array(RatingsByFeature[featureName])

  return df, RatingsByFeature

def sort_features_by_consistency(RatingsByFeature):
  results = defaultdict(list)
  for feature_name in RatingsByFeature.keys():
    rater_vs_rater = np.corrcoef(RatingsByFeature[feature_name])
    num_raters = rater_vs_rater.shape[0]
    avg_corr = np.nanmean(rater_vs_rater[np.triu_indices(num_raters,k=1)])
    results['feature_name'].append(feature_name)
    results['avg_corr'].append(avg_corr)
  results = pd.DataFrame(results)
  sorted = results.sort_values('avg_corr')
  return sorted 

def rater_agreement(df, RatingsByFeature):
  raters = sorted(df.workerId.unique())
  results = defaultdict(list)
  rater_vs_rater_all = []
  for feature_name in RatingsByFeature.keys():
    rater_vs_rater = np.corrcoef(RatingsByFeature[feature_name])
    num_raters = rater_vs_rater.shape[0]
    rater_consistency = (rater_vs_rater.sum(axis=1)-1)/(num_raters-1)
    for rater_num,rater in enumerate(raters):
      results['feature_name'].append(feature_name)
      results['rater'].append(rater)
      results['avg_corr'].append(rater_consistency[rater_num])
    rater_vs_rater_all.append(rater_vs_rater)
  rater_vs_rater_all = np.stack(rater_vs_rater_all)
  results = pd.DataFrame(results)

  return results, rater_vs_rater_all

def plot_sorted_features(sorted_features):
  sns.set(rc={'figure.figsize':(14, len(sorted_features)*.33)})
  ax = sns.barplot(data=sorted_features, y="feature_name", x="avg_corr", orient="h")
  ax.set_ylabel("Feature Name")
  ax.set_xlabel("Average Correlation Between Raters (across items)");
  ax.set_xlim([0,1]);
  return ax 

def compute_feature_reliability(team_name):  
  df, RatingsByFeature = compute_ratings_by_feature(team_name)
  sorted_features = sort_features_by_consistency(RatingsByFeature)

  print("="*50)
  print(f"FEATURE RELIABILTY:")
  print("How consistent were the feature ratings between raters?")
  print("Consider two raters, and their 60 ratings (1 per item).")
  print("Correlate those ratings, then repeat for all pairs of")
  print("raters, and compute the average rating. Higher numbers")
  print("indicate higher consistency in ratings.")
  print("="*50)
  print(sorted_features)
  print("\n")
  ax = plot_sorted_features(sorted_features)
  return ax

def compute_rater_agreement(team_name):
  df, RatingsByFeature = compute_ratings_by_feature(team_name)
  raters = sorted(df.workerId.unique())
  agreement, rater_vs_rater = rater_agreement(df, RatingsByFeature)

  print("="*50)
  print(f"RATER AGREEMENT:")
  print("How well did each rater correlate with the other ")
  print("raters, on average?")
  print("="*50)
  print(agreement.groupby('rater').mean())

  ax = sns.heatmap(np.nanmean(rater_vs_rater, axis=0), square=True,
                   xticklabels=raters, yticklabels=raters,
                   vmin=0, vmax=1)
  return ax

In [None]:
import matplotlib.pyplot as plt

def compute_feature_vs_feature_corr(df):
  features = sorted(df.featureName.unique())
  items = sorted(df.itemName.unique())
  M = []
  for feature in features:
    item_ratings = []
    for item in items:
      subset = df[(df.featureName==feature) & (df.itemName==item)]
      item_ratings.append(subset.ratingScaled.mean())
    M.append(item_ratings)
  M = np.array(M)
  feature_vs_feature = np.corrcoef(M)

  return M, feature_vs_feature

def feature_redundancy(df, feature_vs_feature):
  features = sorted(df.featureName.unique())
  items = sorted(df.itemName.unique())
  corrs = defaultdict(list)
  num_features = len(features)
  for feature1 in range(0,num_features-1):
    for feature2 in range(feature1+1, num_features):
      corr = feature_vs_feature[feature1,feature2]
      featureName1 = features[feature1]
      featureName2 = features[feature2]
      pair = f"{featureName1}_{featureName2}"
      corrs['item1'].append(featureName1)
      corrs['item2'].append(featureName2)
      corrs['pair'].append(pair)
      corrs['correlation'].append(corr)
      corrs['abs_correlation'].append(np.abs(corr))
      corrs['sign'].append("positive" if corr>0 else "negative")
  corrs = pd.DataFrame(corrs)
  corrs = corrs.sort_values('abs_correlation')
  
  return corrs

def plot_feature_correlation_heatmap(features, feature_vs_feature):
  print("="*50)
  print(f"FEATURE REDUNDANCY:")
  print("How correlated are our features?")
  print("Here were plotting the abs(correlation) between")
  print("every pair of features. Lower correlations are")
  print("preferred because it means the features carry")
  print("independent information.")
  print("="*50)

  sns.set(rc={'figure.figsize':(8,8)})
  ax = sns.heatmap(np.abs(feature_vs_feature), square=True,
                   xticklabels=features, yticklabels=features,
                   vmin=0, vmax=1);  
  plt.show()

def plot_feature_correlation_bars(features, corrs, threshold=.9):
  print("="*50)
  print(f"Now as bars, sorted from lowest to highest abs(correlation).")
  print("="*50)

  num_features = len(features)
  sns.set(rc={'figure.figsize':(14,num_features*3)})
  ax = sns.barplot(data=corrs, y="pair", x="abs_correlation", orient="h")
  ax.set_ylabel("Feature Name")
  ax.set_xlabel("Correlation Across Items");
  ax.set_xlim([0,1]);
  ax.axvline(threshold, color='gray', linestyle='--');
  plt.show()  

def compute_feature_redundancy(team_name, threshold = .90):
  df = load_ratings(team_name)
  features = sorted(df.featureName.unique())
  
  M, feature_vs_feature = compute_feature_vs_feature_corr(df)
  corrs = feature_redundancy(df, feature_vs_feature)

  plot_feature_correlation_heatmap(features, feature_vs_feature)
  
  plot_feature_correlation_bars(features, corrs, threshold=threshold)
  

In [None]:
import matplotlib.pyplot as plt

def compute_item_vs_item_corr(df):
  features = sorted(df.featureName.unique())
  items = sorted(df.itemName.unique())
  M = []
  for feature in features:
    item_ratings = []
    for item in items:
      subset = df[(df.featureName==feature) & (df.itemName==item)]
      item_ratings.append(subset.ratingScaled.mean())
    M.append(item_ratings)
  M = np.transpose(np.array(M))
  item_vs_item = np.corrcoef(M)

  return M, item_vs_item


def plot_item_correlation_heatmap(items, item_vs_item):
  print("="*50)
  print(f"ITEM SIMILARITY:")
  print("="*50)

  sns.set(rc={'figure.figsize':(14,14)})
  ax = sns.heatmap(np.abs(item_vs_item), square=True,
                   xticklabels=items, yticklabels=items,
                   vmin=-1, vmax=1);  
  plt.show()

def compute_item_similarity(team_name, threshold = .80):
  df = load_ratings(team_name)
  features = sorted(df.featureName.unique())
  items = sorted(df.itemName.unique())

  M, item_vs_item = compute_item_vs_item_corr(df)
  
  num_items = item_vs_item.shape[0]
  upper_diag = item_vs_item[np.triu_indices(num_items,k=1)]
  total_pairs = upper_diag.shape[0]
  mean_corr = upper_diag.mean()
  number_similar = (upper_diag > threshold).sum()
  percent_similar = number_similar/total_pairs * 100

  plot_item_correlation_heatmap(items, item_vs_item)

  print("\n")
  print(f"Total Pairs of Items: {total_pairs}")
  print(f"Average correlation across pairs (lower is better): {mean_corr:4.2f}")
  print(f"Number of similar pairs (r > {threshold}): {number_similar}")
  print(f"Percent similar: {percent_similar:3.2f}")

  print("\n")
  print(f"So about {percent_similar:3.2f} percent of the time your model is ")  
  print("likely to fail, not because it isn't a good set of features, but")  
  print("because these items are so similar to each other on these features.")  


In [None]:
def show_most_similar_item(team_name, N=1):
  df = load_ratings(team_name)
  features = sorted(df.featureName.unique())
  items = sorted(df.itemName.unique())

  M, item_vs_item = compute_item_vs_item_corr(df)

  results = defaultdict(list)
  for item_idx in range(len(items)):
    corrs = item_vs_item[item_idx]
    corrs[item_idx] = -np.inf
    most_similar_idx = corrs.argsort()[-N]
    itemName = items[item_idx]
    mostSimilarItem = items[most_similar_idx]
    corr = corrs[most_similar_idx]
    results['itemName'].append(itemName)
    results['mostSimilarItem'].append(mostSimilarItem)
    results['correlation'].append(corr)
  results = pd.DataFrame(results)
  
  print("\n")
  print("="*50)
  print(f"MOST SIMILAR ITEM (N={N}):")
  print("="*50)
  return results

# Download Your Dataset

Let's make sure your dataset is available and that we can download it. This cell will fetch your data from the server, and store a local copy. It gets output as a "dataframe", and the total length of this dataframe should equal: numItems * numFeatures * numRaters. For example, if you have 16 features, and 3 raters, the length should be `60*16*3=2880`

In [None]:
team_name = 'dopaminemachine'

dropRaters = []
dropFeatures = []
df = download_ratings(team_name, dropRaters=dropRaters, dropFeatures=dropFeatures)
df.head()

# FEATURE RELIABILITY

How consistently did people rate these features?
We want to use features that people showed pretty high agreement on.

Consider the extreme, where people respond randomly. Of course you would not expect your feature to predict brain data (because it's a bunch of random numbers!).

We can test for agreement, or reliability in our feature estimates by correlating the ratings for each subject with every other subject. 

The average of these pairwise correlations tells us how much agreement there was between our subjects (on average).

Features are sorted from least to most reliable (higher values are better)

(reliability = average correlation across pairs of subjects)

In [None]:
compute_feature_reliability(team_name);

In [None]:
compute_rater_agreement(team_name);

# Feature Redundancy

Next, we correlate the ratings from each feature with each other feature.

We want to know whether each feature is providing potentially useful information.

Highly correlated features are redundant (given one feature, the other feature doesn't tell you anything new or useful).

The same is true for highly negatively correlated features, so here we're sorting by absolute value of the correlation, from lowest to highest. Higher abs(correlation) values are worse (more redundancy).

How high is too high? Greater than .9 could be problematic, because it could make it difficult for the regression algorithm to find an appropriate "weight" for each of your features.

In [None]:
compute_feature_redundancy(team_name)

# Feature Discriminability

How different are the 60 items on your features?
Even if your model is perfect (i.e., these are the exact features the brain cares about), if all of your items share the same features, your model will fail. 

Let's flesh out why:

Remember that we use mitchell's scoring method. He holds out two items, and then trains the model on the remaining 58. Then he uses the model to predict activity to the 2 held-out items.  

The prediction is scored as correct, if the predicted brain pattern is more like the true brain pattern for that item, than for the brain pattern for the other held-out-item.

But if the features are identical for two items, the model will predict exactly the same brain response to each object! So basically it will fail every time. 

This doesn't mean your model is wrong, it means you have the wrong items to test your particular model. So we want features that we think are useful not just in general, but for telling these particular items apart. 

Ideally, our features could be used to tell all possible items apart!


In [None]:
compute_item_similarity(team_name)

In [None]:
show_most_similar_item(team_name, N=1)

# Now What?

Next, send an e-mail to alvarez@wjh.harvard.edu to let me know:
- 1) Whether you want to drop any raters
- 2) Whether you want to drop any features
- 3) Whether you want to add any features (and if so, include your feature survey questions for your new features)

I need your final features at least a 2 days before our next meeting, so that I have time to run the brain-prediction & mind-reading analyses ahead of time (it's too slow for us to do in a 1 hour meeting, so I'll compute the feature-to-voxel weights in advance for you).