# Custom code for manuscript "Neural Similarity Induces Friendship"

Copyright 2024 Yixuan Shen

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

---

__Authors: Yixuan Lisa Shen and Ryan Hyon__

__Last updated: 2025/2/26__

This is the custom code that was used for the main analyses for the manuscript “Neual Similarity Induces Friendship” under revision at _Nature Human Behavior_. In compliance with the guidelines outlined in the Journal’s Code and Software Submission Checklist, we provide a small and simulated dataset here to demo the code. On a 2021 MacBook Pro with 16GB of RAM, it takes about 10 minutes to run through the code in this demo with the simulated dataset. 

This code uses simulated data that include 10 participants and their simulated timeseries (100 TRs) in 5 brain regions. There are two input files: 
1. subject_ts.csv: file that includes simulated timeseries (100 TRs) in 5 brain regions for each of the 10 participants
2. subject_demo_ratings.csv: file that includes 10 participants' demographics information (age, gender, nationality, hometown, college, major, and industry) and handedness, as well as their post-scan self-reported enjoyment and interest ratings for each of the 5 videos they have viewed in the scanner. 
3. dyad_list.csv: file that includes each of the 45 unique dyads from the 10 subjects.
4. edgelist_t2.csv: csv file that includes the edgelist at Time 2 arranged in two columns of source and target
5. edgelist_t3.csv: csv file that includes the edgelist at Time 3 arranged in two columns of source and target
---

## Setting up

Read in input files (Note: make sure to set the file path to the correct path on your local machine)

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import pearsonr
import igraph
import random
import pickle
import os
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from statsmodels.stats import multitest

pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)

filepath = f'/Users/parkinsonlab/Desktop/NHB_friendship/demo_code' #set the file paths to the correct paths on your local machine
subject_ts = pd.read_csv(f'{filepath}/subject_ts.csv', index_col=False)
subject_demo_ratings = pd.read_csv(f'{filepath}/subject_demo_ratings.csv', index_col=False)
dyad = pd.read_csv(f'{filepath}/dyad_list.csv', index_col=False)

def SavePickle(infile, outfile):
    with open(outfile, 'wb') as f:
        pickle.dump(infile, f, protocol = 2)

def LoadPickle(infile):
    with open(infile, 'rb') as f:
        outfile = pickle.load(f)
        return outfile

Create dyad-level variables and store in dyad-level dataframe (dyad_df): 
* age_dist: absolute difference in age
* gender_similarity: binary indicator of whether two subjects are of the same gender
* nationality_similarity: binary indicator of whether two subjects are of the same nationality
* hometown_population_similarity: absolute difference in the size of population between two subjects' hometowns
* dist_hometown: distance between two subjects' hometowns
* dist_college: distance between two subjects' undergraduate alma mater
* college_pub_priv_similarity: binary indicator of whether two subjects' undergraduate alma maters are of the same type of institution (public or private)
* major_similarity: binary indicator of whether two subjects' undergraduate majors belong to the same category 
* industry_similarity: binary indicator of whether two subject work in the same industry
* handedness_similarity: binary indicator of whether two subjects are of the same handedness
* enjoy_similarity: similarity in the enjoyment rating vectors (i.e., Euclidean distance between rating vectors)
* interest_similarity: similarity in the interest rating vectors (i.e., Euclidean distance between rating vectors)

In [2]:
dyad_df = dyad.copy()

dyad_df['age_dist'] =''
dyad_df['gender_similarity'] = ''
dyad_df['nationality_similarity'] = ''
dyad_df['hometown_population_similarity'] = ''
dyad_df['dist_hometown'] = ''
dyad_df['dist_college'] = ''
dyad_df['college_pub_priv_similarity'] = ''
dyad_df['major_similarity'] = ''
dyad_df['industry_similarity'] = ''
dyad_df['handedness_similarity'] = ''
dyad_df['interest_similarity'] = ''
dyad_df['enjoy_similarity'] = ''

#load files that contain demographic information
dist_colleges = pd.read_csv(f'{filepath}/demo_info/dist_colleges.csv', index_col=False)
dist_hometowns = pd.read_csv(f'{filepath}/demo_info/dist_hometowns.csv', index_col=False)
dict_major_cat = LoadPickle(f'{filepath}/demo_info/dict_major_cat.pkl')
dict_college_public_private = LoadPickle(f'{filepath}/demo_info/dict_college_public_private.pkl')
dict_hometown_population = LoadPickle(f'{filepath}/demo_info/dict_hometown_population.pkl')

#get the columns with enjoyment and interest ratings
enjoy_cols = [i for i in subject_demo_ratings.columns if 'enjoy' in i]
interest_cols = [i for i in subject_demo_ratings.columns if 'interest' in i]

for i in range(0,len(dyad_df)):
    dyad_subj1 = dyad_df['dyad_subject1'][i]
    dyad_subj2 = dyad_df['dyad_subject2'][i]
    
    #age_dist
    dyad_subj1_age = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['age'].item()
    dyad_subj2_age = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['age'].item()
    age_dist = np.abs(dyad_subj1_age - dyad_subj2_age)
    dyad_df['age_dist'][i] = age_dist

    #gender_similarity
    dyad_subj1_gender = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['gender'].item()
    dyad_subj2_gender = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['gender'].item()
    gender_similarity = 1 if dyad_subj1_gender == dyad_subj2_gender else 0
    dyad_df['gender_similarity'][i] = gender_similarity
    
    #nationality_similarity
    dyad_subj1_nationality = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['nationality'].item()
    dyad_subj2_nationality = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['nationality'].item()
    nationality_similarity = 1 if dyad_subj1_nationality == dyad_subj2_nationality else 0
    dyad_df['nationality_similarity'][i] = nationality_similarity

    #hometown_population_similarity
    dyad_subj1_hometown = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['hometown'].item()
    dyad_subj2_hometown = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['hometown'].item()
    hometown_population_similarity = np.abs(dict_hometown_population[dyad_subj1_hometown] - dict_hometown_population[dyad_subj2_hometown])
    dyad_df['hometown_population_similarity'][i] = hometown_population_similarity

    #dist_hometown
    if dyad_subj1_hometown == dyad_subj2_hometown:
        dyad_df['dist_hometown'][i] = 0
    else: 
        hometown_idx = dist_hometowns.index[(dist_hometowns['City1'] == dyad_subj1_hometown) & (dist_hometowns['City2'] == dyad_subj2_hometown)].tolist()
        if not hometown_idx:
            hometown_idx = dist_hometowns.index[(dist_hometowns['City2'] == dyad_subj1_hometown) & (dist_hometowns['City1'] == dyad_subj2_hometown)].tolist()
        dyad_df['dist_hometown'][i] = dist_hometowns.loc[hometown_idx[0], 'dist_hometown']

    #dist_college
    dyad_subj1_college = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['college'].item()
    dyad_subj2_college = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['college'].item()

    if dyad_subj1_college == dyad_subj2_college:
        dyad_df['dist_college'][i] = 0
    else:
        college_idx = dist_colleges.index[(dist_colleges['college1'] == dyad_subj1_college) & (dist_colleges['college2'] == dyad_subj2_college)].tolist()
        if not college_idx:
            college_idx = dist_colleges.index[(dist_colleges['college2'] == dyad_subj1_college) & (dist_colleges['college1'] == dyad_subj2_college)].tolist()
        dyad_df['dist_college'][i] = dist_colleges.loc[college_idx[0], 'dist_college']

    #college_pub_priv_similarity
    college_pub_priv_similarity = 1 if dict_college_public_private[dyad_subj1_college] == dict_college_public_private[dyad_subj2_college] else 0
    dyad_df['college_pub_priv_similarity'][i] = college_pub_priv_similarity

    #major_similarity
    dyad_subj1_major = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['major'].item()
    dyad_subj2_major = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['major'].item()
    major_similarity = 1 if dict_major_cat[dyad_subj1_major] == dict_major_cat[dyad_subj2_major] else 0
    dyad_df['major_similarity'][i] = major_similarity

    #industry_similarity
    dyad_subj1_industry = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['industry'].item()
    dyad_subj2_industry = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['industry'].item()
    industry_similarity = 1 if dyad_subj1_industry == dyad_subj2_industry else 0
    dyad_df['industry_similarity'][i] = industry_similarity

    #handedness_similarity
    dyad_subj1_handedness = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj1]['handedness'].item()
    dyad_subj2_handedness = subject_demo_ratings[subject_demo_ratings['subject']==dyad_subj2]['handedness'].item()
    handedness_similarity = 1 if dyad_subj1_handedness == dyad_subj2_handedness else 0
    dyad_df['handedness_similarity'][i] = handedness_similarity

    #enjoy_similarity
    dyad_subj1_enjoy_vec = stats.zscore(subject_demo_ratings[subject_demo_ratings['subject'] == dyad_subj1][enjoy_cols].values.flatten())
    dyad_subj2_enjoy_vec = stats.zscore(subject_demo_ratings[subject_demo_ratings['subject'] == dyad_subj2][enjoy_cols].values.flatten())
    enjoy_distance = np.linalg.norm(dyad_subj1_enjoy_vec - dyad_subj2_enjoy_vec)
    dyad_df['enjoy_similarity'][i] = float(enjoy_distance)
   
    #interest_similarity
    dyad_subj1_interest_vec = stats.zscore(subject_demo_ratings[subject_demo_ratings['subject'] == dyad_subj1][interest_cols].values.flatten())
    dyad_subj2_interest_vec = stats.zscore(subject_demo_ratings[subject_demo_ratings['subject'] == dyad_subj2][interest_cols].values.flatten())
    interest_distance = np.linalg.norm(dyad_subj1_interest_vec - dyad_subj2_interest_vec)
    dyad_df['interest_similarity'][i] = float(interest_distance)

Calculate ISCs in each of the 5 brain regions for each of the 45 unique subject pairs

In [3]:
parcels = ['brain_region1', 'brain_region2', 'brain_region3', 'brain_region4', 'brain_region5']
dyad_df[parcels] = ''

for i in range(0,len(dyad_df)):
    #get the two subjects in the dyad
    dyad_subj1 = dyad_df['dyad_subject1'][i]
    dyad_subj2 = dyad_df['dyad_subject2'][i]
    
    for parcel in parcels:
        #get the timeseries for each of the two subjects in the dyad 
        subj1_index = subject_ts[(subject_ts['subject'] == dyad_subj1) & (subject_ts['brain_parcel'] == parcel)].index.to_list()
        subj1_ts = subject_ts.iloc[subj1_index, 2:103].values.flatten()
        subj2_index = subject_ts[(subject_ts['subject'] == dyad_subj2) & (subject_ts['brain_parcel'] == parcel)].index.to_list()
        subj2_ts = subject_ts.iloc[subj2_index, 2:103].values.flatten()

        #calculate their ISCs within a given brain region and apply Fisher-z transformation
        dyad_df.loc[i, parcel] = np.arctanh(pearsonr(subj1_ts, subj2_ts)[0])

For each brain region, identify disproportionately high and low neural similarity values (i.e., outliers) for values 1.5 times the interquartile range (IQR) above the upper quartile (75th percentile) or below the lower quartile (25th percentile), and replace these outliers with values equal to upper quartile plus 1.5 times the IQR or lower quartile minus the IQR, respectively.

In [4]:
def IQR_outliers(df):
    # recode outliers using IQR method
    cols = [i for i in df.columns if 'brain_region' in i]
    for col in cols:
        Q1 = df[col].quantile(.25)
        Q3 = df[col].quantile(.75)
        IQR = Q3-Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = df[col].clip(lower = lower_bound, upper = upper_bound)

    return df

dyad_df = IQR_outliers(dyad_df)

Create network graphs using the edgelists and get social distance at Time 2 and Time 3 for each dyad. \
Creaet a master dataframe (master_df) to include all the data needed for the analysis. 

In [5]:
#read in edgelist files
edgelist_t2 = pd.read_csv(f'{filepath}/edgelist_t2.csv', index_col=False)
edgelist_t3 = pd.read_csv(f'{filepath}/edgelist_t3.csv', index_col=False)

#function to create network graph from edgelist
# mode = 'mutual' is specified to make sure a tie exists between 2 nodes only if the friendship is mutually reported (i.e., A nominated B as a friend and B nominated A as a friend)
def make_graph(edgelist_time):
    g = igraph.Graph.DataFrame(edgelist_time, use_vids=False, directed=True)
    g = igraph.Graph.as_undirected(g, mode = 'mutual')

    return g

#create network graphs for t2 and t3
g_t2 = make_graph(edgelist_t2)
g_t3 = make_graph(edgelist_t3)

#create a master dataframe master_df that includes all the data needed for the analysis. 
master_df = dyad_df.copy()

master_df['soc_dist2'] = ''
master_df['soc_dist3'] = ''

for i in range(0,len(master_df)):
    #get the two subjects in the dyad
    dyad_subj1 = dyad_df['dyad_subject1'][i]
    dyad_subj2 = dyad_df['dyad_subject2'][i]
    
    #get the distance between the two subjects in the dyad 
    master_df.loc[i, 'soc_dist2'] = igraph.Graph.shortest_paths(g_t2, source = dyad_subj1, target = dyad_subj2)[0][0]
    master_df.loc[i, 'soc_dist3'] = igraph.Graph.shortest_paths(g_t3, source = dyad_subj1, target = dyad_subj2)[0][0]

Permute the social network graph 1000 times at each time point (Time 2 and Time 3)

In [6]:
#set up function permute_graph()
def permute_graph(subject_ids, n_permutations, time):
    for i in range(1, n_permutations+1):
        
        if time == 't2': 
            g = g_t2
        elif time == 't3':
            g = g_t3
        else: 
            print('time input is invalid')

        ids = list(subject_ids)
        shuffled_ids = random.sample(ids, len(ids))

        for id, shuffled_id in zip(ids, shuffled_ids):
            index = g.vs['name'].index(id)
            g.vs[index]['name'] = shuffled_id

        if not os.path.exists(f'{filepath}/derivatives/igraph_data/{time}/permuted_networks/'):
            os.makedirs(f'{filepath}/derivatives/igraph_data/{time}/permuted_networks/')
        
        SavePickle(g, f'{filepath}/derivatives/igraph_data/{time}/permuted_networks/igraph_undirected_mutual_p{i}.pkl')

subject_ids = subject_ts['subject'].unique()

#permute the social network graph at each timepoint
for time in ['t2', 't3']:
    permute_graph(subject_ids, n_permutations=1000, time=time)

Generate master_dfs_permuted based on the permuted graphs. For 1000 permuted graphs, 1000 master_dfs_p{perm}.pkl will be generated. These will be used to create the null models to assess the statistical significance in the analysis.

In [7]:
def get_soc_dist_pemuted(n_permutations):
    for perm in range(1, n_permutations+1):
        dyad_df_permuted = dyad_df.copy()
        
        for i in range(0,len(dyad_df_permuted)):
            #get the two subjects in the dyad
            dyad_subj1 = dyad_df_permuted['dyad_subject1'][i]
            dyad_subj2 = dyad_df_permuted['dyad_subject2'][i]
            
            for time in ['t2', 't3']:
                #read in the permuted network graph
                g = LoadPickle(f'{filepath}/derivatives/igraph_data/{time}/permuted_networks/igraph_undirected_mutual_p{perm}.pkl')
            
                #get the social distances between two nodes in the permuted graph at each timepoint
                if time == 't2':
                    dyad_df_permuted.loc[i, 'soc_dist2'] = igraph.Graph.shortest_paths(g, source = dyad_subj1, target = dyad_subj2)[0][0]
                if time == 't3':
                    dyad_df_permuted.loc[i, 'soc_dist3'] = igraph.Graph.shortest_paths(g, source = dyad_subj1, target = dyad_subj2)[0][0]
            
        if not os.path.exists(f'{filepath}/derivatives/master_dfs_permuted/'):
            os.makedirs(f'{filepath}/derivatives/master_dfs_permuted/')
        
        SavePickle(dyad_df_permuted, f'{filepath}/derivatives/master_dfs_permuted/master_dfs_p{perm}.pkl')

#create 1000 master_dfs_p{perm} 
get_soc_dist_pemuted(1000)

## Analysis

Setting up functions to control for all sociodemographic variables and handedness as well as enjoyment and interest ratings.

In [8]:
# function to control for sociodemographic variables and handedness
def regress_out_covariates(df):
    regressor_cols = ['age_dist', 'gender_similarity', 'nationality_similarity', 'hometown_population_similarity', 'dist_hometown','dist_college', 'college_pub_priv_similarity', 'major_similarity', 'industry_similarity', 'handedness_similarity',]
    regressors_var = df[regressor_cols]
    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])
    outcome_var = df[cols]

    pipe = LinearRegression()
    pipe.fit(regressors_var, outcome_var)
    predicted = pipe.predict(regressors_var)
    actual = outcome_var.values
    resid = actual - predicted

    resid_df = pd.DataFrame(resid, columns = cols)
    df_subset = df[[col for col in df.columns if not col in cols]]
    df_final = pd.concat([df_subset, resid_df], axis = 1)

    return df_final

# function to control for enjoyment and interest 
def regress_out_enjoyment_interest(df):
    regressor_cols = ['enjoy_similarity', 'interest_similarity']
    regressors_var = df[regressor_cols]
    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])
    outcome_var = df[cols]

    pipe = LinearRegression()
    pipe.fit(regressors_var, outcome_var)
    predicted = pipe.predict(regressors_var)
    actual = outcome_var.values
    resid = actual - predicted

    resid_df = pd.DataFrame(resid, columns = cols)
    df_subset = df[[col for col in df.columns if not col in cols]]
    df_final = pd.concat([df_subset.reset_index(drop = True), resid_df.reset_index(drop = True)], axis = 1)

    return df_final


### Analysis testing if pre-existing neural similarity differed between levels of social distance at Time 3
Setting up functions compare_groups_permuted() and calc_sig() to run the analysis.

In [9]:
# functions to run the analysis and calculate statistical significance
def compare_groups_permuted(time, contrast, control):
    vals = np.zeros([1000, len(parcels)])
    for i in range(1000):
        df = LoadPickle(f'{filepath}/derivatives/master_dfs_permuted/master_dfs_p{i+1}.pkl')

        if control == 'demo':
            df = regress_out_covariates(df)
        if control == 'enjoyment-interest':
            df = regress_out_enjoyment_interest(df)

        cols = parcels
        df[cols] = StandardScaler().fit_transform(df[cols])

        if time == 't2':
            soc_dist_col = 'soc_dist2'
        elif time == 't3':
            soc_dist_col = 'soc_dist3'
        else:
            print('invalid time input')

        for j in range(len(cols)):
            col = cols[j]
            x = df[df[soc_dist_col].isin([1])][col].values
            if contrast == '1v2':
                y = df[df[soc_dist_col].isin([2])][col].values
            elif contrast == '1v3':
                y = df[df[soc_dist_col].isin([3])][col].values
            elif contrast == '1v23':
                y = df[df[soc_dist_col].isin([2,3])][col].values

            delta = x.mean() - y.mean()
            vals[i, j] = delta
        
    if not os.path.exists(f'{filepath}/derivatives/friend_group_contrast/'):
            os.makedirs(f'{filepath}/derivatives/friend_group_contrast/')

    df_permuted = pd.DataFrame(vals, columns = cols)
    df_permuted.to_csv(f'{filepath}/derivatives/friend_group_contrast/null_{contrast}_control-{control}_v1000_{time}.csv', index = False)

def calc_sig(time, contrast, control):
    df = master_df

    if control == 'demo':
        df = regress_out_covariates(df)
    if control == 'enjoyment-interest':
        df = regress_out_enjoyment_interest(df)

    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])

    if time == 't2':
        soc_dist_col = 'soc_dist2'
    elif time == 't3':
        soc_dist_col = 'soc_dist3'
    else:
        print('invalid time input')

    stats_dict = {}
    for col in cols:
        x = df[df[soc_dist_col].isin([1])][col].values
        if contrast == '1v2':
            y = df[df[soc_dist_col].isin([2])][col].values
        elif contrast == '1v3':
            y = df[df[soc_dist_col].isin([3])][col].values
        elif contrast == '1v23':
            y = df[df[soc_dist_col].isin([2,3])][col].values
        
        delta = x.mean() - y.mean()
        stats_dict[col] = delta

    df_true = pd.DataFrame(stats_dict, index = ['isc_delta']).T

    df_permuted = pd.read_csv(f'{filepath}/derivatives/friend_group_contrast/null_{contrast}_control-{control}_v1000_{time}.csv')

    rois = cols
    dict_true = dict(zip(df_true.index, df_true.isc_delta))

    pvals = []
    for roi in rois:
        true_delta = dict_true[roi]
        permuted_deltas = df_permuted[roi].values
        pval = (1000 - (permuted_deltas < true_delta).sum()) / 1000
        pvals.append(pval)

    df_true['pval'] = pvals
    fdr_pvals = multitest.fdrcorrection(list(df_true['pval']))[1]
    df_true['pval_fdr'] = fdr_pvals
    df_true[df_true.pval_fdr < .05]

    df_true.sort_values(by='pval_fdr').to_csv(f'{filepath}/derivatives/friend_group_contrast/results_{contrast}_control-{control}_v1000_{time}.csv')

Running the analyses testing if pre-existing neural similarity differed between levels of social distance at Time 3.

In [10]:
time = 't3'
contrasts = ['1v2', '1v3', '1v23']
controls = ['none', 'demo', 'enjoyment-interest']

for contrast in contrasts:
    for control in controls:
        compare_groups_permuted(time, contrast, control)
        calc_sig(time, contrast, control)

### Analysis testing if neural similarity significantly differed as a function of the direction of change in social distance between Time 2 and Time 3
Setting up functions compare_groups_permuted_long() and calc_sig_long() to run the analysis

In [11]:
def compare_groups_permuted_long(contrast, control):
    vals = np.zeros([1000, len(parcels)])
    for i in range(1000):
        df = LoadPickle(f'{filepath}/derivatives/master_dfs_permuted/master_dfs_p{i+1}.pkl')

        #calculate the change in social distance from time2 to time3
        df['soc_dist_diff'] = df['soc_dist3'] - df['soc_dist2']

        if control == 'demo':
            df = regress_out_covariates(df)
        if control == 'enjoyment-interest':
            df = regress_out_enjoyment_interest(df)

        cols = parcels
        df[cols] = StandardScaler().fit_transform(df[cols])

        for j in range(len(cols)):
            col = cols[j]

            x = df[df['soc_dist_diff'] < 0][col].values # grew closer
            if contrast == 'closer_vs_same':
                y = df[df['soc_dist_diff'] == 0][col].values # didnt change
            elif contrast == 'closer_vs_farther':
                y = df[df['soc_dist_diff'] > 0][col].values # grew apart
            elif contrast == 'closer_vs_same-farther':
                y = df[df['soc_dist_diff'] >= 0][col].values # didnt change or grew apart

            delta = x.mean() - y.mean()
            vals[i, j] = delta

    if not os.path.exists(f'{filepath}/derivatives/dist_change_contrast/'):
        os.makedirs(f'{filepath}/derivatives/dist_change_contrast/')

    df_permuted = pd.DataFrame(vals, columns = cols)
    df_permuted.to_csv(f'{filepath}/derivatives/dist_change_contrast/null_{contrast}_control-{control}_v1000.csv', index = False)

def calc_sig_long(contrast, control):
    df = master_df

    df['soc_dist_diff'] = df['soc_dist3'] - df['soc_dist2']
   
    if control == 'demo':
        df = regress_out_covariates(df)
    if control == 'enjoyment-interest':
        df = regress_out_enjoyment_interest(df)

    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])

    stats_dict = {}
    for col in cols:
        x = df[df['soc_dist_diff'] < 0][col].values # grew closer

        if contrast == 'closer_vs_same':
            y = df[df['soc_dist_diff'] == 0][col].values # didnt change
        elif contrast == 'closer_vs_farther':
            y = df[df['soc_dist_diff'] > 0][col].values # grew apart
        elif contrast == 'closer_vs_same-farther':
            y = df[df['soc_dist_diff'] >= 0][col].values # didnt change or grew apart

        delta = x.mean() - y.mean()
        stats_dict[col] = delta

    df_true = pd.DataFrame(stats_dict, index = ['isc_delta']).T

    df_permuted = pd.read_csv(f'{filepath}/derivatives/dist_change_contrast/null_{contrast}_control-{control}_v1000.csv')
    rois = list(df_permuted.columns)
    dict_true = dict(zip(df_true.index, df_true.isc_delta))

    pvals = []
    for roi in rois:
        true_delta = dict_true[roi]
        permuted_deltas = df_permuted[roi].values
        pval = (1000 - (permuted_deltas < true_delta).sum()) / 1000
        pvals.append(pval)

    df_true['pval'] = pvals
    fdr_pvals = multitest.fdrcorrection(list(df_true['pval']))[1]
    df_true['pval_fdr'] = fdr_pvals
    df_true[df_true.pval_fdr < .05]

    df_true.sort_values(by='pval_fdr').to_csv(f'{filepath}/derivatives/dist_change_contrast/results_{contrast}_control-{control}_v1000.csv')

Running the analysis testing if neural similarity significantly differed as a function of the direction of change in social distance between Time 2 and Time 3

In [12]:
dist_change_contrasts = ['closer_vs_same', 'closer_vs_farther', 'closer_vs_same-farther']
controls = ['none', 'demo', 'enjoyment-interest']
for contrast in dist_change_contrasts:
    for control in controls:
        compare_groups_permuted_long(contrast, control)
        calc_sig_long(contrast, control)

### Analyses testing if inter-individual similarity in self-reported ratings of enjoyment or interest partially but significantly accounted for the significant differences observed in the two sets of main analyses

Create 1000 permuted dataset where enjoyment and interest ratings were shuffled at the individual level while holding all else in the dataset constant. 

In [13]:
subject_demo_ratings_permuted = subject_demo_ratings.copy()

for perm in range(1000):
    shuffle_cols = [interest_cols] + [enjoy_cols]
    for shuffle_col in shuffle_cols:
        subject_demo_ratings_permuted[shuffle_col] = subject_demo_ratings_permuted[shuffle_col].sample(frac=1).reset_index(drop=True)

    #create permuted master dataframe by dropping the "real" similarity in enjoyment and interest while holding everything else constant
    master_df_preference_permuted = master_df.drop(['enjoy_similarity', 'interest_similarity'], axis = 1)
    master_df_preference_permuted['enjoy_similarity'] = ''
    master_df_preference_permuted['interest_similarity'] = '' 

    for i in range(0,len(master_df_preference_permuted)):
        dyad_subj1 = master_df_preference_permuted['dyad_subject1'][i]
        dyad_subj2 = master_df_preference_permuted['dyad_subject2'][i]

        #calculate the enjoyment and interest similarity on the permuted dataset with the enjoyment and interest ratings shuffled
        #enjoy_similarity
        dyad_subj1_enjoy_vec = stats.zscore(subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject'] == dyad_subj1][enjoy_cols].values.flatten())
        dyad_subj2_enjoy_vec = stats.zscore(subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject'] == dyad_subj2][enjoy_cols].values.flatten())
        enjoy_distance = np.linalg.norm(dyad_subj1_enjoy_vec - dyad_subj2_enjoy_vec)
        master_df_preference_permuted['enjoy_similarity'][i] = float(enjoy_distance)
    
        #interest_similarity
        dyad_subj1_interest_vec = stats.zscore(subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject'] == dyad_subj1][interest_cols].values.flatten())
        dyad_subj2_interest_vec = stats.zscore(subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject'] == dyad_subj2][interest_cols].values.flatten())
        interest_distance = np.linalg.norm(dyad_subj1_interest_vec - dyad_subj2_interest_vec)
        master_df_preference_permuted['interest_similarity'][i] = float(interest_distance)

    if not os.path.exists(f'{filepath}/derivatives/master_dfs_preference_permuted/'):
        os.makedirs(f'{filepath}/derivatives/master_dfs_preference_permuted/')
        
    SavePickle(master_df_preference_permuted, f'{filepath}/derivatives/master_dfs_preference_permuted/master_dfs_preference_p{perm+1}.pkl')

Setting up the function to regress out similarity in either enjoyment or interest ratings.

In [14]:
def regress_out_preferences(df, regressor):
    regressor_cols = [regressor]
    regressors_var = df[regressor_cols]
    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])
    outcome_var = df[cols]

    pipe = LinearRegression()
    pipe.fit(regressors_var, outcome_var)
    predicted = pipe.predict(regressors_var)
    actual = outcome_var.values
    resid = actual - predicted

    resid_df = pd.DataFrame(resid, columns = cols)
    df_subset = df[[col for col in df.columns if not col in cols]]
    df_final = pd.concat([df_subset, resid_df], axis = 1)

    return df_final

Setting up the function to test, for each brain region in which a significance difference in neural similarity between levels of social distance at Time 3 was observed, if inter-individual similarity in self-reported ratings of enjoyment or interest of the stimuli accounted for a significant portion of this difference

In [15]:
def pref_testing_time3(regressor, contrast):
    df1 = master_df
    cols = parcels
    df1[cols] = StandardScaler().fit_transform(df1[cols])
    df2 = regress_out_preferences(df1, regressor = regressor)

    soc_dist = 'soc_dist3'

    dd_dict = {}
    for col in cols:
        x1 = df1[df1[soc_dist].isin([1])][col].values
        x2 = df2[df2[soc_dist].isin([1])][col].values

        if contrast == '1v2':
            y1 = df1[df1[soc_dist].isin([2])][col].values
            y2 = df2[df2[soc_dist].isin([2])][col].values
        elif contrast == '1v3':
            y1 = df1[df1[soc_dist].isin([3])][col].values
            y2 = df2[df2[soc_dist].isin([3])][col].values
        elif contrast == '1v23':
            y1 = df1[df1[soc_dist].isin([2,3])][col].values
            y2 = df2[df2[soc_dist].isin([2,3])][col].values

        #calculate delta 1, which is the difference in ISC (the contrast)
        delta1 = x1.mean() - y1.mean()
        #calculate delta 2, which is the difference in ISC (the contrast), controlling for the preference variable (enjoyment or preference)
        delta2 = x2.mean() - y2.mean()

        #calculate dd, which is the extent to which the 'uncontrolled' ISC difference is greater than the 'controlled' ISC difference
        #thus, dd captures the extent to which the ISC difference is reduced when controlling for the preference variable
        #thereby capturing the extent to the preference variable might account for the ISC difference
        dd = delta1 - delta2

        dd_dict[col] = dd

    df_dd = pd.DataFrame(dd_dict, index = ['dd']).T

    # Permutation testing, repeating the procedure above but using preferences shuffled at the individual level
    vals = np.zeros([1000, len(parcels)])
    for i in range(1000):
        df1 = LoadPickle(f'{filepath}/derivatives/master_dfs_preference_permuted/master_dfs_preference_p{i+1}.pkl')
        df1[cols] = StandardScaler().fit_transform(df1[cols])
        df2 = regress_out_preferences(df1, regressor = regressor)

        soc_dist = 'soc_dist3'

        for j in range(len(cols)):
            col = cols[j]

            x1 = df1[df1[soc_dist].isin([1])][col].values
            x2 = df2[df2[soc_dist].isin([1])][col].values

            if contrast == '1v2':
                y1 = df1[df1[soc_dist].isin([2])][col].values
                y2 = df2[df2[soc_dist].isin([2])][col].values
            elif contrast == '1v3':
                y1 = df1[df1[soc_dist].isin([3])][col].values
                y2 = df2[df2[soc_dist].isin([3])][col].values
            elif contrast == '1v23':
                y1 = df1[df1[soc_dist].isin([2,3])][col].values
                y2 = df2[df2[soc_dist].isin([2,3])][col].values

            delta1 = x1.mean() - y1.mean()
            delta2 = x2.mean() - y2.mean()
            dd = delta1 - delta2

            vals[i, j] = dd
    df_permuted = pd.DataFrame(vals, columns = cols)
    dict_true = dict(zip(df_dd.index, df_dd.dd))

    pvals = []
    for roi in cols:
        true_delta = dict_true[roi]
        permuted_deltas = df_permuted[roi].values
        pval = (1000 - (permuted_deltas < true_delta).sum()) / 1000
        pvals.append(pval)

    df_dd['pval'] = pvals
    df_dd.sort_values(by='pval').to_csv(f'{filepath}/derivatives/friend_group_contrast/{contrast}_{regressor}-DoD_v1000_t3.csv')


Testing if inter-individual similarity in self-reported ratings of enjoyment or interest of the stimuli accounted for the observed difference in pre-existing neural similarity between friends (with a social distance of 1) versus friends-of-friends-of-friends (with a social distance of 3) in all brain parcels

In [16]:
contrast = '1v3' #where significant differences in pre-existing neural similarity was observed in the source data

for regressor in ['enjoy_similarity', 'interest_similarity']:
    pref_testing_time3(regressor, contrast)

Check whether brain parcels in which a significance difference in neural similarity between levels of social distance at Time 3 was observed include any parcel in which either enjoyment or interest significantly accounts for the ISC difference (if the output is an empty dataframe, then this suggests that enjoyment/interest ratings do not significantly account for the significant ISC difference observed)

In [17]:
def check_pref_testing_time3(regressor, contrast):
    df = pd.read_csv(f'{filepath}/derivatives/friend_group_contrast/results_{contrast}_control-none_v1000_t3.csv')
    pref = pd.read_csv(f'{filepath}/derivatives/friend_group_contrast/{contrast}_{regressor}-DoD_v1000_t3.csv')
    
    #get significant brain parcels
    sig_rois = [_ for _ in list(df[df['pval_fdr'] < .05]['Unnamed: 0'])]
    pref_sig = pref[pref['pval'] < .05]

    foo = pref_sig[pref_sig['Unnamed: 0'].isin(sig_rois)]
    return foo

for regressor in ['enjoy_similarity', 'interest_similarity']:
    contrast = '1v3'
    print(f'{regressor}_{contrast}:')
    print(check_pref_testing_time3(regressor, contrast))

enjoy_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
interest_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []


Setting up the function to test, for each brain region in which a significance difference in neural similarity between levels of social distance change from Time 2 to Time 3 was observed, if inter-individual similarity in self-reported ratings of enjoyment or interest of the stimuli accounted for a significant portion of this difference

In [18]:
def pref_testing_dist_change(regressor, contrast):
    df1 = master_df
    cols = parcels
    df1[cols] = StandardScaler().fit_transform(df1[cols])
    df2 = regress_out_preferences(df1, regressor = regressor)

    dd_dict = {}
    for col in cols:
        x1 = df1[df1['soc_dist_diff'] < 0][col].values
        x2 = df2[df2['soc_dist_diff'] < 0][col].values

        if contrast == 'closer_vs_same':
            y1 = df1[df1['soc_dist_diff'] == 0][col].values
            y2 = df2[df2['soc_dist_diff'] == 0][col].values
        elif contrast == 'closer_vs_farther':
            y1 = df1[df1['soc_dist_diff'] > 0][col].values
            y2 = df2[df2['soc_dist_diff'] > 0][col].values
        elif contrast == 'closer_vs_same-farther':
            y1 = df1[df1['soc_dist_diff'] >= 0][col].values
            y2 = df2[df2['soc_dist_diff'] >= 0][col].values

        #calculate delta 1, which is the difference in ISC (the contrast)
        delta1 = x1.mean() - y1.mean()
        #calculate delta 2, which is the difference in ISC (the contrast), controlling for the preference variable (enjoyment or preference)
        delta2 = x2.mean() - y2.mean()

        #calculate dd, which is the extent to which the 'uncontrolled' ISC difference is greater than the 'controlled' ISC difference
        #thus, dd captures the extent to which the ISC difference is reduced when controlling for the preference variable
        #thereby capturing the extent to the preference variable might account for the ISC difference
        dd = delta1 - delta2

        dd_dict[col] = dd

    df_dd = pd.DataFrame(dd_dict, index = ['dd']).T

    # Permutation testing, repeating the procedure above but using preferences shuffled at the individual level
    vals = np.zeros([1000, len(parcels)])
    for i in range(1000):
        df1 = LoadPickle(f'{filepath}/derivatives/master_dfs_preference_permuted/master_dfs_preference_p{i+1}.pkl')
        df1[cols] = StandardScaler().fit_transform(df1[cols])
        df2 = regress_out_preferences(df1, regressor = regressor)

        for j in range(len(cols)):
            col = cols[j]

            x1 = df1[df1['soc_dist_diff'] < 0][col].values
            x2 = df2[df2['soc_dist_diff'] < 0][col].values

            if contrast == 'closer_vs_same':
                y1 = df1[df1['soc_dist_diff'] == 0][col].values
                y2 = df2[df2['soc_dist_diff'] == 0][col].values
            elif contrast == 'closer_vs_farther':
                y1 = df1[df1['soc_dist_diff'] > 0][col].values
                y2 = df2[df2['soc_dist_diff'] > 0][col].values
            elif contrast == 'closer_vs_same-farther':
                y1 = df1[df1['soc_dist_diff'] >= 0][col].values
                y2 = df2[df2['soc_dist_diff'] >= 0][col].values

            delta1 = x1.mean() - y1.mean()
            delta2 = x2.mean() - y2.mean()
            dd = delta1 - delta2

            vals[i, j] = dd
    df_permuted = pd.DataFrame(vals, columns = cols)
    dict_true = dict(zip(df_dd.index, df_dd.dd))

    pvals = []
    for roi in cols:
        true_delta = dict_true[roi]
        permuted_deltas = df_permuted[roi].values
        pval = (1000 - (permuted_deltas < true_delta).sum()) / 1000
        pvals.append(pval)

    df_dd['pval'] = pvals
    df_dd.sort_values(by='pval').to_csv(f'{filepath}/derivatives/dist_change_contrast/{contrast}_{regressor}-DoD_v1000.csv')

Testing if inter-individual similarity in self-reported ratings of enjoyment or interest of the stimuli accounted for the observed difference in pre-existing neural similarity between dyads who grew closer versus those who grew apart in all brain parcels

In [19]:
contrast = 'closer_vs_farther' #where significant differences in pre-existing neural similarity was observed in the source data

for regressor in ['enjoy_similarity', 'interest_similarity']:
    pref_testing_dist_change(regressor, contrast)

Check whether brain parcels in which a significance difference in neural similarity between direction of change in social distance between Time 2 and Time 3 was observed include any parcel in which either enjoyment or interest significantly accounts for the ISC difference (if the output is an empty dataframe, then this suggests that enjoyment/interest ratings do not significantly account for the significant ISC difference observed)

In [20]:
def check_pref_testing_dist_change(regressor, contrast):
    df = pd.read_csv(f'{filepath}/derivatives/dist_change_contrast/results_{contrast}_control-none_v1000.csv')
    pref = pd.read_csv(f'{filepath}/derivatives/dist_change_contrast/{contrast}_{regressor}-DoD_v1000.csv')
    
    #get significant brain parcels
    sig_rois = [_ for _ in list(df[df['pval_fdr'] < .05]['Unnamed: 0'])]
    pref_sig = pref[pref['pval'] < .05]

    foo = pref_sig[pref_sig['Unnamed: 0'].isin(sig_rois)]
    return foo

contrast = 'closer_vs_farther'

for regressor in ['enjoy_similarity', 'interest_similarity']:
    print(f'{regressor}_{contrast}:')
    print(check_pref_testing_dist_change(regressor, contrast))

enjoy_similarity_closer_vs_farther:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
interest_similarity_closer_vs_farther:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []


### Analyses testing if inter-individual similarity in sociodemographic variables partially but significantly accounted for the significant differences in pre-existing neural similarity observed in the first set of main analyses (friends vs friends-of-friends-of-friends) at Time 3

Create 1000 permuted dataset where all sociodemographic ratings were shuffled at the individual level while holding all else in the dataset constant. 

In [21]:
subject_demo_ratings_permuted = subject_demo_ratings.copy()

for perm in range(1000):
    shuffle_cols = ['age', 'gender', 'nationality', 'college', 'hometown', 'major', 'industry']
    for shuffle_col in shuffle_cols:
        subject_demo_ratings_permuted[shuffle_col] = subject_demo_ratings_permuted[shuffle_col].sample(frac=1).reset_index(drop=True)

    #create permuted master dataframe by dropping the "real" similarity in enjoyment and interest while holding everything else constant
    master_df_demo_permuted = master_df.drop(['age_dist', 'gender_similarity', 'nationality_similarity', 'hometown_population_similarity', 'dist_hometown','dist_college', 'college_pub_priv_similarity', 'major_similarity', 'industry_similarity'], axis = 1)
    master_df_demo_permuted['age_dist'] = ''
    master_df_demo_permuted['gender_similarity'] = '' 
    master_df_demo_permuted['nationality_similarity'] = '' 
    master_df_demo_permuted['hometown_population_similarity'] = '' 
    master_df_demo_permuted['dist_hometown'] = '' 
    master_df_demo_permuted['dist_college'] = '' 
    master_df_demo_permuted['college_pub_priv_similarity'] = '' 
    master_df_demo_permuted['major_similarity'] = '' 
    master_df_demo_permuted['industry_similarity'] = '' 

    for i in range(0,len(master_df_demo_permuted)):
        dyad_subj1 = master_df_demo_permuted['dyad_subject1'][i]
        dyad_subj2 = master_df_demo_permuted['dyad_subject2'][i]

        #calculate the enjoyment and interest similarity on the permuted dataset with the demographic variables shuffled
        #age_dist
        dyad_subj1_age = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['age'].item()
        dyad_subj2_age = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['age'].item()
        age_dist = np.abs(dyad_subj1_age - dyad_subj2_age)
        master_df_demo_permuted['age_dist'][i] = age_dist

        #gender_similarity
        dyad_subj1_gender = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['gender'].item()
        dyad_subj2_gender = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['gender'].item()
        gender_similarity = 1 if dyad_subj1_gender == dyad_subj2_gender else 0
        master_df_demo_permuted['gender_similarity'][i] = gender_similarity
        
        #nationality_similarity
        dyad_subj1_nationality = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['nationality'].item()
        dyad_subj2_nationality = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['nationality'].item()
        nationality_similarity = 1 if dyad_subj1_nationality == dyad_subj2_nationality else 0
        master_df_demo_permuted['nationality_similarity'][i] = nationality_similarity

        #hometown_population_similarity
        dyad_subj1_hometown = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['hometown'].item()
        dyad_subj2_hometown = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['hometown'].item()
        hometown_population_similarity = np.abs(dict_hometown_population[dyad_subj1_hometown] - dict_hometown_population[dyad_subj2_hometown])
        master_df_demo_permuted['hometown_population_similarity'][i] = hometown_population_similarity

        #dist_hometown
        if dyad_subj1_hometown == dyad_subj2_hometown:
            master_df_demo_permuted['dist_hometown'][i] = 0
        else: 
            hometown_idx = dist_hometowns.index[(dist_hometowns['City1'] == dyad_subj1_hometown) & (dist_hometowns['City2'] == dyad_subj2_hometown)].tolist()
            if not hometown_idx:
                hometown_idx = dist_hometowns.index[(dist_hometowns['City2'] == dyad_subj1_hometown) & (dist_hometowns['City1'] == dyad_subj2_hometown)].tolist()
            master_df_demo_permuted['dist_hometown'][i] = dist_hometowns.loc[hometown_idx[0], 'dist_hometown']

        #dist_college
        dyad_subj1_college = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['college'].item()
        dyad_subj2_college = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['college'].item()

        if dyad_subj1_college == dyad_subj2_college:
            master_df_demo_permuted['dist_college'][i] = 0
        else:
            college_idx = dist_colleges.index[(dist_colleges['college1'] == dyad_subj1_college) & (dist_colleges['college2'] == dyad_subj2_college)].tolist()
            if not college_idx:
                college_idx = dist_colleges.index[(dist_colleges['college2'] == dyad_subj1_college) & (dist_colleges['college1'] == dyad_subj2_college)].tolist()
            master_df_demo_permuted['dist_college'][i] = dist_colleges.loc[college_idx[0], 'dist_college']

        #college_pub_priv_similarity
        college_pub_priv_similarity = 1 if dict_college_public_private[dyad_subj1_college] == dict_college_public_private[dyad_subj2_college] else 0
        master_df_demo_permuted['college_pub_priv_similarity'][i] = college_pub_priv_similarity

        #major_similarity
        dyad_subj1_major = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['major'].item()
        dyad_subj2_major = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['major'].item()
        major_similarity = 1 if dict_major_cat[dyad_subj1_major] == dict_major_cat[dyad_subj2_major] else 0
        master_df_demo_permuted['major_similarity'][i] = major_similarity

        #industry_similarity
        dyad_subj1_industry = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj1]['industry'].item()
        dyad_subj2_industry = subject_demo_ratings_permuted[subject_demo_ratings_permuted['subject']==dyad_subj2]['industry'].item()
        industry_similarity = 1 if dyad_subj1_industry == dyad_subj2_industry else 0
        master_df_demo_permuted['industry_similarity'][i] = industry_similarity

    if not os.path.exists(f'{filepath}/derivatives/master_dfs_demo_permuted/'):
        os.makedirs(f'{filepath}/derivatives/master_dfs_demo_permuted/')
        
    SavePickle(master_df_demo_permuted, f'{filepath}/derivatives/master_dfs_demo_permuted/master_dfs_demo_p{perm+1}.pkl')

Setting up the function to test, for each brain region in which a significance difference in neural similarity between levels of social distance at Time 3 was observed, if inter-individual similarity in all or any demographic variables of the stimuli accounted for a significant portion of this difference

In [22]:
def regress_out_demo(df, regressor_cols):
    regressors_var = df[regressor_cols]
    cols = parcels
    df[cols] = StandardScaler().fit_transform(df[cols])
    outcome_var = df[cols]

    pipe = LinearRegression()
    pipe.fit(regressors_var, outcome_var)
    predicted = pipe.predict(regressors_var)
    actual = outcome_var.values
    resid = actual - predicted

    resid_df = pd.DataFrame(resid, columns = cols)
    df_subset = df[[col for col in df.columns if not col in cols]]
    df_final = pd.concat([df_subset, resid_df], axis = 1)

    return df_final


def demo_testing_time3(regressor, contrast):
    df1 = master_df
    cols = parcels
    df1[cols] = StandardScaler().fit_transform(df1[cols])
    df2 = regress_out_demo(df1, regressor)

    soc_dist = 'soc_dist3'

    dd_dict = {}
    for col in cols:
        x1 = df1[df1[soc_dist].isin([1])][col].values
        x2 = df2[df2[soc_dist].isin([1])][col].values

        if contrast == '1v2':
            y1 = df1[df1[soc_dist].isin([2])][col].values
            y2 = df2[df2[soc_dist].isin([2])][col].values
        elif contrast == '1v3':
            y1 = df1[df1[soc_dist].isin([3])][col].values
            y2 = df2[df2[soc_dist].isin([3])][col].values
        elif contrast == '1v23':
            y1 = df1[df1[soc_dist].isin([2,3])][col].values
            y2 = df2[df2[soc_dist].isin([2,3])][col].values

        delta1 = x1.mean() - y1.mean()
        delta2 = x2.mean() - y2.mean()

        # delta 1 is the difference in ISC (the contrast)
        # delta 2 is the difference in ISC (the contrast), controlling for the demographic variable(s)
        # dd is the extent to which the 'uncontrolled' ISC difference is greater than the 'controlled' ISC difference
        # thus, dd captures the extent to which the ISC difference is reduced when controlling for the demographic variable(s)
        # thereby capturing the extent to the demographic variable(s) might account for the ISC difference
        dd = delta1 - delta2

        dd_dict[col] = dd

    df_dd = pd.DataFrame(dd_dict, index = ['dd']).T

    # Permutation testing, repeating the procedure above but using demographic variables shuffled at the individual level
    vals = np.zeros([1000, len(parcels)])
    for i in range(1000):
        df1 = LoadPickle(f'{filepath}/derivatives/master_dfs_demo_permuted/master_dfs_demo_p{i+1}.pkl')
        df1[cols] = StandardScaler().fit_transform(df1[cols])
        df2 = regress_out_demo(df1, regressor)

        soc_dist = 'soc_dist3'

        for j in range(len(cols)):
            col = cols[j]

            x1 = df1[df1[soc_dist].isin([1])][col].values
            x2 = df2[df2[soc_dist].isin([1])][col].values

            if contrast == '1v2':
                y1 = df1[df1[soc_dist].isin([2])][col].values
                y2 = df2[df2[soc_dist].isin([2])][col].values
            elif contrast == '1v3':
                y1 = df1[df1[soc_dist].isin([3])][col].values
                y2 = df2[df2[soc_dist].isin([3])][col].values
            elif contrast == '1v23':
                y1 = df1[df1[soc_dist].isin([2,3])][col].values
                y2 = df2[df2[soc_dist].isin([2,3])][col].values

            delta1 = x1.mean() - y1.mean()
            delta2 = x2.mean() - y2.mean()
            dd = delta1 - delta2

            vals[i, j] = dd
    df_permuted = pd.DataFrame(vals, columns = cols)
    dict_true = dict(zip(df_dd.index, df_dd.dd))

    pvals = []
    for roi in cols:
        true_delta = dict_true[roi]
        permuted_deltas = df_permuted[roi].values
        pval = (1000 - (permuted_deltas < true_delta).sum()) / 1000
        pvals.append(pval)

    df_dd['pval'] = pvals
    return df_dd

Testing if inter-individual similarity in demographic variable(s) accounted for the observed difference in pre-existing neural similarity between friends (with a social distance of 1) versus friends-of-friends-of-friends (with a social distance of 3) in all brain parcels

In [23]:
# List of demographic variables as regressors 
demo_regressor_cols = ['age_dist', 'gender_similarity', 'nationality_similarity', 'dist_hometown','dist_college', 'college_pub_priv_similarity', 'major_similarity', 'hometown_population_similarity', 'industry_similarity']

contrast = '1v3' #where significant differences in pre-existing neural similarity was observed in the source data

# Check for all regressors (altogether)
demo_all_DoD = demo_testing_time3(demo_regressor_cols, contrast) 
demo_all_DoD.sort_values(by='pval').to_csv(f'{filepath}/derivatives/friend_group_contrast/{contrast}_demo-all-DoD_v1000_t3.csv')

# Check for individual regressor
for demo_regressor_col in demo_regressor_cols:
    demo_regressor_col1 = [demo_regressor_col]
    demo_DoD = demo_testing_time3(demo_regressor_col1, contrast)
    demo_DoD.sort_values(by='pval').to_csv(f'{filepath}/derivatives/friend_group_contrast/{contrast}_{demo_regressor_col}-DoD_v1000_t3.csv')

Check whether brain parcels in which a significance difference in neural similarity between levels of social distance at Time 3 was observed include any parcel in which demographic variable(s) account(s) for the ISC difference (if the output is an empty dataframe, then this suggests that enjoyment/interest ratings do not significantly account for the significant ISC difference observed)

In [24]:
def check_demo_testing_time3(regressor, contrast):
    df = pd.read_csv(f'{filepath}/derivatives/friend_group_contrast/results_{contrast}_control-none_v1000_t3.csv')
    demo = pd.read_csv(f'{filepath}/derivatives/friend_group_contrast/{contrast}_{regressor}-DoD_v1000_t3.csv')
    
    #get significant brain parcels
    sig_rois = [_ for _ in list(df[df['pval_fdr'] < .05]['Unnamed: 0'])]
    demo_sig = demo[demo['pval'] < .05]

    foo = demo_sig[demo_sig['Unnamed: 0'].isin(sig_rois)]
    return foo

for regressor in demo_regressor_cols + ['demo-all']:
    contrast = '1v3'
    print(f'{regressor}_{contrast}:')
    print(check_pref_testing_time3(regressor, contrast))

age_dist_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
gender_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
nationality_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
dist_hometown_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
dist_college_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
college_pub_priv_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
major_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
hometown_population_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
industry_similarity_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
demo-all_1v3:
Empty DataFrame
Columns: [Unnamed: 0, dd, pval]
Index: []
