# ** Exclusions **
This is a short noteboook that will calculate the exclusions from the aggregated bi-section task data. 
To run the the analysis click on each block of code and then click the run icon on the toolbar at the top. Alternatively, click a code block and press shift-enter. [Click for here for help](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Running%20Code.html)

If any errors such as "No module named XXX" occur, please install the module. Its much easier to use just download anaconda, which comes with the relevant libraries pre-packaged. 

*Refer to the [Exclusion criterion notebook](Exclusion criterion .ipynb) for a detailed walkthrough.*


In [29]:
import numpy as np
import pandas as pd 


import matplotlib.pyplot as plt 
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns



import math 
import pylab 
from scipy import stats

from statsmodels import robust

import colorama
from colorama import Fore

from IPython.core.display import display, HTML

<a id='1'></a>
## *Getting the data*
This section will get the data from your disk. Ensure that you fill the file path prompt correctly, otherwise there will be nothing to work with.

__[How to get a file-path on a mac](https://apple.stackexchange.com/questions/252171/mac-finder-getting-the-path-of-a-directory-or-file-as-as-string)__
<br> __[How to get a file-path on windows](https://stackoverflow.com/questions/32573080/how-can-i-get-the-path-to-a-file-in-windows-10)__

In [30]:
print("What condition is this?")
condition_name = input()
print("What is the file path?")
file_path = input()

What condition is this?
Gamma 3
What is the file path?
/Users/Akshi/Desktop/Correlation/Correlation_Analysis/Data/Gamma 3.0 Final Analysis (with reruns).xlsm


Make sure the above file path is correct before running the next block of code

In [31]:
#Read the data *** Make sure path is set to the correct file-path *** 
path = file_path
data_sheet = pd.ExcelFile(path)

#Parse the Exclusions sheet to create a Pandas DataFrame
exlcusions = data_sheet.parse('1. Exclusions')
#Select the columns that are needed and create a new DataFrame with them
DF = exlcusions[["ID","subCondition","highRef","estimatedMid","lowRef","roundType", "AnchorValues"]]
#Drop NaN values
DF = DF.dropna(subset=["estimatedMid"])

#Group the DataFrame by subcondition
sub_cond_df = DF.groupby("subCondition")

<a id='2'></a>
## *Functions*
The next code block contains the helper and main functions that will be used to conduct the analysis. 
Ensure that this block is run.

In [32]:
#Functions to transform the data. Either box-cox or cbrt transforms are applied after acconting for anchoring

def transform(sub_cond):
    '''A function that combines attempts for a subcondition, in order to account for anchoring.
        Returns a numpy array with transformed data. Resulting distribution should be Gaussian.
        
        @param sub_cond: subconditon that will be transformed
        @return uni_modal: np.array with transformed data'''
    
    #0 corresponds to first attempt 
    first_idx = 0 
    second_idx = 1
    third_idx = 2
    fourth_idx = 3 
    
    #Get estimatedMid column from DataFrame
    estimates = sub_cond['estimatedMid']
    
    #Create a np-array for transformed data
    uni_modal= np.empty(int(len(sub_cond)/4))
    
    for i in range(int(len(sub_cond)/4)):
        #Get attempts for a participant
        first_atmpt = estimates.iloc[first_idx]
        second_atmpt = estimates.iloc[second_idx]
        third_atmpt = estimates.iloc[third_idx]
        fourth_atmpt = estimates.iloc[fourth_idx]
        #Calculate new estimate via Spencers suggested formula
        estimate = abs(((first_atmpt+third_atmpt) - (second_atmpt+fourth_atmpt)))
        #Add to np-array
        uni_modal[i] = estimate
        #Increase index to next participant
        first_idx+=4
        second_idx+=4
        third_idx+=4
        fourth_idx+=4
    
        
    #return transformed data 
    return uni_modal



def cbrt(data):
    '''function that applys a cubroot transform and returns the array
        @param data : array of estimates that are going to be transformed
        @return measurements: array with transforemd data '''
    
    #Apply cubroot transform 
    measurements = (data**(1/3))
     
    #return transformed data 
    return measurements


def box_cox(data):
    '''function that preforms box-cox transform and returns the array
        @param data : array of estimates that are going to be transformed
        @return measurements: array with transforemd data '''
    
    #Apply cubroot transform 
    measurements = stats.boxcox(data, 0)

    #return transformed data 
    return measurements


#End of transformation functions
#----------------------------------------------------------------------------------------------------------------
#A function that tests for normality of data

def norm_test(data, alpha):
    '''function that determines if the given data is normal or not
    @param data: array containg the data that will have K^2 test applied to it
    @param alpha: significane level (default is 0.05)
    @return normal,stat,p: stat is boolean signifying if data is normally distributed or not. HO sample is Gaussian. Result are results of the test'''
    
    normal = True 
    
    #K^2 test
    stat, p = stats.normaltest(data)
    
    #Print results
    print('\033[0m'+  Fore.BLUE + 'Statistics=%.3f, p=%.3f' % (stat, p))
    
    # interpret p value
    
    if( p > 0.05):
        print('\033[0m' + Fore.BLUE+ 'Sample looks Gaussian (fail to reject H0)')
    else:
        print('\033[0m'+ Fore.RED + 'SAMPLE NOT GAUSSIAN!!!!'  + '(reject H0)')
        normal = False
    print('\033[0m' + Fore.BLUE + "---------------------------------")
    
    return [normal,stat,p]

#----------------------------------------------------------------------------------------------------------------
#Functions to get Robust score and exclusions

def get_score(data):
    '''Function to calculate RobustScore, defined as: RS = (x - median)/MAD, where MAD is Medium Absolutle Deviation
        @param data: array for which score will be calculated
        @return score_list: np array containg RS for each data point'''
    
    #empty numpy array 
    score_list = np.empty(len(data))
    
    #calculate MAD and median
    mad = robust.scale.mad(data)
    median = np.median(data)
    
    for i in range(len(data)):
        #Calculate score for each data point
        num = (data[i]-median)
        denom = mad
        score = num/denom
        #add to list
        score_list[i] = score
        
    return score_list

def exclusion(data):
    '''Function that calculate the exclusions for an array and returns the IDs of participants that should be excluded
        @param data :data for which exclusions will get calculated
        @return IDs : np array containg IDs of participants that should be excluded'''
    
    #exclusions_idx contains the indicies of any participants for the given subconditon that should be excluded 
    exclusions_idx = []

    #get robust score
    data_score = np.abs(np.array(get_score(data)))
    
    #get indicies of exclusions
    exclusions_idx = np.where(data_score > 2.5)[0].tolist()
    
    #increment index to match participant IDs if there are any exclusions
    IDs =  exclusions_idx
    if(IDs):
        ID = [x+1 for x in IDs]
        IDs = ID
    
    

    return IDs 

        
#----------------------------------------------------------------------------------------------------------------
#General purpose ploting function
def dist_plotter(measurements, sub_condition, transformed=False):
    '''Function to plot the distribution and Normal QQ
        @param measurements: np array of vlaues that will be plotted
        @param sub_condition: int sub_condition being plotted  
        @param transformed: bool, default is False. Set to True is plotting transformed data'''
    
    #plot histogram
    plt.subplot(1,2,1)
    sns.distplot(measurements, kde=False)
    if(transformed):
        plt.title("Subcondition " + sub_condition + " transformed data")
    else: 
         plt.title("Subcondition " + sub_condition + " raw data")
    #Normal QQ plot
    plt.subplot(1,2,2)
    stats.probplot(measurements, dist="norm", plot=plt)
    plt.title("Subcondition " + sub_condition + " QQ plot")
    plt.show()

#----------------------------------------------------------------------------------------------------------------
#Main function that will be called to do the analysis work
def analyse():
    '''Main function that will calls relevant helpers'''
    
    #Create a DF for the output
    columns = ["Subcondition", "Statistic", "p-value", "Gaussian","Exclusions"]
    exclusions_df = pd.DataFrame(columns=columns)
    exclusions_df.fillna(0) 
    
    for i in range(1,16):
        #Get subcondition
        sub_cond = sub_cond_df.get_group(i)
        #transform
        transformed = transform(sub_cond)
        #normalise
        normal_data = cbrt(transformed)
        #test for normality
        print('\033[1m' + '\033[4m' + Fore.BLUE + "Subconditon " + str(i)) 
        
        #Add subcond to DF
        exclusions_df.at[i-1,'Subcondition'] = i
        
        norm = norm_test(normal_data, 0.05)
        _gaussian = norm[0]
        stat = norm[1]
        p_val = norm[2]
        
        exclusions_df.at[i-1,"Statistic"]= stat
        exclusions_df.at[i-1,"p-value"] = p_val
        exclusions_df.at[i-1,"Gaussian"] = _gaussian
        
        #commented this out for now. Should try to fix later
        #if( _gaussian != True):
        #    print(Fore.RED + "box_cox applied")
        #    normal_data = box_cox(transformed)
        #    _norm2 = norm_test(normal_data,0.05)
            
        #    count = 5
        #     while(_norm2 != True): 
        #            print(Fore.BLUE+ "Sample still not normal, enter lower sigfincance level")
        #           alpha_lvl = input()
        #            normal_data = box_cox(transformed)
        #            _norm2 = norm_test(normal_data,alpha_lvl)
        #            count -= 1
        #            if(count == 0):
        #                print(Fore.RED+ "Sample is problematic, be careful before proceding further")
        #               break
            
        #Get RS
        scores = get_score(normal_data)
        #get exclusions
        exclusions = exclusion(scores)
        if(exclusions):
            string =  ', '.join(str(x) for x in exclusions)
        else: string = "No exclusions"
        exclusions_df.at[i-1,"Exclusions"] = string
        
        if exclusions:
            print('\033[0m' +  Fore.RED + "Exclude participant(s): "  , end='' )
            print(*exclusions, sep=',')
            print('\n' )
        else: 
            print('\033[0m' + Fore.GREEN+"No exclusions" )
            print('\n')
            print('\n' )
    
   
    pd.set_option('colheader_justify', 'center')
    html_string = '''<html>
      <head><title>Exclusions table</title></head>
      <link rel="stylesheet" type="text/css" href="main.css"/>
      <body>
        {table}
      </body>
    </html>. '''
    
    with open(condition_name+ ' Exclusions' + '.html', 'w') as f:
        f.write(html_string.format(table= exclusions_df.to_html(classes='df style', index=False)))
    
    #pdfkit.from_file( condition_name + '.html' , 'Analysis of ' + condition_name + '.pdf')

    


Run the next block of code to get exclusions

In [33]:
analyse()

[1m[4m[34mSubconditon 1
[0m[34mStatistics=5.889, p=0.053
[0m[34mSample looks Gaussian (fail to reject H0)
[0m[34m---------------------------------
[0m[32mNo exclusions




[1m[4m[34mSubconditon 2
[0m[34mStatistics=0.038, p=0.981
[0m[34mSample looks Gaussian (fail to reject H0)
[0m[34m---------------------------------
[0m[32mNo exclusions




[1m[4m[34mSubconditon 3
[0m[34mStatistics=2.855, p=0.240
[0m[34mSample looks Gaussian (fail to reject H0)
[0m[34m---------------------------------
[0m[32mNo exclusions




[1m[4m[34mSubconditon 4
[0m[34mStatistics=0.744, p=0.689
[0m[34mSample looks Gaussian (fail to reject H0)
[0m[34m---------------------------------
[0m[32mNo exclusions




[1m[4m[34mSubconditon 5
[0m[34mStatistics=1.901, p=0.386
[0m[34mSample looks Gaussian (fail to reject H0)
[0m[34m---------------------------------
[0m[31mExclude participant(s): 4,5


[1m[4m[34mSubconditon 6
[0m[34mStatistics=2.054, p=0.358
[0m[34mSa