# Overview
* Describe the logic of ANCOVA, and the tests for parallelism and inequality of slopes.
* Use OLS regression to fit various models containing a combination of quantitative and dummy coded variables and correctly interpret model coefficient estimates.
* Implement a nested modeling approach to construct standard ANCOVA hypothesis tests.

In terms of situating this research, it was hypothesized that people who are required to evaluate someone they have just hurt tend to denigrate the victim as a means of justifying the harmful act.  To investigate this hypothesis, 20 white male college students were required to give a series of either painful or mild electric shocks (treatment A) as feedback for errors made by a white or black confederate (treatment B) working at a learning task.  The confederates were actually part of the research team and not actually harmed, but that was not known to the research subjects.   Prior to and after administering shocks, the college student subjects rated the confederates in terms of likability, intelligence, and personal adjustment.  The dependent variable was the change in ratings from the pretest to the posttest (i.e. posttest minus pretest).  The 20 white male college students (i.e. subjects) were randomly assigned to the four treatment combinations, such that 5 students were assigned to each of the 4 treatment combinations.  
<br>
To clarify the experimental design:  the four treatment combinations were:
1. Mild electric shock and white confederate;
2. Severe electric shock and white confederate;
3. Mild electric shock and black confederate; 
4. Severe electric shock and black confederate. 

The experiment proceeded as follows:  The subjects, randomly assigned to treatment combination,  evaluated the confederate in terms of likability, intelligence, and personal adjustment.  The subjects watched the confederate perform a learning task.  The subject was required to shock the confederate anytime the confederate made a error.  The confederate (who was part of the research team) made errors purposefully so that the subject would have to implement the shock treatment.  After the learning task was completed, the subject was again asked to evaluate the confederate in terms of likability, intelligence, and personal adjustment.  The response variable (Y) is the difference between the two assessments (i.e. posttest minus pretest), the covariate (X) is a measure of anxiety the subjects exhibited at the beginning of the experiment.  
<br>
While there were 4 treatment combinations, these can be considered stemming from two experimental factors where the levels are:                  

FACTOR A             
* a1 = mild shock         
* a2 = severe shock<br>

FACTOR B:
* b1 = black confederate
* b2 = white confederate

Columns:
* group = treatment groups

In [0]:
%r
#install.packages("RCurl") 
#install.packages("lessR") 

NULL

In [0]:
%r

library(RCurl) # use RCurl to obtain data from github
data_raw <- getURL("https://raw.githubusercontent.com/mattlibonati/Machine-Learning/main/datasets/ANCOVA%20Example_confederate.csv")
data <- read.csv(text = data_raw) 

display(data)

Subject,shock,confed,diff,anxiety,group,d_1,d_2,d_3,d_4
1,1,1,28,21,1,1,0,0,0
2,1,1,20,18,1,1,0,0,0
3,1,1,4,17,1,1,0,0,0
4,1,1,-4,18,1,1,0,0,0
5,1,1,12,17,1,1,0,0,0
6,1,2,0,16,2,0,1,0,0
7,1,2,-16,16,2,0,1,0,0
8,1,2,-8,15,2,0,1,0,0
9,1,2,16,19,2,0,1,0,0
10,1,2,8,17,2,0,1,0,0



Attaching package: ‘RCurl’

The following object is masked _by_ ‘.GlobalEnv’:

    complete


Sometimes you do not want to model to variables at the same time --> can combined into single variable (pairwise products) to analyze all possible combinations <br>
The "group" field is the pairwise product of "shock" and "confed" <br>
Each of the dummy coded variables are for the groups <br>
diff = y-response variable

Were the model to be done based on dummy vars, the line of best fit would go through the group means

In [0]:
%r
library(lessR)



In [0]:
%r
# This should produce a plot; Plot not produced due to Datbricks
Plot(anxiety, diff, by=group, fit=TRUE, data=data)

dev.new(): using pdf(file="Rplots5.pdf")


group: 1   Fit: Mean Squared Error, MSE = 107
 
group: 2   Fit: Mean Squared Error, MSE = 45
 
group: 3   Fit: Mean Squared Error, MSE = 43
 
group: 4   Fit: Mean Squared Error, MSE = 19
 

Some Parameter values (can be manually set) 
------------------------------------------------------- 
size: 0.80  size of plotted points 
jitter_y: 0.00  random vertical movement of points 
jitter_x: 0.00  random horizontal movement of points 


In [0]:
%r

######  Test for Interaction (UNEQUAL Slopes)  #####
y<-data$diff
x<-data$anxiety
d1<-data$d_1
d2<-data$d_2
d3<-data$d_3
d4<-data$d_4

# These are the products used to test for interaction
# The explanatory continious variable anxiety X dummy variables (groups)
# anx1 excluded becuase this is basis of interpretation (control group)
anx2<-x*d2
anx3<-x*d3
anx4<-x*d4

###### Full Model

full<-lm(y~x+d2+d3+d4+anx2+anx3+anx4)
print(anova(full))
print(summary(full))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value      Pr(>F)    
x          1 4010.1  4010.1 61.4080 0.000004643 ***
d2         1    1.2     1.2  0.0178     0.89608    
d3         1  417.1   417.1  6.3871     0.02655 *  
d4         1    9.1     9.1  0.1396     0.71517    
anx2       1    9.4     9.4  0.1434     0.71156    
anx3       1   23.7    23.7  0.3622     0.55849    
anx4       1    0.8     0.8  0.0126     0.91238    
Residuals 12  783.6    65.3                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Call:
lm(formula = y ~ x + d2 + d3 + d4 + anx2 + anx3 + anx4)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.9630  -2.1954   0.2553   5.3676   9.0370 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -82.3704    44.8992  -1.835   0.0915 .
x             5.1852     2.4590   2.109   0.0567 .
d2          -33.1079    63.1266  -0.524   0.6095  
d3          -45.2660    60.4521  -0.749

<b>Full Model Results:</b>
$$ \hat{Y} = -82.37 +  5.19x - 33.11d2 - 45.27d3 - 5.25d4 + 1.77anx2 + 2.09anx3 + 0.35anx4 $$

Due to nature of dummy variables (either 1 or 0), groups slope can be intrepreted as follows: <br>
* d1 group slope = + 5.19
* d2 group slope = + 5.19 + 1.77
* d3 group slope = + 5.19 + 2.09
* d4 group slope = + 5.19 + 0.35

<b>Hypothesis Testing:</b>
1. Sum of Square Regression for full model = 4010.1 + 1.2 + 417.1 + 9.1 + 9.4 + 23.7 + 0.8 = 4471.4 (with 7 degrees of freedom (ind vars in model))
2. Sum of Square Errors for full model = 783.6 (with 12 degrees of freedom (stated in ANOVA output))

<b>Reduced Model:</b> 
Testing for interaction

In [0]:
%r
######  Reduced Model 
# reduced model takes interactions terms (anx2-->4 out of the model)
reduced<-lm(y~x+d2+d3+d4)
print(anova(reduced))
print(summary(reduced))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value      Pr(>F)    
x          1 4010.1  4010.1 73.5824 0.000000361 ***
d2         1    1.2     1.2  0.0213      0.8858    
d3         1  417.1   417.1  7.6534      0.0144 *  
d4         1    9.1     9.1  0.1673      0.6883    
Residuals 15  817.5    54.5                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Call:
lm(formula = y ~ x + d2 + d3 + d4)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7899  -2.7395   0.3697   5.0252   9.2101 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  -98.118     19.752  -4.967  0.000169 ***
x              6.050      1.070   5.655 0.0000458 ***
d2            -2.319      4.973  -0.466  0.647640    
d3           -11.429      5.919  -1.931  0.072630 .  
d4             2.101      5.136   0.409  0.688293    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.3

<b>Reduced Model Results:</b>
$$ \hat{Y} = -98.12 +  6.05x - 2.32d2 - 11.43d3 + 2.10d4 $$

Without the dummy variables, relationship between x and y stays the same for all groups. Adjustments up and down handled by d2 --> d4 betas.

<b>Hypothesis Testing:</b>
1. Sum of Square Regression for full model = 4010.1 + 1.2 + 417.1 + 9.1 = 4437.5 (with 4 degrees of freedom (ind vars in model))
2. Sum of Square Errors for full model = 817.5 (with 15 degrees of freedom (stated in ANOVA output))

<b>Test for Unequal Slopes: </b> Nested F-Test <br>
* Looking at and trying to isolate the sums of squares due to the unequal slopes part
* F = ([SS(Reg-FULL) - SS(Reg-Reduced)] / [df(Reg-FULL) - df(Reg-Reduced)]) / (SS(ERR-FULL)/ df(ERR-Full))
  * Numerator = (4471.4-4437.5)/(7-4) = 33.90/3 = 11.3  
  * Denominator = 783.6/12 = 65.3
  * F = 11.3/65.3 = 0.173
* F is small at 0.173 .: reject the unequal slopes hypothesis - opt for reduced model