# Final Exam: Notes and Computations

## 9. Correlation Matrices

A correlation matrix is an easy way to see bivariate correlation between multiple variables.

In R, execute `cor` on the entire data frame:

```
cor(mtcars, use="complete.obs")
round(cor(mtcars, use="complete.obs"),2)
```

To interpret, look at the intersection of rows and columns. This tells you the correlation betweeen two variables.

```
# correlation
cor(mtcars$mpg, mtcars$hp)
# p-value: statistical significance
cor.test(mtcars$mpg, mtcars$hp)

# correlation matrix
cor(mtcars, use="complete.obs")
round(cor(mtcars, use="complete.obs"),2)

# look at specific variables (subsetting)
mtcars2 <- with(mtcars, data.frame(hp, wt, mpg, am, gear))
cor(mtcars2)

# fancier: correlation matrix with p-value matrix
library(Hmisc)
rcorr(as.matrix(mtcars))
rcorr(as.matrix(mtcars2))

# plot
library(corrplot)
mtcars_corr <- round(cor(mtcars, use="complete.obs"),2)
corrplot(mtcars_corr, method="number")
corrplot(mtcars_corr, method="circle")
```

## (??) Estimation and Interpretation (multiple regression)

`typeprof` is a dummy variable, has value 0 or 1; tells us the diff in dep var when prof = 1 as compared to when prof = 0

it is the diff betw prof occupation and the omitted category which is blue-collar, it is not the diff between prof and white-collar

we can say prof occ hav 16.66 higer level of `prestige` than blue-collar (comparing to the referrent category)

when doing multiple regression, it is not just the diff between these two groups (prof and white-collar), it is the diff between these two groups controlling for the other independent variables in the model

say (in reference to the coefficient for typeprof):
"the predicted diff in prestige betw professional and blue-collar (the omitted category) is 16.66 controlling for education and income"

for dummy variables, always keep in mind the referrent category

## 10. Multivariate Regression

In [1]:
a = -19.672
b1 = -0.000629
b2 = 1.7399
b3 = 0.40994
b4 = 2.0357
b5 = -0.0344

def haty(x1, x2, x3, x4, x5):
    return a + (b1*x1) + (b2*x2) + (b3*x3) + (b4*x4) + (b5*x5)

In [2]:
outlets = 1739
cars = 9.27
income  = 85.4
age  = 3.5
bosses  = 9.0

In [3]:
haty(outlets, cars, income, age, bosses)

37.187268

$$F = \frac{R^{2}/k}{(1 - R^{2})/[n - (k + 1)]}$$

In [6]:
def fscore(r2, k, n):
    return (r2/k)/((1-r2)*(n-(k+1)))

In [7]:
fscore(0.994, 5, 10)

8.283333333333326

## 11. Interaction Term

### part A: interpretation

in multiple regression, categorical variables are substituted... one group in the category is left out

when there are just one category (i.e., a binary variable, e.g. smokers and non-smokers)
there is just one interaction: the "yes" response

**Interpretation (base term)**:
> the effect of variable when categorical-variable category 1 = 0 (and category 2 = 0, ...) 
> controlling for other variables

or
> the effect of variable for the referrent category, controlling for other variables

"for the referrent category, every one-unit increase in variable is assoc with a coefficient unit increase/decrease in the response variable... is/is not statistically significant... with a p-value range confidence interval"

**Interpreting the type variables**:
- these are the categorical variables
- the intercept difference for the groups (within the categorical variable)
- when other variables = 0

"the difference between group and omitted group when other variables = 0 is: the group has on average coefficient higher/lower level of dependent variable than the omitted group"

**Interaction Terms** (4m44s)
- the difference in the slope
- given `varB:varA`
- where `varB` is a categorical variable, and `varA` is a quantitative variable
- the coefficient expresses the difference between the shown group of `varB` and the referrent category
- the total slope is the `varA` coefficient +/- the interaction coefficient

the interaction is kind of estimating separate models, which allows the slope to be different for each group (only one will apply at any given time, all others will = 0)

the coefficients are/are not statistically significant: look at the P-value

"the effect of income for the different types is statistically significant ... "

"there is no statistically significant difference in the effect of income for the different types (professions)"


### part B: computations

Using:

- $y$: faculty salary
- $x_{1}$: total enrollment (enart)
- $x_{2}$: % of students receiving federal aid (pfedaid)
- $x_{3}$: total revenues in millions of dollars (tot_rev_millions)
- $x_{4}$: whether an institution is a Landgrant college or university (landgrant)

The multiple regression formula is:

$\hat{y} = 70507.52921 + 0.43514x_{1} + -138.86349x_{2} + 10.31621x_{3} + 4519.55901x_{4} + -4.41862x{3}x{4}$

In [2]:
a = 70507.52921
b1 = 0.43514
b2 = -138.86349
b3 = 10.31621
b4 = 4519.55901
b34 = -4.41862

In [3]:
def haty(x1, x2, x3, x4):
    return a + (b1*x1) + (b2*x2) + (b3*x3) + (b4*x4) + (b34*x3*x4)

```
Total Enrollment = 20,000
% Receiving Federal Financial Aid = 25
Total Revenues = $30 Million
Landgrant = 1
```

In [4]:
x1 = 20000
x2 = 25
x3 = 30
x4 = 1

In [5]:
haty(x1, x2, x3, x4)

80435.22867

```
Total Enrollment = 20,000
% Federal Receiving Financial Aid = 25
Total Revenues = $30 Million
Landgrant = 0
```

In [6]:
x1 = 20000
x2 = 25
x3 = 30
x4 = 0

In [7]:
haty(x1, x2, x3, x4)

76048.22826

## State Life Expectancy

In [1]:
v = -2.664e-01

In [2]:
v

-0.2664