## Regression for Hours Spent on Labs Survey

In [2]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(broom))

This analysis will investigate the following question:

*How does the number of times a Master of Data Science (MDS) student goes to office hours affect the average number of hours they spend working on labs per week?*

based on the results collected from this [survey](https://ubc-mds.slack.com/archives/C24HU8X0W/p1554330074049900). The raw data for this study is stored in the following [repo](https://github.ubc.ca/bettybhz/Hours_Spent_On_Labs_Survey_Data).

#### Data Preparation

In [5]:
# Load survey data
rawdata = suppressMessages(read_csv("Hours_Spent_On_Labs_Survey_Final.csv"))

# Data cleaning
raw_df = data.frame(rawdata)
row_to_remove = c(1:2)
col_to_remove = c(1:19, 21, 25, 30, 31)
df = raw_df[-row_to_remove, -col_to_remove]
names(df) <- c("attend_OH", "lab_hours", "group", "academic", "yrs_out_school", "program", "stat", "optional")
df <- df %>% mutate(optional= as.integer(optional), stat= as.integer(stat), program= as.integer(program), lab_hours=as.integer(lab_hours), attend_OH= as.integer(attend_OH), group = as.factor(group), academic = as.factor(academic), yrs_out_school= as.integer(yrs_out_school)) %>% select (lab_hours, everything())
df$yrs_group <- cut(df$yrs, breaks=c(0, 3, 6,16), right = FALSE, labels = c("0-2","3-5","5+"))

#### Baselien Model

In [6]:
base <- glm(lab_hours ~ attend_OH , data = df, family = gaussian(link = "log"))
tidy(base)

term,estimate,std.error,statistic,p.value
(Intercept),3.08684341,0.10185073,30.307525,3.75847e-35
attend_OH,0.09727971,0.04064051,2.393664,0.02025323


#### Models with Potiential Confunding Variables 

- optional
- stat
- program

> **a. model with optional**

In [7]:
mod1 <- glm(lab_hours ~ attend_OH + optional, data = df, family = gaussian(link = "log"))
tidy(mod1)

term,estimate,std.error,statistic,p.value
(Intercept),3.20967986,0.18795945,17.076449,5.683253e-23
attend_OH,0.09585089,0.04096773,2.339668,0.02317576
optional,-0.05053274,0.06356079,-0.79503,0.43021


In [13]:
tidy(anova(mod1, test= "F"))

“The following column names in ANOVA output were not recognized or transformed: Deviance, Resid..Df, Resid..Dev”

term,df,Deviance,Resid..Df,Resid..Dev,statistic,p.value
,,,54,8592.109,,
attend_OH,1.0,818.88788,53,7773.221,5.5483027,0.02230556
optional,1.0,98.40167,52,7674.82,0.6667119,0.41792538


**Observation:**

According to Table 2, the coefficient for `attend_OH` is 0.096, which is within the 95% confidence interval (0.0176, 0.177) from the baseline model. According to Table 3, our ANOVA F-test also shows that adding the variable optional does not help improve our model. Therefore, variable `optional` is not a true confounding variable and we will not include `optional` in our final model.

> **b. model with stat**

In [10]:
mod2 <- glm(lab_hours ~ attend_OH + stat, data = df, family = gaussian(link = "log"))
tidy(mod2)

term,estimate,std.error,statistic,p.value
(Intercept),2.8416988,0.25095069,11.323734,1.194923e-15
attend_OH,0.11071148,0.0427523,2.589603,0.01243474
stat,0.07627068,0.06779053,1.125093,0.2657159


In [11]:
tidy(anova(mod2, test= "F"))

“The following column names in ANOVA output were not recognized or transformed: Deviance, Resid..Df, Resid..Dev”

term,df,Deviance,Resid..Df,Resid..Dev,statistic,p.value
,,,54,8592.109,,
attend_OH,1.0,818.8879,53,7773.221,5.61848,0.02151522
stat,1.0,194.3135,52,7578.908,1.333207,0.25351428


**Observation:**

According to Table 4, the coefficient for `attend_OH` is 0.111, which is within the 95% confidence interval (0.0176, 0.177) from the baseline model. According to Table 5, our ANOVA F-test also shows that adding the variable `stat` does not help improve our model. Therefore, variable `stat` is not a true confounding variable and we will not include `stat` in our final model.

> **c. model with program**

In [14]:
mod3 <- glm(lab_hours ~ attend_OH + program, data = df, family = gaussian(link = "log"))
tidy(mod3)

term,estimate,std.error,statistic,p.value
(Intercept),3.3652191,0.21324337,15.78112,1.790939e-21
attend_OH,0.08740459,0.04070128,2.147466,0.03643819
program,-0.09015294,0.06261829,-1.439722,0.1559394


In [15]:
tidy(anova(mod3, test= "F"))

“The following column names in ANOVA output were not recognized or transformed: Deviance, Resid..Df, Resid..Dev”

term,df,Deviance,Resid..Df,Resid..Dev,statistic,p.value
,,,54,8592.109,,
attend_OH,1.0,818.8879,53,7773.221,5.696671,0.02066977
program,1.0,298.4055,52,7474.816,2.075886,0.15563769


**Observation:**

According to Table 6, the coefficient for `attend_OH` is 0.087, which is within the 95% confidence interval (0.0176, 0.177) from the baseline model. According to Table 7, our ANOVA F-test also shows that adding the variable `program` does not help improve our model. Therefore, variable `program` is not a true confounding variable and we will not include `program` in our final model.

### Conclusion

Based on our Exploratory Data Analysis (EDA), we decide to focus our empirical investigation on three confunding variables `optioanl`, `stat`, and `program`. However, after performing regression and F-test on these variables, we found that they are not true confounding variables and adding these three variables does not improve our model. 

Therefore, our final model is our baseline model $E(Y) = exp(\beta_0 + \beta_{\text{attend_OH}}X_{\text{attend_OH}})$.

Since the p-value is 0.02 which is small enough, we can reject the null hypothesis under significance level of 0.05. Therefore, the number of times a MDS student attends office hours affects the average number of hours spent working on labs per week. The coefficient for `attend_OH` is 0.097, where exp(0.097) is the effect of attending office hours. This means on average, one visit increase in office hours is expected to increase hours spent on labs per week by 1.1 times.
