# Lecture 1.4: Data Analysis in R

## Hypothesis Tests  

In [None]:
# Say if we want to compare two different treatments

treat0 = c(79.98, 80.04, 80.02, 80.04, 80.03, 80.03, 80.04, 79.97,
80.05, 80.03, 80.02, 80.00, 80.02)

treat1 = c(80.11, 79.96, 79.83, 79.95, 79.92, 80.07, 79.92, 79.99, 80.12)

boxplot(treat0, treat1)

In [None]:
# We would perform a two sample t-test on the data

t.test(treat0, treat1)

In [None]:
# If you believe the variances of two groups are equal
t.test(treat0, treat1, var.equal = T)

In [None]:
# There are also other tests you could use
# For example the rank sum test

wilcox.test(treat0, treat1)

In [None]:
# Or the Kolmogorov-Smirnov test for comparing distributions

ks.test(treat0, treat1)

In [None]:
# Another example using the sleep data 
data(sleep)
?sleep

In [None]:
sleep

In [None]:
summary(sleep)

In [None]:
with(sleep, t.test(extra ~ group, paired = TRUE))

## Linear Regression

We will use the 'trees' data to demontrate how to build linear regression models in R. 

The data contains measurements of the girth, height and volume of timber in 31 felled black cherry trees.

In [None]:
trees

In [None]:
?trees

In [None]:
summary(trees)

In [None]:
pairs(trees, panel = panel.smooth, main = "trees data")

In [None]:
fit1 = lm(Volume ~ Girth, data = trees)
summary(fit1)

In [None]:
names(fit1)

In [None]:
coef(fit1)

In [None]:
confint (fit1)

In [None]:
# Predict the average volume
predict(fit1, data.frame(Girth =c(11 ,13)), interval = "confidence")

In [None]:
# Predict for individual trees
predict(fit1, data.frame(Girth =c(11 ,13)), interval = "prediction")

In [None]:
# Add another predictor to the model
fit2 = update(fit1, . ~ . + Height, data = trees)
summary(fit2)

In [None]:
# Add an interaction term to the model
fit3 = update(fit2, . ~ . + Girth * Height, data = trees)
summary(fit3)

In [None]:
par(mfrow =c(2,2))
plot(fit3)

In [None]:
hist(fit3$residuals)

In [None]:
qqnorm(fit3$residuals)

In [None]:
shapiro.test(fit3$residuals)

In [None]:
# We can also obtain the residuals using:

residuals(fit3)

In [None]:
# Obtain the fitted values, i.e. y-hat

predict(fit3)

In [None]:
# To check for high leverage points

plot(hatvalues(fit3))

In [None]:
# To find which point has the highest leverage
which.max(hatvalues(fit3))

In [None]:
# We can also transform the data if needed, for example
fit4 = lm(Volume ~ Girth + I(Girth ^2), data = trees)
summary(fit4)

* Let's look at another example. The 'Boston' dataset contains data related the housing values in the suburbs of Boston.

In [None]:
library (MASS)
Boston

In [None]:
?Boston

In [None]:
summary(Boston)

In [None]:
names(Boston)

In [None]:
fit1 = lm(medv ~ ., data = Boston)
summary(fit1)

In [None]:
par(mfrow =c(2,2))
plot(fit1)

In [None]:
# Checking for multicolinearity
library(car)
vif(fit1)

In [None]:
fit2 = step(fit1)

In [None]:
summary(fit2)

**Exercise**: Perform model diagnostics on the model above.

## ANOVA

In [None]:
# We can use ANOVA for model comparison, for example
anova(fit2, fit1)

## Logistic Regression

In [None]:
# We will use the 'Smarket' data from the "ISLR" library
install.packages("ISLR")
library (ISLR)
names(Smarket)

In [None]:
dim(Smarket )

In [None]:
summary (Smarket )

In [None]:
# Fit a logistics regression model to the data
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket ,family = binomial )
summary(glm.fit)

In [None]:
coef(glm.fit)

## Simulation  

We have simulated data from some of the known disitrbution in the previous lectures. For example, if we want to simulate 20 values from a $N(2, 25)$ distribution:

In [None]:
rnorm(20, 2, 5)

We can set the seed so we obtain the same set of data every time.

In [None]:
set.seed(1)
rnorm(20, 2, 5)

In [None]:
set.seed(1)
rnorm(20, 2, 5)

**Exercise**: Assume that we have the following model:

$$ y = 2.3 + 5x + \varepsilon $$
$$ \varepsilon \sim N(0, 1.2)$$

Write some R code to simulate 100 data points from this model, and plot the simulated data.

**Exercise**: We will work with the 'Cars93' data predict the car price.

Note, you will need to remove 'Min.Price' and 'Max.Price' before fitting a model to the data.  

Perform all necessary steps to obtain the best model.

In [None]:
data(Cars93, package = "MASS")