# Introduction to R - Part 3
In this final par of Introduction to R we will look at in-depth data analysis and reporting techniques. We will cover:  
1. Performing statistical analysis on data frames
2. Plotting data
3. Reporting findings

## 1. Performing statistical analysis on data frames
R has functions for most statistical measures. We'll illustrate some of these using the NAFLD data set.

In [37]:
nafldDataset <- read.csv("nafld_dataset2.csv")

# Let's compute the mean age in the patient group and the corresponding standard deviation
mean(nafldDataset$age)
sd(nafldDataset$age)

# Now let's compute the median and IQR of age
median(nafldDataset$age)
IQR(nafldDataset$age)

Let's now look at the mean and 95% confidence interval of the mean of patients' heights. We know that 95% confidence intervals can be estimated as: $$\overline{x} \pm t_{0.05, n-1}\frac{s}{\sqrt{n}}$$ where $n$ is the number of observations, $t_{0.05, n-1}$ is the two-tailed 5% t-value with $n-1$ degrees of freedom, $s$ is the standard devaition and $\overline{x}$ is the mean of the observations.

In [45]:
# We need to load the stats library for this contains functions related to the t-distribution.
library(stats)

m <- mean(nafldDataset$height, na.rm=TRUE)
s <- sd(nafldDataset$height, na.rm=TRUE)
n <- sum(!is.na(nafldDataset$height))

# Calculate 95% confidence intervals of the mean height
lower <- m - qt(1 - 0.05/2, n - 1) * s / sqrt(n)
upper <- m + qt(1 - 0.05/2, n - 1) * s / sqrt(n)
print(cat(c(lower, m, upper, "\n")))

1.62336012932764 1.64513513513514 1.66691014094263 
NULL


__Q__: Writing two extra lines all the time to calculate cofidence intervals is tedious and boring. Write a function that takes the mean, standard deviation and number of observation and returns a vector with two elements: the lower and upper boundary of the 95% confidence interval.

Performing t-tests in R is a simple task. While you can calculate the z-statistic of the test and compare it to the critical t-value determined from a t-distribution table, R offers a function that does all of this. Let's start with a single sample t-test to see whether the mean patient height is 1.7 m.

In [49]:
t.test(nafldDataset$height, mu=1.7)


	One Sample t-test

data:  nafldDataset$height
t = -5.0216, df = 73, p-value = 3.509e-06
alternative hypothesis: true mean is not equal to 1.7
95 percent confidence interval:
 1.62336 1.66691
sample estimates:
mean of x 
 1.645135 


Next, compare the heights between patients who have fibrosis stage of 3 or higher and patients with fibrosis stage lower than 3. In this case ```t.test()``` will perform the unequal variance or Welch two-sample test.

In [47]:
t.test(nafldDataset$height[nafldDataset$bx_fib>=3], nafldDataset$height[nafldDataset$bx_fib<3])


	Welch Two Sample t-test

data:  nafldDataset$height[nafldDataset$bx_fib >= 3] and nafldDataset$height[nafldDataset$bx_fib < 3]
t = -0.4461, df = 69.053, p-value = 0.6569
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05149005  0.03267017
sample estimates:
mean of x mean of y 
 1.639286  1.648696 


Some of the patients underwent lifestyle change and weight-reduction surgery and had their serum ALT levels measured again 6 months after the surgery. We can use the paired t-test to assess whether the treatment made any change to patients' ALT levels.

In [51]:
preSurgeryAlt <- c(45, 141, 84, 22, 84, 54, 23, 32, 78, 26, 75, 68, 38, 73, 20, 158, 49, 44)
postSurgeryAlt <- c(49, 78, 66, 34, 64, 43, 28, 29, 42, 24, 70, 41, 34, 52, 27, 87, 36, 34)
t.test(preSurgeryAlt, postSurgeryAlt, paired=TRUE)


	Paired t-test

data:  preSurgeryAlt and postSurgeryAlt
t = 2.8846, df = 17, p-value = 0.01029
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.118537 26.548130
sample estimates:
mean of the differences 
               15.33333 
