What is a dataframe?

A dataframe is a collection of values arranged as a table. It is a tabular organization of data in where the rows represent cases (or observations) and the columns are  the variables. Dataframes will be the mostly used raw data input you will  ever use in R.

Set working directory. Load in libraries

In [None]:
library(Biostatistics)
library(mosaic)
library(sciplot)
library(ggplot2)

Read in data set and run summary

In [None]:
#data("weaver")
weaver<-read.csv("weaver.csv")
summary(weaver)
head(weaver)

Summary statistics

In [None]:
mass.mean=mean(weaver$mass_final)
mass.mean

In [None]:
mass.median=median(weaver$mass_final)
mass.median

In [None]:
mass.variance=var(weaver$mass_final)
mass.variance

Get the standard deviation

In [None]:
mass.sd=sd(weaver$mass_final)
mass.sd

Let's suppose we want to analyze only the eggs left treatment in the weaver dataframe. This is achieved  by using "subset."
Subset the data to just those with Treatment "Eggs_left"

In [None]:
weaver.eggs5<-subset(weaver,weaver$ GrpSize5)
summary(weaver.eggs5)
head(weaver.eggs.left)

In [None]:
mass.eggs.left.mean=mean(weaver.eggs.left$mass_final)
mass.eggs.left.mean

Test if mass.eggs.left.mean greater than mass.mean

In [None]:
mass.eggs.left.mean < mass.mean

Get the 25th percentile for the variable (Note pdata/qdata requires mosaic package)

In [None]:
mass.q25=qdata(weaver$mass_final,0.30)
mass.q25

Get the proportion of the population less than 40

In [None]:
mass.less.than.40=pdata(weaver$mass_final,40)
mass.less.than.40

Get the proportion of the population greater than 39

In [None]:
mass.more.than.39= 1-pdata(weaver$mass_final,39)
mass.more.than.39

Extract all large birds from the populations

In [None]:
weaver.large=subset(weaver, weaver$mass_final>45)
summary(weaver.large)

If we want to randomly resample the dataset use sample either without (Rand1) or with replacement (Rand2)

In [None]:
weaverRand1 = sample(weaver, 10)
weaverRand1 

In [None]:
weaverRand2 = sample(weaver, 10, replace=TRUE)
weaverRand2 

What is the distribution of "mass" in the weaver study?

In [None]:
histogram(weaver$mass_final, xlab="Mass",  ylab="Count",
main="Histogram of Height", col="red")

Color by treatment

In [None]:
histogram(weaver$mass_final, groups=weaver$Treatment, stripes='horizontal', xlab="Height",  ylab="Count",
main="Histogram of Height")

Using ggplot2 for nicer plotting

In [None]:
ggplot(weaver,aes(x=mass_final, fill=Treatment))+  geom_histogram(binwidth = 2,  color="black", stat="bin")+theme_bw()

R orders things alphabetically, let's reorder the colors

In [None]:
weaver$Treatment<-factor(weaver$Treatment, levels=c("Eggs_removed", "Eggs_left"))

In [None]:
ggplot(weaver,aes(x=mass_final, fill=Treatment))+  geom_histogram(binwidth = 2,  color="black", stat="bin")+theme_bw()

Create a non stacked plot using positon= "dodge"

In [None]:
ggplot(weaver,aes(x=mass_final, fill=Treatment))+  geom_histogram(binwidth = 2,  color="black", stat="bin",position="dodge")+theme_bw()

Let's assume we want to see if there is some diﬀerence in males between Eggs_removed and Eggs_left
We can draw a box (or whisker) plot. Note that this is either Y~X or GroupA,GroupB

In [None]:
boxplot(mass_final ~ Treatment,data=weaver,  xlab="Treatment",ylab="Mass",col=c("blue","red"), names=c("Eggs Removed","Eggs Left"))

box edges represent the third and ﬁrst quartiles (50% of  the data)
▶ black line is the median
▶ lower 50% of the data is below the median
▶ lower 25% of the data occurs between the bottom edge of  the box and the bottom edge of the lower whisker
▶ upper 25% of the data occurs above the top edge of the  box and the top edge of the upper whisker

Frequency distribution and barplot
Determine the frequency distribution for group size in the galton data and make a bargraph.
First we need to get the frequency distribution (use the  command table to make it).

In [None]:
freq.of.group=table(weaver$GrpSize)
freq.of.group

In [None]:
barplot(freq.of.group,xlab="number in Group",  ylab="counts",col="gold")

barplot can be categorial factors as well. Let's create a new categorial variable "large_group", which tells us if the group size greater than 5. It is a true/false variable than is added to the end of our data frame.

In [None]:
weaver$large_group<-weaver$GrpSize>5
summary(weaver)

In [None]:
freq.of.large=table(weaver$large_group)
freq.of.large
barplot(freq.of.large,xlab="Group size over 5",  ylab="counts",col="gold")

Let's subset by group size and look at the mass

In [None]:
Large<-subset(weaver,weaver$large_group=="TRUE")
summary(Large)

In [None]:
Small<-subset(weaver,weaver$large_group=="FALSE")
summary(Small)

In [None]:
boxplot(Small$mass_final,Large$mass_final, xlab="Treatment",ylab="Mass",col=c("purple","forestgreen"), names=c("Small","Large"))

Table and barplot can be done for two categorical variables

In [None]:
freq.of.large.treat=table(weaver$large_group, weaver$Treatment)
freq.of.large.treat

In [None]:
barplot(freq.of.large.treat,ylab="counts",beside=TRUE)

In [None]:
ggplot(weaver, aes(x=Treatment, fill=large_group)) +
     geom_bar(position="dodge") +
     xlab("Treatment") +
     ylab("count") +
     scale_fill_manual(values=c("blue","red")) +
     theme_classic()