Percentiles and coverage intervals

By know you should have understood that in statistics we are  really interested in estimating variation. One way to estimate  variation is by percentiles and coverage intervals

▶ 25th percentile is the value in your data above 1/4 of them
▶ 75th percentile is the value in your data above 3/4 of them
▶ 50th percentile is the value in your data above 1/2 of them (Corresponds to the sample median)
▶ interquartile range (IQR) between 25th and 75th percentiles corresponds to 50% coverage interval

In R we can calculate percentiles/ranges by using the qdata and pdata functions 

We will use "mtcars" a data set in R about "Motor Trend Car Road Tests"

First we load the packages:

In [None]:
library(plyr)
library(mosaic)
library(sciplot)

Then we look at the data set

In [None]:
summary(mtcars)
help(mtcars)

Get the 25th, 75th and interquartile range:

In [None]:
qdata(mtcars$mpg, 0.25)

qdata(mtcars$mpg, 0.75)

IQR(mtcars$mpg)

Get the list of percentiles 0-100:

In [None]:
qdata(mtcars$mtg, seq(0,1,by=0.1))

Let's assume we want to summarize the number of cars based on different number of gears contained in the R mtcars dataframe
First we use the table command:

In [None]:
freq_gear= table(mtcars$gear)
freq_gear

Then we can plot it:

In [None]:
barplot(freq_gear,
col="blue",xlab="Gear",ylab= "Cars")

Can also use ggplot to do this:

In [None]:
ggplot(mtcars, aes(gear))+
geom_bar(color="blue",
fill="blue")

Now we want to summarize the number of cars based on both gears and carburetor:

In [None]:
freq_gear_carb=
table(mtcars$gear,mtcars$carb)

freq_gear_carb

In [None]:
barplot(freq_gear_carb, col=c("purple","orange","black"),
xlab="Carburation number", ylab="Counts",
legend=c("Gear-3","Gear-4","Gear-5"))

Usually we want separate bars

In [None]:
barplot(freq_gear_carb, col=c("purple","orange","black"),
xlab="Carburation number", ylab="Counts",
legend=c("Gear-3","Gear-4","Gear-5"), beside = TRUE)

Adding in error bars to a plot (using the sciplot package)

In [None]:
bargraph.CI(mtcars$gear,mtcars$mpg,
            xlab="Gears",
            ylab="MPG", 
            ci.fun=function(x) 
              c(mean(x)-sd(x),mean(x)+sd(x)))

Same graph but just display the top standard deviation

In [None]:
bargraph.CI(mtcars$gear,mtcars$mpg,
            xlab="Gears",
            ylab="MPG",
            ci.fun=function(x)
            c(mean(x),mean(x)+sd(x)))

In [None]:
Can also make using plyr to make a summary
We will use the iris data set:

In [None]:
summary(iris)

sum.iris<-ddply(iris,c("Species"),summarise, mean_sepal = mean(Sepal.Length), 
                upper_sepal = (mean(Sepal.Length) + sd(Sepal.Length)), lower_sepal = (mean(Sepal.Length))- sd(Sepal.Length))

sum.iris

Then we used ggplot to plot:

In [None]:
ggplot(sum.iris,aes(x=Species, y=mean_sepal,color=Species,fill=Species))+
    geom_bar(aes(x=Species, y=mean_sepal),stat="identity")+ 
    geom_errorbar(aes(ymin=lower_sepal, ymax=upper_sepal),width=.2)+
    xlab("Iris Species")+  ylab("Sepal Length")+  theme_classic()

For continuous variables you can use xyplot:

In [None]:
xyplot(Sepal.Length~Petal.Length,iris, type = c("p", "r"))

In ggplots use the following code.The shaded interval is the 95% CI for the linear relationship

In [None]:
ggplot(iris,aes(x=Petal.Length, y=Sepal.Length))+  geom_point(color="blue",size=2)+
xlab("Petal.Length")+  ylab("Sepal.Length")+  stat_smooth(method = "lm")+  theme_classic()