# Tutorial 3

## Visual and Numerical Summaries of Categorical Data

Today we will do some more work with the same data is in tutorial 2. 

>The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

>We will focus on a random sample of **20,000** people from the BRFSS survey conducted in 2000. While there are over **200** variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace.



In [None]:
source("http://www.openintro.org/stat/data/cdc.R")

I am going to start by printing the first 6 rows of the data frame, so we can see what we are working with.

In [None]:
head(cdc)

Today we will be working with the categorical data.  That is the columns `genhlth`, `hlthplan`, `smoke100`, and `gender`.

## Tables

In [None]:
table(cdc$gender)

The table command summarises categorial variables, the table above gives us a count of how many male and how many female entries there are in our data. 

I might want to compute the proportion in each category.  We can accomplish this by dividing by 20 000 the number of participants in our data.

In [None]:
table(cdc$gender)/20000

In [None]:
table(cdc$smoke100)/20000

In [None]:
table(cdc$genhlth)/20000

### Barplot

I may aslo want to visually display the data from a categorical data.  One option is a barplot, I can choose to either produce a frequency table or a relative frequency table.

In [None]:
barplot(table(cdc$genhlth),main="Frequency of General Health")

In [None]:
barplot(table(cdc$genhlth)/20000,main="Relative Frequency of  General Health")

## Pairs of Categorical Variables.

I can also examine pairs of variables.  I am going to compare gender and smoking by producing a frequency table.

In [None]:
table(cdc$gender,cdc$smoke100)

There are now more options if we want to compute relative frequencies.

In [None]:
gensmoke<-table(cdc$smoke100,cdc$gender)

In [None]:
prop.table(gensmoke,1)

In [None]:
prop.table(gensmoke,2)

The second option in `prop.table` tells R to compute row or column proportions.

- 1 splits the proportions by the first variable "what percentage of smokers are male?" ,   
-- The sum of each row is 1.

---  

- 2 splits by the second variable "what percentage of men or women smoke?".  
-- The sum of each column is 1.

** Does the data show a relationship between the two variables?**

---

## Barplots with 2 variables

Now we will visually display this data.  First I will produce a few barplots to compare the options.

In [None]:
barplot(gensmoke)

I will add a legend, and improve the labels to make this plot more useful.

In [None]:
barplot(gensmoke,legend=c("Non Smoker","Smoker"),names.arg=c("Male","Female"))

Another option is to place the bars side by side.  I have used it below.

In [None]:
barplot(gensmoke,beside=TRUE,legend=c("Non Smoker","Smoker"),names.arg=c("Male","Female"))

Looking at the bar plots, which do you think best display the data?  

Is there evidence of a relationship between smoking and gender?


---


If I use the generic plot command on the table `gensmoke` it will produce a mosaic plot.

In [None]:
plot(gensmoke)

There is also a mosaic plot command, (`mosaic.plot`) which will allow for more flexible options.

In [None]:
mosaicplot(gensmoke,color=TRUE)

In [None]:
mosaicplot(gensmoke,color=TRUE,main="Participants split by Gender and Smoking",xlab="Non-Smokers vs Smokers")

I want to print the plots side by side for comparison.

In [None]:
par(mfrow=c(1,2))

mosaicplot(gensmoke,color=TRUE,main="Participants split by Gender and Smoking",xlab="Non-Smokers vs Smokers")


barplot(gensmoke,beside=TRUE,legend=c("Non Smoker","Smoker"),names.arg=c("Male","Female"))



In [None]:
par(mfrow=c(1,2))

mosaicplot(cdc$genhlth~cdc$exerany,color=TRUE,main="Exercise vs General Health")


barplot(table(cdc$exerany,cdc$genhlth),beside=TRUE)

In [None]:
table(cdc$genhlth,cdc$exerany)

# Subsets

Sometimes you may want to examine a subset of the data set which matches some critera.  

To select a particular row or a few rows you can use the following. 

In [None]:
cdc[4,]

In [None]:
cdc[2:10,]

Or I can use a rule to select a subset of the data.

In [None]:
subset(cdc,cdc$genhlth=="poor")

You can use & to use multiple conditions

In [None]:
subset(cdc,cdc$genhlth=="poor" & cdc$age<35)

The veritcal line | below means or. You can see I selected all of the rows with age 27 or 21.

In [None]:
subset(cdc,cdc$age=="27" | cdc$age=="21")