# Tutorial 2

## Visual and Numerical Summaries

Today we will be using data from the CDC.  

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of **20,000** people from the BRFSS survey conducted in 2000. While there are over **200** variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace.



In [None]:
source("http://www.openintro.org/stat/data/cdc.R")

We have loaded the data each row in this data frame represents an individual and each column a variable.

The data frame is called **`cdc`**

We will start by exploring the data.



In [None]:
names(cdc)

In [None]:
cdc

In [None]:
head(cdc)

In [None]:
tail(cdc)

The columns have names: genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and  gender. 

Each one of these variables corresponds to a question that was asked in the survey. 




* genhlth  
 * respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor
* exerany 
 * whether the respondent exercised in the past month (1) or did not (0)
* hlthplan 
 * whether the respondent had some form of health coverage (1) or did not (0)
* Smoke100 
 * whether the respondent had smoked at least 100 cigarettes in her lifetime.
 
 The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

In [None]:
summary(cdc)

The `table` command will summarise categorical data.

In [None]:
table(cdc$gender)

In the cells below I have asked R to print particular rows in the data frame.  This can be useful when trying to identify particular occurances.

In [None]:
cdc[4,]

In [None]:
cdc[4:10,]

I can also use R to define new data sets.  Below I compute the BMI, and save it as a new element called `bmi`.

In [None]:
bmi <- (cdc$weight / cdc$height^2) * 703

In [None]:
bmi

You can see why we may not want to print the output from this calculation.  We don't need to see all 20 000 values to verify our formula.  

I should use a sample instead.


In [None]:
head(bmi)

In [None]:
summary(bmi)

In [None]:
hist(bmi)

I can varry many option with histograms.  I am going to try changing the number of bins.


In [None]:
hist(bmi,breaks=5)

In [None]:
hist(bmi,breaks=25)

I can also make box plots.

In [None]:
boxplot(bmi)

I would like to investigate if bmi is related to general health.  I will do this by producing side by side box plots.

In [None]:
boxplot(bmi~cdc$genhlth)

In [None]:
boxplot(bmi~cdc$gender)