# Lab 5 Practice: Examining Categorical Data

## Reminder - working with notebooks

#### 1) It is important to save your work, exit the notebook, and logout of syzygy whenever you are finished working on the notebook for that session. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in to resume working on the notebook.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**


#### 2) When you resume your work on a notebook, your previous work/output may still be displayed, but none of your previous work is maintained in memory accessible by the notebook. In particular, you will need to load the dataset again in order to continue working with the data. One easy way to refresh your notebook is to go to the notebook cell where you left off and do the following.

- **Select Kernel > Restart Kernel and Run up to Selected Cell.**
#### This will run all of the code in your notebook up to the selected cell.

## Objectives
* numerical summaries of a single categorical (qualitative) variable
    * counts
    * proportions
* graphical summaries of a single categorical variable
    * bar plots
* exploration of the relationship between categorical variables
  * contingency tables
  * conditional distributions (row proportions, column proportions
  * stacked bar plot
  * side-by-side bar plot
  * mosaic plot  

## Load Data: 

In [None]:
source("http://www.openintro.org/stat/data/cdc.R")

The `source` function is used to import the dataset that will be used in the tutorial. The data that is available to you is called `cdc`.

## Data Information:

### Data Set:

Today we will be using data from the CDC, the Centre for Disease Control in the U.S..

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

 
#### Name: #### 
* `cdc` - health data from the sample of the BRFSS survey from 2000.

#### Variables: ####
* `genhlth` - respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor
* `exerany` - whether the respondent exercised in the past month (recent) or did not (not_recent). 0 = did not exercise, 1 = exercised
* `hlthplan` - whether the respondent had some form of health coverage (insured) or did not (uninsured). 0 = does not have health plan, 1 = has health plan
* `smoke100` - whether the respondent had smoked at least 100 cigarettes in her lifetime (smoker) or has not (nonsmoker). 0 = nonsmoker, 1 = smoker
* `height` - the respondent's height measured in inches.
* `weight` - the respondent's weight measured in pounds.
* `wtdesire` - the respondent's desired weight measured in pounds.
* `age` - the respondent's age measured in years.
* `gender` - whether the respondent said they were female or male.

## Getting Started

R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different subject) and each column is a different variable (the first is `genhlth`, the second `exerany` and so on). 

To view the names of the variables, type the command

In [None]:
names(cdc)

This returns the names `genhlth`, `exerany`, `hlthplan`, `smoke100`, `height`, `weight`, `wtdesire`, `age`, and `gender`. Each one of these variables corresponds to a question that was asked in the survey. For example, for `genhlth`, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The `exerany` variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, `hlthplan` indicates whether the respondent had some form of health coverage (1) or did not (0). The `smoke100` variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime (1) or has not (0). The other variables record the respondent's height in inches, weight in pounds as well as their desired weight, `wtdesire`, age in years, and gender.

We can see the size of the data frame by using the `dim` function

In [None]:
dim(cdc)

which will return the number of rows and columns.

We can look at the first few entries (rows) of our data with the command

In [None]:
head(cdc)

and similarly we can look at the last few by typing

In [None]:
tail(cdc)

In addition to showing some of the values recorded for each variable, the `head` and `tail` functions provide information on how each variable is being treated by R. For example, at the top of the `genhlth` column, you will see the label `<fct>`. This indicates that `genhlth` is being treated as a `Factor` variable which is equivalent to a **categorical** variable. Similarly, `gender` is being treated as a `Factor`, or **categorical** variable. 

The other variables are all being treated as numerical variables. Those labeled `<dbl>` are formally treated as continuous numerical variables, while those labeled `<int>` are formally treated as discrete numerical variables.

Note, that when we begin exploring the data frame, there may be variables that are not treated by R in the correct manner. For example, the variable `hlthplan` records whether the respondent has some sort of health coverage. This is clearly a categorical variable. However, it was recorded either a `0` for those without coverage, or `1` for those with health coverage. Initially, upon encountering a series of `0`s and `1`s, R treats the variable as numerical.

The variable type can be changed in R if it is incorrect. We will discuss that later.

You could also look at all of the data frame at once by typing its name, `cdc`,  into a code cell, but that might be unwise here. We know `cdc` has 20,000 rows, so viewing the entire data set could be overwhelming. Better to observe smaller portions of the dataframe using `head`, `tail`, or other subsetting techniques.

You could also use the `str` function to review the structure of the data frame. This will identify the variables in the data frame, the type of each variable, and provide an example of some of the values for each variable.

In [None]:
str(cdc)

### Summarizing Categorical Data

#### Frequency Table and Relative Frequency Table

The BRFSS questionnaire contains a massive amount of information. A good first step in any analysis is to reduce all of that information into a few summary statistics and graphics. For categorical data, we could consider the sample frequency or relative frequency distribution. The function `table` does this for you by counting the number of times each kind of response was given. For example, to see the general health for people surveyed, `genhlth`, type

In [None]:
table(cdc$genhlth)

To look at the relative frequency, one would need to adjust the frequency in each category by the total number of observations. Since there are 20000 subjects in this data frame, you could look at the relative frequency distribution by typing

In [None]:
table(cdc$genhlth)/20000

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to something we observed in the previous tutorial; when we multiplied or divided a vector with a number, R applied that action across entries in the vectors. As we see above, this also works for tables.

There is an easier way to obtain the realtive frequency table. The `prop.table` function takes a table of counts and converts those counts into proportions or relative frequencies. For example, to obtain the relative frequency table for `genhlth`

In [None]:
genhlth_count = table(cdc$genhlth) 
# this counts the number of subjects in each of the categories of the response variable
# then saves the results in the object called genhlth_count
genhlth_count 
# this displays the table of counts

In [None]:
genhlth_prop = prop.table(genhlth_count)
# this converts table of counts into a table of proportions, or relative frequencies
# then saves results in object called genhlth_prop
genhlth_prop 
# this displays the table of proportions

### Exercise: Find both the count and the proportion of people who have smoked 100 cigarettes, `smoke100`, in their lifetime

<details>

<summary><b>Click to view sample code:</b></summary>


```
smoke_count = table(cdc$smoke100) 
smoke_count 
```

<br>
#this counts the number of subjects in each of the categories of the response variable (nonsmoker = 0, smoker = 1)<br>
#then saves the results in the object called smoke_count<br>
#then displays the table of counts

</details>

<details>

<summary><b>Click to view sample code:</b></summary>


```
smoke_prop = prop.table(smoke_count)
smoke_prop 
```

<br>
#this converts table of counts into a table of proportions, or relative frequencies<br>
#then saves results in object called smoke_prop<br>
#then displays the proportions

</details>

#### Graphical Summaries

Bar plots, or bar charts, are a common graphical summary for categorical variables.  
  
The function `barplot` creates plots based upon the counts in each category, or the proportion in each category, where the height of each bar represents the count, or proportion, in each category. The function requires information regarding count, or proportion, in each category, hence it requires information from the `table` or `prop.table` function.  
  
Use the `barplot` function to produce a bar plot of the sample proprtions in each category. The `barplot` function only requires the data to be graphed, but there are some optional arguments that may be used to customize the bar plot. Some of these optional arguments include:
* `xlab` - specify the label for the x-axis, eg `xlab = "General Health"`
* `ylab` - specify the label for the y-axis, eg `ylab = "Proportion"`
* `ylim` - specify the minimum and maximum value for the `y-axis, eg `ylim=c(minimum, maximum)`
* `main` - specify a main title for the graph, eg `main = "General Health"`

In [None]:
barplot(genhlth_prop, ylab="Proportion", main = "General Health" ) 
# this produces barplot of proprtions in each category

### Exercise: Construct a barplot for the proportion of people who have smoked 100 cigarettes in their lifetime.

<details>

<summary><b>Click to view sample code:</b></summary>


```
barplot(smoke_prop, xlab = "Nonsmoker = 0 and Smoker = 1", ylab="Proportion", main = "Smoking Status" )
```

<br>
# this produces barplot of proprtions in each category

</details>

#### Contingency Table

If two categorical variables are being compared, the numerical summary usually includes creating a contingency table which displays the number of observations that appear for each combination of categories from the two categorical variables. Categorical variables may be summarized by considering the counts, or proportion, in each of the categories.  
  
For example, the `table` command may be used to create a contingency table of counts to examine which participants have smoked, `smoke100`, across male and female participants, `gender`. Since there are two categorical variables appearing in the table, the order of the variables will determine the row variable and column variable for the table.

In [None]:
table(cdc$gender, cdc$smoke100)

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender.

Since the interest is in comparing the proportions of males who have smoked to the proportion of females who have smoked, we usually want to estimate the relevant proportions for each group. The `prop.table` function is an easy way to accomplish this. It takes a table of counts and converts those counts into proportions. The default for `prop.table` is to convert counts into proportions of the overall total. However, the `margin` argument allows you to specify row proportions (`margin = 1`) or column proportions (`margin = 2`). ***In our table the rows refer to gender, which means that row proportions will be of primary interest***. Use the `prop.table` function below to compare the various options. Since the `prop.table` function uses a table of counts as its input, we could save the table of counts as an object to be used by the `prop.table`function.  

In [None]:
gender_smoke_count = table(cdc$gender, cdc$smoke100)
# this creates a contingency table representing the number of smokers among male and female participants
# then saves the results in the object called gender_smoke_count
gender_smoke_count 
# this displays the table of counts

#### Proportion of All Subjects

In [None]:
gender_smoke_prop = prop.table(gender_smoke_count)
# this calculates the proportion of all subjects in each combination of gender and smoking status
gender_smoke_prop
# this displays the table of proportions

#### Row Proportions - Proportions of smokers and nonsmokers for male participants and female participants

In [None]:
gender_smoke_row_prop = prop.table(gender_smoke_count, margin = 1) 
# this calculates the proportions for each row
gender_smoke_row_prop
# this displays the table of row proportions

#### Column Proportions - Proportions males and females within each smoking category

In [None]:
gender_smoke_col_prop = prop.table(gender_smoke_count, margin = 2) 
# this calculates the proportions for each row
gender_smoke_col_prop
# this displays the table of column proportions

### Exercise: Construct a contingency table of counts to explore the relationship between exercise, `exerany`, and smoking history, `smoke100`. Set exercise as the row variable.

<details>

<summary><b>Click to view sample code:</b></summary>


```
exercise_smoke_count = table(cdc$exerany, cdc$smoke100, dnn = c("Exercise", "Smoke"))
exercise_smoke_count 
```

<br>
# creates a contingency table representing the number of smokers among exercisers (1) and non exercisers(0)<br>  
# the dnn argument is not required, but allows us to identify the name of the row variable and the column variable<br>  
# saves the results in the object called exercise_smoke_count<br>  
# displays the table of counts

</details>

### Exercise: Construct a contingency table that displays the proportion of smokers and nonsmokers for each exercise group.

<details>

<summary><b>Click to view sample code:</b></summary>


```
exercise_smoke_row_prop = prop.table(exercise_smoke_count, margin = 1)
exercise_smoke_row_prop 
```

<br>
# calculates the proportions for each row<br>  
# displays the table of proportions
</details>

### Graphical summaries

Barplots, or bar charts, and mosaic plots are common graphical summaries for categorical variables.  
  
The `barplot` function creates plots based upon specified counts or proportions. The height of each bar represents the relevant count, or proportion. The `barplot` function requires information regarding count, or proportion, hence it requires information from the `table` or `prop.table` function. Since there are two categorical variables information regarding each variable will be displayed by displaying stacked bars or grouped bars (side-by-side plot). It also may be necessary to include a legend in order to better convey the information contained in the graph. 
  
The `barplot` function only requires the data to be graphed, but as with the `plot` function, there are some optional arguments that may be used to customize the bar plot. Some of these optional arguments include:
* `beside` - setting `beside = TRUE` produces a grouped (side-by-side) barplot; setting `beside = FALSE` produces a stacked barplot 
* `legend` - setting `legend = TRUE` adds a legend to the barplot
* `xlab` - specify the label for the x-axis
* `ylab` - specify the label for the y-axis
* `ylim` - specify the minimum and maximum value for the y-axis, eg `ylim=c(minimum, maximum)`
* `main` - specify a main title for the graph, eg `main = "Comparing treatments of Type II Diabetes"`

Since we want to compare the smoking behaviour of males vs females, the `gender_smoke_row_prop` table is of primary interest. This table shows the proprtions of smokers and nonsmokers among male participants and female participants. There are several options for summarizing this table graphically using the `barplot` function.

#### Grouped barplots (side-by-side barplot)

In [None]:
barplot(gender_smoke_row_prop, beside =TRUE, legend=TRUE, main = "Comparing smoking among male and female participants" )
# this produces a grouped barplot of row proprtions; 
# this reflects the distribution of smoking among male and female participants

R makes decisions about which categorical variable to display on the x-axis and which categorical variable to use to differentiate the bars based upon which categorical variable is the row variable and which is the column variable. If you would prefer to switch how the variables are being treated, you need to use the `t` function to transpose the rows and columns as shown below.

In [None]:
barplot(t(gender_smoke_row_prop), beside =TRUE, legend=TRUE, main = "Comparing smoking among male and female participants" )
# this produces a grouped barplot of the row proprtions, 
# this uses t() to transpose the role of the row variable and column variable when creating the graph

#### Stacked barplots

In [None]:
barplot(gender_smoke_row_prop, legend=TRUE, main = "Comparing smoking among male and female participants" )
# this produces a stacked barplot of row proprtions; 
# this reflects the distribution of treatment success/failure for each treatment

R makes decisions about which categorical variable to display on the x-axis and which categorical variable to use to differentiate the bars based upon which categorical variable is the row variable and which is the column variable. If you would prefer to switch how the variables are being treated, you need to use the `t` function to transpose the rows and columns as shown below.

In [None]:
barplot(t(gender_smoke_row_prop),legend=TRUE, main = "Comparing smoking among male and female participants" )
# this produces a grouped barplot of the row proprtions
# this uses t() to transpose the role of the row variable and column variable when creating the graph

***Consider how each of the barplots above represents the proportions that appear in the `gender_smoke_row_prop` table.***

### Exercise: Construct a stacked barplot that displays the the proportion of smokers and nonsmokers for each exercise group.

<details>

<summary><b>Click to view sample code:</b></summary>


```
barplot(t(exercise_smoke_row_prop),legend=TRUE, xlab = "Exercise in past month", main = "Comparing smoking among exercisers (1) and non exercisers (0)" )
```

<br>    
#produces a grouped barplot of the row proprtions, but uses t() to transpose the role of the row variable and column variable when creating the graph

</details>

#### Mosaic plot

The function `mosaicplot` may be used to create a mosaic plot for a single categorical variable, or a pair of categorical variable. The function `mosaicplot` uses the table of counts to produce the mosaic plot. Therefore, to produce a mosaic plot that reflects the relationship between gender, `gender` and smoking, `smoke100`, we would enter the following command.

In [None]:
mosaicplot(gender_smoke_count, main = "Comparing smoking among male and female participants" )

### Exercise: Based upon the available information in the contingency table and graphs, does there seem to be an association between gender and smoking? If so, how would you describe the association?

#### Answer: *The contingency tables and graphs suggest that there is an association between gender and smoking. Those subjects who identified as male were more likely to be smokers than those who identified as female. 52.5% of males were smokers while 42.4% of females were smokers.*

### Exercise: Construct a mosaic plot that displays the relationship between exercise and smoking history.

<details>

<summary><b>Click to view sample code:</b></summary>


```
mosaicplot(exercise_smoke_count, xlab = "Exercise in past month", ylab = "Smoking status", main = "Comparing smoking among exercisers and non exercisers" )
```

</details>

### Exercise: Based upon the available information in the contingency table and graphs, does there seem to be an association between exercise and smoking history? If so, how would you describe the association?

### Answer: 

<details>

<summary><b>Sample Answer:</b></summary>

<br>*The contingency tables and graphs suggests that there may be an association between exercise and smoking history, but is is not particularly compelling. Those subjects who had not exercised in the past month were slightly less likely to be smokers than those who had exercised. 50% of non exercisers were smokers while 46.3% of exercisers were smokers.*

</details>

#### Let’s stop here. 

#### It is important to save your work, exit the notebook, and logout of syzygy when you are done. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**