## Learning outcomes for this notebook

* Be able to produce a contingency table and interpret what it means.
* Be able to create a clearly labelled grouped bar chart and interpret it.

## Displaying the association between two categorial variables

### Introduction

Here we show how to display data for two categorical variables simultaneously. The goal is to create an image that visualizes association or correlation between two variables.

If two categorical variables are associated, the relative frequencies for one variable will differ among categories of the other variable. To reveal an association, we show the frequencies using a contingency table, or a grouped bar graph.

### Avian malaria and reproduction

Is reproduction hazardous to health? If not, then it is difficult to explain why adults in many organisms seem to hold back on the number of offspring they raise. Oppliger *et al.* (1996) investigated the impact of reproductive effort on the susceptibility to malaria in wild great tits (*Parus major*) breeding in nest boxes. 

![](images/02_ex_03-A_fmt.jpeg)

They divided 65 nesting females into two treatment groups. In one group of 30 females, each bird had two eggs stolen from her nest, causing the female to lay an additional egg. The extra effort required might increase stress on these females. The remaining 35 females were left alone, establishing the control group. A blood sample was taken from each female 14 days after her eggs hatched to test for infection by avian malaria. The dataset is contained in the file [`avian_malaria.csv`](avian_malaria.csv).

<div class="alert alert-danger">
In the code cell below, read in the dataset from the file `avian_malaria.csv` and print it to see the variable names and how the data is structured. 
</div>

In this dataset each row is an individual bird. Notice now that we have four columns. The index starting from 0, the bird identification number (`Bird id`) which can be ignored, and the categorical variables `'treatment'` and `'infection'`.

### Displaying the data in a contingency table

The most common way of displaying the association between two categorical variables is with a contingency table. Pandas can create contingency tables for you using the `pd.crosstab()` function like so

```python
print(pd.crosstab(DataFrame['explanatory variable'], DataFrame['response variable']))
```

This prints a contingency table with the explanatory variable categories as rows and the response variable categories as columns.
<div class="alert alert-danger">
Print a contingency table for the avian malaria dataset.
</div>

Take some time to understand what this table means. 

The columns represent whether malaria infection was discovered or not. The rows represent the treatment groups, either "Control group" or "Egg-removal group". In the control group, 7 birds had malaria and 28 did not. In the egg-removal group, 15 birds had malaria and 15 did not. 

This is called a 2x2 contingency table because both variables have two categories each. If, for example, a dataset had one variable with two categories and another variable with three categories, the contingency table would be 2x3. 

Contingency tables usually come with column and row totals. This can be displayed using the `margins=True` argument in pd.crosstab().
<div class="alert alert-danger">
Add `margins=True` to your call to pd.crosstab() in the code cell above and satisfy yourself that the totals are correct. 
</div>

<div class="alert alert-danger">
Given the contingency table would you say there is an impact of reproductive effort on the susceptibility to malaria in wild great tits? Give your answer below and explain your reasoning.
</div>

>Write your answer here.

### Displaying two categorical variables in a graph

A contingency table is usually sufficient for displaying the association between two categorical variables. However, association can also be displayed graphically.

For a single categorical variable we created a bar graph with the seaborn function `sns.countplot()` to display the data. We do the same when we have two categorical variables, except now we need to include both variables. This is done as follows:
```python
sns.countplot(x=DataFrame['explanatory variable'], hue=DataFrame['response variable'])
```
or
```python
sns.countplot(x='explanatory variable', hue='response variable', data=DataFrame)
```

This is also known as a *grouped bar graph*. 

<div class="alert alert-danger">
Create a clearly labelled grouped bar graph of malaria infection. (Don't forget to import seaborn and matplotlib.)
</div>

The categories of the variable specified by the `hue` argument of `countplot()` are given different colours and are indicated in a legend.

In terms of graph design, it is better to keep the bars filled with colour so that it is easy to compare frequencies of the response variable `'infection'` between the two treatment groups.

### Trematode infection in fish

Many parasites have more than one species of host, so the individual parasite must get from one host to another to complete its life cycle. Trematodes of the species *Euhaplorchis californiensis* use three hosts during their life cycle. 
![](images/euhaplorchis_californiensis.jpg)

Worms mature in birds and lay eggs that pass out of the bird in its faeces. The horn snail *Cerithidea californica* eats these eggs, which hatch and grow to another life stage in the snail, sterilising the snail in the process. When an infected snail is eaten by the California killifish *Fundulus parvipinnis*, the parasite develops to the next life stage and encysts in the fish’s braincase. Finally, when the killifish is eaten by a bird, the worm becomes a mature adult and starts the cycle again. Researchers have observed that infected fish spend excessive time near the water surface, where they may be more vulnerable to bird predation. This would certainly be to the worm’s advantage, as it would increase its chances of being ingested by a bird, its next host. Lafferty and Morris (1996) tested the hypothesis that infection influences risk of predation by birds. A large outdoor tank was stocked with three kinds of killifish: unparasitized, lightly infected, and heavily infected. This tank was left open to foraging by birds, especially great egrets, great blue herons, and snowy egrets. The file [`trematodes.csv`](trematodes.csv) lists each fish, its infection level and whether it was eaten or not.

<div class="alert alert-danger">

1. Read the data in from the file `trematodes.csv`. 
2. What are the names of the categorical variables? (Use the markdown cell below for your answers.)
3. What are the categories of the categorical variables?
4. Construct a contingency table. 
5. Create a clearly labelled grouped bar graph.
6. Does this data support Lafferty and Morris's hypothesis that infection influences risk of predation by birds? Explain your answer.
</div>
<br>

>Write your answers here.

In the above two examples, there were similar numbers of individuals in each treatment group (see Table 1).

Dataset | Numbers in each treatment group
:--- | ---:
Bird malaria| 35 (control), 30 (egg removal)
Trematodes | 50 (uninfected), 45 (lightly infected), 46 (highly infected)

<center><b>Table 1:</b> Number of individuals in each treatment group in the last two examples.</center>

When this is the case the contingency table makes it easy to see if there is an association between the explanatory variable (treatments) and the response variable.

When the numbers of individuals in each treatment group are not similar the contingency table needs to be modified to help reveal an association. This is explored in the next example. 

### Schizophrenia and famine

Can environmental factors influence the incidence of schizophrenia? St. Claire *et al.* (2005) measured the incidence of the disease among children born in a region of eastern China during a severe famine in 1960, before the famine in 1956, and after the famine in 1965. The data can be found in the file [`schizophrenia.csv`](schizophrenia.csv).

<div class="alert alert-danger">

1. Read the data in from the file `schizophrenia.csv`. 
2. What are the names of the categorical variables?  (Use the markdown cell below for your answers.)
3. What are the categories of the categorical variables?
4. Construct a contingency table. 
5. Based on this contingency table is there an association between the three time periods and the incidence of schizophrenia?
</div><br>

>Write your answers here.

You might find it difficult to see whether there is an association based on just the frequencies. That's because of the large variation in  number of individuals in each category of the explanatory variable, `'timePeriod'`, which varies from 13,748 during the famine, to 83,536 after the famine. What we want to do is compare the relative frequencies across time periods. To do this we include the argument `normalize='index'` in `pd.crosstab()`. This makes the relative frequencies in each row sum to 1 and comparison across explanatory variables easier. You may also want to use `.round()` (see notebook 19) to round the relative frequencies to a more readable number of decimal places.

<div class="alert alert-danger">

* Normalise and round the frequenices in each time period. 
* Is there is an association between the two variables?
</div>
<br>

>Write your answer here.

### References

Oppliger, A. *et al.* (1996). Clutch size and malaria resistance. *Nature* **381**:565.

Lafferty, K. D. and Morris, A. K. (1996). Altered behavior of parasitized killifish increases susceptibility to predation by bird final hosts. *Ecology* **77**:1390-1397.

St. Clair, D., *et al.* (2005). Rates of adult schizophrenia following prenatal exposure to the Chinese famine of 1959-1961. *J. Am. Med. Ass.* **294**:557-562.