# Understanding empiric contact data
Having worked with some "fake empiric" matrices under the assumption of
frequency-dependent transmission in the previous notebook,
let's move on to using some actual data derived from contact surveys.
In the previous notebook, we also briefly introduced the POLYMOD study,
which is described here
https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0050074#s5
and includes a link to the survey results in matrix form here
https://doi.org/10.1371/journal.pmed.0050074.st005.

The matrices are presented in a form that is transposed relative to the way
we have been thinking about contact structures.
In the document downloaded from the link above, 
the rows are labelled "age of contact"
and the columns are labelled "age group of participant".
Because we have preferred to think in terms of 
a vector of prevalence values multiplied through by a row of the mixing matrix,
this is transposed relative to our convention.
We'll therefore transpose these matrices from the outset
to keep with the same conventions we were using previously.

In [None]:
import numpy as np
import pandas as pd
pd.options.plotting.backend = "plotly"
import plotly.express as px
import plotly.graph_objects as go

In [None]:
def build_polymod_britain_matrix():
    matrix = [
        [1.92, 0.65, 0.41, 0.24, 0.46, 0.73, 0.67, 0.83, 0.24, 0.22, 0.36, 0.20, 0.20, 0.26, 0.13],
        [0.95, 6.64, 1.09, 0.73, 0.61, 0.75, 0.95, 1.39, 0.90, 0.16, 0.30, 0.22, 0.50, 0.48, 0.20],
        [0.48, 1.31, 6.85, 1.52, 0.27, 0.31, 0.48, 0.76, 1.00, 0.69, 0.32, 0.44, 0.27, 0.41, 0.33],
        [0.33, 0.34, 1.03, 6.71, 1.58, 0.73, 0.42, 0.56, 0.85, 1.16, 0.70, 0.30, 0.20, 0.48, 0.63],
        [0.45, 0.30, 0.22, 0.93, 2.59, 1.49, 0.75, 0.63, 0.77, 0.87, 0.88, 0.61, 0.53, 0.37, 0.33],
        [0.79, 0.66, 0.44, 0.74, 1.29, 1.83, 0.97, 0.71, 0.74, 0.85, 0.88, 0.87, 0.67, 0.74, 0.33],
        [0.97, 1.07, 0.62, 0.50, 0.88, 1.19, 1.67, 0.89, 1.02, 0.91, 0.92, 0.61, 0.76, 0.63, 0.27],
        [1.02, 0.98, 1.26, 1.09, 0.76, 0.95, 1.53, 1.50, 1.32, 1.09, 0.83, 0.69, 1.02, 0.96, 0.20],
        [0.55, 1.00, 1.14, 0.94, 0.73, 0.88, 0.82, 1.23, 1.35, 1.27, 0.89, 0.67, 0.94, 0.81, 0.80],
        [0.29, 0.54, 0.57, 0.77, 0.97, 0.93, 0.57, 0.80, 1.32, 1.87, 0.61, 0.80, 0.61, 0.59, 0.57],
        [0.33, 0.38, 0.40, 0.41, 0.44, 0.85, 0.60, 0.61, 0.71, 0.95, 0.74, 1.06, 0.59, 0.56, 0.57],
        [0.31, 0.21, 0.25, 0.33, 0.39, 0.53, 0.68, 0.53, 0.55, 0.51, 0.82, 1.17, 0.85, 0.85, 0.33],
        [0.26, 0.25, 0.19, 0.24, 0.19, 0.34, 0.40, 0.39, 0.47, 0.55, 0.41, 0.78, 0.65, 0.85, 0.57],
        [0.09, 0.11, 0.12, 0.20, 0.19, 0.22, 0.13, 0.30, 0.23, 0.13, 0.21, 0.28, 0.36, 0.70, 0.60],
        [0.14, 0.15, 0.21, 0.10, 0.24, 0.17, 0.15, 0.41, 0.50, 0.71, 0.53, 0.76, 0.47, 0.74, 1.47],
    ]
    return np.array(matrix).T  # Transposing here

OK, next let's see what an empiric matrix looks like.

In [None]:
age_groups = [i for i in range(0, 75, 5)]
empiric_matrix = build_polymod_britain_matrix()
px.imshow(
    empiric_matrix, 
    x=age_groups, 
    y=age_groups, 
    labels={"x": "contact age", "y": "respondent age"},
)

In [None]:
empiric_dataframe = pd.DataFrame(empiric_matrix)
fig = go.Figure(
    data=[
        go.Surface(
            x=empiric_dataframe.columns, 
            y=empiric_dataframe.index, 
            z=empiric_dataframe.values,
        )
    ]
)
fig.update_layout(
    width=800, 
    height=600, 
    scene=dict(
        xaxis_title="respondent age",
        yaxis_title="contact age",
        zaxis_title="contacts",
    ),
)
fig.show()

## Matrix characteristics
Already there are several features of this matrix that are of epidemiological interest:
- The matrix is generally assortative, 
with its diagonal elements often taking the largest values
(so people tend to associate most strongly with other people of the same age group)
- There is a second, fainter peak in contacts that is loosely parallel to the diagonal
of the matrix, but separated by about 30 years. This represents mixing
between two successive generations (i.e. parents with their children and vice versa)
- The most intense mixing appears to occur in children and young adults,
although we'll go into that in more detail next

So there is plenty to think about in terms of the implications
for infectious disease transmission even before we incorporate
these empiric data into our simulation model.

Of course, there is a bit of jiggle or random variability,
as there will be with any empirically collected data,
and these contact rates/patterns are highly dependent on the social characteristics
of the setting that they have been collected from.
Nevertheless, this is not an uncommon pattern
that was also observed for the other POLYMOD countries.

### Average contacts reported per participant
Next, let's have a look at the total of each row of the matrix.
Given that each row represent a susceptible age bracket at risk of being infected,
the row totals should represent the average total number of contacts reported
by participants in the study.

We can see from the graph below, 
that older children and young adults report the greatest number of contacts,
and there is a gradual decline thereafter with increasing age through the rest of adulthood.

Remember that these numbers are simply averages of the number of reported contacts 
of survey participants.

In [None]:
axis_labels = {"index": "age", "value": "contacts"}
pd.DataFrame(empiric_matrix, index=age_groups, columns=age_groups).plot.area(labels=axis_labels)

### Age of contacts reported by all participants
Looking at the sums over the columns,
we see largely a similar pattern for similar reasons 
to those described above for the row totals.
However, there is one noticeable difference,
which is that there is an uptick at the top for
the oldest age bracket.
Perhaps this is just random variability?
... but the rate nearly doubles from the second-top
(65 to 69 years) age bracket to the top one (70 years and above),
and it seems that this happens for many age groups,
so perhaps there's another explanation.
Note that this is the same observation as noting
that the right-most column of the mixing matrix
we plotted above is marginally brighter 
than the one immediately to its left.

In fact, there is a very good reason for this,
and that's that now that we're summing over the columns,
we're considering quite a different quantity.
This again illustrates the point from the
[previous notebook](./16-contact-surveys.ipynb)
that the number of reported contacts
(overall and for any specific age group of respondents)
is determined to some extent by the number of people in
that age group available to come into contact with.
For these age brackets and considering a high-income country,
we would expect that the last age group would be larger than the second-last
because the second-last only includes people aged 65 to 69 years,
whereas the last one includes people of any age from 70 upwards.
Even though this age group may be less socially active,
in many high-income settings this age group is at least double the size
of the second-oldest bracket.

So given that the survey results reflect the reported
rates of contact with any person from a given age bracket,
we would expect the reported rates of contact
to be higher for contacts from larger age groups.

In [None]:
pd.DataFrame(empiric_matrix.T, index=age_groups, columns=age_groups).plot.area(
    labels=dict(index="age", value="contacts"),
)

This should reinforce the point that we have to think
carefully when adapting a matrix from one setting to another,
even if we want the contact structures to be essentially retained.